Improving Chinese-Catalan machine translation with Wikipedia parallel corpus

Enllaç permanent

Descripció

  • Resum

    Neural Machine Translation (NMT) is the state-of-the-art solution for automatic translation. However, the performance of NMT models on low-resource language pairs is still limited because it relies on parallel corpora resources. The present study aims to improve the M2M100 model developed by Facebook for its MT between Chinese and Catalan, which is a European language with rather limited resources. The study uses the Wikipedia Parallel corpora elaborated by the collaborative study of Zhou (2022) as training data. We used the full finetune method to train the model and we have an improvement of Chinese↔Catalan (+0.3-0.5 BLEU) with a bigger corpus and (+0.1-0.2 BLEU) with a smaller corpus but of better quality. The results also show that while a small dataset is enough to improve the performance of the already state-of-the-art baseline system, the size of the corpora determines the training results on similar data quality conditions.
  • Descripció

    Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Maite Melero i Nogués
  • Mostra el registre complet