Neural Machine Translation (NMT) is the state-of-the-art solution for automatic translation. However, the performance of NMT models on low-resource language pairs is still limited
because it relies on parallel corpora resources. The present study aims to improve the M2M100 model developed by Facebook for its MT between Chinese and Catalan, which is a European language with rather limited resources. The study uses the Wikipedia Parallel corpora
elaborated by the collaborative study of Zhou (2022) ...
Neural Machine Translation (NMT) is the state-of-the-art solution for automatic translation. However, the performance of NMT models on low-resource language pairs is still limited
because it relies on parallel corpora resources. The present study aims to improve the M2M100 model developed by Facebook for its MT between Chinese and Catalan, which is a European language with rather limited resources. The study uses the Wikipedia Parallel corpora
elaborated by the collaborative study of Zhou (2022) as training data. We used the full finetune method to train the model and we have an improvement of Chinese↔Catalan (+0.3-0.5
BLEU) with a bigger corpus and (+0.1-0.2 BLEU) with a smaller corpus but of better quality.
The results also show that while a small dataset is enough to improve the performance of the
already state-of-the-art baseline system, the size of the corpora determines the training results
on similar data quality conditions.
+