The impact of tokenization on gender bias in Machine Translation

Enllaç permanent

Descripció

Resum
This study examines the impact of tokenization methods on gender bias in Neural Machine Translation (NMT). Unigram, BPE, Character, and Morfessor tokenization approaches are compared in terms of translation quality measured by BLEU scores and gender accuracy. Results show that Unigram achieves the highest BLEU scores, closely followed by BPE and Morfessor, while Character performs lower. However, all models display a bias towards generating masculine forms more frequently than feminine forms in gender accuracy analysis. They also overwhelming generate masculine forms when no context is provided. The Unigram method exhibits the highest accuracy for both feminine and masculine forms, surpassing BPE and Morfessor. These findings emphasize the need to address gender bias in MT systems and the complex relationship between tokenization methods, translation quality, and gender accuracy. Further research is warranted to explore additional factors influencing gender bias. This study contributes to the development of inclusive and unbiased translation technologies.
Descripció
Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Maite Melero Nogues
Col·leccions
Màster en Lingüística Teòrica i Aplicada. Treballs de fi de màster de recerca

Fitxers