The impact of tokenization on gender bias in Machine Translation

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Mash, Audrey
  • dc.date.accessioned 2023-09-29T15:08:12Z
  • dc.date.available 2023-09-29T15:08:12Z
  • dc.date.issued 2023-09-29
  • dc.description Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Maite Melero Noguesca
  • dc.description.abstract This study examines the impact of tokenization methods on gender bias in Neural Machine Translation (NMT). Unigram, BPE, Character, and Morfessor tokenization approaches are compared in terms of translation quality measured by BLEU scores and gender accuracy. Results show that Unigram achieves the highest BLEU scores, closely followed by BPE and Morfessor, while Character performs lower. However, all models display a bias towards generating masculine forms more frequently than feminine forms in gender accuracy analysis. They also overwhelming generate masculine forms when no context is provided. The Unigram method exhibits the highest accuracy for both feminine and masculine forms, surpassing BPE and Morfessor. These findings emphasize the need to address gender bias in MT systems and the complex relationship between tokenization methods, translation quality, and gender accuracy. Further research is warranted to explore additional factors influencing gender bias. This study contributes to the development of inclusive and unbiased translation technologies.ca
  • dc.format.mimetype application/pdf*
  • dc.identifier.uri http://hdl.handle.net/10230/57998
  • dc.language.iso engca
  • dc.rights Llicència CC Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional (CC BY-NC-ND 4.0)ca
  • dc.rights.accessRights info:eu-repo/semantics/openAccessca
  • dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/deed.caca
  • dc.subject.keyword Machine Translation
  • dc.subject.keyword Neural Machine Translation
  • dc.subject.keyword Sub-word tokenization
  • dc.subject.keyword Gender bias
  • dc.subject.keyword Unigram
  • dc.subject.keyword BPE (Byte Pair Encoding)
  • dc.subject.keyword Character-based tokenization
  • dc.subject.keyword Morfessor
  • dc.title The impact of tokenization on gender bias in Machine Translationca
  • dc.type info:eu-repo/semantics/masterThesisca