Building a Catalan-Chinese parallel corpus from Wikipedia for use in machine translation

dc.contributor.authorZhou, Chenyue
dc.contributor.otherMelero i Nogués, Maite
dc.date.accessioned2022-09-21T16:55:53Z
dc.date.available2022-09-21T16:55:53Z
dc.date.issued2022-09-21
dc.descriptionTreball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Maite Meleroca
dc.description.abstractThe lack of parallel corpora is one of the biggest challenges hindering progress in Machine Translation for low-resource languages. In this work, we crawl and filter parallel sentences in Catalan and Chinese from Wikipedia in order to compile a parallel corpus of good quality. This paper describes the processes we follow to build the corpus, including mining the text data, computing sentence embeddings, extracting sentence alignment and filtering for better corpus quality. We manually audit the corpus quality based on an error taxonomy. Results show that the automatic filtering we applied makes a great improvement in the quality of our web-crawled corpus. The corpus is later used as training data to finetune a multilingual Machine Translation (MT) system in both CA→ZH and ZH→CA directions. Results show that finetuning with our corpus successfully managed to improve BLEU score in both directions on the Flores-101 public benchmark test sets, which demonstrates the importance of corpus in MT and the quality of our Catalan-Chinese parallel corpus.EN
dc.format.mimetypeapplication/pdf*
dc.identifier.urihttp://hdl.handle.net/10230/54140
dc.language.isoengca
dc.rightsLlicència CC Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional (CC BY-NC-ND 4.0)ca
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/ca
dc.subject.keywordParallel corpusen
dc.subject.keywordData miningen
dc.subject.keywordCorpus qualityen
dc.subject.keywordMachine translationen
dc.subject.keywordCatalanen
dc.subject.keywordChineseen
dc.subject.keywordLow-resource languagesen
dc.titleBuilding a Catalan-Chinese parallel corpus from Wikipedia for use in machine translationen
dc.typeinfo:eu-repo/semantics/masterThesisca

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhou_2022.pdf
Size:
1.03 MB
Format:
Adobe Portable Document Format
Description:

License

Rights