Introducing QuBERT: a large monolingual corpus and BERT model for Southern Quechua

Zevallos, Rodolfo; Ortega, John E.; Chen, William; Castro, Richard; Bel Rafecas, Núria; Yoshikawa, Cesar; Ventura, Renzo; Aradiel, Hilario; Melgarejo, Nelsi

Introducing QuBERT: a large monolingual corpus and BERT model for Southern Quechua

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Zevallos, Rodolfo
dc.contributor.author Ortega, John E.
dc.contributor.author Chen, William
dc.contributor.author Castro, Richard
dc.contributor.author Bel Rafecas, Núria
dc.contributor.author Yoshikawa, Cesar
dc.contributor.author Ventura, Renzo
dc.contributor.author Aradiel, Hilario
dc.contributor.author Melgarejo, Nelsi
dc.date.accessioned 2023-03-14T07:17:17Z
dc.date.available 2023-03-14T07:17:17Z
dc.date.issued 2022
dc.description Comunicació presentada a 3rd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2022), celebrat el 14 de juliol de 2022 a Seattle, Estats Units.
dc.description.abstract The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called QuBERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84–87% on POS tasks.
dc.description.sponsorship This work was partially funded by Project PID2019-104512GB-I00 of the Spanish Ministerio de Ciencia, Innovación and Universidades and Agencia Estatal de Investigación.
dc.format.mimetype application/pdf
dc.identifier.citation Zevallos R, Ortega JE, Chen W, Castro R, Bel N, Yoshikawa C, Ventura R, Aradiel H, Melgarejo N. Introducing QuBERT: a large monolingual corpus and BERT model for Southern Quechua. In: Cherry C, Fan A, Foster G, Haffari G, Khadivi S, Peng N, Ren X, Shareghi E, Swayamdipta S, editors. The 3rd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2022): proceedings of the DeepLo Workshop; 2022 Jul 14; Seattle, United States. [Stroudsburg]: Association for Computational Linguistics; 2022. 13 p. DOI: 10.18653/v1/2022.deeplo-1.1
dc.identifier.doi http://dx.doi.org/10.18653/v1/2022.deeplo-1.1
dc.identifier.uri http://hdl.handle.net/10230/56223
dc.language.iso eng
dc.publisher ACL (Association for Computational Linguistics)
dc.relation.ispartof Cherry C, Fan A, Foster G, Haffari G, Khadivi S, Peng N, Ren X, Shareghi E, Swayamdipta S, editors. The 3rd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2022): proceedings of the DeepLo Workshop; 2022 Jul 14; Seattle, United States. [Stroudsburg]: Association for Computational Linguistics; 2022. 13 p.
dc.relation.projectID info:eu-repo/grantAgreement/ES/2PE/PID2019-104512GB-I00
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.rights.uri http://creativecommons.org/licenses/by/4.0/
dc.subject.other Quítxua meridional -- Traducció automàtica
dc.title Introducing QuBERT: a large monolingual corpus and BERT model for Southern Quechua
dc.type info:eu-repo/semantics/conferenceObject
dc.type.version info:eu-repo/semantics/publishedVersion

Col·leccions

Congressos (Departament de Traducció i Ciències del Llenguatge)