Lexical simplification benchmarks for English, Portuguese, and Spanish

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Štajner, Sanja
  • dc.contributor.author Ferrés, Daniel
  • dc.contributor.author Shardlow, Matthew
  • dc.contributor.author North, Kai
  • dc.contributor.author Zampieri, Marcos
  • dc.contributor.author Saggion, Horacio
  • dc.date.accessioned 2023-03-03T07:48:58Z
  • dc.date.available 2023-03-03T07:48:58Z
  • dc.date.issued 2022
  • dc.description.abstract Even in highly-developed countries, as many as 15–30% of the population can only understand texts written using a basic vocabulary. Their understanding of everyday texts is limited, which prevents them from taking an active role in society and making informed decisions regarding healthcare, legal representation, or democratic choice. Lexical simplification is a natural language processing task that aims to make text understandable to everyone by replacing complex vocabulary and expressions with simpler ones, while preserving the original meaning. It has attracted considerable attention in the last 20 years, and fully automatic lexical simplification systems have been proposed for various languages. The main obstacle for the progress of the field is the absence of high-quality datasets for building and evaluating lexical simplification systems. In this study, we present a new benchmark dataset for lexical simplification in English, Spanish, and (Brazilian) Portuguese, and provide details about data selection and annotation procedures, to enable compilation of comparable datasets in other languages and domains. As the first multilingual lexical simplification dataset, where instances in all three languages were selected and annotated using comparable procedures, this is the first dataset that offers a direct comparison of lexical simplification systems for three languages. To showcase the usability of the dataset, we adapt two state-of-the-art lexical simplification systems with differing architectures (neural vs. non-neural) to all three languages (English, Spanish, and Brazilian Portuguese) and evaluate their performances on our new dataset. For a fairer comparison, we use several evaluation measures which capture varied aspects of the systems' efficacy, and discuss their strengths and weaknesses. We find that a state-of-the-art neural lexical simplification system outperforms a state-of-the-art non-neural lexical simplification system in all three languages, according to all evaluation measures. More importantly, we find that the state-of-the-art neural lexical simplification systems perform significantly better for English than for Spanish and Portuguese, thus posing a question if such an architecture can be used for successful lexical simplification in other languages, especially the low-resourced ones.
  • dc.description.sponsorship HS and DF acknowledge support from the project Context-aware Multilingual Text Simplification (ConMuTeS) PID2019-109066GB-I00/AEI/10.13039/501100011033 awarded by Ministerio de Ciencia, Innovación y Universidades (MCIU) and by Agencia Estatal de Investigación (AEI) of Spain.
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Štajner S, Ferrés D, Shardlow M, North K, Zampieri M, Saggion H. Lexical simplification benchmarks for English, Portuguese, and Spanish. Front Artif Intell. 2022;5:991242. DOI: 10.3389/frai.2022.991242
  • dc.identifier.doi http://dx.doi.org/10.3389/frai.2022.991242
  • dc.identifier.issn 2624-8212
  • dc.identifier.uri http://hdl.handle.net/10230/56023
  • dc.language.iso eng
  • dc.publisher Frontiers
  • dc.relation.ispartof Frontiers in Artificial Intelligence. 2022;5:991242.
  • dc.relation.isreferencedby https://www.frontiersin.org/articles/10.3389/frai.2022.991242/full#supplementary-material
  • dc.relation.isreferencedby https://github.com/LaSTUS-TALN-UPF/TSAR-2022-Shared-Task
  • dc.relation.projectID info:eu-repo/grantAgreement/ES/2PE/PID2019-109066GB-I00
  • dc.rights © 2022 Štajner, Ferrés, Shardlow, North, Zampieri and Saggion. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.rights.uri http://creativecommons.org/licenses/by/4.0/
  • dc.subject.keyword natural language processing
  • dc.subject.keyword lexical simplification
  • dc.subject.keyword benchmark datasets
  • dc.subject.keyword evaluation methodologies
  • dc.subject.keyword low-resource tasks
  • dc.subject.keyword artificial intelligence for social good
  • dc.title Lexical simplification benchmarks for English, Portuguese, and Spanish
  • dc.type info:eu-repo/semantics/article
  • dc.type.version info:eu-repo/semantics/publishedVersion