Zipf's law for word frequencies: word forms versus lemmas in long texts
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Corral, Álvaroca
- dc.contributor.author Boleda, Gemmaca
- dc.contributor.author Ferrer-i-Cancho, Ramonca
- dc.date.accessioned 2017-08-24T12:11:59Z
- dc.date.available 2017-08-24T12:11:59Z
- dc.date.issued 2015
- dc.description.abstract Zipf’s law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf’s law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf’s law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf’s law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.en
- dc.description.sponsorship This work was supported by FIS2009-09508 from the former Spanish Ministerio de Ciencia e Innovación (www.micinn.es, AC); FIS2012-31324 from the Spanish Ministerio de Economia y Competitividad (www.mineco.gob.es, AC); 2009SGR-164, 2014SGR-1307 from the Agència de Gestió d’Ajuts Universitaris i de Recerca (www.gencat.cat/agaur, AC); Beatriu de Pinós grants 2010 BP-A 00070 and 2010 BP-A2 00015 from Agència de Gestió d’Ajuts Universitaris i de Recerca (www.gencat.cat/agaur of the Generalitat de Catalunya,GB); BASMATI, TIN2011-27479-C04-03 and OpenMT-2, TIN2009-14675-C03 from the former Spanish Ministerio de Ciencia e Innovación (www.micinn.es, RFiC); and grant Iniciació i reincorporació a la recerca from Universitat Politecnica de Catalunya (www.upc.edu, RFiC).
- dc.format.mimetype application/pdfca
- dc.identifier.citation Corral A, Boleda G, Ferrer-i-Cancho R. Zipf’s law for word frequencies: word forms versus lemmas in long texts. PLoS ONE. 2015; 10(7):e0129031. DOI: 10.1371/journal.pone.0129031
- dc.identifier.doi http://dx.doi.org/10.1371/journal.pone.0129031
- dc.identifier.issn 1932-6203
- dc.identifier.uri http://hdl.handle.net/10230/32682
- dc.language.iso eng
- dc.publisher Public Library of Science (PLoS)ca
- dc.relation.ispartof PLoS ONE. 2015; 10(7):e0129031. DOI: 10.1371/journal.pone.0129031
- dc.rights © 2015 Corral et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.rights.uri https://creativecommons.org/licenses/by/4.0/
- dc.subject.keyword Language and physicsen
- dc.subject.keyword Languageen
- dc.subject.keyword Statistical physicsen
- dc.subject.keyword Zipf's lawen
- dc.title Zipf's law for word frequencies: word forms versus lemmas in long textsca
- dc.type info:eu-repo/semantics/article
- dc.type.version info:eu-repo/semantics/publishedVersion