Technologies and tools for corpus creation, normalization and annotation
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Prokopidis, Prokopisca
- dc.contributor.author Papavassiliou, Vassilisca
- dc.contributor.author Pecina, Pavelca
- dc.contributor.author Rimell, Lauraca
- dc.contributor.author Poibeau, Thierryca
- dc.contributor.author Bartolini, Robertoca
- dc.contributor.author Caselli, Tommasoca
- dc.contributor.author Frontini, Francescaca
- dc.contributor.author Aleksic, Veraca
- dc.contributor.author Thurmair, Gregorca
- dc.contributor.author Poch, Marcca
- dc.contributor.author Bel Rafecas, Núriaca
- dc.contributor.author Hamon, Olivierca
- dc.date.accessioned 2014-05-26T10:24:07Z
- dc.date.available 2014-05-26T10:24:07Z
- dc.date.created 2010-07-16
- dc.date.issued 2014-05-26
- dc.description The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition. This report presents the terminology used in this document in Section 2. The report explaines state-of-the-art and existing tools for corpus acquisition, corpus normalization, and text processing in Sections 3, 4 and 5 respectively. The resources to be produced in the context of WP4 are discussed in Section/n6. In Section 7 it presents the solution path we aim to explore for generating these resources.ca
- dc.format.mimetype application/pdfca
- dc.identifier.citation Prokopidis P, Papavassiliou V, Pecina P, Rimel L, Poibeau Th, Bartolini R, Caselli T, Frontini F, Aleksic V, Thurmair G, Poch M, Bel N, Hamon O. Technologies and tools for corpus creation, normalization and annotation [Internet]. Final report 16 Jul 2010. 63 p. (Panacea Project. Work Package Reports, no. D4.1). Available from:en
- dc.identifier.uri http://hdl.handle.net/10230/22510
- dc.language.iso engca
- dc.relation.ispartofseries Panacea Project. Work Package Reports;D4.1
- dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/248064ca
- dc.rights This documented is licensed under a Creative Commons Attribution 3.0 Spain License.ca
- dc.rights.accessRights info:eu-repo/semantics/openAccessca
- dc.rights.uri http://creativecommons.org/licenses/by/3.0/es/ca
- dc.subject.keyword Panacea Project
- dc.subject.keyword automatic acquisition of lexicon
- dc.subject.keyword natural language processing
- dc.title Technologies and tools for corpus creation, normalization and annotationca
- dc.type info:eu-repo/semantics/workingPaperca