Welcome to the UPF Digital Repository

Technologies and tools for corpus creation, normalization and annotation

Show simple item record

dc.contributor.author Prokopidis, Prokopis
dc.contributor.author Papavassiliou, Vassilis
dc.contributor.author Pecina, Pavel
dc.contributor.author Rimell, Laura
dc.contributor.author Poibeau, Thierry
dc.contributor.author Bartolini, Roberto
dc.contributor.author Caselli, Tommaso
dc.contributor.author Frontini, Francesca
dc.contributor.author Aleksic, Vera
dc.contributor.author Thurmair, Gregor
dc.contributor.author Poch, Marc
dc.contributor.author Bel Rafecas, Núria
dc.contributor.author Hamon, Olivier
dc.date.accessioned 2014-05-26T10:24:07Z
dc.date.available 2014-05-26T10:24:07Z
dc.date.created 2010-07-16
dc.date.issued 2014-05-26
dc.identifier.citation Prokopidis P, Papavassiliou V, Pecina P, Rimel L, Poibeau Th, Bartolini R, Caselli T, Frontini F, Aleksic V, Thurmair G, Poch M, Bel N, Hamon O. Technologies and tools for corpus creation, normalization and annotation [Internet]. Final report 16 Jul 2010. 63 p. (Panacea Project. Work Package Reports, no. D4.1). Available from:
dc.identifier.uri http://hdl.handle.net/10230/22510
dc.description The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition. This report presents the terminology used in this document in Section 2. The report explaines state-of-the-art and existing tools for corpus acquisition, corpus normalization, and text processing in Sections 3, 4 and 5 respectively. The resources to be produced in the context of WP4 are discussed in Section/n6. In Section 7 it presents the solution path we aim to explore for generating these resources.
dc.format.mimetype application/pdf
dc.language.iso eng
dc.relation.ispartofseries Panacea Project. Work Package Reports;D4.1
dc.rights This documented is licensed under a Creative Commons Attribution 3.0 Spain License.
dc.rights.uri http://creativecommons.org/licenses/by/3.0/es/
dc.title Technologies and tools for corpus creation, normalization and annotation
dc.type info:eu-repo/semantics/workingPaper
dc.subject.keyword Panacea Project
dc.subject.keyword automatic acquisition of lexicon
dc.subject.keyword natural language processing
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/248064
dc.rights.accessRights info:eu-repo/semantics/openAccess


This item appears in the following Collection(s)

Show simple item record

Search DSpace

Advanced Search


My Account


In collaboration with Compliant to Partaking