Report on the revised Corpus Acquisition & Annotation subsystem and its components

Prokopidis, Prokopis; Papavassiliou, Vassilis; Toral, Antonio; Poch, Marc; Frontini, Francesca; Rubino, Francesco; Thurmair, Gregor

Report on the revised Corpus Acquisition & Annotation subsystem and its components

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Prokopidis, Prokopisca
dc.contributor.author Papavassiliou, Vassilisca
dc.contributor.author Toral, Antonioca
dc.contributor.author Poch, Marcca
dc.contributor.author Frontini, Francescaca
dc.contributor.author Rubino, Francescoca
dc.contributor.author Thurmair, Gregorca
dc.date.accessioned 2014-05-26T11:32:33Z
dc.date.available 2014-05-26T11:32:33Z
dc.date.created 2011-11-02
dc.date.issued 2014-05-26
dc.description PANACEA WP4 targets the creation of a Corpus Acquisition and Annotation (CAA) subsystem for the acquisition and processing of monolingual and bilingual language resources (LRs). The CAA subsystem consists of tools that have been integrated as web services in the PANACEA platform of LR production. D4.2 Initial functional prototype and documentation in T13 provided documentation on the initial functional prototype of this subsystem, while this deliverable presents updates in the revised subsystem during the second development cycle of the project. The deliverable is structured as follows. A revised version of the Focused Monolingual Crawler (FMC), that has been implemented according to the results of the first evaluation cycle and the reviewers’ comments in the first annual review report, is described in section 2. New and revised versions of tools for corpus normalization (cleaning and deduplication) are detailed in section 3. Section 4 provides documentation on the NLP tools introduced for the first time in the subsystem. These tools focus mainly on sentence splitting/tokenization and POS tagging/lemmatization for English (EN), French (FR), Spanish (ES), German (DE), Italian (IT) and Greek (EL).ca
dc.format.mimetype application/pdfca
dc.identifier.citation Prokopidis P, Papavassiliou V, Toral A, Poch M, Frontini F, Rubino F, Thurmair G. Report on the revised Corpus Acquisition & Annotation subsystem and its components [Internet]. Final report 02 Nov 2011. 36 p. (Panacea Project. Work Package Reports, no. D4.4). Available from:en
dc.identifier.uri http://hdl.handle.net/10230/22513
dc.language.iso engca
dc.relation.ispartofseries Panacea Project. Work Package Reports;D4.4
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/248064ca
dc.rights This documented is licensed under a Creative Commons Attribution 3.0 Spain License.ca
dc.rights.accessRights info:eu-repo/semantics/openAccessca
dc.rights.uri http://creativecommons.org/licenses/by/3.0/es/ca
dc.subject.keyword Panacea Project
dc.subject.keyword automatic acquisition of lexicon
dc.subject.keyword natural language processing
dc.title Report on the revised Corpus Acquisition & Annotation subsystem and its componentsca
dc.type info:eu-repo/semantics/workingPaperca

Col·leccions

IULA. Documentació del Projecte Panacea