Welcome to the UPF Digital Repository

Final Report on the Corpus Acquisition & Annotation subsystem and its components

Show simple item record

dc.contributor.author Prokopidis, Prokopis
dc.contributor.author Papavassiliou, Vassilis
dc.contributor.author Toral, Antonio
dc.contributor.author Poch, Marc
dc.contributor.author Frontini, Francesca
dc.contributor.author Rubino, Francesco
dc.contributor.author Thurmair, Gregor
dc.date.accessioned 2014-05-26T11:44:31Z
dc.date.available 2014-05-26T11:44:31Z
dc.date.created 2012-10-31
dc.date.issued 2014-05-26
dc.identifier.citation Prokopidis P, Papavassiliou V, Toral A, Poch M, Frontini F, Rubino F, Thurmair G. Final Report on the Corpus Acquisition & Annotation subsystem and its components [Internet]. Final report 31 Oct 2012. 52 p. (Panacea Project. Work Package Reports, no. D4.5). Available from:
dc.identifier.uri http://hdl.handle.net/10230/22514
dc.description PANACEA WP4 targets the creation of a Corpus Acquisition and Annotation (CAA) subsystem for the acquisition and processing of monolingual and bilingual language resources (LRs). The/nCAA subsystem consists of tools that have been integrated as web services in the PANACEA platform of LR production. D4.2 Initial functional prototype and documentation in T13 and D4.4 Report on the revised Corpus Acquisition & Annotation subsystem and its components in T23 provided initial and updated documentation on this subsystem, while this deliverable presents the final documentation of the subsystem as it evolved after the third development cycle of the project. The deliverable is structured as follows. The Corpus Acquisition Component (i.e. the Focused Monolingual and Bilingual Crawlers (FMC/FBC)) is described in section 2. The final list of/ntools for corpus normalization (cleaning and de-duplication) is detailed in section 3. Section 4 provides documentation on all NLP tools included in the subsystem. Due to its nature, this deliverable aggregates considerable parts of all previous WP4 deliverables. The main new additions include a) new functionalities for, among others, crawling strategy, de-duplication, and detection of parallel document pairs; and b) new NLP tools for syntactic analysis, named entity recognition, tweet processing and anonymization.
dc.format.mimetype application/pdf
dc.language.iso eng
dc.relation.ispartofseries Panacea Project. Work Package Reports;D4.5
dc.rights This documented is licensed under a Creative Commons Attribution 3.0 Spain License.
dc.rights.uri http://creativecommons.org/licenses/by/3.0/es/
dc.title Final Report on the Corpus Acquisition & Annotation subsystem and its components
dc.type info:eu-repo/semantics/workingPaper
dc.subject.keyword Panacea Project
dc.subject.keyword automatic acquisition of lexicon
dc.subject.keyword natural language processing
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/248064
dc.rights.accessRights info:eu-repo/semantics/openAccess


This item appears in the following Collection(s)

Show simple item record

Search DSpace

Advanced Search


My Account


In collaboration with Compliant to Partaking