Mining and exploiting domain-specific corpora in the PANACEA platform
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Bel Rafecas, Núriaca
- dc.contributor.author Papavassiliou, Vassilisca
- dc.contributor.author Prokopidis, Prokopisca
- dc.contributor.author Toral, Antonioca
- dc.contributor.author Arranz, Victoriaca
- dc.date.accessioned 2013-02-26T10:07:56Z
- dc.date.available 2013-02-26T10:07:56Z
- dc.date.issued 2012ca
- dc.description.abstract The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stages involved in the acquisition,production, updating and maintenance of the large language resources required by, among others, MT systems. The development of a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web is one of the most innovative building blocks of PANACEA. The CAC, which is the first stage in the PANACEA pipeline for building Language Resources, adopts an efficient and distributed methodology to crawl for web documents with rich textual content in specific languages and predefined domains. The CAC includes modules that can acquire parallel data from sites with in-domain content available in more than one language. In order to extrinsically evaluate the CAC methodology, we have conducted several experiments that used crawled parallel corpora for the identification and extraction of parallel sentences using sentence alignment. The corpora were then successfully used for domain adaptation of Machine Translation Systems.en
- dc.format.mimetype application/pdfca
- dc.identifier.citation Bel N, Papavasiliou V, Prokopidis P, Toral A, Arranz V. Mining and exploiting domain-specific corpora in the PANACEA platform. In: Rapp R, Tadić M, Sharoff S (et al.), editors. Proceedings of the 5th Workshop on Building and Using Comparable Corpora at the Eighth International Conference on Language Resources and Evaluation (LREC-2012); 2012 May 23-25; Istanbul, Turkey. Paris: European Language Resources Association; 2012. p. 24-26.ca
- dc.identifier.uri http://hdl.handle.net/10230/20416
- dc.language.iso engca
- dc.publisher ELRA (European Language Resources Association)ca
- dc.relation.ispartof Rapp R, Tadić M, Sharoff S (et al.), editors. Proceedings of the 5th Workshop on Building and Using Comparable Corpora at the Eighth International Conference on Language Resources and Evaluation (LREC-2012); 2012 May 23-25; Istanbul, Turkey. Paris: European Language Resources Association; 2012. p. 24-26.en
- dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/248064ca
- dc.rights © 2012 ELRA - European Language Resources Association. All rights reserved.ca
- dc.rights.accessRights info:eu-repo/semantics/openAccessca
- dc.subject.keyword Web crawlingen
- dc.subject.keyword Boilerplate removalen
- dc.subject.keyword Corpus acquisitionen
- dc.subject.keyword IPR for language resourcesen
- dc.title Mining and exploiting domain-specific corpora in the PANACEA platformca
- dc.type info:eu-repo/semantics/conferenceObjectca
- dc.type.version info:eu-repo/semantics/publishedVersionca