PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles

Ferrés, Daniel; Saggion, Horacio; Ronzano, Francesco; Bravo Serrano, Àlex, 1984-

PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Ferrés, Danielca
dc.contributor.author Saggion, Horacioca
dc.contributor.author Ronzano, Francescoca
dc.contributor.author Bravo Serrano, Àlex, 1984-ca
dc.date.accessioned 2018-04-16T10:42:11Z
dc.date.available 2018-04-16T10:42:11Z
dc.date.issued 2018
dc.description Comunicació presentada a la Language Resources and Evaluation Conference (LREC) 2018, celebrada els dies 7 a 12 de maig de 2018 a Miyazaki, Japó.
dc.description.abstract The availability of automated approaches and tools to extract structured textual content from PDF articles is essential to enable scientific text mining. This paper describes and evaluates the PDFdigest tool, a PDF-to-XML textual content extraction system specially designed to extract scientific articles’ headings and logical structure (title, authors, abstract,...) and its textual content. The extractor deals with both text-based and image-based PDF articles using custom rule-based algorithms implemented with existing state-of-the-art open-source tools for both PDF-to-HTML conversion and image-based PDF Optical Character Recognition.en
dc.description.sponsorship This work was partly funded by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE) and the Spanish MINECO Ministry (MDM-2015-0502).
dc.format.mimetype application/pdf
dc.identifier.citation Ferrés D, Saggion H, Ronzano F, Bravo À. PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles. In: Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018); 2018 May 7-12; Miyazaki, Japan. L18-1298.
dc.identifier.isbn 979-10-95546-00-9
dc.identifier.uri http://hdl.handle.net/10230/34367
dc.language.iso eng
dc.publisher ACL (Association for Computational Linguistics)
dc.relation.ispartof Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018); 2018 May 7-12; Miyazaki, Japan. L18-1298.
dc.relation.projectID info:eu-repo/grantAgreement/ES/1PE/TIN2015-65308-C5-5-R
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.rights.uri http://creativecommons.org/licenses/by-nc/4.0/
dc.subject.keyword Language resourcesen
dc.subject.keyword Scientific text miningen
dc.subject.keyword Digital librariesen
dc.subject.keyword Information extractionen
dc.subject.keyword PDF conversionen
dc.title PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articlesca
dc.type info:eu-repo/semantics/conferenceObject
dc.type.version info:eu-repo/semantics/acceptedVersion

Col·leccions

Congressos (Departament de Tecnologies de la Informació i les Comunicacions)