PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Ferrés, Danielca
  • dc.contributor.author Saggion, Horacioca
  • dc.contributor.author Ronzano, Francescoca
  • dc.contributor.author Bravo Serrano, Àlex, 1984-ca
  • dc.date.accessioned 2018-04-16T10:42:11Z
  • dc.date.available 2018-04-16T10:42:11Z
  • dc.date.issued 2018
  • dc.description Comunicació presentada a la Language Resources and Evaluation Conference (LREC) 2018, celebrada els dies 7 a 12 de maig de 2018 a Miyazaki, Japó.
  • dc.description.abstract The availability of automated approaches and tools to extract structured textual content from PDF articles is essential to enable scientific text mining. This paper describes and evaluates the PDFdigest tool, a PDF-to-XML textual content extraction system specially designed to extract scientific articles’ headings and logical structure (title, authors, abstract,...) and its textual content. The extractor deals with both text-based and image-based PDF articles using custom rule-based algorithms implemented with existing state-of-the-art open-source tools for both PDF-to-HTML conversion and image-based PDF Optical Character Recognition.en
  • dc.description.sponsorship This work was partly funded by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE) and the Spanish MINECO Ministry (MDM-2015-0502).
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Ferrés D, Saggion H, Ronzano F, Bravo À. PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles. In: Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018); 2018 May 7-12; Miyazaki, Japan. L18-1298.
  • dc.identifier.isbn 979-10-95546-00-9
  • dc.identifier.uri http://hdl.handle.net/10230/34367
  • dc.language.iso eng
  • dc.publisher ACL (Association for Computational Linguistics)
  • dc.relation.ispartof Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018); 2018 May 7-12; Miyazaki, Japan. L18-1298.
  • dc.relation.projectID info:eu-repo/grantAgreement/ES/1PE/TIN2015-65308-C5-5-R
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.rights.uri http://creativecommons.org/licenses/by-nc/4.0/
  • dc.subject.keyword Language resourcesen
  • dc.subject.keyword Scientific text miningen
  • dc.subject.keyword Digital librariesen
  • dc.subject.keyword Information extractionen
  • dc.subject.keyword PDF conversionen
  • dc.title PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articlesca
  • dc.type info:eu-repo/semantics/conferenceObject
  • dc.type.version info:eu-repo/semantics/acceptedVersion