Welcome to the UPF Digital Repository

A semi-automatic indexing system based on embedded information in HTML documents

Show simple item record

dc.contributor.author Vàllez Letrado, Mari
dc.contributor.author Pedraza, Rafael
dc.contributor.author Codina, Lluís
dc.contributor.author Blanco, Saúl
dc.contributor.author Rovira, Cristòfol
dc.date.accessioned 2017-04-25T07:52:30Z
dc.date.available 2017-04-25T07:52:30Z
dc.date.issued 2015
dc.identifier.citation Vallez M, Pedraza-Jiménez R, Blanco S, Codina Ll, Rovira C. A semi-automatic indexing system based on embedded information in HTML documents. Library Hi Tech. 2015;33(2):195-210. DOI: 10.1108/LHT-12-2014-0114
dc.identifier.issn 0737-8831
dc.identifier.uri http://hdl.handle.net/10230/30892
dc.description.abstract Purpose: The purpose of this paper is to describe and evaluate the tool DigiDoc MetaEdit which allows the semi-automatic indexing of HTML documents. The tool works by identifying and suggesting keywords from a thesaurus according to the embedded information in HTML documents. This enables the parameterization of keyword assignment based on how frequently the terms appear in the document, the relevance of their position, and the combination of both. Design/methodology/approach: In order to evaluate the efficiency of the indexing tool, the descriptors/keywords suggested by the indexing tool are compared to the keywords which have been indexed manually by human experts. To make this comparison a corpus of HTML documents are randomly selected from a journal devoted to Library and Information Science. Findings: The results of the evaluation show that there: first, is close to a 50 per cent match or overlap between the two indexing systems, however, if you take into consideration the related terms and the narrow terms the matches can reach 73 per cent; and second, the first terms identified by the tool are the most relevant. Originality/value: The tool presented identifies the most important keywords in an HTML document based on the embedded information in HTML documents. Nowadays, representing the contents of documents with keywords is an essential practice in areas such as information retrieval and e-commerce.
dc.description.sponsorship This article is part of the projects: “Audiencias activas y periodismo” (Active audiences and journalism). CSO2012-39518-C04-02 and "Comunicación online de los destinos turísticos" (Online communication of tourist destinations) CSO2011-22691. Plan Nacional de I+D+i, Ministerio de Economía y Competitividad (Spain).
dc.format.mimetype application/pdf
dc.language.iso eng
dc.publisher Emerald
dc.relation.ispartof Library Hi Tech. 2015;33(2):195-210.
dc.rights © Emerald Group Publishing Limited
dc.title A semi-automatic indexing system based on embedded information in HTML documents
dc.type info:eu-repo/semantics/article
dc.identifier.doi http://dx.doi.org/10.1108/LHT-12-2014-0114
dc.subject.keyword Digital documents
dc.subject.keyword Information retrieval
dc.subject.keyword Indexing
dc.subject.keyword Search engines
dc.subject.keyword Hypertext markup language
dc.relation.projectID info:eu-repo/grantAgreement/ES/3PN/CSO2012-39518-C04-02
dc.relation.projectID info:eu-repo/grantAgreement/ES/3PN/CSO2011-22691
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.type.version info:eu-repo/semantics/acceptedVersion

This item appears in the following Collection(s)

Show simple item record

Search DSpace

Advanced Search


My Account


Compliant to Partaking