MultiBooked_Corpora [research data]

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Barnes, Jeremyca
  • dc.date.accessioned 2018-02-19T09:52:11Z
  • dc.date.available 2018-02-19T09:52:11Z
  • dc.date.issued 2015-01
  • dc.description The corpora are compiled from hotel reviews taken mainly from booking.com. The corpora are in Kaf/Naf format [https://github.com/opener-project/kaf/wiki/KAF-structure-overview] [https://github.com/newsreader/NAF], which is an xml-style stand-off format that allows for multiple layers of annotation. Each review was sentence- and word-tokenized and lemmatized using Freeling [http://nlp.lsi.upc.edu/freeling/node/1] for Catalan and ixa-pipes [http://ixa2.si.ehu.es/ixa-pipes/] for Basque. Finally, for each language two annotators annotated opinion holders, opinion targets, and opinion expressions for each review, following the guidelines set out in the OpeNER project [http://www.opener-project.eu/]. Details can be found in the paper. This package includes the two corpora, as well as providing scripts to obtain corpus statistics (corpus_stats.py), reproduce the benchmarks reported in the paper (crf.py), extract only the opinionated units from the text (extract_opinions.py), or map the aspect-level annotations to sentence- or document-level annotated corpora (extract_sentences.py). Requirements for stats and extraction: Python 3, NumPyca
  • dc.description.abstract We release two corpora of hotel reviews annotated for aspect-level sentiment analysis in Catalan and Basque. We also include scripts which allow the conversion to sentence-level annotations and provide benchmarks for opinion holder, target, and expression extraction based on conditional random fields.ca
  • dc.identifier.citation Barnes J. MultiBooked_Corpora [research data]. Repositori Digital de la UPF: Barcelona; 2015. Disponible a: http://hdl.handle.net/10230/33928
  • dc.identifier.doi https://doi.org/10.34810/data398
  • dc.identifier.uri http://hdl.handle.net/10230/33928
  • dc.language.iso catca
  • dc.language.iso eus
  • dc.relation Publicació relacionada: Barnes J, Lambert P, Badia T. Multibooked: a corpus of Basque and Catalan hotel reviews annotated for aspect-level sentiment classification. Paper persented at: Language Resources and Evaluation Conference (LREC); 2018 May 7-12; Miyazaki, Japan.
  • dc.rights Licensed under the terms of the Creative Commons CC-BY public license.ca
  • dc.rights.accessRights info:eu-repo/semantics/openAccessca
  • dc.rights.uri http://creativecommons.org/licenses/by/3.0/es/ca
  • dc.subject.keyword Cross-lingual sentiment analysis
  • dc.subject.keyword Sentiment analysis
  • dc.subject.keyword Under-resourced languages
  • dc.subject.keyword Catalan
  • dc.subject.keyword Basque
  • dc.subject.keyword Análisis de sentimiento
  • dc.subject.keyword Catalán
  • dc.subject.keyword Euskera
  • dc.subject.keyword Análisi de sentiment
  • dc.subject.keyword Català
  • dc.subject.keyword Iritzien
  • dc.subject.keyword Analisia
  • dc.subject.keyword Euskara
  • dc.title MultiBooked_Corpora [research data]ca
  • dc.type info:eu-repo/semantics/otherca
  • dc.type Dataset