dc.contributor.author |
Barnes, Jeremy |
dc.date.accessioned |
2018-02-19T09:52:11Z |
dc.date.available |
2018-02-19T09:52:11Z |
dc.date.issued |
2015-01 |
dc.identifier.citation |
Barnes J. MultiBooked_Corpora [research data]. Repositori Digital de la UPF: Barcelona; 2015. Disponible a: http://hdl.handle.net/10230/33928 |
dc.identifier.uri |
http://hdl.handle.net/10230/33928 |
dc.description |
The corpora are compiled from hotel reviews taken mainly from booking.com. The corpora are in Kaf/Naf format [https://github.com/opener-project/kaf/wiki/KAF-structure-overview] [https://github.com/newsreader/NAF], which is an xml-style stand-off format that allows for multiple layers of annotation. Each review was sentence- and word-tokenized and lemmatized using Freeling [http://nlp.lsi.upc.edu/freeling/node/1] for Catalan and ixa-pipes [http://ixa2.si.ehu.es/ixa-pipes/] for Basque. Finally, for each language two annotators annotated opinion holders, opinion targets, and opinion expressions for each review, following the guidelines set out in the OpeNER project [http://www.opener-project.eu/]. Details can be found in the paper.
This package includes the two corpora, as well as providing scripts to obtain corpus statistics (corpus_stats.py), reproduce the benchmarks reported in the paper (crf.py), extract only the opinionated units from the text (extract_opinions.py), or map the aspect-level annotations to sentence- or document-level annotated corpora (extract_sentences.py).
Requirements for stats and extraction: Python 3, NumPy |
dc.description.abstract |
We release two corpora of hotel reviews annotated for aspect-level sentiment analysis in Catalan and Basque. We also include scripts which allow the conversion to sentence-level annotations and provide benchmarks for opinion holder, target, and expression extraction based on conditional random fields. |
dc.language.iso |
cat |
dc.language.iso |
eus |
dc.relation |
Publicació relacionada: Barnes J, Lambert P, Badia T. Multibooked: a corpus of Basque and Catalan hotel reviews annotated for aspect-level sentiment classification. Paper persented at: Language Resources and Evaluation Conference (LREC); 2018 May 7-12; Miyazaki, Japan. |
dc.rights |
Licensed under the terms of the Creative Commons CC-BY public license. |
dc.rights.uri |
http://creativecommons.org/licenses/by/3.0/es/ |
dc.title |
MultiBooked_Corpora [research data] |
dc.type |
info:eu-repo/semantics/other |
dc.type |
Dataset |
dc.subject.keyword |
Cross-lingual sentiment analysis |
dc.subject.keyword |
Sentiment analysis |
dc.subject.keyword |
Under-resourced languages |
dc.subject.keyword |
Catalan |
dc.subject.keyword |
Basque |
dc.subject.keyword |
Análisis de sentimiento |
dc.subject.keyword |
Catalán |
dc.subject.keyword |
Euskera |
dc.subject.keyword |
Análisi de sentiment |
dc.subject.keyword |
Català |
dc.subject.keyword |
Iritzien |
dc.subject.keyword |
Analisia |
dc.subject.keyword |
Euskara |
dc.rights.accessRights |
info:eu-repo/semantics/openAccess |