MultiBooked_Corpora [research data]
MultiBooked_Corpora [research data]
Citació
- Barnes J. MultiBooked_Corpora [research data]. Repositori Digital de la UPF: Barcelona; 2015. Disponible a: http://hdl.handle.net/10230/33928
Enllaç permanent
Descripció
Resum
We release two corpora of hotel reviews annotated for aspect-level sentiment analysis in Catalan and Basque. We also include scripts which allow the conversion to sentence-level annotations and provide benchmarks for opinion holder, target, and expression extraction based on conditional random fields.Descripció
The corpora are compiled from hotel reviews taken mainly from booking.com. The corpora are in Kaf/Naf format [https://github.com/opener-project/kaf/wiki/KAF-structure-overview] [https://github.com/newsreader/NAF], which is an xml-style stand-off format that allows for multiple layers of annotation. Each review was sentence- and word-tokenized and lemmatized using Freeling [http://nlp.lsi.upc.edu/freeling/node/1] for Catalan and ixa-pipes [http://ixa2.si.ehu.es/ixa-pipes/] for Basque. Finally, for each language two annotators annotated opinion holders, opinion targets, and opinion expressions for each review, following the guidelines set out in the OpeNER project [http://www.opener-project.eu/]. Details can be found in the paper. This package includes the two corpora, as well as providing scripts to obtain corpus statistics (corpus_stats.py), reproduce the benchmarks reported in the paper (crf.py), extract only the opinionated units from the text (extract_opinions.py), or map the aspect-level annotations to sentence- or document-level annotated corpora (extract_sentences.py). Requirements for stats and extraction: Python 3, NumPy