Building a dataset of emotions with distant supervision

Enllaç permanent

Descripció

  • Resum

    In Natural Language Processing (NLP), emotion detection is a challenging problem of text classification. Using supervised machine learning to tackle this task requires annotated datasets, which can be difficult to come by because they are costly to produce. Moreover, emotions are subjective, and human annotators often disagree in their assessments. Recently, many methods have been proposed to reduce costs, including distant supervision. This thesis presents a strategy for annotating emotions in literary works in Brazilian Portuguese. Using a combination of regular expressions for automatic dialogue extraction, SpaCy, and a lexicon containing 26 emotions, we classify dialogue by considering words used by the narrator to introduce and describe it. The results are mixed, given the large set of emotion labels, many of which are underrepresented in our data collection efforts. However, this strategy can still benefit the annotation of literary corpora with more common emotions such as Happiness and Dissatisfaction.
  • Descripció

    Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Núria Bel
  • Mostra el registre complet