Exploring the integration of large language models for automatic emotion labeling in speech
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Yun Chien, Yi
- dc.date.accessioned 2025-10-20T16:01:56Z
- dc.date.available 2025-10-20T16:01:56Z
- dc.date.issued 2025
- dc.description Treball fi de màster de: Master in Intelligent Interactive Systems
- dc.description Supervisor: Prof. María Inés Torres Barañano
- dc.description.abstract In this work, we present a comprehensive comparison of methodologies for speech emotion recognition (SER), with a focus on evaluating the effectiveness of large language models (LLMs) in this domain. Our study is structured into three parts. First, we extract audio embeddings using models such as WavLM, HuBERT, and Dasheng, and use classical machine learning classifier-Support Vector Machine (SVM) and Multilayer Perceptron (MLP) for emotion prediction. These approach serves as a baseline for comparison. Second, we investigate the capacity of LLMs like GPT-4o, Qwen2-Audio, and Amazon Nova Sonic to analyze audio features, including speaker attributes such as gender, thereby extending their application beyond traditional natural language processing. Third, we explore a more integrated approach that directly inputs raw audio into LLM for audio processing, such as Qwen2-Audio7B-Instruct, for end-to-end emotion classification, without the need for traditional signal-processing-based feature extraction. We evaluate and compare the performance of these methodologies based on various metrics, such as accuracy, precision, recall, and F1-score. A key aspect of this study is the primary focus on the results obtained from LLM-based models. Our results reveal several key insights: (1) data distribution significantly affects classifier performance; (2) different audio embeddings shows different results even with the same classifier and dataset; and (3) despite their capability, current LLMs still underperform compared to classical classifiers such as SVM and MLP in emotion prediction tasks.ENG
- dc.identifier.uri http://hdl.handle.net/10230/71584
- dc.language.iso eng
- dc.rights Llicència CC Reconeixement-NoComercial-CompartirIgual 4.0 Internacional (CC BY-NC-SA 4.0)
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
- dc.subject.other Emocions
- dc.title Exploring the integration of large language models for automatic emotion labeling in speech
- dc.type info:eu-repo/semantics/masterThesis
