Repositori Digital de la UPF
In this work, we present a comprehensive comparison of methodologies for speech emotion recognition (SER), with a focus on evaluating the effectiveness of large language models (LLMs) in this domain. Our study is structured into three parts. First, we extract audio embeddings using models such as WavLM, HuBERT, and Dasheng, and use classical machine learning classifier-Support Vector Machine (SVM) and
Multilayer Perceptron (MLP) for emotion prediction. These approach serves as a baseline for comparison. Second, we investigate the capacity of LLMs like GPT-4o, Qwen2-Audio, and Amazon Nova Sonic to analyze audio features, including speaker attributes such as gender, thereby extending their application beyond traditional natural language processing. Third, we explore a more integrated approach that
directly inputs raw audio into LLM for audio processing, such as Qwen2-Audio7B-Instruct, for end-to-end emotion classification, without the need for traditional signal-processing-based feature extraction. We evaluate and compare the performance of these methodologies based on various metrics, such as accuracy, precision, recall, and F1-score. A key aspect of this study is the primary focus on the results obtained from LLM-based models. Our results reveal several key insights: (1) data distribution significantly affects classifier performance; (2) different audio embeddings shows different results even with the same classifier and dataset; and (3) despite their capability, current LLMs still underperform compared to classical classifiers such as SVM and MLP in emotion prediction tasks.
(2025) Yun Chien, Yi