Audio-visual gated-sequenced neural networks for affect recognition
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Aspandi, Decky
- dc.contributor.author Sukno, Federico Mateo
- dc.contributor.author Schuller, Björn
- dc.contributor.author Binefa i Valls, Xavier
- dc.date.accessioned 2024-06-05T11:06:32Z
- dc.date.available 2024-06-05T11:06:32Z
- dc.date.issued 2022
- dc.description.abstract The interest in automatic emotion recognition and the larger field of Affective Computing has recently gained momentum. The current emergence of large, video-based affect datasets offering rich multi-modal inputs facilitates the development of deep learning-based models for automatic affect analysis that currently holds the state of the art. However, recent approaches to process these modalities cannot fully exploit them due to the use of oversimplified fusion schemes. Furthermore, the efficient use of temporal information inherent to these huge data are also largely unexplored hindering their potential progress. In this work, we propose a multi-modal, sequence-based neural network with gating mechanisms for Valence and Arousal based affect recognition. Our model consists of three major networks: Firstly, a latent-feature generator that extracts compact representations from both modalities that have been artificially degraded to add robustness. Secondly, a multi-task discriminator that estimates both input identity and a first step emotion quadrant estimation. Thirdly, a sequence-based predictor with attention and gating mechanisms that effectively merges both modalities and uses this information through sequence modelling. In our experiments on the SEMAINE and SEWA affect datasets, we observe the impact of both proposed methods with progressive increase in accuracy. We further show in our ablation studies how the internal attention weight and gating coefficient impact our models’ estimates quality. Finally, we demonstrate state of the art accuracy through comparisons with current alternatives on both datasets.
- dc.description.sponsorship This work is partly supported by the Spanish Ministry of Science and Innovation under project grant PID2020-114083GB-I00, and the donation bahi2018-19 to the CMTech at UPF. Further funding has been received from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 826506 (sustAGE) along with the UDeco project by the German BMBF-KMU Innovativ program.
- dc.format.mimetype application/pdf
- dc.identifier.citation Aspandi D, Sukno F, Schuller BW, Binefa X. Audio-visual gated-sequenced neural networks for affect recognition. IEEE Trans Affect Comput. 2023;14(3):2193-208. DOI: 10.1109/TAFFC.2022.3156026
- dc.identifier.doi http://dx.doi.org/10.1109/TAFFC.2022.3156026
- dc.identifier.issn 1949-3045
- dc.identifier.uri http://hdl.handle.net/10230/60358
- dc.language.iso eng
- dc.publisher Institute of Electrical and Electronics Engineers (IEEE)
- dc.relation.ispartof IEEE Trans Affect Comput. 2023;14(3):2193-208.
- dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/826506
- dc.relation.projectID info:eu-repo/grantAgreement/ES/2PE/PID2020-114083GB-I00
- dc.rights © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/TAFFC.2022.3156026
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.subject.keyword Affective computing
- dc.subject.keyword Deep learning
- dc.subject.keyword Multi-modal fusion
- dc.subject.keyword Sequence modelling
- dc.title Audio-visual gated-sequenced neural networks for affect recognition
- dc.type info:eu-repo/semantics/article
- dc.type.version info:eu-repo/semantics/acceptedVersion