Hierarchical and multimodal learning for heterogeneous sound classification

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Anastasopoulou, Panagiota
  • dc.contributor.author Dal Rí, Francesco Ardan
  • dc.contributor.author Serra, Xavier
  • dc.contributor.author Font, Frederic
  • dc.date.accessioned 2025-10-10T11:44:57Z
  • dc.date.available 2025-10-10T11:44:57Z
  • dc.date.issued 2025
  • dc.description.abstract This paper investigates multimodal and hierarchical classification strategies to enhance performance in real-world sound classification tasks, centered on the two-level structure of the Broad Sound Taxonomy. We propose a framework that enables the system to consider high-level sound categories when refining its predictions at the subclass level, thereby aligning with the natural hierarchy of sound semantics. To improve accuracy, we integrate acoustic features with semantic cues extracted from crowdsourced textual metadata of the sounds such as titles, tags, and descriptions. During training, we utilize and compare pre-trained embeddings across these modalities, enabling better generalization across acoustically heterogeneous yet semantically related categories. Our experiments show that the use of text-audio embeddings improve classification. We also observe that, although hierarchical supervision does not significantly impact accuracy, it leads to more coherent and perceptually structured latent representations. These improvements in classification performance and representation quality make the system more suitable for downstream tasks such as sound retrieval, description, and similarity search.
  • dc.description.sponsorship This research was partially funded by the Secretaria d’Universitats i Recerca del Departament de Recerca i Universitats de la Generalitat de Catalunya (2024FI-100252) under the Joan Oró program, the IA y Música: Cátedra en Inteligencia Artificial y Música (TSI-100929-2023-1) by the Secretaría de Estado de Digitalización e Inteligencia Artificial and the European Union-Next Generation EU under the program Cátedras ENIA 2022, and the IMPA project (PID2023-152250OBI00) by the Ministry of Science, Innovation and Universities of the Spanish Government, the Agencia Estatal de Investigación (AEI) and co-financed by the European Union
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Anastasopoulou P, Dal Rí Fa, Serra X, Font F. Hierarchical and multimodal learning for heterogeneous sound classification. In: Benetos E, Font F, Fuentes M, Martin Morato I, Rocamora M, editors. Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025); 2025 Oct 30-31; Barcelona, Spain. [Barcelona]: DCASE; 2025. p. 105-9.
  • dc.identifier.uri http://hdl.handle.net/10230/71472
  • dc.language.iso eng
  • dc.publisher Detection and Classification of Acoustic Scenes and Events (DCASE)
  • dc.relation.ispartof Benetos E, Font F, Fuentes M, Martin Morato I, Rocamora M, editors. Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025); 2025 Oct 30-31; Barcelona, Spain. [Barcelona]: DCASE; 2025.
  • dc.relation.projectID info:eu-repo/grantAgreement/ES/3PE/PID2023-152250OB-I00
  • dc.rights This work is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by/4.0/
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.rights.uri http://creativecommons.org/licenses/by/4.0/
  • dc.subject.keyword Sound classification
  • dc.subject.keyword Hierarchical
  • dc.subject.keyword Multimodal
  • dc.subject.keyword Taxonomy
  • dc.title Hierarchical and multimodal learning for heterogeneous sound classification
  • dc.type info:eu-repo/semantics/conferenceObject
  • dc.type.version info:eu-repo/semantics/publishedVersion