Evaluation of deep audio representations for semantic sound similarity

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Araz, Recep Oguz
  • dc.contributor.author Bogdanov, Dmitry
  • dc.contributor.author Alonso-Jiménez, Pablo
  • dc.contributor.author Font Corbera, Frederic
  • dc.date.accessioned 2025-02-25T07:24:58Z
  • dc.date.available 2025-02-25T07:24:58Z
  • dc.date.issued 2024
  • dc.description.abstract Navigating large audio collections presents a significant challenge due to the intricate nature of sound properties and the varied needs of users. To enhance user experience, audio-sharing platforms offer a sound similarity function, which leverages vector-based representations of audio clips to facilitate the retrieval of sounds. This study evaluates the retrieval performances of one manually engineered audio representation and thirteen deep audio embeddings in the semantic sound similarity task. By employing a diverse range of models, our research investigates the effects of utilizing different input modalities and training objectives. In the process, we explore various design choices for integrating embeddings into sound similarity systems. Our evaluation is based on objective ranked performance metrics that incorporate sound classes and sound families, complemented by preliminary subjective assessments. We observe that the multimodal models using audio and language modalities outperform audio-only models by a significant margin, which in turn outperform audio and image models. Notably, the state-of-the-art models on the sound event classification task are not the topperforming models on the semantic sound similarity task. In addition, our findings in embedding processing methods and similarity search functions provide insights broadly applicable to information retrieval systems across different modalities.
  • dc.description.sponsorship This work is supported by “IA y Musica: Cátedra en Inteligencia Artificial y Música” (TSI-100929-2023-1) funded by the Secretaría de Estado de Digitalización e Inteligencia Artificial and the European Union-Next Generation EU, under the program Cátedras ENIA.
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Araz RO, Bogdanov D, Alonso-Jiménez P, Font F. Evaluation of deep audio representations for semantic sound similarity. In: 21st International Conference on Content-Based Multimedia Indexing (CBMI) 2024; 2024 Sept 18-20; Reykjavik, Iceland. New Jersey: IEEE; 2024. 7 p. DOI: 10.1109/CBMI62980.2024.10859250
  • dc.identifier.doi http://dx.doi.org/10.1109/CBMI62980.2024.10859250
  • dc.identifier.uri http://hdl.handle.net/10230/69722
  • dc.language.iso eng
  • dc.publisher Institute of Electrical and Electronics Engineers (IEEE)
  • dc.rights © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/CBMI62980.2024.10859250
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.subject.keyword Sound similarity
  • dc.subject.keyword Semantic similarity
  • dc.subject.keyword Audio information retrieval
  • dc.subject.keyword Content-based retrieval
  • dc.subject.keyword Query-by-example
  • dc.subject.keyword QbE
  • dc.subject.keyword Deep embeddings
  • dc.subject.keyword Representation learning
  • dc.subject.keyword Multimodal representation learning
  • dc.title Evaluation of deep audio representations for semantic sound similarity
  • dc.type info:eu-repo/semantics/conferenceObject
  • dc.type.version info:eu-repo/semantics/acceptedVersion