Araz, Recep OguzBogdanov, DmitryAlonso-Jiménez, PabloFont Corbera, Frederic2025-02-252025-02-252024Araz RO, Bogdanov D, Alonso-Jiménez P, Font F. Evaluation of deep audio representations for semantic sound similarity. In: 21st International Conference on Content-Based Multimedia Indexing (CBMI) 2024; 2024 Sept 18-20; Reykjavik, Iceland. New Jersey: IEEE; 2024. 7 p. DOI: 10.1109/CBMI62980.2024.10859250http://hdl.handle.net/10230/69722Navigating large audio collections presents a significant challenge due to the intricate nature of sound properties and the varied needs of users. To enhance user experience, audio-sharing platforms offer a sound similarity function, which leverages vector-based representations of audio clips to facilitate the retrieval of sounds. This study evaluates the retrieval performances of one manually engineered audio representation and thirteen deep audio embeddings in the semantic sound similarity task. By employing a diverse range of models, our research investigates the effects of utilizing different input modalities and training objectives. In the process, we explore various design choices for integrating embeddings into sound similarity systems. Our evaluation is based on objective ranked performance metrics that incorporate sound classes and sound families, complemented by preliminary subjective assessments. We observe that the multimodal models using audio and language modalities outperform audio-only models by a significant margin, which in turn outperform audio and image models. Notably, the state-of-the-art models on the sound event classification task are not the topperforming models on the semantic sound similarity task. In addition, our findings in embedding processing methods and similarity search functions provide insights broadly applicable to information retrieval systems across different modalities.application/pdfeng© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/CBMI62980.2024.10859250Evaluation of deep audio representations for semantic sound similarityinfo:eu-repo/semantics/conferenceObjecthttp://dx.doi.org/10.1109/CBMI62980.2024.10859250Sound similaritySemantic similarityAudio information retrievalContent-based retrievalQuery-by-exampleQbEDeep embeddingsRepresentation learningMultimodal representation learninginfo:eu-repo/semantics/openAccess