Evaluation of deep audio representations for semantic sound similarity
Evaluation of deep audio representations for semantic sound similarity
Citació
- Araz RO, Bogdanov D, Alonso-Jiménez P, Font F. Evaluation of deep audio representations for semantic sound similarity. In: 21st International Conference on Content-Based Multimedia Indexing (CBMI) 2024; 2024 Sept 18-20; Reykjavik, Iceland. New Jersey: IEEE; 2024. 7 p. DOI: 10.1109/CBMI62980.2024.10859250
Enllaç permanent
Descripció
Resum
Navigating large audio collections presents a significant challenge due to the intricate nature of sound properties and the varied needs of users. To enhance user experience, audio-sharing platforms offer a sound similarity function, which leverages vector-based representations of audio clips to facilitate the retrieval of sounds. This study evaluates the retrieval performances of one manually engineered audio representation and thirteen deep audio embeddings in the semantic sound similarity task. By employing a diverse range of models, our research investigates the effects of utilizing different input modalities and training objectives. In the process, we explore various design choices for integrating embeddings into sound similarity systems. Our evaluation is based on objective ranked performance metrics that incorporate sound classes and sound families, complemented by preliminary subjective assessments. We observe that the multimodal models using audio and language modalities outperform audio-only models by a significant margin, which in turn outperform audio and image models. Notably, the state-of-the-art models on the sound event classification task are not the topperforming models on the semantic sound similarity task. In addition, our findings in embedding processing methods and similarity search functions provide insights broadly applicable to information retrieval systems across different modalities.