Hierarchical and multimodal learning for heterogeneous sound classification
Hierarchical and multimodal learning for heterogeneous sound classification
Citació
- Anastasopoulou P, Dal Rí Fa, Serra X, Font F. Hierarchical and multimodal learning for heterogeneous sound classification. In: Benetos E, Font F, Fuentes M, Martin Morato I, Rocamora M, editors. Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025); 2025 Oct 30-31; Barcelona, Spain. [Barcelona]: DCASE; 2025. p. 105-9.
Enllaç permanent
Descripció
Resum
This paper investigates multimodal and hierarchical classification strategies to enhance performance in real-world sound classification tasks, centered on the two-level structure of the Broad Sound Taxonomy. We propose a framework that enables the system to consider high-level sound categories when refining its predictions at the subclass level, thereby aligning with the natural hierarchy of sound semantics. To improve accuracy, we integrate acoustic features with semantic cues extracted from crowdsourced textual metadata of the sounds such as titles, tags, and descriptions. During training, we utilize and compare pre-trained embeddings across these modalities, enabling better generalization across acoustically heterogeneous yet semantically related categories. Our experiments show that the use of text-audio embeddings improve classification. We also observe that, although hierarchical supervision does not significantly impact accuracy, it leads to more coherent and perceptually structured latent representations. These improvements in classification performance and representation quality make the system more suitable for downstream tasks such as sound retrieval, description, and similarity search.
