Urban sound & sight: dataset and benchmark for audio-visual urban scene understanding

Citació

Fuentes M, Steers B, Zinemanas P, Rocamora M, Bondi L, Wilkins J, Shi Q, Hou Y, Das S, Serra X, Bello JP. Urban sound & sight: dataset and benchmark for audio-visual urban scene understanding. In: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); 2022 May 22-27; Singapore. [New Jersery]: The Institute of Electrical and Electronics Engineers; 2022. p. 141-5. DOI: 10.1109/ICASSP43922.2022.9747644

Enllaç permanent

Descripció

Resum
Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To address this we present a curated audio-visual dataset, Urban Sound & Sight (Urbansas), developed for investigating the detection and localization of sounding vehicles in the wild. Urbansas consists of 12 hours of unlabeled data along with 3 hours of manually annotated data, including bounding boxes with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the challenges presented by the dataset and how to use its annotations for the localization of vehicles in the wild through audio models.
Descripció
Comunicació presentada a: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), celebrat del 22 al 27 de maig de 2022 a Singapur.
DOI
http://doi.org/10.1109/ICASSP43922.2022.9747644
Col·leccions
Articles (Departament de Tecnologies de la Informació i les Comunicacions)

Fitxers