Urban sound & sight: dataset and benchmark for audio-visual urban scene understanding

Fuentes, Magdalena; Steers, Bea; Zinemanas, Pablo; Rocamora, Martín; Bondi, Luca; Wilkins, Julia; Shi, Qianyi; Hou, Yao; Das, Samarjit; Serra, Xavier; Bello, Juan Pablo

Urban sound & sight: dataset and benchmark for audio-visual urban scene understanding

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Fuentes, Magdalena
dc.contributor.author Steers, Bea
dc.contributor.author Zinemanas, Pablo
dc.contributor.author Rocamora, Martín
dc.contributor.author Bondi, Luca
dc.contributor.author Wilkins, Julia
dc.contributor.author Shi, Qianyi
dc.contributor.author Hou, Yao
dc.contributor.author Das, Samarjit
dc.contributor.author Serra, Xavier
dc.contributor.author Bello, Juan Pablo
dc.date.accessioned 2022-06-20T05:54:39Z
dc.date.available 2022-06-20T05:54:39Z
dc.date.issued 2022
dc.description Comunicació presentada a: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), celebrat del 22 al 27 de maig de 2022 a Singapur.
dc.description.abstract Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To address this we present a curated audio-visual dataset, Urban Sound & Sight (Urbansas), developed for investigating the detection and localization of sounding vehicles in the wild. Urbansas consists of 12 hours of unlabeled data along with 3 hours of manually annotated data, including bounding boxes with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the challenges presented by the dataset and how to use its annotations for the localization of vehicles in the wild through audio models.
dc.format.mimetype application/pdf
dc.identifier.citation Fuentes M, Steers B, Zinemanas P, Rocamora M, Bondi L, Wilkins J, Shi Q, Hou Y, Das S, Serra X, Bello JP. Urban sound & sight: dataset and benchmark for audio-visual urban scene understanding. In: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); 2022 May 22-27; Singapore. [New Jersery]: The Institute of Electrical and Electronics Engineers; 2022. p. 141-5. DOI: 10.1109/ICASSP43922.2022.9747644
dc.identifier.doi http://doi.org/10.1109/ICASSP43922.2022.9747644
dc.identifier.uri http://hdl.handle.net/10230/53526
dc.language.iso eng
dc.publisher Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); 2022 May 22-27; Singapore. [New Jersery]: The Institute of Electrical and Electronics Engineers; 2022. p. 141-5.
dc.rights © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/ICASSP43922.2022.9747644
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.subject.keyword audio-visual
dc.subject.keyword urban research
dc.subject.keyword traffic
dc.subject.keyword dataset
dc.title Urban sound & sight: dataset and benchmark for audio-visual urban scene understanding
dc.type info:eu-repo/semantics/article
dc.type.version info:eu-repo/semantics/acceptedVersion

Col·leccions

Articles (Departament de Tecnologies de la Informació i les Comunicacions)