Fuentes, MagdalenaSteers, BeaZinemanas, PabloRocamora, MartínBondi, LucaWilkins, JuliaShi, QianyiHou, YaoDas, SamarjitSerra, XavierBello, Juan Pablo2022-06-202022-06-202022Fuentes M, Steers B, Zinemanas P, Rocamora M, Bondi L, Wilkins J, Shi Q, Hou Y, Das S, Serra X, Bello JP. Urban sound & sight: dataset and benchmark for audio-visual urban scene understanding. In: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); 2022 May 22-27; Singapore. [New Jersery]: The Institute of Electrical and Electronics Engineers; 2022. p. 141-5. DOI: 10.1109/ICASSP43922.2022.9747644http://hdl.handle.net/10230/53526Comunicació presentada a: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), celebrat del 22 al 27 de maig de 2022 a Singapur.Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To address this we present a curated audio-visual dataset, Urban Sound & Sight (Urbansas), developed for investigating the detection and localization of sounding vehicles in the wild. Urbansas consists of 12 hours of unlabeled data along with 3 hours of manually annotated data, including bounding boxes with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the challenges presented by the dataset and how to use its annotations for the localization of vehicles in the wild through audio models.application/pdfeng© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/ICASSP43922.2022.9747644Urban sound & sight: dataset and benchmark for audio-visual urban scene understandinginfo:eu-repo/semantics/articlehttp://doi.org/10.1109/ICASSP43922.2022.9747644audio-visualurban researchtrafficdatasetinfo:eu-repo/semantics/openAccess