Automatic audio-visual urban traffic understanding is a growing area
of research with many potential applications of value to industry,
academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area
hinders their development. To address this we present a curated
audio-visual dataset, Urban Sound & Sight (Urbansas), developed
for investigating the detection and localization of sounding vehicles
in the wild. Urbansas consists ...
Automatic audio-visual urban traffic understanding is a growing area
of research with many potential applications of value to industry,
academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area
hinders their development. To address this we present a curated
audio-visual dataset, Urban Sound & Sight (Urbansas), developed
for investigating the detection and localization of sounding vehicles
in the wild. Urbansas consists of 12 hours of unlabeled data along
with 3 hours of manually annotated data, including bounding boxes
with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the
challenges presented by the dataset and how to use its annotations
for the localization of vehicles in the wild through audio models.
+