This work aims at investigating cross-modal connections
between audio and video sources in the task of musical
instrument recognition. We also address in this work the
understanding of the representations learned by convolutional
neural networks (CNNs) and we study feature correspondence
between audio and visual components of a multimodal
CNN architecture. For each instrument category,
we select the most activated neurons and investigate existing
cross-correlations between neurons from the ...
This work aims at investigating cross-modal connections
between audio and video sources in the task of musical
instrument recognition. We also address in this work the
understanding of the representations learned by convolutional
neural networks (CNNs) and we study feature correspondence
between audio and visual components of a multimodal
CNN architecture. For each instrument category,
we select the most activated neurons and investigate existing
cross-correlations between neurons from the audio and
video CNN which activate the same instrument category.
We analyse two training schemes for multimodal applications
and perform a comparative analysis and visualisation
of model predictions.
+