Understanding the reasons behind the predictions of deep neural
networks is a pressing concern as it can be critical in several application
scenarios. In this work, we present a novel interpretable
model for polyphonic sound event detection. It tackles one of the
limitations of our previous work, i.e. the difficulty to deal with a
multi-label setting properly. The proposed architecture incorporates
a prototype layer and an attention mechanism. The network learns a
set of local prototypes in ...
Understanding the reasons behind the predictions of deep neural
networks is a pressing concern as it can be critical in several application
scenarios. In this work, we present a novel interpretable
model for polyphonic sound event detection. It tackles one of the
limitations of our previous work, i.e. the difficulty to deal with a
multi-label setting properly. The proposed architecture incorporates
a prototype layer and an attention mechanism. The network learns a
set of local prototypes in the latent space representing a patch in the
input representation. Besides, it learns attention maps for positioning
the local prototypes and reconstructing the latent space. Then,
the predictions are solely based on the attention maps. Thus, the
explanations provided are the attention maps and the corresponding
local prototypes. Moreover, one can reconstruct the prototypes
to the audio domain for inspection. The obtained results in urban
sound event detection are comparable to that of two opaque baselines
but with fewer parameters while offering interpretability.
+