We present MusAV, a new public benchmark dataset
for comparative validation of arousal and valence (AV) regression
models for audio-based music emotion recognition.
To gather the ground truth, we rely on relative judgments
instead of absolute values to simplify the manual
annotation process and improve its consistency. We build
MusAV by gathering comparative annotations of arousal
and valence on pairs of tracks, using track audio previews
and metadata from the Spotify API. The resulting dataset
contains ...
We present MusAV, a new public benchmark dataset
for comparative validation of arousal and valence (AV) regression
models for audio-based music emotion recognition.
To gather the ground truth, we rely on relative judgments
instead of absolute values to simplify the manual
annotation process and improve its consistency. We build
MusAV by gathering comparative annotations of arousal
and valence on pairs of tracks, using track audio previews
and metadata from the Spotify API. The resulting dataset
contains 2,092 track previews covering 1,404 genres, with
pairwise relative AV judgments by 20 annotators and various
subsets of the ground truth based on different levels
of annotation agreement. We demonstrate the use of the
dataset in an example study evaluating nine models for AV
regression that we train based on state-of-the-art audio embeddings
and three existing datasets of absolute AV annotations.
The results on MusAV offer a view of the performance
of the models complementary to the metrics obtained
during training and provide insights into the impact
of the considered datasets and embeddings on the generalization
abilities of the models.
+