Supervised music source separation systems using deep
learning are trained by minimizing a loss function between
pairs of predicted separations and ground-truth isolated
sources. However, open datasets comprising isolated
sources are few, small, and restricted to a few music styles.
At the same time, multi-track datasets with source bleeding
are usually found larger in size, and are easier to compile.
In this work, we address the task of singing voice separation
when the ground-truth signals ...
Supervised music source separation systems using deep
learning are trained by minimizing a loss function between
pairs of predicted separations and ground-truth isolated
sources. However, open datasets comprising isolated
sources are few, small, and restricted to a few music styles.
At the same time, multi-track datasets with source bleeding
are usually found larger in size, and are easier to compile.
In this work, we address the task of singing voice separation
when the ground-truth signals have bleeding and only
the target vocals and the corresponding mixture are available.
We train a cold diffusion model on the frequency
domain to iteratively transform a mixture into the corresponding
vocals with bleeding. Next, we build the final
separation masks by clustering spectrogram bins according
to their evolution along the transformation steps. We
test our approach on a Carnatic music scenario for which
solely datasets with bleeding exist, while current research
on this repertoire commonly uses source separation models
trained solely with Western commercial music. Our evaluation
on a Carnatic test set shows that our system improves
Spleeter on interference removal and it is competitive in
terms of signal distortion. Code is open sourced.
+