A diffusion-inspired training strategy for singing voice extraction in the waveform domain

dc.contributor.authorPlaja-Roglans, Genís
dc.contributor.authorMiron, Marius
dc.contributor.authorSerra, Xavier
dc.date.accessioned2023-04-11T06:43:01Z
dc.date.available2023-04-11T06:43:01Z
dc.date.issued2022
dc.descriptionComunicació presentada a 23nd International Society for Music Information Retrieval Conference (ISMIR 2022), celebrat del 4 al 8 de desembre de 2022 a Bangalore, Índia.
dc.description.abstractNotable progress in music source separation has been achieved using multi-branch networks that operate on both temporal and spectral domains. However, such networks tend to be complex and heavy-weighted. In this work, we tackle the task of singing voice extraction from polyphonic music signals in an end-to-end manner using an approach inspired by the training and sampling process of denoising diffusion models. We perform unconditional signal modelling to gradually convert an input mixture signal to the corresponding singing voice or accompaniment. We use fewer parameters than the state-of-the-art models while operating on the waveform domain, bypassing the phase estimation problem. More concisely, we train a non-causal WaveNet using a diffusion-inspired strategy while improving the said network for singing voice extraction and obtaining performance comparable to the end-to-end stateof-the-art on MUSDB18. We further report results on a non-MUSDB-overlapping version of MedleyDB and the multi-track audio of Saraga Carnatic showing good generalization, and run perceptual tests of our approach. Code, models, and audio examples are made available.
dc.description.sponsorshipThis work was carried out under the projects Musical AI - PID2019-111403GB-I00/AEI/10.13039/501100011033 and NextCore - RTC2019-007248-7 funded by the Spanish Ministerio de Ciencia, Innovación y Universidades (MCIU) and the Agencia Estatal de Investigación (AEI).
dc.format.mimetypeapplication/pdf
dc.identifier.citationPlaja-Roglans G, Miron M, Serra X. A diffusion-inspired training strategy for singing voice extraction in the waveform domain. In: Rao P, Murthy H, Srinivasamurthy A, Bittner R, Caro Repetto R, Goto M, Serra X, Miron M, editors. Proceedings of the 23nd International Society for Music Information Retrieval Conference (ISMIR 2022); 2022 Dec 4-8; Bengaluru, India. [Canada]: International Society for Music Information Retrieval; 2022. p. 685-93. DOI: 10.5281/zenodo.7316754
dc.identifier.doihttp://dx.doi.org/10.5281/zenodo.7316754
dc.identifier.isbn978-1-7327299-2-6
dc.identifier.urihttp://hdl.handle.net/10230/56443
dc.language.isoeng
dc.publisherInternational Society for Music Information Retrieval (ISMIR)
dc.relation.ispartofRao P, Murthy H, Srinivasamurthy A, Bittner R, Caro Repetto R, Goto M, Serra X, Miron M, editors. Proceedings of the 23nd International Society for Music Information Retrieval Conference (ISMIR 2022); 2022 Dec 4-8; Bengaluru, India. [Canada]: International Society for Music Information Retrieval; 2022. p. 685-93.
dc.relation.isreferencedbyhttps://github.com/genisplaja/diffusion-vocal-sep
dc.relation.projectIDinfo:eu-repo/grantAgreement/ES/2PE/PID2019-111403GB-I00
dc.relation.projectIDinfo:eu-repo/grantAgreement/ES/2PE/RTC2019-007248-7
dc.rights© G. Plaja-Roglans, M. Miron, and X. Serra. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
dc.rights.accessRightsinfo:eu-repo/semantics/openAccess
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subject.otherMúsica -- Informàtica
dc.subject.otherCant
dc.titleA diffusion-inspired training strategy for singing voice extraction in the waveform domain
dc.typeinfo:eu-repo/semantics/conferenceObject
dc.type.versioninfo:eu-repo/semantics/publishedVersion

Fitxers

Paquet original

Mostrant 1 - 1 de 1
Carregant...
Miniatura
Nom:
Serra_Pro_Diff.pdf
Mida:
451.98 KB
Format:
Adobe Portable Document Format

Llicència

Drets