Welcome to the UPF Digital Repository

Self-supervised learning from automatically separated sound scenes

Show simple item record

dc.contributor.author Fonseca, Eduardo
dc.contributor.author Jansen, Aren
dc.contributor.author Ellis, Daniel P. W.
dc.contributor.author Wisdom, Scott
dc.contributor.author Tagliasacchi, Marco
dc.contributor.author Hershey, John R.
dc.contributor.author Plakal, Manoj
dc.contributor.author Hershey, Shawn
dc.contributor.author Moore, R. Channing
dc.contributor.author Serra, Xavier
dc.date.accessioned 2023-03-09T07:26:47Z
dc.date.issued 2021
dc.identifier.citation Fonseca E, Jansen A, Ellis DPW, Wisdom S, Tagliasacchi M, Hershey JR, Plakal M, Hershey S, Moore RC, Serra X. Self-supervised learning from automatically separated sound scenes. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA); 2021 Oct 17-20; New Paltz, United States. [Piscataway]: IEEE; 2021. p. 251-5. DOI: 10.1109/WASPAA52581.2021.9632739
dc.identifier.issn 1931-1168
dc.identifier.uri http://hdl.handle.net/10230/56125
dc.description Comunicació presentada a 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), celebrat del 17 al 20 d'octubre de 2021 a New Paltz, Estats Units.
dc.description.abstract Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
dc.format.mimetype application/pdf
dc.language.iso eng
dc.publisher Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA); 2021 Oct 17-20; New Paltz, United States. [Piscataway]: IEEE; 2021. p. 251-5.
dc.rights © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/WASPAA52581.2021.9632739
dc.title Self-supervised learning from automatically separated sound scenes
dc.type info:eu-repo/semantics/conferenceObject
dc.identifier.doi http://dx.doi.org/10.1109/WASPAA52581.2021.9632739
dc.subject.keyword contrastive learning
dc.subject.keyword audio representation learning
dc.subject.keyword self-supervision
dc.subject.keyword source separation
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.type.version info:eu-repo/semantics/acceptedVersion

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account

Statistics

In collaboration with Compliant to Partaking