dc.contributor.author |
Fonseca, Eduardo |
dc.contributor.author |
Jansen, Aren |
dc.contributor.author |
Ellis, Daniel P. W. |
dc.contributor.author |
Wisdom, Scott |
dc.contributor.author |
Tagliasacchi, Marco |
dc.contributor.author |
Hershey, John R. |
dc.contributor.author |
Plakal, Manoj |
dc.contributor.author |
Hershey, Shawn |
dc.contributor.author |
Moore, R. Channing |
dc.contributor.author |
Serra, Xavier |
dc.date.accessioned |
2023-03-09T07:26:47Z |
dc.date.issued |
2021 |
dc.identifier.citation |
Fonseca E, Jansen A, Ellis DPW, Wisdom S, Tagliasacchi M, Hershey JR, Plakal M, Hershey S, Moore RC, Serra X. Self-supervised learning from automatically separated sound scenes. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA); 2021 Oct 17-20; New Paltz, United States. [Piscataway]: IEEE; 2021. p. 251-5. DOI: 10.1109/WASPAA52581.2021.9632739 |
dc.identifier.issn |
1931-1168 |
dc.identifier.uri |
http://hdl.handle.net/10230/56125 |
dc.description |
Comunicació presentada a 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), celebrat del 17 al 20 d'octubre de 2021 a New Paltz, Estats Units. |
dc.description.abstract |
Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark. |
dc.format.mimetype |
application/pdf |
dc.language.iso |
eng |
dc.publisher |
Institute of Electrical and Electronics Engineers (IEEE) |
dc.relation.ispartof |
2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA); 2021 Oct 17-20; New Paltz, United States. [Piscataway]: IEEE; 2021. p. 251-5. |
dc.rights |
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/WASPAA52581.2021.9632739 |
dc.title |
Self-supervised learning from automatically separated sound scenes |
dc.type |
info:eu-repo/semantics/conferenceObject |
dc.identifier.doi |
http://dx.doi.org/10.1109/WASPAA52581.2021.9632739 |
dc.subject.keyword |
contrastive learning |
dc.subject.keyword |
audio representation learning |
dc.subject.keyword |
self-supervision |
dc.subject.keyword |
source separation |
dc.rights.accessRights |
info:eu-repo/semantics/openAccess |
dc.type.version |
info:eu-repo/semantics/acceptedVersion |