VoViT: low latency graph-based audio-visual voice separation transformer
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Montesinos García, Juan Felipe
- dc.contributor.author Kadandale, Venkatesh S.
- dc.contributor.author Haro Ortega, Gloria
- dc.date.accessioned 2025-03-27T07:24:43Z
- dc.date.available 2025-03-27T07:24:43Z
- dc.date.issued 2022
- dc.description.abstract This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/.
- dc.description.sponsorship We acknowledge support by MICINN/FEDER UE project PID2021-127643NB-I00; H2020-MSCA-RISE-2017 project 777826 NoMADS. J.F.M. acknowledges support by FPI scholarship PRE2018-083920. We acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments.
- dc.format.mimetype application/pdf
- dc.identifier.citation Montesinos JF, Kadandale VS, Haro G. VoViT: low latency graph-based audio-visual voice separation transformer. In: Avidan S, Brostow G, Cissé M, Maria Farinella G, Hassner T, editors. 17th European Conference on Computer Vision Part XVIII (ECCV 2022); 2022 October 23-7; Tel Aviv, Israel. Cham: Springer Verlag; 2022. p.310-26. (LNCS; no. 13678). DOI: 10.1007/978-3-031-19836-6_18
- dc.identifier.doi 10.1007/978-3-031-19836-6_18
- dc.identifier.uri http://hdl.handle.net/10230/70025
- dc.language.iso eng
- dc.publisher Springer
- dc.relation.ispartof Avidan S, Brostow G, Cissé M, Maria Farinella G, Hassner T, editors. 17th European Conference on Computer Vision Part XVIII (ECCV 2022); 2022 October 23-7; Tel Aviv, Israel. Cham: Springer Verlag; 2022. p.310-26. (LNCS; no. 13678). DOI: 10.1007/978-3-031-19836-6_18
- dc.relation.projectID info:eu-repo/grantAgreement/ES/3PE/PID2021-127643
- dc.relation.projectID info:eu-repo/grantAgreement/EC/HE/777826
- dc.rights © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 S. Avidan et al. (Eds.): ECCV 2022, LNCS 13697, pp. 310–326, 2022. https://doi.org/10.1007/978-3-031-19836-6_18
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.subject.keyword Audio-visual
- dc.subject.keyword Source separation
- dc.subject.keyword Speech
- dc.subject.keyword Singing voice
- dc.title VoViT: low latency graph-based audio-visual voice separation transformer
- dc.type info:eu-repo/semantics/conferenceObject
- dc.type.version info:eu-repo/semantics/acceptedVersion