VocaLiST: an audio-visual synchronisation model for lips and voices

Kadandale, Venkatesh S.; Montesinos García, Juan Felipe; Haro Ortega, Gloria

VocaLiST: an audio-visual synchronisation model for lips and voices

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Kadandale, Venkatesh S.
dc.contributor.author Montesinos García, Juan Felipe
dc.contributor.author Haro Ortega, Gloria
dc.date.accessioned 2023-02-23T07:10:51Z
dc.date.available 2023-02-23T07:10:51Z
dc.date.issued 2022
dc.description Comunicació presentada a Interspeech 2022, celebrat del 18 al 22 de setembre de 2022 a Inchon, Corea del Sud.
dc.description.abstract In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audiovisual correspondence score. We propose an audio-visual crossmodal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/
dc.description.sponsorship We acknowledge support by MICINN/FEDER UE project PID2021-127643NB-I00; H2020-MSCA-RISE-2017 project 777826 NoMADS. V. S. K. has received support through “la Caixa” Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064 and the Marie SkłodowskaCurie grant agreement No. 713673. J. F. M. acknowledges support by FPI scholarship PRE2018-083920.
dc.format.mimetype application/pdf
dc.identifier.citation Kadandale VS, Montesinos JF, Haro G. VocaLiST: an audio-visual synchronisation model for lips and voices. In: Proc. Interspeech 2022; 2022 Sep 18-22; Incheon, South Korea. [Baixas]: International Speech Communication Association; 2022. p. 3128-32. DOI: 10.21437/Interspeech.2022-10861
dc.identifier.doi http://dx.doi.org/10.21437/Interspeech.2022-10861
dc.identifier.uri http://hdl.handle.net/10230/55883
dc.language.iso eng
dc.publisher International Speech Communication Association (ISCA)
dc.relation.ispartof Proc. Interspeech 2022; 2022 Sep 18-22; Incheon, South Korea. [Baixas]: International Speech Communication Association; 2022. p. 3128-32.
dc.relation.isreferencedby https://ipcv.github.io/VocaLiST/
dc.relation.projectID info:eu-repo/grantAgreement/ES/3PE/PID2021-127643NB-I00
dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/777826
dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/713673
dc.relation.projectID info:eu-repo/grantAgreement/ES/2PE/PRE2018-083920
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.subject.keyword audio-visual
dc.subject.keyword speech
dc.subject.keyword singing voice
dc.subject.keyword synchronisation
dc.subject.keyword source separation
dc.subject.keyword self-supervision
dc.subject.keyword cross-modal
dc.title VocaLiST: an audio-visual synchronisation model for lips and voices
dc.type info:eu-repo/semantics/conferenceObject
dc.type.version info:eu-repo/semantics/publishedVersion

Col·leccions

Congressos (Departament de Tecnologies de la Informació i les Comunicacions)
Documents OpenAIRE (Open Access Infrastructure for Research in Europe)