VocaLiST: an audio-visual synchronisation model for lips and voices

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Kadandale, Venkatesh S.
  • dc.contributor.author Montesinos García, Juan Felipe
  • dc.contributor.author Haro Ortega, Gloria
  • dc.date.accessioned 2023-02-23T07:10:51Z
  • dc.date.available 2023-02-23T07:10:51Z
  • dc.date.issued 2022
  • dc.description Comunicació presentada a Interspeech 2022, celebrat del 18 al 22 de setembre de 2022 a Inchon, Corea del Sud.
  • dc.description.abstract In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audiovisual correspondence score. We propose an audio-visual crossmodal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/
  • dc.description.sponsorship We acknowledge support by MICINN/FEDER UE project PID2021-127643NB-I00; H2020-MSCA-RISE-2017 project 777826 NoMADS. V. S. K. has received support through “la Caixa” Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064 and the Marie SkłodowskaCurie grant agreement No. 713673. J. F. M. acknowledges support by FPI scholarship PRE2018-083920.
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Kadandale VS, Montesinos JF, Haro G. VocaLiST: an audio-visual synchronisation model for lips and voices. In: Proc. Interspeech 2022; 2022 Sep 18-22; Incheon, South Korea. [Baixas]: International Speech Communication Association; 2022. p. 3128-32. DOI: 10.21437/Interspeech.2022-10861
  • dc.identifier.doi http://dx.doi.org/10.21437/Interspeech.2022-10861
  • dc.identifier.uri http://hdl.handle.net/10230/55883
  • dc.language.iso eng
  • dc.publisher International Speech Communication Association (ISCA)
  • dc.relation.ispartof Proc. Interspeech 2022; 2022 Sep 18-22; Incheon, South Korea. [Baixas]: International Speech Communication Association; 2022. p. 3128-32.
  • dc.relation.isreferencedby https://ipcv.github.io/VocaLiST/
  • dc.relation.projectID info:eu-repo/grantAgreement/ES/3PE/PID2021-127643NB-I00
  • dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/777826
  • dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/713673
  • dc.relation.projectID info:eu-repo/grantAgreement/ES/2PE/PRE2018-083920
  • dc.rights © 2022 ISCA
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.subject.keyword audio-visual
  • dc.subject.keyword speech
  • dc.subject.keyword singing voice
  • dc.subject.keyword synchronisation
  • dc.subject.keyword source separation
  • dc.subject.keyword self-supervision
  • dc.subject.keyword cross-modal
  • dc.title VocaLiST: an audio-visual synchronisation model for lips and voices
  • dc.type info:eu-repo/semantics/conferenceObject
  • dc.type.version info:eu-repo/semantics/publishedVersion