Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Peiró Lilja, Àlex
  • dc.contributor.author Farrús, Mireia
  • dc.date.accessioned 2020-11-11T07:30:26Z
  • dc.date.available 2020-11-11T07:30:26Z
  • dc.date.issued 2020
  • dc.description Comunicació presentada a Interspeech 2020 celebrat del 25 al 29 d'octubre de 2020 a Shanghai, Xina.
  • dc.description.abstract State-of-the-art end-to-end speech synthesis models have reached levels of quality close to human capabilities. However, there is still room for improvement in terms of naturalness, related to prosody, which is essential for human-machine interaction. Therefore, part of current research has shift its focus on improving this aspect with many solutions, which mainly involve prosody adaptability or control. In this work, we explored a way to include linguistic features into the sequenceto- sequence Tacotron2 system to improve the naturalness of the generated voice. That is, making the prosody of the synthesis looking more like the real human speaker. Specifically we embedded with an additional encoder part-of-speech tags and punctuation mark locations of the input text to condition Tacotron2 generation. We propose two different architectures for this parallel encoder: one based on a stack of convolutional plus recurrent layers, and another formed by a stack of bidirectional recurrent plus linear layers. To evaluate the similarity between real read-speech and synthesis, we carried out an objective test using signal processing metrics and a perceptual test. The presented results show that we achieved an improvement in naturalness.
  • dc.description.sponsorship This work is a part of the INGENIOUS project, funded by the European Union’s Horizon 2020 Research and Innovation Programme and the Korean Government under Grant Agreement No 833435. The second author has been funded by the Agencia Estatal de Investigaci´on (AEI), Ministerio de Ciencia, Innovaci ´on y Universidades and the Fondo Social Europeo (FSE) under grant RYC-2015-17239 (AEI/FSE, UE). This work has been carried out using an NVIDIA GPU Titan Xp generously provided by NVIDIA Company.
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Peiró-Lilja A, Farrús M. Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding. In: Proceedings of Interspeech 2020; 2020 Oct 25-29; Shanghai, China. [Baixas]: ISCA; 2020. p. 3994-8. DOI: 10.21437/Interspeech.2020-1788
  • dc.identifier.doi http://dx.doi.org/10.21437/Interspeech.2020-1788
  • dc.identifier.issn 1990-9772
  • dc.identifier.uri http://hdl.handle.net/10230/45714
  • dc.language.iso eng
  • dc.publisher International Speech Communication Association (ISCA)
  • dc.relation.ispartof Proceedings of Interspeech 2020; 2020 Oct 25-29; Shanghai, China. [Baixas]: ISCA; 2020.
  • dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/833435
  • dc.rights © 2020 ISCA
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.subject.keyword Naturalness
  • dc.subject.keyword Prosody adaptation
  • dc.subject.keyword End-to-end
  • dc.subject.keyword Sequence-to-sequence
  • dc.subject.keyword Text-to-speech
  • dc.title Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding
  • dc.type info:eu-repo/semantics/conferenceObject
  • dc.type.version info:eu-repo/semantics/publishedVersion