Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding

Peiró Lilja, Àlex; Farrús, Mireia

Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Peiró Lilja, Àlex
dc.contributor.author Farrús, Mireia
dc.date.accessioned 2020-11-11T07:30:26Z
dc.date.available 2020-11-11T07:30:26Z
dc.date.issued 2020
dc.description Comunicació presentada a Interspeech 2020 celebrat del 25 al 29 d'octubre de 2020 a Shanghai, Xina.
dc.description.abstract State-of-the-art end-to-end speech synthesis models have reached levels of quality close to human capabilities. However, there is still room for improvement in terms of naturalness, related to prosody, which is essential for human-machine interaction. Therefore, part of current research has shift its focus on improving this aspect with many solutions, which mainly involve prosody adaptability or control. In this work, we explored a way to include linguistic features into the sequenceto- sequence Tacotron2 system to improve the naturalness of the generated voice. That is, making the prosody of the synthesis looking more like the real human speaker. Specifically we embedded with an additional encoder part-of-speech tags and punctuation mark locations of the input text to condition Tacotron2 generation. We propose two different architectures for this parallel encoder: one based on a stack of convolutional plus recurrent layers, and another formed by a stack of bidirectional recurrent plus linear layers. To evaluate the similarity between real read-speech and synthesis, we carried out an objective test using signal processing metrics and a perceptual test. The presented results show that we achieved an improvement in naturalness.
dc.description.sponsorship This work is a part of the INGENIOUS project, funded by the European Union’s Horizon 2020 Research and Innovation Programme and the Korean Government under Grant Agreement No 833435. The second author has been funded by the Agencia Estatal de Investigaci´on (AEI), Ministerio de Ciencia, Innovaci ´on y Universidades and the Fondo Social Europeo (FSE) under grant RYC-2015-17239 (AEI/FSE, UE). This work has been carried out using an NVIDIA GPU Titan Xp generously provided by NVIDIA Company.
dc.format.mimetype application/pdf
dc.identifier.citation Peiró-Lilja A, Farrús M. Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding. In: Proceedings of Interspeech 2020; 2020 Oct 25-29; Shanghai, China. [Baixas]: ISCA; 2020. p. 3994-8. DOI: 10.21437/Interspeech.2020-1788
dc.identifier.doi http://dx.doi.org/10.21437/Interspeech.2020-1788
dc.identifier.issn 1990-9772
dc.identifier.uri http://hdl.handle.net/10230/45714
dc.language.iso eng
dc.publisher International Speech Communication Association (ISCA)
dc.relation.ispartof Proceedings of Interspeech 2020; 2020 Oct 25-29; Shanghai, China. [Baixas]: ISCA; 2020.
dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/833435
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.subject.keyword Naturalness
dc.subject.keyword Prosody adaptation
dc.subject.keyword End-to-end
dc.subject.keyword Sequence-to-sequence
dc.subject.keyword Text-to-speech
dc.title Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding
dc.type info:eu-repo/semantics/conferenceObject
dc.type.version info:eu-repo/semantics/publishedVersion

Col·leccions

Congressos (Departament de Tecnologies de la Informació i les Comunicacions)
Documents OpenAIRE (Open Access Infrastructure for Research in Europe)