Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding

Citation

  • Peiró-Lilja A, Farrús M. Naturalness enhancement with linguistic information in end-to-end TTS using unsupervised parallel encoding. In: Proceedings of Interspeech 2020; 2020 Oct 25-29; Shanghai, China. [Baixas]: ISCA; 2020. p. 3994-8. DOI: 10.21437/Interspeech.2020-1788

Permanent Link

Description

  • Abstract

    State-of-the-art end-to-end speech synthesis models have reached levels of quality close to human capabilities. However, there is still room for improvement in terms of naturalness, related to prosody, which is essential for human-machine interaction. Therefore, part of current research has shift its focus on improving this aspect with many solutions, which mainly involve prosody adaptability or control. In this work, we explored a way to include linguistic features into the sequenceto- sequence Tacotron2 system to improve the naturalness of the generated voice. That is, making the prosody of the synthesis looking more like the real human speaker. Specifically we embedded with an additional encoder part-of-speech tags and punctuation mark locations of the input text to condition Tacotron2 generation. We propose two different architectures for this parallel encoder: one based on a stack of convolutional plus recurrent layers, and another formed by a stack of bidirectional recurrent plus linear layers. To evaluate the similarity between real read-speech and synthesis, we carried out an objective test using signal processing metrics and a perceptual test. The presented results show that we achieved an improvement in naturalness.
  • Description

    Comunicació presentada a Interspeech 2020 celebrat del 25 al 29 d'octubre de 2020 a Shanghai, Xina.
  • Full item page