Sequence-to-sequence singing synthesis using the feed-forward transformer

Blaauw, Merlijn; Bonada, Jordi, 1973-

Sequence-to-sequence singing synthesis using the feed-forward transformer

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Blaauw, Merlijn
dc.contributor.author Bonada, Jordi, 1973-
dc.date.accessioned 2021-02-12T07:23:14Z
dc.date.issued 2020
dc.description Comunicació presentada a: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, celebrat en línia del 4 al 8 de maig de 2020.
dc.description.abstract We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decoder based on a feed-forward variant of the Transformer model, a series of self-attention and convolutional layers refines the result of the initial alignment to reach the target acoustic features. Advantages of this approach include faster inference and avoiding the exposure bias issues that affect autoregressive models trained by teacher forcing. We evaluate the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the importance of the accuracy of the duration model.en
dc.description.sponsorship This work was funded by TROMPA H2020 No 770376.
dc.format.mimetype application/pdf
dc.identifier.citation Blaauw M, Bonada J. Sequence-to-sequence singing synthesis using the feed-forward transformer. In: 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); 2020 May 4-8; Barcelona, Spain. New Jersery: The Institute of Electrical and Electronics Engineers; 2020. p. 7229-33. DOI: 10.1109/ICASSP40776.2020.9053944
dc.identifier.doi http://dx.doi.org/10.1109/ICASSP40776.2020.9053944
dc.identifier.issn 2379-190X
dc.identifier.uri http://hdl.handle.net/10230/46457
dc.language.iso eng
dc.publisher Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP); 2020 May 4-8; Barcelona, Spain. New Jersery: The Institute of Electrical and Electronics Engineers; 2020. p. 7229-33
dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/770376
dc.rights © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. http://dx.doi.org/10.1109/ICASSP40776.2020.9053944
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.subject.keyword Singing synthesisen
dc.subject.keyword Sequence-to-sequenceen
dc.subject.keyword Self-attentionen
dc.subject.keyword Feed-forwarden
dc.subject.keyword Transformeren
dc.title Sequence-to-sequence singing synthesis using the feed-forward transformeren
dc.type info:eu-repo/semantics/conferenceObject
dc.type.version info:eu-repo/semantics/acceptedVersion

Col·leccions

Congressos (Departament de Tecnologies de la Informació i les Comunicacions)
Documents OpenAIRE (Open Access Infrastructure for Research in Europe)