Öktem, AlpFarrús, MireiaBonafonte Cávez, Antonio2022-05-172022-05-172021Öktem A, Farrús M, Bonafonte A. Corpora compilation for prosody-informed speech processing. Lang Resour Eval. 2021;55:925-46. DOI: 10.1007/s10579-021-09556-21574-020Xhttp://hdl.handle.net/10230/53108Research on speech technologies necessitates spoken data, which is usually obtained through read recorded speech, and specifically adapted to the research needs. When the aim is to deal with the prosody involved in speech, the available data must reflect natural and conversational speech, which is usually costly and difficult to get. This paper presents a machine learning-oriented toolkit for collecting, handling, and visualization of speech data, using prosodic heuristic. We present two corpora resulting from these methodologies: PANTED corpus, containing 250 h of English speech from TED Talks, and Heroes corpus containing 8 h of parallel English and Spanish movie speech. We demonstrate their use in two deep learning-based applications: punctuation restoration and machine translation. The presented corpora are freely available to the research community.application/pdfeng© The Author(s) 2021 This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/.Corpora compilation for prosody-informed speech processinginfo:eu-repo/semantics/articlehttp://dx.doi.org/10.1007/s10579-021-09556-2Speech corpusParallel dataSpeech transcriptionSpoken machine translationPunctuationPauseF0Intensityinfo:eu-repo/semantics/openAccess