Research on speech technologies necessitates spoken data, which is
usually obtained through read recorded speech, and specifically adapted to the
research needs. When the aim is to deal with the prosody involved in speech, the
available data must reflect natural and conversational speech, which is usually
costly and difficult to get. This paper presents a machine learning-oriented toolkit
for collecting, handling, and visualization of speech data, using prosodic heuristic.
We present two corpora ...
Research on speech technologies necessitates spoken data, which is
usually obtained through read recorded speech, and specifically adapted to the
research needs. When the aim is to deal with the prosody involved in speech, the
available data must reflect natural and conversational speech, which is usually
costly and difficult to get. This paper presents a machine learning-oriented toolkit
for collecting, handling, and visualization of speech data, using prosodic heuristic.
We present two corpora resulting from these methodologies: PANTED corpus,
containing 250 h of English speech from TED Talks, and Heroes corpus containing
8 h of parallel English and Spanish movie speech. We demonstrate their use in two
deep learning-based applications: punctuation restoration and machine translation.
The presented corpora are freely available to the research community.
+