Intonation is traditionally considered to be the most important prosodic feature, whereupon an important research effort has been devoted to automatic segmentation and labeling of speech samples to grasp intonation cues. A number of studies also show that when duration or intensity are incorporated, automatic prosody labeling is further improved. However, the combination of word level acoustic features still attains poor results when machine learning techniques are applied on annotated corpora to ...
Intonation is traditionally considered to be the most important prosodic feature, whereupon an important research effort has been devoted to automatic segmentation and labeling of speech samples to grasp intonation cues. A number of studies also show that when duration or intensity are incorporated, automatic prosody labeling is further improved. However, the combination of word level acoustic features still attains poor results when machine learning techniques are applied on annotated corpora to derive intonation for speech synthesis applications. To address this problem, we present an experimental set-up for the development of a hierarchical prosodic structure model which combines linguistic features, including information structure, and three acoustic elements (intensity, pitch and duration). We show empirically that this combination leads to a considerably more accurate representation of prosody and, consequently, a more reliable automatic labeling of speech corpora for machine learning.
+