Maximizing accuracy of forced alignment for spontaneous child speech
Sociophonetic study of large speech corpora generally requires the use of forced alignment - the automatic process of determining the start and end time of each speech sound within the recording - in order to facilitate large-scale automated extraction of acoustic measurements of targeted vowels or consonants. There is an extensive literature evaluating alignment accuracy of a number of forced alignment tools and procedures, processing speech data from a range of languages and dialects. In general, these evaluations use typical adult speech data, often elicited in a controlled laboratory environment. There is little literature on the effectiveness of forced alignment systems on child speech, and none on speech elicited in field environments. This presents a problem for research at the intersection of language acquisition and sociophonetics as there is no established best practice for automatically aligning child speech. Child speech presents special challenges to automated tools, as it includes more variation in speech sounds and voice quality, and non-standard pronunciation and prosody. We evaluated two toolkits, Kaldi via the Montreal Forced Aligner (MFA), and the Hidden Markov Model Toolkit (HTK), using different configurations to force align non-rhotic child speech elicited in a preschool environment. Against many of our expectations, we found that MFA, using rhotic acoustic models pre-trained on adult speech, performed best. This paper provides a clear methodology for other researchers in sociophonetics to evaluate the success or otherwise of phonetic alignment.
Keywords: child speech, language acquisition, sociophonetics, speech corpora, forced alignment