Maximizing accuracy of forced alignment for spontaneous child speech
- Robert Fromont Robert Fromont ORCID profile. (opens in new tab) , robert.fromont@canterbury.ac.nz(compose email, opens in email app.), New Zealand Institute of Language, Brain and Behaviour, University of Canterbury (opens in new tab)
- Lynn Clark Lynn Clark ORCID profile. (opens in new tab) , New Zealand Institute of Language, Brain and Behaviour, University of Canterbury (opens in new tab)
- Joshua Wilson Black Joshua Wilson Black ORCID profile. (opens in new tab) , New Zealand Institute of Language, Brain and Behaviour, University of Canterbury (opens in new tab)
- Margaret Blackwood, New Zealand Institute of Language, Brain and Behaviour, University of Canterbury (opens in new tab)
Abstract
Sociophonetic study of large speech corpora generally requires the use of forced alignment - the automatic process of determining the start and end time of each speech sound within the recording - in order to facilitate large-scale automated extraction of acoustic measurements of targeted vowels or consonants. There is an extensive literature evaluating alignment accuracy of a number of forced alignment tools and procedures, processing speech data from a range of languages and dialects. In general, these evaluations use typical adult speech data, often elicited in a controlled laboratory environment. There is little literature on the effectiveness of forced alignment systems on child speech, and none on speech elicited in field environments. This presents a problem for research at the intersection of language acquisition and sociophonetics as there is no established best practice for automatically aligning child speech. Child speech presents special challenges to automated tools, as it includes more variation in speech sounds and voice quality, and non-standard pronunciation and prosody. We evaluated two toolkits, Kaldi via the Montreal Forced Aligner (MFA), and the Hidden Markov Model Toolkit (HTK), using different configurations to force align non-rhotic child speech elicited in a preschool environment. Against many of our expectations, we found that MFA, using rhotic acoustic models pre-trained on adult speech, performed best. This paper provides a clear methodology for other researchers in sociophonetics to evaluate the success or otherwise of phonetic alignment.
Keywords:
- child speech
- language acquisition
- sociophonetics
- speech corpora
- forced alignment
Published on
1 September 2023
Peer Reviewed