Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Khazar Khorrami; Okko Räsänen

doi:10.34842/w3vw-s845

Options

Article

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Authors

Khazar Khorrami (Tampere University)
Okko Räsänen (Tampere University)

Abstract

Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question whether knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and instead of the units ever being proximal learning goals for the learner. In this study, formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated with phonetic, syllabic, and lexical units of speech indeed emerge from the audiovisual learning process. The finding is also robust against variations in model architecture or characteristics of model training and testing data. The results suggest that cross-modal and cross-situational learning may, in principle, assist in early language development much beyond just enabling association of acoustic word forms to their referential meanings.

Keywords: neural networks, language representation learning, visually grounded speech, computational modeling, early language acquisition

Downloads:
Download PDF
View PDF

Published on
2021-08-05

Peer Reviewed

License

Creative Commons Attribution-Noncommercial 4.0 International

Authors

Khazar Khorrami (Unit of Computing Sciences, Tampere University)
Okko Räsänen (Unit of Computing Sciences, Tampere University)

Downloads

Issue

Issue: Volume 1 • Issue 1 • 2021

Identifiers

DOI: https://doi.org/10.34842/w3vw-s845

Publication details

Pages: 123-191
Accepted on: 2021-07-14

File Checksums (MD5)

PDF: e3d0b2f9a8309f176cb8757d71d6d044

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? - A computational investigation

Abstract

Harvard-Style Citation

Vancouver-Style Citation

APA-Style Citation

Non Specialist Summary