Back Past events: Jim Magnuson. Breaking the sound barrier: Toward realistic models of human speech recognition

Jim Magnuson. Breaking the sound barrier: Toward realistic models of human speech recognition

30/9/2021
- ZOOM ROOM 2

What: Breaking the sound barrier: Toward realistic models of human speech recognition

Where:  Zoom room 2

Who: Prof. Jim Magnuson, Department of Psychology, University of Connecticut & Ikerbasque Research Professor, BCBL

When:  Thursday, September 30th at 12:30 PM.

One of the great unsolved challenges in the cognitive and neural sciences is understanding how human listeners achieve phonetic constancy (seemingly effortless perception of a speaker's intended consonants and vowels under typical conditions) despite a lack of invariant cues to speech sounds. Models (mathematical, neural network, or Bayesian) of human speech recognition have been essential tools in the development of theories over the last forty years. However, they have been little help in understanding phonetic constancy because virtually none not operate on real speech (they instead focus on mapping from a sequence of consonants and vowels to words in memory). The few models that work on real speech borrow elements from automatic speech recognition (ASR), but do not achieve high accuracy and are arguably too complex to provide much theoretical insight. Over the last two decades,  advances in deep learning have revolutionized ASR, using neural networks that emerged from the same framework as those used in cognitive models. These models do not offer much guidance for human speech recognition because of their complexity. Our team asked whether we could borrow minimal elements from deep learning to construct a simple cognitive neural network that could work on real speech. The result is EARSHOT, a neural network model trained on 1000 words produced by 9 talkers, and tested on a tenth. It learns to map spectral slice inputs to sparse "pseudo-semantic" vectors via recurrent hidden units. The element we have borrowed from deep learning is to use "long short-term memory" (LSTM) nodes in the hidden layer. LSTM nodes have internal "gates" that allow nodes to become differentially sensitive to variable time scales. EARSHOT achieves high accuracy on trained items and moderate generalization to excluded talker-word pairs and excluded talkers, while exhibiting human-like over-time phonological competition. Analyses of hidden units – based on approaches used in human electrocorticography – reveal that the model learns a distributed phonetic code to map speech to semantics. I will discuss the implications for cognitive and neural theories of human speech learning and processing, and provide some updates on EARSHOT development since our first publication (Magnuson et al., 2020).

Magnuson, J.S., You, H., Luthra, S., Li, M., Nam, H., Escabí, M., Brown, K., Allopenna, P.D., Theodore, R.M., Monto, N., & Rueckl, J.G. (2020). EARSHOT: A minimal neural network model of incremental human speech recognition. Cognitive Science, 44, e12823. http://dx.doi.org/10.1111/cogs.12823