[PS-3.14] Content-free and content-specific cross-modal predictions in audiovisual speech: a computational model

Olasagasti, I. , Hovsepyan, S. , Bouton, S. . & Giraud, A.

Department of Fundamental Neuroscience, University of Geneva, Geneva, Switzerland.

Speech processing is inherently multisensory, allowing one sensory system to use predictions about timing and content from another one to facilitate parsing and decoding. In this work we propose a predictive coding model of audiovisual processing of simple syllables that incorporates both predictions about timing and content.
The implementation is based on a generative model in which the activation of abstract units representing individual syllables generates predictions about the dynamic audiovisual sensory features consistent with the corresponding syllable: lip aperture for the visual modality and second formant transitions and voicing in the acoustic modality.
The model provides a novel interpretation of audiovisual integration and specifies two potential ways through which visual information affects speech processing: 1) content-specific information such as place of articulation favors specific speech tokens and their related recognition units, and 2) content unspecific timing information aligns internal rhythms and helps the parsing of continuous speech. It also specifies how content specific information can affect speech processing by changing the internal representations of speech categories (learning). The model thus provides a unified account of ?what? and ?when? predictions across modalities as well as associated sensory learning.