Frank, S. , Fernandez Monsalve, I. & Thompson, R.
University College London
When reading a sentence, words that are less expected to occur require more cognitive effort to process, as is apparent from increased reading times. The extent to which a word is unexpected can be quantified by its ‘surprisal’, an information-theoretic measure that can be estimated by any probabilistic model of the language. Since language models differ in their underlying assumptions, they yield different surprisal estimates, which allows for a comparison between models: A model whose surprisal estimates predict reading times more accurately is based on cognitively more plausible assumptions. However, it is difficult to accurately estimate word surprisal using models that are complex enough to be cognitively interesting. Therefore, surprisal-based model comparison has thus far only been performed using surprisal of each word’s syntactic category rather than the word itself. This has the drawback that theory-dependant syntactic categories need to be assigned to the words.
We solved this problem by reducing the vocabulary to 7,754 high-frequency words. The 702,412 sentences (comprising 7.6 million word tokens) that contained only those words were selected from the British National Corpus, and used as training data for both a recurrent neural network (RNN) and a probabilistic phrase-structure grammar (PSG). Next, these models estimated word surprisal over a set of sentences that were semi-randomly selected from three novels. Reading times on the same sentences were collected using both self-paced reading and eye-tracking. A comparison of the predictive values of the surprisal estimates by the two models revealed that the RNN accounts for more variance in reading times, even though the PSG has access to linguistically informed syntactic structures and performed better at capturing the statistical patterns of the language. This strongly suggests that the human sentence-comprehension system proceeds more like an RNN than like a PSG, confirming earlier results based on syntactic categories.