Psycholinguistics Workshop - Sathvik Nair / Words, Subwords, and Morphemes: What really matters in the surprisal-reading time relationship?

Psycholinguistics Workshop - Sathvik Nair / Words, Subwords, and Morphemes: What really matters in the surprisal-reading time relationship?
Friday September 15, Sathvik Nair leads Psycholinguistics Lab, with a discussion of where and how an analysis to the morphemic level matters to prediction.
Abstract
An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization that does not consider morphology. Does that matter? We carefully test this assumption by comparing surprisal estimates with simpler statistical n-gram language models using orthographic, morphological, and LLM-style tokenization against broad-coverage reading time data. In the aggregate, our results replicate previous findings and provide evidence that predictions using LLM-style tokenization do not suffer relative to morphological and orthographic segmentation. However, finer-grained analyses indicate that morpheme-level information is indeed relevant to prediction, and has the potential to provide more psychologically realistic estimates of human results.