Skip to main content
Skip to main content

Psycholinguistics Workshop - Sathvik Nair / Words, Subwords, and Morphemes: What really matters in the surprisal-reading time relationship?

Sathvik Nair, PhD student in Linguistics, standing in front of a brick wall painted colorfully with geometric shapes, smiling broadly.

Psycholinguistics Workshop - Sathvik Nair / Words, Subwords, and Morphemes: What really matters in the surprisal-reading time relationship?

Linguistics Friday, September 15, 2023 12:00 pm - 1:39 pm Marie Mount Hall, 1108B

Friday September 15, Sathvik Nair leads Psycholinguistics Lab, with a discussion of where and how an analysis to the morphemic level matters to prediction.


Abstract

An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization that does not consider morphology. Does that matter? We carefully test this assumption by comparing surprisal estimates with simpler statistical n-gram language models using orthographic, morphological, and LLM-style tokenization against broad-coverage reading time data. In the aggregate, our results replicate previous findings and provide evidence that predictions using LLM-style tokenization do not suffer relative to morphological and orthographic segmentation. However, finer-grained analyses indicate that morpheme-level information is indeed relevant to prediction, and has the potential to provide more psychologically realistic estimates of human results.

Add to Calendar 09/15/23 12:00:00 09/15/23 13:39:00 America/New_York Psycholinguistics Workshop - Sathvik Nair / Words, Subwords, and Morphemes: What really matters in the surprisal-reading time relationship?

Friday September 15, Sathvik Nair leads Psycholinguistics Lab, with a discussion of where and how an analysis to the morphemic level matters to prediction.


Abstract

An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization that does not consider morphology. Does that matter? We carefully test this assumption by comparing surprisal estimates with simpler statistical n-gram language models using orthographic, morphological, and LLM-style tokenization against broad-coverage reading time data. In the aggregate, our results replicate previous findings and provide evidence that predictions using LLM-style tokenization do not suffer relative to morphological and orthographic segmentation. However, finer-grained analyses indicate that morpheme-level information is indeed relevant to prediction, and has the potential to provide more psychologically realistic estimates of human results.

Marie Mount Hall false