Skip to main content
Skip to main content

Joselyn in Philly with the Acoustical Society of America

May 12, 2026 Linguistics

Linguistics PhD student Joselyn Rodriguez, standing in front of a chalkboard, smiling at the camera.

Noise robustness of self-supervised speech models compared to human listeners.

May 11-15 the Acoustical Society of America holds its One Hundred and Ninetieth annual meeting, in Philadelphia, and featured in it is work by Joselyn Rodriguez, "Noise robustness of self-supervised speech models compared to human listeners," abstracted below, with co-authors Ahmed Attia and Carol Espy Wilson.


Modern self-supervised learning models have shown great advancement in applications in automatic speech recognition. However, they still struggle in noisy environments. A landmark work by Miller and Nicely investigated the errors of human perception under different levels and kinds of noise, shaping our understanding of human perception. In this work, we investigate how close machine perception is to human perception by recreating this study with two self-supervised speech recognition models, Wav2Vec 2.0 and WavLM. We compare the results of the models and humans qualitatively through classification accuracy scores and hierarchical clustering models. In a classification task, we find that models perform worse at phone classification relative to human listeners across signal-to-noise ratios (SNR). Notably, the models’ performance lags behind humans by 12dB illustrating the gap in performance between humans and models. In examining the pattern of errors for specific consonants, we found that voiceless stop confusions are similar across humans and models—primarily across place, rather than voicing. We also find high accuracy for nasals across SNRs relative to other phones, consistent with human findings as well. However, there are still substantial differences in performance between human and model classification.