Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

Abstract

Understanding the limits of language is a prerequisite for Large LanguageModels (LLMs) to act as theories of natural language. LLM performance in somelanguage tasks presents both quantitative and qualitative differences from thatof humans, however it remains to be determined whether such differences areamenable to model size. This work investigates the critical role of modelscaling, determining whether increases in size make up for such differencesbetween humans and models. We test three LLMs from different families (Bard,137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on agrammaticality judgment task featuring anaphora, center embedding,comparatives, and negative polarity. N=1,200 judgments are collected and scoredfor accuracy, stability, and improvements in accuracy upon repeatedpresentation of a prompt. Results of the best performing LLM, ChatGPT-4, arecompared to results of n=80 humans on the same stimuli. We find that increasedmodel size may lead to better performance, but LLMs are still not sensitive to(un)grammaticality as humans are. It seems possible but unlikely that scalingalone can fix this issue. We interpret these results by comparing languagelearning in vivo and in silico, identifying three critical differencesconcerning (i) the type of evidence, (ii) the poverty of the stimulus, and(iii) the occurrence of semantic hallucinations due to impenetrable linguisticreference.

Quick Read (beta)

loading the full paper ...