Holmes: Benchmark the Linguistic Competence of Language Models

Abstract

We introduce Holmes, a benchmark to assess the linguistic competence oflanguage models (LMs) - their ability to grasp linguistic phenomena. Unlikeprior prompting-based evaluations, Holmes assesses the linguistic competence ofLMs via their internal representations using classifier-based probing. In doingso, we disentangle specific phenomena (e.g., part-of-speech of words) fromother cognitive abilities, like following textual instructions, and meet recentcalls to assess LMs' linguistic competence in isolation. Composing Holmes, wereview over 250 probing studies and feature more than 200 datasets to assesssyntax, morphology, semantics, reasoning, and discourse phenomena. Analyzingover 50 LMs reveals that, aligned with known trends, their linguisticcompetence correlates with model size. However, surprisingly, modelarchitecture and instruction tuning also significantly influence performance,particularly in morphology and syntax. Finally, we propose FlashHolmes, astreamlined version of Holmes designed to lower the high computation load whilemaintaining high-ranking precision.

Quick Read (beta)

loading the full paper ...