VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations

Abstract

Despite their remarkable successes, state-of-the-art language models facechallenges in grasping certain important semantic details. This paperintroduces the VISLA (Variance and Invariance to Semantic and LexicalAlterations) benchmark, designed to evaluate the semantic and lexicalunderstanding of language models. VISLA presents a 3-way semantic(in)equivalence task with a triplet of sentences associated with an image, toevaluate both vision-language models (VLMs) and unimodal language models(ULMs). An evaluation involving 34 VLMs and 20 ULMs reveals surprisingdifficulties in distinguishing between lexical and semantic variations. Spatialsemantics encoded by language models also appear to be highly sensitive tolexical information. Notably, text encoders of VLMs demonstrate greatersensitivity to semantic and lexical variations than unimodal text encoders. Ourcontributions include the unification of image-to-text and text-to-textretrieval tasks, an off-the-shelf evaluation without fine-tuning, and assessingLMs' semantic (in)variance in the presence of lexical alterations. The resultshighlight strengths and weaknesses across diverse vision and unimodal languagemodels, contributing to a deeper understanding of their capabilities. % VISLAenables a rigorous evaluation, shedding light on language models' capabilitiesin handling semantic and lexical nuances. Data and code will be made availableat https://github.com/Sri-Harsha/visla_benchmark.

Quick Read (beta)

loading the full paper ...