Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Abstract

Is vision good enough for language? Recent advancements in multimodal modelsprimarily stem from the powerful reasoning abilities of large language models(LLMs). However, the visual component typically depends only on theinstance-level contrastive language-image pre-training (CLIP). Our researchreveals that the visual capabilities in recent multimodal LLMs (MLLMs) stillexhibit systematic shortcomings. To understand the roots of these errors, weexplore the gap between the visual embedding space of CLIP and vision-onlyself-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIPperceives as similar despite their clear visual differences. With these pairs,we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposesareas where state-of-the-art systems, including GPT-4V, struggle withstraightforward questions across nine basic visual patterns, often providingincorrect answers and hallucinated explanations. We further evaluate variousCLIP-based vision-and-language models and found a notable correlation betweenvisual patterns that challenge CLIP models and those problematic for multimodalLLMs. As an initial effort to address these issues, we propose a Mixture ofFeatures (MoF) approach, demonstrating that integrating vision self-supervisedlearning features with MLLMs can significantly enhance their visual groundingcapabilities. Together, our research suggests visual representation learningremains an open challenge, and accurate visual grounding is crucial for futuresuccessful multimodal systems.

Quick Read (beta)

loading the full paper ...