SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Abstract

Comprehending text-rich visual content is paramount for the practicalapplication of Multimodal Large Language Models (MLLMs), since text-richscenarios are ubiquitous in the real world, which are characterized by thepresence of extensive texts embedded within images. Recently, the advent ofMLLMs with impressive versatility has raised the bar for what we can expectfrom MLLMs. However, their proficiency in text-rich scenarios has yet to becomprehensively and objectively assessed, since current MLLM benchmarksprimarily focus on evaluating general visual comprehension. In this work, weintroduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating\textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3Kmultiple-choice questions with precise human annotations, spanning three broadcategories: Charts, Maps, and Webs, each of which covers a wide spectrum oftext-rich scenarios in the real world. These categories, due to their inherentcomplexity and diversity, effectively simulate real-world text-richenvironments. We further conduct a thorough evaluation involving 34 prominentMLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize thecurrent limitations of MLLMs in text-rich visual comprehension. We hope thatour work can serve as a valuable addition to existing MLLM benchmarks,providing insightful observations and inspiring further research in the area oftext-rich visual comprehension with MLLMs. The dataset and evaluation code canbe accessed at https://github.com/AILab-CVC/SEED-Bench.

Quick Read (beta)

loading the full paper ...