How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Abstract

In this report, we introduce InternVL 1.5, an open-source multimodal largelanguage model (MLLM) to bridge the capability gap between open-source andproprietary commercial models in multimodal understanding. We introduce threesimple improvements: (1) Strong Vision Encoder: we explored a continuouslearning strategy for the large-scale vision foundation model -- InternViT-6B,boosting its visual understanding capabilities, and making it can betransferred and reused in different LLMs. (2) Dynamic High-Resolution: wedivide images into tiles ranging from 1 to 40 of 448$\times$448 pixelsaccording to the aspect ratio and resolution of the input images, whichsupports up to 4K resolution input. (3) High-Quality Bilingual Dataset: wecarefully collected a high-quality bilingual dataset that covers common scenes,document images, and annotated them with English and Chinese question-answerpairs, significantly enhancing performance in OCR- and Chinese-related tasks.We evaluate InternVL 1.5 through a series of benchmarks and comparativestudies. Compared to both open-source and proprietary models, InternVL 1.5shows competitive performance, achieving state-of-the-art results in 8 of 18benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

Quick Read (beta)

loading the full paper ...