Language-Image Models with 3D Understanding

Abstract

Multi-modal large language models (MLLMs) have shown incredible capabilitiesin a variety of 2D vision and language tasks. We extend MLLMs' perceptualcapabilities to ground and reason about images in 3-dimensional space. To thatend, we first develop a large-scale pre-training dataset for 2D and 3D calledLV3D by combining multiple existing 2D and 3D recognition datasets under acommon task formulation: as multi-turn question-answering. Next, we introduce anew MLLM named Cube-LLM and pre-train it on LV3D. We show that pure datascaling makes a strong 3D perception capability without 3D specificarchitectural design or training objective. Cube-LLM exhibits intriguingproperties similar to LLMs: (1) Cube-LLM can apply chain-of-thought promptingto improve 3D understanding from 2D context information. (2) Cube-LLM canfollow complex and diverse instructions and adapt to versatile input and outputformats. (3) Cube-LLM can be visually prompted such as 2D box or a set ofcandidate 3D boxes from specialists. Our experiments on outdoor benchmarksdemonstrate that Cube-LLM significantly outperforms existing baselines by 21.3points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7points on the DriveLM dataset for complex reasoning about driving scenarios,respectively. Cube-LLM also shows competitive results in general MLLMbenchmarks such as refCOCO for 2D grounding with (87.0) average score, as wellas visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. forcomplex reasoning. Our project is available athttps://janghyuncho.github.io/Cube-LLM.

Quick Read (beta)

loading the full paper ...