GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Abstract

Most multimodal large language models (MLLMs) learn language-to-objectgrounding through causal language modeling where grounded objects are capturedby bounding boxes as sequences of location tokens. This paradigm lackspixel-level representations that are important for fine-grained visualunderstanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLMdeveloped by grounding Large Language Models to holistic segmentation.GROUNDHOG incorporates a masked feature extractor and converts extractedfeatures into visual entity tokens for the MLLM backbone, which then connectsgroundable phrases to unified grounding masks by retrieving and merging theentity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visualinstruction tuning dataset with Multi-Modal Multi-Grained Grounding, byharvesting a collection of segmentation-grounded datasets with richannotations. Our experimental results show that GROUNDHOG achieves superiorperformance on various language grounding tasks without task-specificfine-tuning, and significantly reduces object hallucination. GROUNDHOG alsodemonstrates better grounding towards complex forms of visual input andprovides easy-to-understand diagnosis in failure cases.

Quick Read (beta)

loading the full paper ...