Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

Abstract

Large Vision Language Models (LVLMs) have demonstrated impressive zero-shotcapabilities in various vision-language dialogue scenarios. However, theabsence of fine-grained visual object detection hinders the model fromunderstanding the details of images, leading to irreparable visualhallucinations and factual errors. In this paper, we propose Lyrics, a novelmulti-modal pre-training and instruction fine-tuning paradigm that bootstrapsvision-language alignment from fine-grained cross-modal collaboration. Buildingon the foundation of BLIP-2, Lyrics infuses local visual features extractedfrom a visual refiner that includes image tagging, object detection andsemantic segmentation modules into the Querying Transformer, while on the textside, the language inputs equip the boundary boxes and tags derived from thevisual refiner. We further introduce a two-stage training scheme, in which thepre-training stage bridges the modality gap through explicit and comprehensivevision-language alignment targets. During the instruction fine-tuning stage, weintroduce semantic-aware visual feature extraction, a crucial method thatenables the model to extract informative features from concrete visual objects.Our approach achieves robust performance on 13 datasets across variousvision-language tasks, and demonstrates promising multi-modal understanding,perception and conversation capabilities in 11 scenario-based benchmarktoolkits.

Quick Read (beta)

loading the full paper ...