SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Abstract

The rapid evolution of multimodal foundation model has demonstratedsignificant progresses in vision-language understanding and generation, e.g.,our previous work SEED-LLaMA. However, there remains a gap between itscapability and the real-world applicability, primarily due to the model'slimited capacity to effectively respond to various user instructions andinteract with diverse visual data. In this work, we focus on bridging this gapthrough integrating two enhanced features: (1) comprehending images ofarbitrary sizes and ratios, and (2) enabling multi-granularity imagegeneration. We present a unified and versatile foundation model, namely,SEED-X, which is able to model multi-granularity visual semantics forcomprehension and generation tasks. Besides the competitive results on publicbenchmarks, SEED-X demonstrates its effectiveness in handling real-worldapplications across various domains after instruction tuning. We hope that ourwork will inspire future research into what can be achieved by versatilemultimodal foundation models in real-world applications. The models, codes, anddatasets will be released in https://github.com/AILab-CVC/SEED-X.

Quick Read (beta)

loading the full paper ...