Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation

Abstract

Referring image segmentation aims to segment an object referred to by naturallanguage expression from an image. The primary challenge lies in the efficientpropagation of fine-grained semantic information from textual features tovisual features. Many recent works utilize a Transformer to address thischallenge. However, conventional transformer decoders can distort linguisticinformation with deeper layers, leading to suboptimal results. In this paper,we introduce CRFormer, a model that iteratively calibrates multi-modal featuresin the transformer decoder. We start by generating language queries usingvision features, emphasizing different aspects of the input language. Then, wepropose a novel Calibration Decoder (CDec) wherein the multi-modal features caniteratively calibrated by the input language features. In the CalibrationDecoder, we use the output of each decoder layer and the original languagefeatures to generate new queries for continuous calibration, which graduallyupdates the language features. Based on CDec, we introduce a LanguageReconstruction Module and a reconstruction loss. This module leverages queriesfrom the final layer of the decoder to reconstruct the input language andcompute the reconstruction loss. This can further prevent the languageinformation from being lost or distorted. Our experiments consistently show thesuperior performance of our approach across RefCOCO, RefCOCO+, and G-Refdatasets compared to state-of-the-art methods.

Quick Read (beta)

loading the full paper ...