RSCaMa: Remote Sensing Image Change Captioning with State Space Model

Abstract

Remote Sensing Image Change Captioning (RSICC) aims to identify surfacechanges in multi-temporal remote sensing images and describe them in naturallanguage. Current methods typically rely on an encoder-decoder architecture andfocus on designing a sophisticated neck to process bi-temporal featuresextracted by the backbone. Recently, State Space Models (SSMs), especiallyMamba, have demonstrated outstanding performance in many fields, owing to theirefficient feature-selective modelling capability. However, their potential inthe RSICC task remains unexplored. In this paper, we introduce Mamba into RSICCand propose a novel approach called RSCaMa (Remote Sensing Change CaptioningMamba). Specifically, we utilize Siamese backbones to extract bi-temporalfeatures, which are then processed through multiple CaMa layers consisting ofSpatial Difference-guided SSM (SD-SSM) and Temporal Traveling SSM (TT-SSM).SD-SSM uses differential features to enhance change perception, while TT-SSMpromotes bitemporal interactions in a token-wise cross-scanning manner.Experimental results validate the effectiveness of CaMa layers and demonstratethe superior performance of RSCaMa, as well as the potential of Mamba in theRSICC task. Additionally, we systematically compare the effects of threelanguage decoders, including Mamba, GPT-style decoder with causal attentionmechanism, and Transformer decoder with cross-attention mechanism. Thisprovides valuable insights for future RSICC research. The code will beavailable at https://github.com/Chen-Yang-Liu/RSCaMa

Quick Read (beta)

loading the full paper ...