导航

Acta Aeronautica et Astronautica Sinica ›› 2026, Vol. 47 ›› Issue (10): 532910.doi: 10.7527/S1000-6893.2025.32910

• Special Issue: Intelligent Processing and Analysis of Aerospace Remote Sensing Images • Previous Articles    

A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models

Ao SUN1, Fang XU1, Shuguo JIANG2, Wen YANG1,3, Guisong XIA1()   

  1. 1.School of Artificial Intelligence,Wuhan University,Wuhan 430072,China
    2.School of Computer Science,Wuhan University,Wuhan 430072,China
    3.Electronic Information School,Wuhan University,Wuhan 430072,China
  • Received:2025-10-14 Revised:2025-11-06 Accepted:2025-12-29 Online:2026-01-22 Published:2026-01-15
  • Contact: Guisong XIA E-mail:guisong.xia@whu.edu.cn
  • Supported by:
    National Natural Science Foundation of China(62401406);Postdoctoral Fellowship Program of CPSF(GZB20240562)

Abstract:

The integration of multi-modal remote sensing images greatly improves the comprehensiveness of land cover information. With the growing accessibility of multi-modal remote sensing data, multi-modal remote sensing foundation models are progressively developed to align different modalities, thereby facilitating the integration of diverse data sources. Nevertheless, existing foundation models typically concentrate on learning the common characteristics across modalities, forcing intrinsically irrelevant representations to converge while neglecting the synergistic information that is only revealed through modal interactions, thus hampering their advancement in comprehensive analysis of Earth observation data. To address this, we introduce UPSeg, a novel multi-modal remote sensing semantic segmentation method designed to explicitly capture and exploit these inter-modal synergies. UPSeg emulates human cognition by using unimodal features as inspiration for processing other modalities, effectively reducing uncertainties inherent in single-modality data. Specifically, we utilize features extracted from unimodal data as mutual inspiration and guidance to refine the feature parsing process of another modality, strengthening inter-modal interactions and maximizing synergistic benefits within the underlying model. Considering that the interaction of modality-unique information is more conducive to the generation of new insights, we propose a Variance Enhancement Module that employs a cross-modal attention mechanism to accentuate the distinctive features of each modality, enhancing the directionality and intentionality of cross-modal interactions. Extensive evaluation demonstrates that the proposed algorithm effectively addresses the limitations of existing foundation models in leveraging modal synergies, facilitating precise land cover classification. Our method outperforms the state-of-the-art multi-modal semantic segmentation algorithms with a gain of 2.0% in terms of Mean Pixel Accuracy (mPA) on WHU-OPT-SAR dataset and 4.2% on Gaofen Image Dataset (GID).

Key words: semantic segmentation, multi-modal fusion, land cover classification, remote sensing foundation model, visual prompt learning

CLC Number: