航空学报 > 2026, Vol. 47 Issue (10): 532910-532910   doi: 10.7527/S1000-6893.2025.32910

航天遥感图像智能处理与分析专刊

遥感基础模型多模态协同增强的语义分割方法

孙奥1, 徐芳1, 江树国2, 杨文1,3, 夏桂松1()   

  1. 1.武汉大学 人工智能学院,武汉 430072
    2.武汉大学 计算机学院,武汉 430072
    3.武汉大学 电子信息学院,武汉 430072
  • 收稿日期:2025-10-14 修回日期:2025-11-06 接受日期:2025-12-29 出版日期:2026-01-22 发布日期:2026-01-15
  • 通讯作者: 夏桂松 E-mail:guisong.xia@whu.edu.cn
  • 基金资助:
    国家自然科学基金(62401406);国家自然科学基金(U22B2011);国家资助博士后研究人员计划(GZB20240562)

A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models

Ao SUN1, Fang XU1, Shuguo JIANG2, Wen YANG1,3, Guisong XIA1()   

  1. 1.School of Artificial Intelligence,Wuhan University,Wuhan 430072,China
    2.School of Computer Science,Wuhan University,Wuhan 430072,China
    3.Electronic Information School,Wuhan University,Wuhan 430072,China
  • Received:2025-10-14 Revised:2025-11-06 Accepted:2025-12-29 Online:2026-01-22 Published:2026-01-15
  • Contact: Guisong XIA E-mail:guisong.xia@whu.edu.cn
  • Supported by:
    National Natural Science Foundation of China(62401406);Postdoctoral Fellowship Program of CPSF(GZB20240562)

摘要:

多模态遥感影像的融合能够显著增强土地覆盖信息的完整性。随着多模态数据获取的日益便捷,多模态遥感基础模型快速发展,通过跨模态对齐不断推动多源数据的融合应用。然而现有的基础模型通常侧重于学习模态间的共性特征,这会将本质上不相关的表示强行逼近,同时忽略了只有通过模态间相互作用才能揭示的协同信息,从而阻碍了其对地球观测数据的综合分析能力的提升。为此提出了UPSeg,一种新型的多模态遥感语义分割方法,旨在明确捕获并利用这些模态间的协同信息。UPSeg模拟人类认知方式,以单模态特征作为启发辅助处理其他模态,从而有效降低单一模态数据固有的不确定性。具体而言,利用从单模态数据提取的特征作为相互启发和引导,优化另一模态的特征解析过程,增强模态间的交互并最大化基础模型内的协同效应。考虑到模态特有信息交互更有利于新见解的产生,提出了差异增强模块,通过跨模态注意力机制突出每种模态的特征差异性,从而提升跨模态交互的针对性和方向性。大量实验评估表明,该算法有效解决了现有基础模型在利用模态协同信息方面的局限性,促进了精确的土地覆盖分类。在WHU-OPT-SAR数据集和高分图像数据集(GID)上,本方法在平均像素准确率(mPA)上分别比最先进的多模态语义分割算法高出2.0%和4.2%。

关键词: 语义分割, 多模态融合, 土地覆盖分类, 遥感基础模型, 视觉提示学习

Abstract:

The integration of multi-modal remote sensing images greatly improves the comprehensiveness of land cover information. With the growing accessibility of multi-modal remote sensing data, multi-modal remote sensing foundation models are progressively developed to align different modalities, thereby facilitating the integration of diverse data sources. Nevertheless, existing foundation models typically concentrate on learning the common characteristics across modalities, forcing intrinsically irrelevant representations to converge while neglecting the synergistic information that is only revealed through modal interactions, thus hampering their advancement in comprehensive analysis of Earth observation data. To address this, we introduce UPSeg, a novel multi-modal remote sensing semantic segmentation method designed to explicitly capture and exploit these inter-modal synergies. UPSeg emulates human cognition by using unimodal features as inspiration for processing other modalities, effectively reducing uncertainties inherent in single-modality data. Specifically, we utilize features extracted from unimodal data as mutual inspiration and guidance to refine the feature parsing process of another modality, strengthening inter-modal interactions and maximizing synergistic benefits within the underlying model. Considering that the interaction of modality-unique information is more conducive to the generation of new insights, we propose a Variance Enhancement Module that employs a cross-modal attention mechanism to accentuate the distinctive features of each modality, enhancing the directionality and intentionality of cross-modal interactions. Extensive evaluation demonstrates that the proposed algorithm effectively addresses the limitations of existing foundation models in leveraging modal synergies, facilitating precise land cover classification. Our method outperforms the state-of-the-art multi-modal semantic segmentation algorithms with a gain of 2.0% in terms of Mean Pixel Accuracy (mPA) on WHU-OPT-SAR dataset and 4.2% on Gaofen Image Dataset (GID).

Key words: semantic segmentation, multi-modal fusion, land cover classification, remote sensing foundation model, visual prompt learning

中图分类号: