遥感基础模型多模态协同增强的语义分割方法

孙奥; 徐芳; 江树国; 杨文; 夏桂松

doi:10.7527/S1000-6893.2025.32910

航空学报 >

2026 , Vol. 47 >Issue 10: 532910 - 532910

DOI: https://doi.org/10.7527/S1000-6893.2025.32910

航天遥感图像智能处理与分析专刊

遥感基础模型多模态协同增强的语义分割方法

孙奥 ,
徐芳 ,
江树国 ,
杨文 ,
夏桂松

展开

^1.武汉大学人工智能学院，武汉 430072
^2.武汉大学计算机学院，武汉 430072
^3.武汉大学电子信息学院，武汉 430072

E-mail： guisong.xia@whu.edu.cn

收稿日期: 2025-10-14

修回日期: 2025-11-06

录用日期: 2025-12-29

网络出版日期: 2026-01-15

基金资助

国家自然科学基金(62401406);国家自然科学基金(U22B2011);国家资助博士后研究人员计划(GZB20240562)

收起

A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models

Ao SUN ,
Fang XU ,
Shuguo JIANG ,
Wen YANG ,
Guisong XIA

Expand

^1.School of Artificial Intelligence，Wuhan University，Wuhan 430072，China
^2.School of Computer Science，Wuhan University，Wuhan 430072，China
^3.Electronic Information School，Wuhan University，Wuhan 430072，China

E-mail： guisong.xia@whu.edu.cn

Received date: 2025-10-14

Revised date: 2025-11-06

Accepted date: 2025-12-29

Online published: 2026-01-15

Supported by

National Natural Science Foundation of China(62401406);Postdoctoral Fellowship Program of CPSF(GZB20240562)

Fold

摘要

多模态遥感影像的融合能够显著增强土地覆盖信息的完整性。随着多模态数据获取的日益便捷，多模态遥感基础模型快速发展，通过跨模态对齐不断推动多源数据的融合应用。然而现有的基础模型通常侧重于学习模态间的共性特征，这会将本质上不相关的表示强行逼近，同时忽略了只有通过模态间相互作用才能揭示的协同信息，从而阻碍了其对地球观测数据的综合分析能力的提升。为此提出了UPSeg，一种新型的多模态遥感语义分割方法，旨在明确捕获并利用这些模态间的协同信息。UPSeg模拟人类认知方式，以单模态特征作为启发辅助处理其他模态，从而有效降低单一模态数据固有的不确定性。具体而言，利用从单模态数据提取的特征作为相互启发和引导，优化另一模态的特征解析过程，增强模态间的交互并最大化基础模型内的协同效应。考虑到模态特有信息交互更有利于新见解的产生，提出了差异增强模块，通过跨模态注意力机制突出每种模态的特征差异性，从而提升跨模态交互的针对性和方向性。大量实验评估表明，该算法有效解决了现有基础模型在利用模态协同信息方面的局限性，促进了精确的土地覆盖分类。在WHU-OPT-SAR数据集和高分图像数据集（GID）上，本方法在平均像素准确率（mPA）上分别比最先进的多模态语义分割算法高出2.0%和4.2%。

关键词： 语义分割; 多模态融合; 土地覆盖分类; 遥感基础模型; 视觉提示学习

本文引用格式

孙奥 , 徐芳 , 江树国 , 杨文 , 夏桂松 . 遥感基础模型多模态协同增强的语义分割方法[J]. 航空学报, 2026 , 47(10) : 532910 -532910 . DOI: 10.7527/S1000-6893.2025.32910

Abstract

The integration of multi-modal remote sensing images greatly improves the comprehensiveness of land cover information. With the growing accessibility of multi-modal remote sensing data， multi-modal remote sensing foundation models are progressively developed to align different modalities， thereby facilitating the integration of diverse data sources. Nevertheless， existing foundation models typically concentrate on learning the common characteristics across modalities， forcing intrinsically irrelevant representations to converge while neglecting the synergistic information that is only revealed through modal interactions， thus hampering their advancement in comprehensive analysis of Earth observation data. To address this， we introduce UPSeg， a novel multi-modal remote sensing semantic segmentation method designed to explicitly capture and exploit these inter-modal synergies. UPSeg emulates human cognition by using unimodal features as inspiration for processing other modalities， effectively reducing uncertainties inherent in single-modality data. Specifically， we utilize features extracted from unimodal data as mutual inspiration and guidance to refine the feature parsing process of another modality， strengthening inter-modal interactions and maximizing synergistic benefits within the underlying model. Considering that the interaction of modality-unique information is more conducive to the generation of new insights， we propose a Variance Enhancement Module that employs a cross-modal attention mechanism to accentuate the distinctive features of each modality， enhancing the directionality and intentionality of cross-modal interactions. Extensive evaluation demonstrates that the proposed algorithm effectively addresses the limitations of existing foundation models in leveraging modal synergies， facilitating precise land cover classification. Our method outperforms the state-of-the-art multi-modal semantic segmentation algorithms with a gain of 2.0% in terms of Mean Pixel Accuracy （mPA） on WHU-OPT-SAR dataset and 4.2% on Gaofen Image Dataset （GID）.

Key words： semantic segmentation; multi-modal fusion; land cover classification; remote sensing foundation model; visual prompt learning

参考文献

[1]	肖欣林，施伟超，郑向涛，等. 基于多模型协同的舰船目标检测［J］. 航空学报， 2024， 45（14）： 630241.
	XIAO X L， SHI W C， ZHENG X T， et al. Multiple models collaboration for ship detection［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（14）： 630241 （in Chinese）.
[2]	赵其昌，吴一全，苑玉彬. 光学遥感图像舰船目标检测与识别方法研究进展［J］. 航空学报， 2024， 45（8）： 029025.
	ZHAO Q C， WU Y Q， YUAN Y B. Progress of ship detection and recognition methods in optical remote sensing images［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（8）： 029025 （in Chinese）.
[3]	WANG L B， FANG S H， MENG X L， et al. Building extraction with vision transformer［J］. IEEE Transactions on Geoscience and Remote Sensing， 2022， 60： 5625711.
[4]	ODONGO R. Remote sensing applications in environmental monitoring［J］. European Journal of Natural Sciences， 2023， 1（1）： 1-12.
[5]	MA A L， CHEN D Y， ZHONG Y F， et al. National-scale greenhouse mapping for high spatial resolution remote sensing imagery using a dense object dual-task deep learning framework： A case study of China［J］. ISPRS Journal of Photogrammetry and Remote Sensing， 2021， 181： 279-294.
[6]	HONG D F， GAO L R， YOKOYA N， et al. More diverse means better： Multimodal deep learning meets remote-sensing imagery classification［J］. IEEE Transactions on Geoscience and Remote Sensing， 2021， 59（5）： 4340-4354.
[7]	KIEU N， NGUYEN K， NAZIB A， et al. Multimodal colearning meets remote sensing： Taxonomy， state of the art， and future works［J］. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 2024， 17： 7386-7409.
[8]	MA W L， KARAKU? O， ROSIN P L. AMM-FuseNet： Attention-based multi-modal image fusion network for land cover mapping［J］. Remote Sensing， 2022， 14（18）： 4458.
[9]	GUO X， LAO J W， DANG B， et al. SkySense： A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery［C］∥2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2024： 27662-27673.
[10]	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］∥International Conference on Machine Learning. PmLR， 2021： 8748-8763.
[11]	STOJNIC V， RISOJEVIC V. Self-supervised learning of remote sensing scene representations using contrastive multiview coding［C］∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2021： 1182-1191.
[12]	何友，刘瑜，李耀文，等. 多源信息融合发展及展望［J］. 航空学报， 2025， 46（6）： 531672.
	HE Y， LIU Y， LI Y W， et al. Development and prospects of multisource information fusion［J］. Acta Aeronautica et Astronautica Sinica， 2025， 46（6）： 531672 （in Chinese）.
[13]	LI W J， YANG W， LIU T P， et al. Predicting gradient is better： Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture［J］. ISPRS Journal of Photogrammetry and Remote Sensing， 2024， 218： 326-338.
[14]	GIRDHAR R， EL-NOUBY A， LIU Z， et al. ImageBind one embedding space to bind them all［C］∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2023： 15180-15190.
[15]	PREXL J， SCHMITT M. SenPa-MAE： Sensor parameter aware masked autoencoder for multi-satellite self-supervised pretraining［M］∥Pattern Recognition. ChamSpringer Nature Switzerland， 2025： 317-331.
[16]	XIONG Z T， WANG Y， ZHANG F H， et al. One for all： Toward unified foundation models for Earth vision［C］∥IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium. Piscataway： IEEE Press， 2024： 2734-2738.
[17]	HUANG Y， DU C Z， XUE Z H， et al. What makes multi-modal learning better than single （provably）［C］∥Proceedings of the 35th International Conference on Neural Information Processing Systems. New York： ACM， 2021： 10944-10956.
[18]	SUN X， TIAN Y， LU W X， et al. From single-to multi-modal remote sensing imagery interpretation： A survey and taxonomy［J］. Science China Information Sciences， 2023， 66（4）： 140301.
[19]	WANG X， CHEN G Y， QIAN G W， et al. Large-scale multi-modal pre-trained models： A comprehensive survey［J］. Machine Intelligence Research， 2023， 20（4）： 447-482.
[20]	FULLER A， GREEN J， MILLARD K. CROMA： Remote sensing representations with contrastive radar-optical masked autoencoders［C］∥Advances in Neural Information Processing Systems 36. New Orleans： Neural Information Processing Systems Foundation， Inc. （NeurIPS）， 2023： 5506-5538.
[21]	WANG Y， ALBRECHT C M， BRAHAM N A ALI， et al. Decoupling common and unique representations for multimodal self-supervised learning［M］∥Computer Vision-ECCV 2024. Cham： Springer Nature Switzerland， 2024： 286-303.
[22]	LIANG P P， ZADEH A， MORENCY L P. Foundations & trends in multimodal machine learning： Principles， challenges， and open questions［J］. ACM Computing Surveys， 2024， 56（10）： 1-42.
[23]	XIONG Z， WANG Y， ZHANG F， et al. Neural plasticity-inspired foundation model for observing the earth crossing modalities［DB/OL］. arXiv preprint： 2403.15356， 2024.
[24]	CAO B， GUO J L， ZHU P F， et al. Bi-directional adapter for multimodal tracking［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2024， 38（2）： 927-935.
[25]	ZHU J W， LAI S M， CHEN X， et al. Visual prompt multi-modal tracking［C］∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2023： 9516-9526.
[26]	CHEN T R， ZHU L Y， DING C T， et al. SAM-adapter： Adapting segment anything in underperformed scenes［C］∥2023 IEEE/CVF International Conference on Computer Vision Workshops （ICCVW）. Piscataway： IEEE Press， 2023： 3359-3367.
[27]	LI X， ZHANG G， CUI H， et al. MCANet： A joint semantic segmentation framework of optical and SAR images for land use classification［J］. International Journal of Applied Earth Observation and Geoinformation， 2022， 106： 102638.
[28]	TONG X Y， XIA G S， LU Q K， et al. Land-cover classification with high-resolution remote sensing images using transferable deep models［J］. Remote Sensing of Environment， 2020， 237： 111322.
[29]	LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization［DB/OL］. arXiv preprint： 1711.05101， 2017.
[30]	ZHANG J M， LIU H Y， YANG K L， et al. CMX： Cross-modal fusion for RGB-X semantic segmentation with transformers［J］. IEEE Transactions on Intelligent Transportation Systems， 2023， 24（12）： 14679-14694.
[31]	MA X P， XU X C， ZHANG X K， et al. Adjacent-scale multimodal fusion networks for semantic segmentation of remote sensing data［J］. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 2024， 17： 20116-20128.
[32]	MA X P， ZHANG X K， PUN M O， et al. A multilevel multimodal fusion transformer for remote sensing semantic segmentation［J］. IEEE Transactions on Geoscience and Remote Sensing， 2024， 62： 5403215.
[33]	SCHEIBENREIF L， HANNA J， MOMMERT M， et al. Self-supervised vision transformers for land-cover segmentation and classification［C］∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2022： 1421-1430.
[34]	STRUDEL R， GARCIA R， LAPTEV I， et al. Segmenter： Transformer for semantic segmentation［C］∥ 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Piscataway： IEEE Press， 2021： 7242-7252.
[35]	HAN B R， ZHANG S， SHI X J， et al. Bridging remote sensors with multisensor geospatial foundation models［C］∥2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2024： 27852-27862.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献