遥感基础模型多模态协同增强的语义分割方法

doi:10.7527/S1000-6893.2025.32910

航天遥感图像智能处理与分析专刊

本期目录 | 过刊浏览 | 高级检索

前一篇 |

遥感基础模型多模态协同增强的语义分割方法

孙奥¹, 徐芳¹, 江树国², 杨文¹^,³, 夏桂松¹()

^1.武汉大学人工智能学院，武汉 430072
^2.武汉大学计算机学院，武汉 430072
^3.武汉大学电子信息学院，武汉 430072

收稿日期:2025-10-14 修回日期:2025-11-06 接受日期:2025-12-29 出版日期:2026-01-22 发布日期:2026-01-15
通讯作者: 夏桂松 E-mail:guisong.xia@whu.edu.cn
基金资助:
国家自然科学基金(62401406);国家自然科学基金(U22B2011);国家资助博士后研究人员计划(GZB20240562)

A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models

Ao SUN¹, Fang XU¹, Shuguo JIANG², Wen YANG¹^,³, Guisong XIA¹()

^1.School of Artificial Intelligence，Wuhan University，Wuhan 430072，China
^2.School of Computer Science，Wuhan University，Wuhan 430072，China
^3.Electronic Information School，Wuhan University，Wuhan 430072，China

Received:2025-10-14 Revised:2025-11-06 Accepted:2025-12-29 Online:2026-01-22 Published:2026-01-15
Contact: Guisong XIA E-mail:guisong.xia@whu.edu.cn
Supported by:
National Natural Science Foundation of China(62401406);Postdoctoral Fellowship Program of CPSF(GZB20240562)

摘要/Abstract

摘要：

多模态遥感影像的融合能够显著增强土地覆盖信息的完整性。随着多模态数据获取的日益便捷，多模态遥感基础模型快速发展，通过跨模态对齐不断推动多源数据的融合应用。然而现有的基础模型通常侧重于学习模态间的共性特征，这会将本质上不相关的表示强行逼近，同时忽略了只有通过模态间相互作用才能揭示的协同信息，从而阻碍了其对地球观测数据的综合分析能力的提升。为此提出了UPSeg，一种新型的多模态遥感语义分割方法，旨在明确捕获并利用这些模态间的协同信息。UPSeg模拟人类认知方式，以单模态特征作为启发辅助处理其他模态，从而有效降低单一模态数据固有的不确定性。具体而言，利用从单模态数据提取的特征作为相互启发和引导，优化另一模态的特征解析过程，增强模态间的交互并最大化基础模型内的协同效应。考虑到模态特有信息交互更有利于新见解的产生，提出了差异增强模块，通过跨模态注意力机制突出每种模态的特征差异性，从而提升跨模态交互的针对性和方向性。大量实验评估表明，该算法有效解决了现有基础模型在利用模态协同信息方面的局限性，促进了精确的土地覆盖分类。在WHU-OPT-SAR数据集和高分图像数据集（GID）上，本方法在平均像素准确率（mPA）上分别比最先进的多模态语义分割算法高出2.0%和4.2%。

关键词: 语义分割, 多模态融合, 土地覆盖分类, 遥感基础模型, 视觉提示学习

Abstract:

The integration of multi-modal remote sensing images greatly improves the comprehensiveness of land cover information. With the growing accessibility of multi-modal remote sensing data， multi-modal remote sensing foundation models are progressively developed to align different modalities， thereby facilitating the integration of diverse data sources. Nevertheless， existing foundation models typically concentrate on learning the common characteristics across modalities， forcing intrinsically irrelevant representations to converge while neglecting the synergistic information that is only revealed through modal interactions， thus hampering their advancement in comprehensive analysis of Earth observation data. To address this， we introduce UPSeg， a novel multi-modal remote sensing semantic segmentation method designed to explicitly capture and exploit these inter-modal synergies. UPSeg emulates human cognition by using unimodal features as inspiration for processing other modalities， effectively reducing uncertainties inherent in single-modality data. Specifically， we utilize features extracted from unimodal data as mutual inspiration and guidance to refine the feature parsing process of another modality， strengthening inter-modal interactions and maximizing synergistic benefits within the underlying model. Considering that the interaction of modality-unique information is more conducive to the generation of new insights， we propose a Variance Enhancement Module that employs a cross-modal attention mechanism to accentuate the distinctive features of each modality， enhancing the directionality and intentionality of cross-modal interactions. Extensive evaluation demonstrates that the proposed algorithm effectively addresses the limitations of existing foundation models in leveraging modal synergies， facilitating precise land cover classification. Our method outperforms the state-of-the-art multi-modal semantic segmentation algorithms with a gain of 2.0% in terms of Mean Pixel Accuracy （mPA） on WHU-OPT-SAR dataset and 4.2% on Gaofen Image Dataset （GID）.

Key words: semantic segmentation, multi-modal fusion, land cover classification, remote sensing foundation model, visual prompt learning

中图分类号:

孙奥, 徐芳, 江树国, 杨文, 夏桂松. 遥感基础模型多模态协同增强的语义分割方法[J]. 航空学报, 2026, 47(10): 532910.

Ao SUN, Fang XU, Shuguo JIANG, Wen YANG, Guisong XIA. A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models[J]. Acta Aeronautica et Astronautica Sinica, 2026, 47(10): 532910.

图/表 10

图 1

图 2

图 3

表1

图 4

图 5

图 6

表2

表3

表4

参考文献 35

[1]	肖欣林，施伟超，郑向涛，等. 基于多模型协同的舰船目标检测［J］. 航空学报， 2024， 45（14）： 630241.
	XIAO X L， SHI W C， ZHENG X T， et al. Multiple models collaboration for ship detection［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（14）： 630241 （in Chinese）.
[2]	赵其昌，吴一全，苑玉彬. 光学遥感图像舰船目标检测与识别方法研究进展［J］. 航空学报， 2024， 45（8）： 029025.
	ZHAO Q C， WU Y Q， YUAN Y B. Progress of ship detection and recognition methods in optical remote sensing images［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（8）： 029025 （in Chinese）.
[3]	WANG L B， FANG S H， MENG X L， et al. Building extraction with vision transformer［J］. IEEE Transactions on Geoscience and Remote Sensing， 2022， 60： 5625711.
[4]	ODONGO R. Remote sensing applications in environmental monitoring［J］. European Journal of Natural Sciences， 2023， 1（1）： 1-12.
[5]	MA A L， CHEN D Y， ZHONG Y F， et al. National-scale greenhouse mapping for high spatial resolution remote sensing imagery using a dense object dual-task deep learning framework： A case study of China［J］. ISPRS Journal of Photogrammetry and Remote Sensing， 2021， 181： 279-294.
[6]	HONG D F， GAO L R， YOKOYA N， et al. More diverse means better： Multimodal deep learning meets remote-sensing imagery classification［J］. IEEE Transactions on Geoscience and Remote Sensing， 2021， 59（5）： 4340-4354.
[7]	KIEU N， NGUYEN K， NAZIB A， et al. Multimodal colearning meets remote sensing： Taxonomy， state of the art， and future works［J］. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 2024， 17： 7386-7409.
[8]	MA W L， KARAKUŞ O， ROSIN P L. AMM-FuseNet： Attention-based multi-modal image fusion network for land cover mapping［J］. Remote Sensing， 2022， 14（18）： 4458.
[9]	GUO X， LAO J W， DANG B， et al. SkySense： A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery［C］∥2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2024： 27662-27673.
[10]	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］∥International Conference on Machine Learning. PmLR， 2021： 8748-8763.
[11]	STOJNIC V， RISOJEVIC V. Self-supervised learning of remote sensing scene representations using contrastive multiview coding［C］∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2021： 1182-1191.
[12]	何友，刘瑜，李耀文，等. 多源信息融合发展及展望［J］. 航空学报， 2025， 46（6）： 531672.
	HE Y， LIU Y， LI Y W， et al. Development and prospects of multisource information fusion［J］. Acta Aeronautica et Astronautica Sinica， 2025， 46（6）： 531672 （in Chinese）.
[13]	LI W J， YANG W， LIU T P， et al. Predicting gradient is better： Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture［J］. ISPRS Journal of Photogrammetry and Remote Sensing， 2024， 218： 326-338.
[14]	GIRDHAR R， EL-NOUBY A， LIU Z， et al. ImageBind one embedding space to bind them all［C］∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2023： 15180-15190.
[15]	PREXL J， SCHMITT M. SenPa-MAE： Sensor parameter aware masked autoencoder for multi-satellite self-supervised pretraining［M］∥Pattern Recognition. ChamSpringer Nature Switzerland， 2025： 317-331.
[16]	XIONG Z T， WANG Y， ZHANG F H， et al. One for all： Toward unified foundation models for Earth vision［C］∥IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium. Piscataway： IEEE Press， 2024： 2734-2738.
[17]	HUANG Y， DU C Z， XUE Z H， et al. What makes multi-modal learning better than single （provably）［C］∥Proceedings of the 35th International Conference on Neural Information Processing Systems. New York： ACM， 2021： 10944-10956.
[18]	SUN X， TIAN Y， LU W X， et al. From single-to multi-modal remote sensing imagery interpretation： A survey and taxonomy［J］. Science China Information Sciences， 2023， 66（4）： 140301.
[19]	WANG X， CHEN G Y， QIAN G W， et al. Large-scale multi-modal pre-trained models： A comprehensive survey［J］. Machine Intelligence Research， 2023， 20（4）： 447-482.
[20]	FULLER A， GREEN J， MILLARD K. CROMA： Remote sensing representations with contrastive radar-optical masked autoencoders［C］∥Advances in Neural Information Processing Systems 36. New Orleans： Neural Information Processing Systems Foundation， Inc. （NeurIPS）， 2023： 5506-5538.
[21]	WANG Y， ALBRECHT C M， BRAHAM N A ALI， et al. Decoupling common and unique representations for multimodal self-supervised learning［M］∥Computer Vision-ECCV 2024. Cham： Springer Nature Switzerland， 2024： 286-303.
[22]	LIANG P P， ZADEH A， MORENCY L P. Foundations & trends in multimodal machine learning： Principles， challenges， and open questions［J］. ACM Computing Surveys， 2024， 56（10）： 1-42.
[23]	XIONG Z， WANG Y， ZHANG F， et al. Neural plasticity-inspired foundation model for observing the earth crossing modalities［DB/OL］. arXiv preprint： 2403.15356， 2024.
[24]	CAO B， GUO J L， ZHU P F， et al. Bi-directional adapter for multimodal tracking［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2024， 38（2）： 927-935.
[25]	ZHU J W， LAI S M， CHEN X， et al. Visual prompt multi-modal tracking［C］∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2023： 9516-9526.
[26]	CHEN T R， ZHU L Y， DING C T， et al. SAM-adapter： Adapting segment anything in underperformed scenes［C］∥2023 IEEE/CVF International Conference on Computer Vision Workshops （ICCVW）. Piscataway： IEEE Press， 2023： 3359-3367.
[27]	LI X， ZHANG G， CUI H， et al. MCANet： A joint semantic segmentation framework of optical and SAR images for land use classification［J］. International Journal of Applied Earth Observation and Geoinformation， 2022， 106： 102638.
[28]	TONG X Y， XIA G S， LU Q K， et al. Land-cover classification with high-resolution remote sensing images using transferable deep models［J］. Remote Sensing of Environment， 2020， 237： 111322.
[29]	LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization［DB/OL］. arXiv preprint： 1711.05101， 2017.
[30]	ZHANG J M， LIU H Y， YANG K L， et al. CMX： Cross-modal fusion for RGB-X semantic segmentation with transformers［J］. IEEE Transactions on Intelligent Transportation Systems， 2023， 24（12）： 14679-14694.
[31]	MA X P， XU X C， ZHANG X K， et al. Adjacent-scale multimodal fusion networks for semantic segmentation of remote sensing data［J］. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 2024， 17： 20116-20128.
[32]	MA X P， ZHANG X K， PUN M O， et al. A multilevel multimodal fusion transformer for remote sensing semantic segmentation［J］. IEEE Transactions on Geoscience and Remote Sensing， 2024， 62： 5403215.
[33]	SCHEIBENREIF L， HANNA J， MOMMERT M， et al. Self-supervised vision transformers for land-cover segmentation and classification［C］∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2022： 1421-1430.
[34]	STRUDEL R， GARCIA R， LAPTEV I， et al. Segmenter： Transformer for semantic segmentation［C］∥ 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Piscataway： IEEE Press， 2021： 7242-7252.
[35]	HAN B R， ZHANG S， SHI X J， et al. Bridging remote sensors with multisensor geospatial foundation models［C］∥2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2024： 27852-27862.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

方法	WHU-OPT-SAR		GID
方法	mIoU/%	mPA/%	mIoU/%	mPA/%
AMMFuseNet	45.2	57.1	47.9	62.6
MCANet	48.5	61.0	53.0	70.0
CMX	45.9	57.5	59.1	73.0
ASMFNet	48.8	63.7	59.4	74.6
FTransUNet	50.6	64.5	61.3	75.9
SSLViT	43.1	54.2	53.7	67.5
CROMA	43.9	54.5	53.5	67.7
DeCUR	44.3	55.6	57.4	71.1
DOFA	47.5	60.6	61.5	75.6
UPSeg-DOFA	51.1	66.5	66.3	80.1
UPSeg-SAM	51.3	70.5	64.2	80.3

方法	多模态	WHU-OPT-SAR		GID
方法	多模态	mIoU/%	mPA/%	mIoU/%	mPA/%
RGB	×	50.7	64.1	66.2	78.1
SAR/NIR	×	43.6	54.5	60.4	75.8
DOFA	√	47.5	60.6	61.5	75.6
UPSeg	√	51.1	66.5	66.3	80.1

方法	视觉提示	VEM	WHU-OPT-SAR		GID
方法	视觉提示	VEM	mIoU/%	mPA/%	mIoU/%	mPA/%
w/o VP			47.3	62.8	61.1	73.7
w/o VEM	√		51.0	66.0	65.9	78.7
UPSeg	√	√	51.1	66.5	66.3	80.1

训练策略	初始化	WHU-OPT-SAR		GID
训练策略	初始化	mIoU/%	mPA/%	mIoU/%	mPA/%
固定	λ=1	50.7	65.6	65.2	78.8
可学习	Init=1	50.9	66.1	66.1	79.4
可学习	Init=10	51.1	66.5	66.3	80.1

遥感基础模型多模态协同增强的语义分割方法

A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 35

相关文章 7

编辑推荐

Metrics

本文评价

[1]	马宇卓, 任侃, 李涛, 陈钱. 基于距离损失提升航空图像语义分割研究[J]. 航空学报, 2026, 47(8): 332780-332780.
[2]	唐旭, 谷峰, 马晶晶, 张向荣. 基于视觉-语言预训练模型的高光谱-激光雷达联合分类方法[J]. 航空学报, 2026, 47(10): 139-158.
[3]	丛润民, 孙豪言, 罗宇轩, 方豪. 基于类关系挖掘的遥感图像广义小样本分割方法[J]. 航空学报, 2025, 46(23): 631694-631694.
[4]	李嘉欣, 吕帅帅, 王叶子, 杨宇, 李梓悦. 基于Transformer的航空结构表面裂纹智能追踪方法[J]. 航空学报, 2025, 46(21): 532355-532355.
[5]	罗旭东, 吴一全, 陈金林. 无人机航拍影像目标检测与语义分割的深度学习方法研究进展[J]. 航空学报, 2024, 45(6): 28822-028822.
[6]	苏鑫, 管润程, 王桥, 苑伟政, 吕湘连, 何洋. 基于深度学习的结冰区域和厚度检测方法[J]. 航空学报, 2023, 44(S2): 729283-729283.
[7]	李晓航, 周建江. 基于自适应记忆长度的多尺度模态融合网络[J]. 航空学报, 2023, 44(22): 628977-628977.