遥感基础模型多模态协同增强的语义分割方法

doi:10.7527/S1000-6893.2025.32910

Abstract

Abstract:

The integration of multi-modal remote sensing images greatly improves the comprehensiveness of land cover information. With the growing accessibility of multi-modal remote sensing data， multi-modal remote sensing foundation models are progressively developed to align different modalities， thereby facilitating the integration of diverse data sources. Nevertheless， existing foundation models typically concentrate on learning the common characteristics across modalities， forcing intrinsically irrelevant representations to converge while neglecting the synergistic information that is only revealed through modal interactions， thus hampering their advancement in comprehensive analysis of Earth observation data. To address this， we introduce UPSeg， a novel multi-modal remote sensing semantic segmentation method designed to explicitly capture and exploit these inter-modal synergies. UPSeg emulates human cognition by using unimodal features as inspiration for processing other modalities， effectively reducing uncertainties inherent in single-modality data. Specifically， we utilize features extracted from unimodal data as mutual inspiration and guidance to refine the feature parsing process of another modality， strengthening inter-modal interactions and maximizing synergistic benefits within the underlying model. Considering that the interaction of modality-unique information is more conducive to the generation of new insights， we propose a Variance Enhancement Module that employs a cross-modal attention mechanism to accentuate the distinctive features of each modality， enhancing the directionality and intentionality of cross-modal interactions. Extensive evaluation demonstrates that the proposed algorithm effectively addresses the limitations of existing foundation models in leveraging modal synergies， facilitating precise land cover classification. Our method outperforms the state-of-the-art multi-modal semantic segmentation algorithms with a gain of 2.0% in terms of Mean Pixel Accuracy （mPA） on WHU-OPT-SAR dataset and 4.2% on Gaofen Image Dataset （GID）.

Key words: semantic segmentation, multi-modal fusion, land cover classification, remote sensing foundation model, visual prompt learning

CLC Number:

Ao SUN, Fang XU, Shuguo JIANG, Wen YANG, Guisong XIA. A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models[J]. Acta Aeronautica et Astronautica Sinica, 2026, 47(10): 532910.

Figures/Tables 10

Fig.1

Fig.2

Fig.3

Table 1

Fig.4

Fig.5

Fig.6

Table 2

Table 3

Table 4

References 35

[1]	肖欣林，施伟超，郑向涛，等. 基于多模型协同的舰船目标检测［J］. 航空学报， 2024， 45（14）： 630241.
	XIAO X L， SHI W C， ZHENG X T， et al. Multiple models collaboration for ship detection［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（14）： 630241 （in Chinese）.
[2]	赵其昌，吴一全，苑玉彬. 光学遥感图像舰船目标检测与识别方法研究进展［J］. 航空学报， 2024， 45（8）： 029025.
	ZHAO Q C， WU Y Q， YUAN Y B. Progress of ship detection and recognition methods in optical remote sensing images［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（8）： 029025 （in Chinese）.
[3]	WANG L B， FANG S H， MENG X L， et al. Building extraction with vision transformer［J］. IEEE Transactions on Geoscience and Remote Sensing， 2022， 60： 5625711.
[4]	ODONGO R. Remote sensing applications in environmental monitoring［J］. European Journal of Natural Sciences， 2023， 1（1）： 1-12.
[5]	MA A L， CHEN D Y， ZHONG Y F， et al. National-scale greenhouse mapping for high spatial resolution remote sensing imagery using a dense object dual-task deep learning framework： A case study of China［J］. ISPRS Journal of Photogrammetry and Remote Sensing， 2021， 181： 279-294.
[6]	HONG D F， GAO L R， YOKOYA N， et al. More diverse means better： Multimodal deep learning meets remote-sensing imagery classification［J］. IEEE Transactions on Geoscience and Remote Sensing， 2021， 59（5）： 4340-4354.
[7]	KIEU N， NGUYEN K， NAZIB A， et al. Multimodal colearning meets remote sensing： Taxonomy， state of the art， and future works［J］. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 2024， 17： 7386-7409.
[8]	MA W L， KARAKUŞ O， ROSIN P L. AMM-FuseNet： Attention-based multi-modal image fusion network for land cover mapping［J］. Remote Sensing， 2022， 14（18）： 4458.
[9]	GUO X， LAO J W， DANG B， et al. SkySense： A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery［C］∥2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2024： 27662-27673.
[10]	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］∥International Conference on Machine Learning. PmLR， 2021： 8748-8763.
[11]	STOJNIC V， RISOJEVIC V. Self-supervised learning of remote sensing scene representations using contrastive multiview coding［C］∥2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2021： 1182-1191.
[12]	何友，刘瑜，李耀文，等. 多源信息融合发展及展望［J］. 航空学报， 2025， 46（6）： 531672.
	HE Y， LIU Y， LI Y W， et al. Development and prospects of multisource information fusion［J］. Acta Aeronautica et Astronautica Sinica， 2025， 46（6）： 531672 （in Chinese）.
[13]	LI W J， YANG W， LIU T P， et al. Predicting gradient is better： Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture［J］. ISPRS Journal of Photogrammetry and Remote Sensing， 2024， 218： 326-338.
[14]	GIRDHAR R， EL-NOUBY A， LIU Z， et al. ImageBind one embedding space to bind them all［C］∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2023： 15180-15190.
[15]	PREXL J， SCHMITT M. SenPa-MAE： Sensor parameter aware masked autoencoder for multi-satellite self-supervised pretraining［M］∥Pattern Recognition. ChamSpringer Nature Switzerland， 2025： 317-331.
[16]	XIONG Z T， WANG Y， ZHANG F H， et al. One for all： Toward unified foundation models for Earth vision［C］∥IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium. Piscataway： IEEE Press， 2024： 2734-2738.
[17]	HUANG Y， DU C Z， XUE Z H， et al. What makes multi-modal learning better than single （provably）［C］∥Proceedings of the 35th International Conference on Neural Information Processing Systems. New York： ACM， 2021： 10944-10956.
[18]	SUN X， TIAN Y， LU W X， et al. From single-to multi-modal remote sensing imagery interpretation： A survey and taxonomy［J］. Science China Information Sciences， 2023， 66（4）： 140301.
[19]	WANG X， CHEN G Y， QIAN G W， et al. Large-scale multi-modal pre-trained models： A comprehensive survey［J］. Machine Intelligence Research， 2023， 20（4）： 447-482.
[20]	FULLER A， GREEN J， MILLARD K. CROMA： Remote sensing representations with contrastive radar-optical masked autoencoders［C］∥Advances in Neural Information Processing Systems 36. New Orleans： Neural Information Processing Systems Foundation， Inc. （NeurIPS）， 2023： 5506-5538.
[21]	WANG Y， ALBRECHT C M， BRAHAM N A ALI， et al. Decoupling common and unique representations for multimodal self-supervised learning［M］∥Computer Vision-ECCV 2024. Cham： Springer Nature Switzerland， 2024： 286-303.
[22]	LIANG P P， ZADEH A， MORENCY L P. Foundations & trends in multimodal machine learning： Principles， challenges， and open questions［J］. ACM Computing Surveys， 2024， 56（10）： 1-42.
[23]	XIONG Z， WANG Y， ZHANG F， et al. Neural plasticity-inspired foundation model for observing the earth crossing modalities［DB/OL］. arXiv preprint： 2403.15356， 2024.
[24]	CAO B， GUO J L， ZHU P F， et al. Bi-directional adapter for multimodal tracking［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2024， 38（2）： 927-935.
[25]	ZHU J W， LAI S M， CHEN X， et al. Visual prompt multi-modal tracking［C］∥2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2023： 9516-9526.
[26]	CHEN T R， ZHU L Y， DING C T， et al. SAM-adapter： Adapting segment anything in underperformed scenes［C］∥2023 IEEE/CVF International Conference on Computer Vision Workshops （ICCVW）. Piscataway： IEEE Press， 2023： 3359-3367.
[27]	LI X， ZHANG G， CUI H， et al. MCANet： A joint semantic segmentation framework of optical and SAR images for land use classification［J］. International Journal of Applied Earth Observation and Geoinformation， 2022， 106： 102638.
[28]	TONG X Y， XIA G S， LU Q K， et al. Land-cover classification with high-resolution remote sensing images using transferable deep models［J］. Remote Sensing of Environment， 2020， 237： 111322.
[29]	LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization［DB/OL］. arXiv preprint： 1711.05101， 2017.
[30]	ZHANG J M， LIU H Y， YANG K L， et al. CMX： Cross-modal fusion for RGB-X semantic segmentation with transformers［J］. IEEE Transactions on Intelligent Transportation Systems， 2023， 24（12）： 14679-14694.
[31]	MA X P， XU X C， ZHANG X K， et al. Adjacent-scale multimodal fusion networks for semantic segmentation of remote sensing data［J］. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing， 2024， 17： 20116-20128.
[32]	MA X P， ZHANG X K， PUN M O， et al. A multilevel multimodal fusion transformer for remote sensing semantic segmentation［J］. IEEE Transactions on Geoscience and Remote Sensing， 2024， 62： 5403215.
[33]	SCHEIBENREIF L， HANNA J， MOMMERT M， et al. Self-supervised vision transformers for land-cover segmentation and classification［C］∥2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2022： 1421-1430.
[34]	STRUDEL R， GARCIA R， LAPTEV I， et al. Segmenter： Transformer for semantic segmentation［C］∥ 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Piscataway： IEEE Press， 2021： 7242-7252.
[35]	HAN B R， ZHANG S， SHI X J， et al. Bridging remote sensors with multisensor geospatial foundation models［C］∥2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2024： 27852-27862.

方法	WHU-OPT-SAR		GID
方法	mIoU/%	mPA/%	mIoU/%	mPA/%
AMMFuseNet	45.2	57.1	47.9	62.6
MCANet	48.5	61.0	53.0	70.0
CMX	45.9	57.5	59.1	73.0
ASMFNet	48.8	63.7	59.4	74.6
FTransUNet	50.6	64.5	61.3	75.9
SSLViT	43.1	54.2	53.7	67.5
CROMA	43.9	54.5	53.5	67.7
DeCUR	44.3	55.6	57.4	71.1
DOFA	47.5	60.6	61.5	75.6
UPSeg-DOFA	51.1	66.5	66.3	80.1
UPSeg-SAM	51.3	70.5	64.2	80.3

方法	多模态	WHU-OPT-SAR		GID
方法	多模态	mIoU/%	mPA/%	mIoU/%	mPA/%
RGB	×	50.7	64.1	66.2	78.1
SAR/NIR	×	43.6	54.5	60.4	75.8
DOFA	√	47.5	60.6	61.5	75.6
UPSeg	√	51.1	66.5	66.3	80.1

方法	视觉提示	VEM	WHU-OPT-SAR		GID
方法	视觉提示	VEM	mIoU/%	mPA/%	mIoU/%	mPA/%
w/o VP			47.3	62.8	61.1	73.7
w/o VEM	√		51.0	66.0	65.9	78.7
UPSeg	√	√	51.1	66.5	66.3	80.1

训练策略	初始化	WHU-OPT-SAR		GID
训练策略	初始化	mIoU/%	mPA/%	mIoU/%	mPA/%
固定	λ=1	50.7	65.6	65.2	78.8
可学习	Init=1	50.9	66.1	66.1	79.4
可学习	Init=10	51.1	66.5	66.3	80.1

A semantic segmentation method enhanced by multimodal collaboration in remote sensing foundation models

RichHTML

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 35

Related Articles 7

Recommended Articles

Metrics

Comments

[1]	Yuzhuo MA, Kan REN, Tao LI, Qian CHEN. Improving remote sensing image semantic segmentation based on distance loss [J]. Acta Aeronautica et Astronautica Sinica, 2026, 47(8): 332780-332780.
[2]	Xu TANG, Feng GU, Jingjing MA, Xiangrong ZHANG. Hyperspectral-LiDAR joint classification method based on vision-language pre-trained models [J]. Acta Aeronautica et Astronautica Sinica, 2026, 47(10): 139-158.
[3]	Runmin CONG, Haoyan SUN, Yuxuan LUO, Hao FANG. Generalized few-shot segmentation for remote sensing image based on class relation mining [J]. Acta Aeronautica et Astronautica Sinica, 2025, 46(23): 631694-631694.
[4]	Jiaxin LI, Shuaishuai LYU, Yezi WANG, Yu YANG, Ziyue LI. Transformer-based intelligent tracking method of aviation structure surface cracks [J]. Acta Aeronautica et Astronautica Sinica, 2025, 46(21): 532355-532355.
[5]	Xudong LUO, Yiquan WU, Jinlin CHEN. Research progress on deep learning methods for object detection and semantic segmentation in UAV aerial images [J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(6): 28822-028822.
[6]	Xin SU, Runcheng GUAN, Qiao WANG, Weizheng YUAN, Xianglian LYU, Yang HE. Ice area and thickness detection method based on deep learning [J]. Acta Aeronautica et Astronautica Sinica, 2023, 44(S2): 729283-729283.
[7]	Xiaohang LI, Jianjiang ZHOU. Multi⁃scale modality fusion network based on adaptive memory length [J]. Acta Aeronautica et Astronautica Sinica, 2023, 44(22): 628977-628977.