遥感基础模型多模态协同增强的语义分割方法-航天遥感图像智能处理与分析

  • 孙奥 ,
  • 徐芳 ,
  • 江树国 ,
  • 杨文 ,
  • 夏桂松
展开
  • 1. 武汉大学人工智能学院
    2. 武汉大学
    3. 武汉大学计算机学院

收稿日期: 2025-10-14

  修回日期: 2026-01-09

  网络出版日期: 2026-01-15

基金资助

国家自然科学基金;国家资助博士后研究人员计划;中国博士后科学基金;湖北省博士后项目

A Semantic Segmentation Method Enhanced by Multimodal Collaboration in Remote Sensing Foundation Models

  • SUN Ao ,
  • XU Fang ,
  • JIANG Shu-Guo ,
  • YANG Wen ,
  • XIA Gui-Song
Expand

Received date: 2025-10-14

  Revised date: 2026-01-09

  Online published: 2026-01-15

摘要

多模态遥感影像的融合能够显著增强土地覆盖信息的完整性。随着多模态数据获取的日益便捷,多模态遥感基础模型快速发展,通过跨模态对齐不断推动多源数据的融合应用。然而,现有的基础模型通常侧重于学习模态间的共性特征,这会将本质上不相关的表示强行逼近,同时忽略了只有通过模态间相互作用才能揭示的协同信息,从而阻碍了其对地球观测数据的综合分析能力的提升。为此,我们提出了UPSeg,一种新型的多模态遥感语义分割方法,旨在明确捕获并利用这些模态间的协同信息。UPSeg模拟人类认知方式,以单模态特征作为“启发”来辅助处理其他模态,从而有效降低单一模态数据固有的不确定性。具体而言,我们利用从单模态数据提取的特征作为相互启发和引导,以优化另一模态的特征解析过程,增强模态间的交互并最大化基础模型内的协同效应。考虑到模态特有信息交互更有利于新见解的产生,我们提出了“方差增强模块”,通过跨模态注意力机制突出每种模态的特征差异性,从而提升跨模态交互的针对性和方向性。大量实验评估表明,该算法有效解决了现有基础模型在利用模态协同信息方面的局限性,促进了精确的土地覆盖分类。在WHU-OPT-SAR数据集和GID数据集上,我们的方法在mPA指标上分别比最先进的多模态语义分割算法高出2.0%和4.2%。

本文引用格式

孙奥 , 徐芳 , 江树国 , 杨文 , 夏桂松 . 遥感基础模型多模态协同增强的语义分割方法-航天遥感图像智能处理与分析[J]. 航空学报, 0 : 1 -0 . DOI: 10.7527/S1000-6893.2025.32910

Abstract

The integration of multi-modal remote sensing images greatly improves the comprehensiveness of land cover information. With the growing accessibility of multi-modal remote sensing data, multi-modal remote sensing foundation models are progressively developed to align different modalities, thereby facilitating the integration of diverse data sources. Nevertheless, existing foundation models typically concentrate on learning the common characteristics across modalities, forcing intrinsically irrelevant representations to converge while neglecting the synergistic information that is only revealed through modal interactions, thus hampering their advancement in comprehensive analysis of Earth observation data. To address this, we introduce UPSeg, a novel multi-modal remote sensing semantic segmentation method designed to explicitly capture and exploit these inter-modal synergies. UPSeg emulates human cognition by using unimodal features as “inspiration” for processing other modalities, effectively reducing uncertainties inherent in single-modality data. Specifically, we utilize features extracted from unimodal data as mutual inspiration and guidance to refine the feature parsing process of another modality, strengthening inter-modal interactions and maximizing synergistic benefits within the underlying model. Considering that the interaction of modality-unique information is more conducive to the generation of new insights, we propose a Variance Enhancement Module that employs a cross-modal attention mechanism to accentuate the distinctive features of each modality, enhancing the directionality and intentionality of cross-modal interactions. Extensive evaluation demonstrates that the proposed algorithm effectively addresses the limitations of existing foundation models in leveraging modal synergies, facilitating precise land cover classification. Our method outperforms the state-of-the-art multi-modal semantic segmentation algorithms with a gain of 2.0% in terms of mPA on WHU-OPT-SAR dataset and 4.2% on GID.

参考文献

[1]. Cao, B., Guo, J., Zhu, P., Hu, Q.: Bi-directional adapter for multimodal tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 927–935 (2024)
[2]. Feng, Z., Song, L., Yang, S., Zhang, X., Jiao, L.: Cross-modal contrastive learning for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing (2023)
[3]. Fuller, A., Millard, K., Green, J.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Processing Systems 36 (2024)
[4]. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15180–15190 (2023)
[5]. Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27672–27683 (2024)
[6]. Han, B., Zhang, S., Shi, X., Reichstein, M.: Bridging remote sensors with multisensor geospatial foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27852–27862 (2024)
[7]. Hong, D., Gao, L., Yokoya, N., Yao, J., Chanussot, J., Du, Q., Zhang, B.: More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Transactions on Geoscience and Remote Sensing 59(5), 4340–4354 (2020)
[8]. Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., Huang, L.: What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems 34, 10944–10956 (2021)
[9]. Kieu, N., Nguyen, K., Nazib, A., Fernando, T., Fookes, C., Sridharan, S.: Multimodal Co-Learning Meets Remote Sensing: Taxonomy, State of the Art, and Future Works. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2024)
[10]. Li, W., Yang, W., Liu, T., Hou, Y., Li, Y., Liu, Z., Liu, Y., Liu, L.: Predicting gradient is better: Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture. ISPRS Journal of Photogrammetry and Remote Sensing 218, 326–338 (2024)
[11]. Li, X., Zhang, G., Cui, H., Hou, S., Wang, S., Li, X., Chen, Y., Li, Z., Zhang, L.: MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. International Journal of Applied Earth Observation and Geoinformation 106, 102638 (2022)
[12]. Liang, P.P., Zadeh, A., Morency, L.P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys 56(10), 1–42 (2024)
[13]. Loshchilov, I.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[14]. Ma, A., Chen, D., Zhong, Y., Zheng, Z., Zhang, L.: National-scale greenhouse mapping for high spatial resolution remote sensing imagery using a dense object dual-task deep learning framework: A case study of China. ISPRS Journal of Photogrammetry and Remote Sensing 181, 279–294 (2021)
[15]. Ma, W., Karaku?, O., Rosin, P.L.: AMM-FuseNet: Attention-based multi-modal image fusion network for land cover mapping. Remote Sensing 14(18), 4458 (2022)
[16]. Ma, X., Zhang, X., Pun, M.O., Liu, M.: A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing (2024)
[17]. Prexl, J., Schmitt, M.: SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pre-training. arXiv preprint arXiv:2408.11000 (2024)
[18]. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763. PMLR (2021)
[19]. Scheibenreif, L., Hanna, J., Mommert, M., Borth, D.: Self-supervised vision transformers for land-cover segmentation and classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1422–1431 (2022)
[20]. Stojnic, V., Risojevic, V.: Self-supervised learning of remote sensing scene representations using contrastive multiview coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1182–1191 (2021)
[21]. Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7262–7272 (2021)
[22]. Sun, X., Tian, Y., Lu, W., Wang, P., Niu, R., Yu, H., Fu, K.: From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy. Science China Information Sciences 66(4), 140301 (2023)
[23]. Tong, X.Y., Xia, G.S., Lu, Q., Shen, H., Li, S., You, S., Zhang, L.: Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sensing of Environment 237, 111322 (2020)
[24]. Wang, L., Fang, S., Meng, X., Li, R.: Building extraction with vision transformer. IEEE Transactions on Geoscience and Remote Sensing 60, 1–11 (2022)
[25]. Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., Meng, X., Atkinson, P.M.: Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing 190, 196–214 (2022)
[26]. Wang, Q., Chen, W., Tang, H., Pan, X., Zhao, H., Yang, B., Zhang, H., Gu, W.: Simultaneous extracting area and quantity of agricultural greenhouses in large scale with deep learning method and high-resolution remote sensing images. Science of The Total Environment 872, 162229 (2023)
[27]. Wang, X., Chen, G., Qian, G., Gao, P., Wei, X.Y., Wang, Y., Tian, Y., Gao, W.: Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research 20(4), 447–482 (2023)
[28]. Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023)
[29]. Xiong, Z., Wang, Y., Zhang, F., Stewart, A.J., Hanna, J., Borth, D., Papoutsis, I., Saux, B.L., Camps-Valls, G., Zhu, X.X.: Neural plasticity-inspired foundation model for observing the earth crossing modalities. arXiv preprint arXiv:2403.15356 (2024)
[30]. Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for earth vision. arXiv preprint arXiv:2401.07527 (2024)
[31]. Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems (2023)
[32]. Zhu, J., Lai, S., Chen, X., Wang, D., Lu, H.: Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9516–9526 (2023)
文章导航

/