[1]. Cao, B., Guo, J., Zhu, P., Hu, Q.: Bi-directional adapter for multimodal tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 927–935 (2024)
[2]. Feng, Z., Song, L., Yang, S., Zhang, X., Jiao, L.: Cross-modal contrastive learning for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing (2023)
[3]. Fuller, A., Millard, K., Green, J.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. Advances in Neural Information Processing Systems 36 (2024)
[4]. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15180–15190 (2023)
[5]. Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27672–27683 (2024)
[6]. Han, B., Zhang, S., Shi, X., Reichstein, M.: Bridging remote sensors with multisensor geospatial foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 27852–27862 (2024)
[7]. Hong, D., Gao, L., Yokoya, N., Yao, J., Chanussot, J., Du, Q., Zhang, B.: More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Transactions on Geoscience and Remote Sensing 59(5), 4340–4354 (2020)
[8]. Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., Huang, L.: What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems 34, 10944–10956 (2021)
[9]. Kieu, N., Nguyen, K., Nazib, A., Fernando, T., Fookes, C., Sridharan, S.: Multimodal Co-Learning Meets Remote Sensing: Taxonomy, State of the Art, and Future Works. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2024)
[10]. Li, W., Yang, W., Liu, T., Hou, Y., Li, Y., Liu, Z., Liu, Y., Liu, L.: Predicting gradient is better: Exploring self-supervised learning for SAR ATR with a joint-embedding predictive architecture. ISPRS Journal of Photogrammetry and Remote Sensing 218, 326–338 (2024)
[11]. Li, X., Zhang, G., Cui, H., Hou, S., Wang, S., Li, X., Chen, Y., Li, Z., Zhang, L.: MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. International Journal of Applied Earth Observation and Geoinformation 106, 102638 (2022)
[12]. Liang, P.P., Zadeh, A., Morency, L.P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys 56(10), 1–42 (2024)
[13]. Loshchilov, I.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[14]. Ma, A., Chen, D., Zhong, Y., Zheng, Z., Zhang, L.: National-scale greenhouse mapping for high spatial resolution remote sensing imagery using a dense object dual-task deep learning framework: A case study of China. ISPRS Journal of Photogrammetry and Remote Sensing 181, 279–294 (2021)
[15]. Ma, W., Karaku?, O., Rosin, P.L.: AMM-FuseNet: Attention-based multi-modal image fusion network for land cover mapping. Remote Sensing 14(18), 4458 (2022)
[16]. Ma, X., Zhang, X., Pun, M.O., Liu, M.: A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing (2024)
[17]. Prexl, J., Schmitt, M.: SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pre-training. arXiv preprint arXiv:2408.11000 (2024)
[18]. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763. PMLR (2021)
[19]. Scheibenreif, L., Hanna, J., Mommert, M., Borth, D.: Self-supervised vision transformers for land-cover segmentation and classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1422–1431 (2022)
[20]. Stojnic, V., Risojevic, V.: Self-supervised learning of remote sensing scene representations using contrastive multiview coding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1182–1191 (2021)
[21]. Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7262–7272 (2021)
[22]. Sun, X., Tian, Y., Lu, W., Wang, P., Niu, R., Yu, H., Fu, K.: From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy. Science China Information Sciences 66(4), 140301 (2023)
[23]. Tong, X.Y., Xia, G.S., Lu, Q., Shen, H., Li, S., You, S., Zhang, L.: Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sensing of Environment 237, 111322 (2020)
[24]. Wang, L., Fang, S., Meng, X., Li, R.: Building extraction with vision transformer. IEEE Transactions on Geoscience and Remote Sensing 60, 1–11 (2022)
[25]. Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., Meng, X., Atkinson, P.M.: Unetformer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing 190, 196–214 (2022)
[26]. Wang, Q., Chen, W., Tang, H., Pan, X., Zhao, H., Yang, B., Zhang, H., Gu, W.: Simultaneous extracting area and quantity of agricultural greenhouses in large scale with deep learning method and high-resolution remote sensing images. Science of The Total Environment 872, 162229 (2023)
[27]. Wang, X., Chen, G., Qian, G., Gao, P., Wei, X.Y., Wang, Y., Tian, Y., Gao, W.: Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research 20(4), 447–482 (2023)
[28]. Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023)
[29]. Xiong, Z., Wang, Y., Zhang, F., Stewart, A.J., Hanna, J., Borth, D., Papoutsis, I., Saux, B.L., Camps-Valls, G., Zhu, X.X.: Neural plasticity-inspired foundation model for observing the earth crossing modalities. arXiv preprint arXiv:2403.15356 (2024)
[30]. Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for earth vision. arXiv preprint arXiv:2401.07527 (2024)
[31]. Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems (2023)
[32]. Zhu, J., Lai, S., Chen, X., Wang, D., Lu, H.: Visual prompt multi-modal tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9516–9526 (2023)