导航

Acta Aeronautica et Astronautica Sinica ›› 2026, Vol. 47 ›› Issue (10): 632630.doi: 10.7527/S1000-6893.2025.32630

• Special Issue: Intelligent Processing and Analysis of Aerospace Remote Sensing Images • Previous Articles    

Hyperspectral-LiDAR joint classification method based on vision-language pre-trained models

Xu TANG(), Feng GU, Jingjing MA, Xiangrong ZHANG   

  1. School of Artificial Intelligence,Xidian University,Xi’an 710126,China
  • Received:2025-07-28 Revised:2025-09-08 Accepted:2025-09-30 Online:2025-10-28 Published:2025-10-24
  • Contact: Xu TANG E-mail:tangxu128@gmail.com
  • Supported by:
    General Program of National Natural Science Foundation of China(62571387);the Fundamental Research Funds for the Central Universities(YJSJ25014)

Abstract:

To address the challenge of inaccurate land-cover classification caused by differences in spatial resolution, data heterogeneity, and limited labeled samples in multimodal remote sensing data, we investigate the joint classification of Hyperspectral Imagery (HSI) and LiDAR data. We propose a Semantic-aware Cross-modal Fusion Network (SCF-Net). First, a lightweight patch encoder transforms the input data into RGB-compatible feature maps, which are then fed into a Contrastive Language-Image Pre-training (CLIP)-based visual encoder enhanced with learnable prompts. To efficiently integrate multimodal information, an adaptive cross-modal fusion architecture is employed, featuring grouped linear projection and a relation-aware interaction module that enables dynamic spatial feature exchange at low computational cost. For semantic discrimination, attribute-category textual prompts are generated, and classification is performed by computing the cosine similarity between visual and textual embeddings, followed by a TopK attribute averaging strategy. Experiments on the Houston 2013, MUUFL, and Trento datasets demonstrate that SCF-Net outperforms eight state-of-the-art fusion methods, achieving improvements of over 2.88% in overall accuracy, 2.69% in average accuracy, and 3.02% in Kappa coefficient, while maintaining high parameter efficiency. Ablation studies further validate the effectiveness of each component. This network offers a novel paradigm for integrating multimodal remote sensing data with large-scale vision-language pre-trained models in complex classification tasks.

Key words: multi-modal remote sensing data, land cover classification, semantic-aware cross-modal fusion network, adaptive cross-modal fusion, vision-language pre-trained model, Hyperspectral-LiDAR fusion

CLC Number: