航空学报 > 2026, Vol. 47 Issue (10): 632630-632630   doi: 10.7527/S1000-6893.2025.32630

航天遥感图像智能处理与分析专刊

基于视觉-语言预训练模型的高光谱-激光雷达联合分类方法

唐旭(), 谷峰, 马晶晶, 张向荣   

  1. 西安电子科技大学 人工智能学院,西安 710126
  • 收稿日期:2025-07-28 修回日期:2025-09-08 接受日期:2025-09-30 出版日期:2025-10-28 发布日期:2025-10-24
  • 通讯作者: 唐旭 E-mail:tangxu128@gmail.com
  • 基金资助:
    国家自然科学基金面上项目(62571387);中央高校基本科研业务费专项资金(YJSJ25014)

Hyperspectral-LiDAR joint classification method based on vision-language pre-trained models

Xu TANG(), Feng GU, Jingjing MA, Xiangrong ZHANG   

  1. School of Artificial Intelligence,Xidian University,Xi’an 710126,China
  • Received:2025-07-28 Revised:2025-09-08 Accepted:2025-09-30 Online:2025-10-28 Published:2025-10-24
  • Contact: Xu TANG E-mail:tangxu128@gmail.com
  • Supported by:
    General Program of National Natural Science Foundation of China(62571387);the Fundamental Research Funds for the Central Universities(YJSJ25014)

摘要:

为解决多模态遥感数据因空间分辨率差异、数据异质性及标记样本有限导致的土地覆盖分类不准确问题,对高光谱图像与激光雷达数据联合分类方法展开研究。提出语义感知跨模态融合网络(SCF-Net),先运用轻量级图像块编码器将数据转换为RGB兼容特征图,输入可学习提示增强的基于对比语言-图像预训练(CLIP)视觉编码器;再借助自适应跨模态融合架构,以分组线性投影和关系判断模块实现低计算成本的动态空间特征交互;最后生成属性-类别文本提示,通过计算特征与文本嵌入余弦相似度,采用TopK属性平均策略完成分类。在Houston 2013、MUUFL和Trento数据集上的实验表明,较8种融合技术,SCF-Net总体精度提升超2.88%,平均精度提高2.69%,Kappa系数增长3.02%,同时保持较高参数效率。消融实验验证了各模块有效性。SCF-Net为复杂遥感分类任务中多模态数据与大规模预训练模型的结合应用提供了新范式。

关键词: 多模态遥感数据, 土地覆盖分类, 语义感知跨模态融合网络, 自适应跨模态融合, 视觉-语言预训练模型, 高光谱-激光雷达融合

Abstract:

To address the challenge of inaccurate land-cover classification caused by differences in spatial resolution, data heterogeneity, and limited labeled samples in multimodal remote sensing data, we investigate the joint classification of Hyperspectral Imagery (HSI) and LiDAR data. We propose a Semantic-aware Cross-modal Fusion Network (SCF-Net). First, a lightweight patch encoder transforms the input data into RGB-compatible feature maps, which are then fed into a Contrastive Language-Image Pre-training (CLIP)-based visual encoder enhanced with learnable prompts. To efficiently integrate multimodal information, an adaptive cross-modal fusion architecture is employed, featuring grouped linear projection and a relation-aware interaction module that enables dynamic spatial feature exchange at low computational cost. For semantic discrimination, attribute-category textual prompts are generated, and classification is performed by computing the cosine similarity between visual and textual embeddings, followed by a TopK attribute averaging strategy. Experiments on the Houston 2013, MUUFL, and Trento datasets demonstrate that SCF-Net outperforms eight state-of-the-art fusion methods, achieving improvements of over 2.88% in overall accuracy, 2.69% in average accuracy, and 3.02% in Kappa coefficient, while maintaining high parameter efficiency. Ablation studies further validate the effectiveness of each component. This network offers a novel paradigm for integrating multimodal remote sensing data with large-scale vision-language pre-trained models in complex classification tasks.

Key words: multi-modal remote sensing data, land cover classification, semantic-aware cross-modal fusion network, adaptive cross-modal fusion, vision-language pre-trained model, Hyperspectral-LiDAR fusion

中图分类号: