首页 >

基于视觉-语言预训练模型的高光谱-激光雷达联合分类方法-“航天遥感图像智能处理与分析”专栏

唐旭,谷峰,马晶晶,张向荣   

  1. 西安电子科技大学
  • 收稿日期:2025-07-28 修回日期:2025-10-14 出版日期:2025-10-24 发布日期:2025-10-24
  • 通讯作者: 唐旭
  • 基金资助:
    基于多任务小样本学习的复杂遥感影像检索;中央高校基本科研业务费专项资金资助

Hyperspectral-LiDAR Joint Classification Method Based on Vision-Language Pre-trained Models

  • Received:2025-07-28 Revised:2025-10-14 Online:2025-10-24 Published:2025-10-24
  • Supported by:
    ;supported by “the Fundamental Research Funds for the Central Universities”

摘要: 为解决多模态遥感数据因空间分辨率差异、数据异质性及标记样本有限,导致土地覆盖分类不准确的问题,对高光谱图像与激光雷达数据联合分类方法展开研究。提出语义感知跨模态融合网络(Semantic-Aware Cross-Modal Fusion Network, SCF-Net),先运用轻量级图像块编码器将数据转换为RGB兼容特征图,输入可学习提示增强的基于对比语言-图像预训练(Contrastive Language-Image Pre-Training,CLIP)视觉编码器;再借助自适应跨模态融合架构,以分组线性投影和关系判断模块实现低计算成本的动态空间特征交互;最后生成属性-类别文本提示,通过计算特征与文本嵌入余弦相似度,并采用Top-K属性平均策略完成分类。在Houston 2013、MUUFL和Trento等数据集上的实验表明,该方法较8种先进融合技术,总体精度提升超2.88%,平均精度提高2.69%,Kappa系数增长3.02%,同时保持较高参数效率。消融实验验证了各模块有效性。该网络为复杂遥感分类任务中多模态数据与大规模预训练模型的结合应用提供了新范式。

关键词: 多模态遥感数据, 土地覆盖分类, 语义感知跨模态融合网络, 自适应跨模态融合

Abstract: To address the challenge of inaccurate land-cover classification caused by differences in spatial resolution, data heterogeneity, and limited labeled samples in multimodal remote sensing data, this study investigates the joint clas-sification of hyperspectral imagery (HSI) and LiDAR data. We propose a Semantic-Aware Cross-Modal Fusion Net-work (SCF-Net). First, a lightweight patch encoder transforms the input data into RGB-compatible feature maps, which are then fed into a CLIP-based visual encoder enhanced with learnable prompts. To efficiently integrate mul-timodal information, an adaptive cross-modal fusion architecture is employed, featuring grouped linear projection and a relation-aware interaction module that enables dynamic spatial feature exchange at low computational cost. For semantic discrimination, attribute-category textual prompts are generated, and classification is performed by computing the cosine similarity between visual and textual embeddings, followed by a Top-K attribute averaging strategy. Experiments on the Houston 2013, MUUFL, and Trento datasets demonstrate that SCF-Net outperforms eight state-of-the-art fusion methods, achieving improvements of over 2.88% in overall accuracy, 2.69% in average accuracy, and 3.02% in Kappa coefficient, while maintaining high parameter efficiency. Ablation studies further vali-date the effectiveness of each component. This network offers a novel paradigm for integrating multimodal remote sensing data with large-scale vision-language pre-trained models in complex classification tasks.

Key words: Multi-modal Remote Sensing Data, Land Cover Classification, Semantic-Aware Cross-Modal Fusion Network, Adaptive Cross-Modal Fusion

中图分类号: