首页 >

基于跨模态共模交互与差分感知的目标检测方法

杨熙阳1,林家泉2   

  1. 1. 中国民航大学电子信息与自动化学院
    2. 中国民航大学
  • 收稿日期:2026-01-22 修回日期:2026-05-18 出版日期:2026-05-19 发布日期:2026-05-19
  • 通讯作者: 林家泉
  • 基金资助:
    国家重点研发计划项目

1,Jiaquan LIN   

  • Received:2026-01-22 Revised:2026-05-18 Online:2026-05-19 Published:2026-05-19
  • Contact: Jiaquan LIN

摘要: 面向无人机在复杂环境下的精准感知需求,可见光与红外双模态融合检测技术因其显著的互补优势而备受关注。当前方法在应对航拍场景中模态特征异构、背景干扰复杂及小目标特征微弱等挑战时,往往难以同时达成高精度与高效率的平衡。针对这一问题,本文提出一种基于跨模态共模交互与差分感知的目标检测方法。首先,针对跨模态特征对齐难题,设计了双向跨模态共模融合(Bidirectional-Cross-Modal Common Mode Fusion, BCMF)模块,通过双向注意力机制实现可见光与红外模态的深层交互与共性特征提取。其次,为抑制复杂背景噪声并增强目标显著性,构建了上下文门控差分增强(Context-Gated Differential Block, CGDB)模块,利用大感受野上下文信息对特征进行自适应门控筛选。进一步,为提升多尺度特征判别力,采用双FPN结构,独立维护并融合双模态特征流,避免特征混淆。在DroneVehicle与VEDAI数据集上的实验表明,所提方法在取得高平均精度的同时,保持了模型轻量化,其综合性能较现有主流融合方法有较大幅度提升。

关键词: 多模态目标检测, 无人机, 特征融合, 轻量化模型, 注意力机制

Abstract: To address the demand for precise perception in complex environments for Unmanned Aerial Vehicles (UAVs), dual-modal fusion detection technology integrating visible light and infrared has garnered significant attention due to its distinct complementary advantages. However, existing methods often struggle to achieve a balance between high accuracy and high efficiency when confronting challenges such as heterogeneous modal features, complex background interference, and weak small-object characteristics in aerial photography scenarios. To tackle these issues, this paper proposes a object detection method based on cross-modal common-mode interaction and differential perception. First, to address cross-modal feature alignment challenges, a Bidirectional-Cross-Modal Common Mode Fusion (BCMF) module is designed. This module employs a bidirectional attention mechanism to enable deep interaction between visible light and infrared modalities and extract common features. Second, to suppress complex background noise and enhance target saliency, a Context-Gated Differential Block (CGDB) module is constructed. This module employs large receptive field context information for adaptive gated feature selection. Furthermore, to enhance multi-scale feature discriminative power, an innovative dual FPN architecture is adopted. This independently maintains and fuses dual-modal feature streams, preventing feature confusion. Experiments on the DroneVehicle and VEDAI datasets demonstrate that the proposed method achieves high average accuracy while maintaining model lightweightness. Its overall performance shows significant improvement over existing mainstream fusion methods.

Key words: multimodal object detection, UAV, feature fusion, lightweight model, attention mechanism

中图分类号: