首页 >

基于双分支特征聚合的无人机视觉位置识别-“干扰环境下的无人机多源感知”专栏

刘奇,裴智翔,惠乐,何明一,戴玉超   

  1. 西北工业大学
  • 收稿日期:2025-06-23 修回日期:2025-09-02 出版日期:2025-09-05 发布日期:2025-09-05
  • 通讯作者: 戴玉超

Dual-Branch Feature Aggregation for UAV Visual Place Recognition

  1. 1. Northwestern Polytechnical University
    2.
  • Received:2025-06-23 Revised:2025-09-02 Online:2025-09-05 Published:2025-09-05
  • Contact: Yuchao Dai

摘要: 无人机依赖全球导航卫星系统(GNSS)进行导航定位容易受到信号阻挡或干扰造成失效。视觉位置识别(VPR)通过将无人机捕获的视觉信息与预先构建的地图数据进行匹配实现地理定位,能够在GNSS信号拒止环境下提供可靠的定位信息,因此成为近年来的研究热点。传统VPR方法依赖于预训练网络提取用于匹配和检索的全局特征,通常对视角、尺度、光照等视觉外观变化敏感,并且容易丢失细粒度信息。为此,本文提出了一种基于双分支特征增强网络的无人机视觉地理位置识别方法,结合了预训练的视觉Transformer模型和状态空间模型以提取更加鲁棒的特征。具体来说,设计了一个集成了DINOv2和VMamba模型的双分支特征提取网络,通过结合ViT的全局语义理解和视觉状态空间模型的局部动态建模能力,实现更强的泛化与细节感知能力。此外,该方法引入了一个受MLP-Mixer架构启发的高效特征融合框架,以增强多通道特征表示的性能。在同视角的 ALTO 数据集和跨视角的 VIGOR 数据集上进行的实验表明,所提出的方法在诸如 R@1 和 R@5 等指标上具有较高的准确性,且优于现有方法,无论是在同一视角还是跨视角的场景中,都能够更有效地识别出匹配图像。

关键词: 无人机视觉位置识别, 视觉匹配定位, 状态空间模型, 双分支特征提取, 图像检索

Abstract: UAVs' reliance on Global Navigation Satellite Systems (GNSS) for navigation and positioning is prone to failure due to signal blockage or interference. Visual Place Recognition (VPR) enables geographic localization by matching the visual information captured by UAVs with pre-built map data, providing reliable positioning information in GNSS-denied environments. As a result, it has become a research hotspot in recent years. Traditional VPR methods rely on pre-trained networks to extract global features for matching and retrieval, which are often sensitive to changes in visual appearance such as viewpoint, scale, and lighting, and are prone to losing fine-grained information. To address these issues, this paper proposes a UAV visual geo-localization method based on a dual-branch feature aggregation network, combining a pre-trained Vision Transformer model and a state-space model to extract more robust features. Specifically, a dual-branch network integrating the DINOv2 and VMamba models is designed. By combining the global semantic understanding of ViT and the local dynamic modeling capability of the visual state-space model, it achieves stronger generalization and fine-grained perception. Additionally, the method introduces an efficient feature fusion framework inspired by the MLP-Mixer architecture to enhance the performance of multi-channel feature representation. Experiments conducted on the same-view ALTO dataset and the cross-view VIGOR dataset demonstrate that the proposed method achieves high accuracy in metrics such as R@1 and R@5, outperforming existing methods. It can more effectively identify matching images in different scenarios.

Key words: UAV Visual Place Recognition, Visual Matching Localization, State-Space Model, Dual-Branch Feature Extraction, Image Retrieval

中图分类号: