基于双分支特征聚合的无人机视觉位置识别

刘奇; 裴智翔; 惠乐; 何明一; 戴玉超

doi:10.7527/S1000-6893.2025.32457

航空学报 >

2025 , Vol. 46 >Issue 23: 632457 - 632457

DOI: https://doi.org/10.7527/S1000-6893.2025.32457

干扰环境下无人机多源感知专栏

基于双分支特征聚合的无人机视觉位置识别

刘奇 ,
裴智翔 ,
惠乐 ,
何明一 ,
戴玉超

展开

西北工业大学电子信息学院陕西省信息获取与处理重点实验室，西安 710129

．E-mail： daiyuchao@nwpu.edu.cn

收稿日期: 2025-06-23

修回日期: 2025-07-28

录用日期: 2025-08-19

网络出版日期: 2025-09-05

基金资助

国家自然科学基金(62271410);国家自然科学基金(12150007)

收起

Dual-branch feature aggregation for UAV visual place recognition

Qi LIU ,
Zhixiang PEI ,
Le HUI ,
Mingyi HE ,
Yuchao DAI

Expand

Shaanxi Key Laboratory of Information Acquisition and Processing，School of Electronics and Information，Northwestern Polytechnical University，Xi’an 710129，China

E-mail：daiyuchao@nwpu.edu.cn

Received date: 2025-06-23

Revised date: 2025-07-28

Accepted date: 2025-08-19

Online published: 2025-09-05

Supported by

National Natural Science Foundation of China(62271410)

Fold

摘要

无人机依赖全球导航卫星系统（GNSS）进行导航定位容易受到信号阻挡或干扰造成失效。视觉位置识别（VPR）通过将无人机捕获的视觉信息、预先构建的地图数据进行匹配实现地理定位，能够在GNSS信号拒止环境下提供可靠的定位信息，因此成为近年来的研究热点。传统VPR方法依赖预训练网络提取用于匹配、检索的全局特征，通常对视角、尺度、光照等视觉外观变化敏感，并且容易丢失细粒度信息。为此，提出了一种基于双分支特征聚合网络的无人机视觉地理位置识别方法，结合了预训练的视觉Transformer模型、状态空间模型以提取更加鲁棒的特征。具体来说，设计了一个集成了DINOv2、VMamba模型的双分支特征提取网络，通过结合ViT的全局语义理解、视觉状态空间模型的局部动态建模能力，实现更强的泛化、细节感知能力。此外，引入了一个受MLP-Mixer架构启发的高效特征融合框架，以增强多通道特征表示的性能。在同视角的ALTO数据集、跨视角的 VIGOR 数据集上进行的实验表明，所提出的方法在诸如返回前1、5个结果中的召回率指标上具有较高的准确性，且优于现有方法，无论是在同一视角还是跨视角的场景中，都能够更有效地识别出匹配图像。

关键词： 无人机视觉位置识别; 视觉匹配定位; 状态空间模型; 双分支特征提取; 图像检索

本文引用格式

刘奇 , 裴智翔 , 惠乐 , 何明一 , 戴玉超 . 基于双分支特征聚合的无人机视觉位置识别[J]. 航空学报, 2025 , 46(23) : 632457 -632457 . DOI: 10.7527/S1000-6893.2025.32457

Abstract

UAVs’ reliance on Global Navigation Satellite Systems （GNSS） for navigation and positioning is prone to failure due to signal blockage or interference. Visual Place Recognition （VPR） enables geographic localization by matching the visual information captured by UAVs with pre-built map data， providing reliable positioning information in GNSS-denied environments， and thus become a research hotspot in recent years. Traditional VPR methods typically depend on pre-trained networks to extract global features for matching and retrieval， but they are sensitive to changes in visual appearance such as viewpoint， scale， and lighting， and are prone to losing fine-grained information. To address these issues， this paper proposes a UAV visual geo-localization method based on a dual-branch feature aggregation network that combines a pre-trained Vision Transformer model and a state-space model to extract more robust features. Specifically， a dual-branch feature extraction network integrating the DINOv2 and VMamba models is designed， which leverages the global semantic understanding of ViT and the local dynamic modeling capability of the visual state-space model to achieve stronger generalization and fine-grained perception. Additionally， the method introduces an efficient feature fusion framework inspired by the MLP-Mixer architecture to enhance the performance of multi-channel feature representation. Experiments conducted on the same-view ALTO dataset and the cross-view VIGOR dataset demonstrate that the proposed method achieves high accuracy in metrics such as R@1 and R@5， outperforming existing methods. This method is proved effective in identifying matching images in different scenarios.

Key words： UAV visual place recognition; visual matching localization; state-space model; dual-branch feature extraction; image retrieval

参考文献

[1]	LYU M Y， ZHAO Y B， HUANG C， et al. Unmanned aerial vehicles for search and rescue： A survey［J］. Remote Sensing， 2023， 15（13）： 3266.
[2]	冷佳旭，莫梦竟成，周应华，等. 无人机视角下的目标检测研究进展［J］. 中国图象图形学报， 2023， 28（9）： 2563-2586.
	LENG J X， MO M， ZHOU Y H， et al. Recent advances in drone-view object detection［J］. Journal of Image and Graphics， 2023， 28（9）： 2563-2586 （in Chinese）.
[3]	CANDIAGO S， REMONDINO F， DE GIGLIO M， et al. Evaluating multispectral images and vegetation indices for precision farming applications from UAV images［J］. Remote Sensing， 2015， 7（4）： 4026-4047.
[4]	MENOUAR H， GUVENC I， AKKAYA K， et al. UAV-enabled intelligent transportation systems for the smart city： applications and challenges［J］. IEEE Communications Magazine， 2017， 55（3）： 22-28.
[5]	GYAGENDA N， HATILIMA J V， ROTH H， et al. A review of GNSS-independent UAV navigation techniques［J］. Robotics and Autonomous Systems， 2022， 152： 104069.
[6]	吴成一. GNSS拒止条件下的无人机视觉导航研究［D］. 西安：西安电子科技大学， 2021.
	WU C Y. GNSS-denied UAV visual navigation research［D］. Xi’an： Xidian University， 2021 （in Chinese）.
[7]	ARAFAT M Y， ALAM M M， MOH S. Vision-based navigation techniques for unmanned aerial vehicles： review and challenges［J］. Drones， 2023， 7（2）： 89.
[8]	GUPTA A， FERNANDO X. Simultaneous localization and mapping （SLAM） and data fusion in unmanned aerial vehicles： recent advances and challenges［J］. Drones， 2022， 6（4）： 85.
[9]	袁媛，孙柏，刘赶超. 景象匹配无人机视觉定位［J］. 自动化学报， 2025， 51（2）： 287-311.
	YUAN Y， SUN B， LIU G C. Drone-based scene matching visual geo-localization［J］. Acta Automatica Sinica， 2025， 51（2）： 287-311 （in Chinese）.
[10]	VAN DALEN G J， MAGREE D P， JOHNSON E N. Absolute localization using image alignment and particle filtering［C］∥ AIAA Guidance， Navigation， and Control Conference. Reston： AIAA， 2016： 0647.
[11]	MANTELLI M， PITTOL D， NEULAND R， et al. A novel measurement model based on abBRIEF for global localization of a UAV over satellite images［J］. Robotics and Autonomous Systems， 2019， 112： 304-319.
[12]	COUTURIER A， AKHLOUFI M A. UAV navigation in GPS-denied environment using particle filtered RVL［C］∥ Situation Awareness in Degraded Environments 2019. Baltimore： SPIE， 2019： 188-198.
[13]	MOSKALENKO I， KORNILOVA A， FERRER G. Visual place recognition for aerial imagery： A survey［J］. Robotics and Autonomous Systems， 2025， 183： 104837.
[14]	XU W， YAO Y， CAO J， et al. UAV-VisLoc： A large-scale dataset for UAV visual localization［DB/OL］. arXiv preprint： 2405.11936， 2024.
[15]	CISNEROS I， YIN P， ZHANG J， et al. ALTO： A Large-Scale Dataset for UAV Visual Place Recognition and Localization［DB/OL］. arXiv preprint： 2207.12317， 2022.
[16]	ZHU S J， YANG T， CHEN C. VIGOR： cross-view image geo-localization beyond one-to-one retrieval［C］∥ 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2021： 5316-5325.
[17]	ZHENG Z D， WEI Y C， YANG Y. University-1652： A multi-view multi-source benchmark for drone-based geo-localization［C］∥ Proceedings of the 28th ACM International Conference on Multimedia. New York： ACM， 2020： 1395-1403.
[18]	RADENOVI? F， TOLIAS G， CHUM O. Fine-tuning CNN image retrieval with No human annotation［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（7）： 1655-1668.
[19]	ARANDJELOVI? R， GRONAT P， TORII A， et al. NetVLAD： CNN architecture for weakly supervised place recognition［C］∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE Press， 2016： 5297-5307.
[20]	BERTON G， MASONE C， CAPUTO B. Re-thinking visual geo-localization for large-scale applications［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE Press， 2022： 4878-4888.
[21]	BERTON G， TRIVIGNO G， CAPUTO B， et al. EigenPlaces： Training viewpoint robust models for visual place recognition［C］∥2023 IEEE/CVF International Conference on Computer Vision （ICCV）. Piscataway： IEEE Press， 2023： 11046-11056.
[22]	ALI-BEY A， CHAIB-DRAA B， GIGUERE P. MixVPR： feature mixing for visual place recognition［C］∥ 2023 IEEE/CVF Winter Conference on Applications of Computer Vision （WACV）. Piscataway： IEEE Press， 2023： 2998-3007.
[23]	LU F， ZHANG L， LAN X， et al. Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition［C］∥ International Conference on Learning Representations （ICLR）. Vienna： ICLR， 2024：11133-11154.
[24]	OQUAB M， DARCET T， MOUTAKANNI T， et al. DINOv2： Learning Robust Visual Features without Supervision［DB/OL］. arXiv preprint： 2304.07193， 2024.
[25]	LIU Y， TIAN Y， ZHAO Y， et al. Vmamba： Visual state space model［J］. Advances in neural information processing systems， 2024， 37： 103031-103063.
[26]	TOLSTIKHIN I O， HOULSBY N， KOLESNIKOV A， et al. MLP-mixer： an all-MLP architecture for vision［C］∥ Neural Information Processing Systems. New York： Curran Associates Inc.， 2021：24261-24272.
[27]	KEETHA N， MISHRA A， KARHADE J， et al. AnyLoc： Towards universal visual place recognition［J］. IEEE Robotics and Automation Letters， 2024， 9（2）： 1286-1293.
[28]	AdaptFormer： Adapting vision transformers for scalable visual recognition［C］∥ Proceedings of the 36th International Conference on Neural Information Processing Systems. New York： ACM， 2022： 16664-16678.
[29]	YUAN Y， CHEN W Y， YANG Y， et al. In defense of the triplet loss again： learning robust person re-identification with fast approximated triplet loss and label distillation［C］∥ 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2020： 1454-1463.
[30]	LIU Z， MAO H Z， WU C Y， et al. A ConvNet for the 2020s［C］∥ 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2022： 11966-11976.
[31]	SHI Y， LIU L， YU X， et al. Spatial-aware feature aggregation for cross-view image based geo-localization［M］∥Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York： Curran Associates Inc.， 2019： 10090-10100.
[32]	ZHU S J， SHAH M， CHEN C. TransGeo： transformer is all you need for cross-view image geo-localization［C］∥ 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2022： 1152-1161.
[33]	DEUSER F， HABEL K， OSWALD N. Sample4Geo： hard negative sampling for cross-view geo-localisation［C］∥ 2023 IEEE/CVF International Conference on Computer Vision （ICCV）. Piscataway： IEEE Press， 2023： 16801-16810.
[34]	WANG Z， SHI D X， QIU C P， et al. Sequence matching for image-based UAV-to-satellite geolocalization［J］. IEEE Transactions on Geoscience and Remote Sensing， 2024， 62： 5607815.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献