Dual-branch feature aggregation for UAV visual place recognition

Qi LIU; Zhixiang PEI; Le HUI; Mingyi HE; Yuchao DAI

doi:10.7527/S1000-6893.2025.32457

ACTA AERONAUTICAET ASTRONAUTICA SINICA >

2025 , Vol. 46 >Issue 23: 632457 - 632457

DOI: https://doi.org/10.7527/S1000-6893.2025.32457

special column

Dual-branch feature aggregation for UAV visual place recognition

Qi LIU ,
Zhixiang PEI ,
Le HUI ,
Mingyi HE ,
Yuchao DAI

Expand

Shaanxi Key Laboratory of Information Acquisition and Processing，School of Electronics and Information，Northwestern Polytechnical University，Xi’an 710129，China

E-mail：daiyuchao@nwpu.edu.cn

Received date: 2025-06-23

Revised date: 2025-07-28

Accepted date: 2025-08-19

Online published: 2025-09-05

Supported by

National Natural Science Foundation of China(62271410)

Fold

Abstract

UAVs’ reliance on Global Navigation Satellite Systems （GNSS） for navigation and positioning is prone to failure due to signal blockage or interference. Visual Place Recognition （VPR） enables geographic localization by matching the visual information captured by UAVs with pre-built map data， providing reliable positioning information in GNSS-denied environments， and thus become a research hotspot in recent years. Traditional VPR methods typically depend on pre-trained networks to extract global features for matching and retrieval， but they are sensitive to changes in visual appearance such as viewpoint， scale， and lighting， and are prone to losing fine-grained information. To address these issues， this paper proposes a UAV visual geo-localization method based on a dual-branch feature aggregation network that combines a pre-trained Vision Transformer model and a state-space model to extract more robust features. Specifically， a dual-branch feature extraction network integrating the DINOv2 and VMamba models is designed， which leverages the global semantic understanding of ViT and the local dynamic modeling capability of the visual state-space model to achieve stronger generalization and fine-grained perception. Additionally， the method introduces an efficient feature fusion framework inspired by the MLP-Mixer architecture to enhance the performance of multi-channel feature representation. Experiments conducted on the same-view ALTO dataset and the cross-view VIGOR dataset demonstrate that the proposed method achieves high accuracy in metrics such as R@1 and R@5， outperforming existing methods. This method is proved effective in identifying matching images in different scenarios.

Key words： UAV visual place recognition; visual matching localization; state-space model; dual-branch feature extraction; image retrieval

Cite this article

Qi LIU , Zhixiang PEI , Le HUI , Mingyi HE , Yuchao DAI . Dual-branch feature aggregation for UAV visual place recognition[J]. ACTA AERONAUTICAET ASTRONAUTICA SINICA, 2025 , 46(23) : 632457 -632457 . DOI: 10.7527/S1000-6893.2025.32457

References

[1]	LYU M Y， ZHAO Y B， HUANG C， et al. Unmanned aerial vehicles for search and rescue： A survey［J］. Remote Sensing， 2023， 15（13）： 3266.
[2]	冷佳旭，莫梦竟成，周应华，等. 无人机视角下的目标检测研究进展［J］. 中国图象图形学报， 2023， 28（9）： 2563-2586.
	LENG J X， MO M， ZHOU Y H， et al. Recent advances in drone-view object detection［J］. Journal of Image and Graphics， 2023， 28（9）： 2563-2586 （in Chinese）.
[3]	CANDIAGO S， REMONDINO F， DE GIGLIO M， et al. Evaluating multispectral images and vegetation indices for precision farming applications from UAV images［J］. Remote Sensing， 2015， 7（4）： 4026-4047.
[4]	MENOUAR H， GUVENC I， AKKAYA K， et al. UAV-enabled intelligent transportation systems for the smart city： applications and challenges［J］. IEEE Communications Magazine， 2017， 55（3）： 22-28.
[5]	GYAGENDA N， HATILIMA J V， ROTH H， et al. A review of GNSS-independent UAV navigation techniques［J］. Robotics and Autonomous Systems， 2022， 152： 104069.
[6]	吴成一. GNSS拒止条件下的无人机视觉导航研究［D］. 西安：西安电子科技大学， 2021.
	WU C Y. GNSS-denied UAV visual navigation research［D］. Xi’an： Xidian University， 2021 （in Chinese）.
[7]	ARAFAT M Y， ALAM M M， MOH S. Vision-based navigation techniques for unmanned aerial vehicles： review and challenges［J］. Drones， 2023， 7（2）： 89.
[8]	GUPTA A， FERNANDO X. Simultaneous localization and mapping （SLAM） and data fusion in unmanned aerial vehicles： recent advances and challenges［J］. Drones， 2022， 6（4）： 85.
[9]	袁媛，孙柏，刘赶超. 景象匹配无人机视觉定位［J］. 自动化学报， 2025， 51（2）： 287-311.
	YUAN Y， SUN B， LIU G C. Drone-based scene matching visual geo-localization［J］. Acta Automatica Sinica， 2025， 51（2）： 287-311 （in Chinese）.
[10]	VAN DALEN G J， MAGREE D P， JOHNSON E N. Absolute localization using image alignment and particle filtering［C］∥ AIAA Guidance， Navigation， and Control Conference. Reston： AIAA， 2016： 0647.
[11]	MANTELLI M， PITTOL D， NEULAND R， et al. A novel measurement model based on abBRIEF for global localization of a UAV over satellite images［J］. Robotics and Autonomous Systems， 2019， 112： 304-319.
[12]	COUTURIER A， AKHLOUFI M A. UAV navigation in GPS-denied environment using particle filtered RVL［C］∥ Situation Awareness in Degraded Environments 2019. Baltimore： SPIE， 2019： 188-198.
[13]	MOSKALENKO I， KORNILOVA A， FERRER G. Visual place recognition for aerial imagery： A survey［J］. Robotics and Autonomous Systems， 2025， 183： 104837.
[14]	XU W， YAO Y， CAO J， et al. UAV-VisLoc： A large-scale dataset for UAV visual localization［DB/OL］. arXiv preprint： 2405.11936， 2024.
[15]	CISNEROS I， YIN P， ZHANG J， et al. ALTO： A Large-Scale Dataset for UAV Visual Place Recognition and Localization［DB/OL］. arXiv preprint： 2207.12317， 2022.
[16]	ZHU S J， YANG T， CHEN C. VIGOR： cross-view image geo-localization beyond one-to-one retrieval［C］∥ 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2021： 5316-5325.
[17]	ZHENG Z D， WEI Y C， YANG Y. University-1652： A multi-view multi-source benchmark for drone-based geo-localization［C］∥ Proceedings of the 28th ACM International Conference on Multimedia. New York： ACM， 2020： 1395-1403.
[18]	RADENOVI? F， TOLIAS G， CHUM O. Fine-tuning CNN image retrieval with No human annotation［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（7）： 1655-1668.
[19]	ARANDJELOVI? R， GRONAT P， TORII A， et al. NetVLAD： CNN architecture for weakly supervised place recognition［C］∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE Press， 2016： 5297-5307.
[20]	BERTON G， MASONE C， CAPUTO B. Re-thinking visual geo-localization for large-scale applications［C］∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE Press， 2022： 4878-4888.
[21]	BERTON G， TRIVIGNO G， CAPUTO B， et al. EigenPlaces： Training viewpoint robust models for visual place recognition［C］∥2023 IEEE/CVF International Conference on Computer Vision （ICCV）. Piscataway： IEEE Press， 2023： 11046-11056.
[22]	ALI-BEY A， CHAIB-DRAA B， GIGUERE P. MixVPR： feature mixing for visual place recognition［C］∥ 2023 IEEE/CVF Winter Conference on Applications of Computer Vision （WACV）. Piscataway： IEEE Press， 2023： 2998-3007.
[23]	LU F， ZHANG L， LAN X， et al. Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition［C］∥ International Conference on Learning Representations （ICLR）. Vienna： ICLR， 2024：11133-11154.
[24]	OQUAB M， DARCET T， MOUTAKANNI T， et al. DINOv2： Learning Robust Visual Features without Supervision［DB/OL］. arXiv preprint： 2304.07193， 2024.
[25]	LIU Y， TIAN Y， ZHAO Y， et al. Vmamba： Visual state space model［J］. Advances in neural information processing systems， 2024， 37： 103031-103063.
[26]	TOLSTIKHIN I O， HOULSBY N， KOLESNIKOV A， et al. MLP-mixer： an all-MLP architecture for vision［C］∥ Neural Information Processing Systems. New York： Curran Associates Inc.， 2021：24261-24272.
[27]	KEETHA N， MISHRA A， KARHADE J， et al. AnyLoc： Towards universal visual place recognition［J］. IEEE Robotics and Automation Letters， 2024， 9（2）： 1286-1293.
[28]	AdaptFormer： Adapting vision transformers for scalable visual recognition［C］∥ Proceedings of the 36th International Conference on Neural Information Processing Systems. New York： ACM， 2022： 16664-16678.
[29]	YUAN Y， CHEN W Y， YANG Y， et al. In defense of the triplet loss again： learning robust person re-identification with fast approximated triplet loss and label distillation［C］∥ 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Piscataway： IEEE Press， 2020： 1454-1463.
[30]	LIU Z， MAO H Z， WU C Y， et al. A ConvNet for the 2020s［C］∥ 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2022： 11966-11976.
[31]	SHI Y， LIU L， YU X， et al. Spatial-aware feature aggregation for cross-view image based geo-localization［M］∥Proceedings of the 33rd International Conference on Neural Information Processing Systems. New York： Curran Associates Inc.， 2019： 10090-10100.
[32]	ZHU S J， SHAH M， CHEN C. TransGeo： transformer is all you need for cross-view image geo-localization［C］∥ 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Piscataway： IEEE Press， 2022： 1152-1161.
[33]	DEUSER F， HABEL K， OSWALD N. Sample4Geo： hard negative sampling for cross-view geo-localisation［C］∥ 2023 IEEE/CVF International Conference on Computer Vision （ICCV）. Piscataway： IEEE Press， 2023： 16801-16810.
[34]	WANG Z， SHI D X， QIU C P， et al. Sequence matching for image-based UAV-to-satellite geolocalization［J］. IEEE Transactions on Geoscience and Remote Sensing， 2024， 62： 5607815.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References