航空学报 > 2025, Vol. 46 Issue (23): 632017-632017   doi: 10.7527/S1000-6893.2025.32017

干扰环境下无人机多源感知专栏

基于特征协同重构的RGB-T无人机目标跟踪

高栋, 赖普坚, 王世磊, 程塨()   

  1. 西北工业大学 自动化学院,西安 710072
  • 收稿日期:2025-03-25 修回日期:2025-04-16 接受日期:2025-05-30 出版日期:2025-06-30 发布日期:2025-06-13
  • 通讯作者: 程塨 E-mail:gcheng@nwpu.edu.cn
  • 基金资助:
    国家自然科学基金(61772425);陕西省自然科学基金(2021JC-16)

RGB-T UAV object tracking based on feature-cooperative reconstruction

Dong GAO, Pujian LAI, Shilei WANG, Gong CHENG()   

  1. School of Automation,Northwestern Polytechnical University,Xi’an 710072,China
  • Received:2025-03-25 Revised:2025-04-16 Accepted:2025-05-30 Online:2025-06-30 Published:2025-06-13
  • Contact: Gong CHENG E-mail:gcheng@nwpu.edu.cn
  • Supported by:
    National Natural Science Foundation of China(61772425);Shaanxi Province Natural Science Foundation(2021JC-16)

摘要:

RGB-T无人机目标跟踪通过融合可见光和热红外模态的互补信息,提升复杂环境下的目标跟踪鲁棒性。然而,现有的方法忽略了模态差异引起的噪声干扰,导致模态间特征互补的有效性受损,特征表征能力下降,从而制约了RGB-T无人机目标跟踪器的性能表现。针对上述问题,提出了一种基于特征协同重构的RGB-T无人机目标跟踪方法,该方法的核心是设计了一种由跨模态交互编码器和特征重构解码器组成的特征协同重构模块。具体地,跨模态交互编码器通过自适应特征交互机制,在提取辅助模态关键互补信息的同时有效抑制跨模态噪声干扰。随后,特征重构解码器利用编码器输出的查询特征引导模态特征重构,在保留模态特定特征的同时引入跨模态互补信息,增强特征表征能力。此外,为了提高动态场景下的目标定位精度,提出了一种跨模态位置线索融合模块,用于融合不同模态的搜索区域,从而提供更准确的位置线索。最后,在VTUAV和HiAL共2个RGB-T无人机目标跟踪基准数据集和LasHeR数据集对所提方法进行了全面的实验评估。实验结果表明,所提方法在VTUAV和HiAL数据集上性能显著优于现有方法。特别地,在VTUAV数据集上,相比HMFT,所提方法的跟踪成功率和准确率分别提升了9.9%和9.0%。

关键词: 无人机, 目标跟踪, Transformer, 跨模态特征交互, 特征协同重构, 跨模态位置线索融合

Abstract:

RGB-T Unmanned Aerial Vehicle (UAV) object tracking enhances tracking robustness in complex environments by fusing complementary information from visible light and thermal infrared modalities. However, existing methods neglect the noise interference caused by modality gaps, which weakens the effectiveness of cross-modal feature complementarity and degrades the power of feature representation; thereby, limiting the performance of RGB-T UAV trackers. To address this issue, a feature-cooperative reconstruction-based tracker is proposed, the core of which is to develop a feature-cooperative reconstruction module, consisting of a cross-modal interaction encoder and a feature reconstruction decoder. Specifically, the cross-modal interaction encoder employs an adaptive feature interaction strategy to extract critical complementary information from the auxiliary modality while effectively suppressing cross-modal noise interference. The feature reconstruction decoder then utilizes the query features from the encoder to guide the reconstruction of features, preserving modality-specific information while incorporating cross-modal complementary details to enhance feature representation. Additionally, to improve target localization accuracy in dynamic scenes, a cross-modal location cue fusion module is proposed to integrate search regions from different modalities, providing more precise localization cues. Finally, extensive experimental evaluations on two RGB-T UAV object tracking benchmark datasets (i.e., VTUAV and HiAL) as well as the LasHeR dataset are conducted. The results demonstrate that the proposed method significantly outperforms existing methods. Notably, compared to HMFT, the proposed method improves tracking success rate and precision on the VTUAV dataset by 9.9% and 9.0%, respectively.

Key words: UAV, object tracking, Transformer, cross-modal feature interaction, feature-cooperative reconstruction, cross-modal location cue fusion

中图分类号: