航空学报 > 2025, Vol. 46 Issue (8): 331035-331035   doi: 10.7527/S1000-6893.2024.31035

基于改进TD3算法的无人机动态环境无地图导航

姜凌峰1, 李新凯1(), 张海2, 李涵玮1, 张宏立3   

  1. 1.新疆大学 电气工程学院,乌鲁木齐 830017
    2.新疆大学 工程训练中心,乌鲁木齐 830017
    3.新疆大学 智能科学与技术学院(未来技术学院),乌鲁木齐 830017
  • 收稿日期:2024-08-02 修回日期:2024-11-04 接受日期:2024-12-06 出版日期:2024-12-13 发布日期:2024-12-12
  • 通讯作者: 李新凯 E-mail:lxk@xju.edu.cn
  • 基金资助:
    国家自然科学基金(62263030);新疆维吾尔自治区自然科学基金(2022D01C86)

Mapless navigation of UAVs in dynamic environments based on an improved TD3 algorithm

Lingfeng JIANG1, Xinkai LI1(), Hai ZHANG2, Hanwei LI1, Hongli ZHANG3   

  1. 1.School of Electrical Engineering,Xinjiang University,Urumqi 830017,China
    2.Engineering Training Center,Xinjiang University,Urumqi 830017,China
    3.School of Intelligent Science and Technology (School of Future Technology),Xinjiang University,Urumqi 830017,China
  • Received:2024-08-02 Revised:2024-11-04 Accepted:2024-12-06 Online:2024-12-13 Published:2024-12-12
  • Contact: Xinkai LI E-mail:lxk@xju.edu.cn
  • Supported by:
    National Natural Science Foundation of China(62263030);Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01C86)

摘要:

针对无人机导航系统在未知动态环境中难以进行建图、导航的问题,提出了基于改进双延迟深度确定性策略梯度(TD3)算法的端到端无地图导航方法。为解决无地图环境下无人机感知受限问题将导航模型定义为部分可观马尔科夫决策过程(POMDP)引入门控循环单元(GRU),使得策略网络能够利用历史状态的时序信息获取最优策略,避免陷入局部最优;基于TD3算法引入softmax算子对值函数进行处理,同时采用双策略网络,以解决TD3算法中存在策略函数不稳定和值函数低估问题;设计非稀疏奖励函数,解决强化学习策略在稀疏奖励条件下难以收敛的问题。最后,在AirSim平台上进行仿真实验,结果表明,相比传统深度强化学习算法,改进算法在无人机无地图避障导航问题上,具有更快的收敛速度和更高的任务成功率。

关键词: 无地图导航, 深度强化学习, 确定策梯度, 无人机, 动态环境

Abstract:

To address the challenges of mapping and navigation in unknown dynamic environments for drone navigation systems, a mapless navigation method based on an improved Twin Delayed Deep Deterministic policy gradient (TD3) is proposed. To solve the perception limitations in a mapless environment, the navigation model is defined as a Partially Observable Markov Decision Process (POMDP). A Gated Recurrent Units (GRU) is introduced to enable the policy network to utilize the temporal information from historical states, allowing it to obtain an optimal policy and avoid falling into local optima. Based on the TD3 algorithm, a softmax operator is employed to the value function, and a dual policy networks is adopted to address issues of policy function instability and value function underestimation in the TD3 algorithm. A non-sparse reward function is designed to resolve the challenge of policy convergence in reinforcement learning under sparse reward conditions. Finally, simulation experiments conducted on the AirSim platform demonstrate that the improved algorithm achieves faster convergence and higher task success rates in drone mapless obstacle avoidance navigation compared to traditional deep reinforcement learning algorithms.

Key words: mapless navigation, deep reinforcement learning, deterministic policy gradient, UAVs, dynamic environment

中图分类号: