基于改进TD3算法的无人机动态环境无地图导航

doi:10.7527/S1000-6893.2024.31035

电子电气工程与控制

本期目录 | 过刊浏览 | 高级检索

前一篇 |

基于改进TD3算法的无人机动态环境无地图导航

姜凌峰¹, 李新凯¹(), 张海², 李涵玮¹, 张宏立³

^1.新疆大学电气工程学院，乌鲁木齐 830017
^2.新疆大学工程训练中心，乌鲁木齐 830017
^3.新疆大学智能科学与技术学院（未来技术学院），乌鲁木齐 830017

收稿日期:2024-08-02 修回日期:2024-11-04 接受日期:2024-12-06 出版日期:2024-12-13 发布日期:2024-12-12
通讯作者: 李新凯 E-mail:lxk@xju.edu.cn
基金资助:
国家自然科学基金(62263030);新疆维吾尔自治区自然科学基金(2022D01C86)

Mapless navigation of UAVs in dynamic environments based on an improved TD3 algorithm

Lingfeng JIANG¹, Xinkai LI¹(), Hai ZHANG², Hanwei LI¹, Hongli ZHANG³

^1.School of Electrical Engineering，Xinjiang University，Urumqi 830017，China
^2.Engineering Training Center，Xinjiang University，Urumqi 830017，China
^3.School of Intelligent Science and Technology （School of Future Technology），Xinjiang University，Urumqi 830017，China

Received:2024-08-02 Revised:2024-11-04 Accepted:2024-12-06 Online:2024-12-13 Published:2024-12-12
Contact: Xinkai LI E-mail:lxk@xju.edu.cn
Supported by:
National Natural Science Foundation of China(62263030);Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01C86)

摘要/Abstract

摘要：

针对无人机导航系统在未知动态环境中难以进行建图、导航的问题，提出了基于改进双延迟深度确定性策略梯度（TD3）算法的端到端无地图导航方法。为解决无地图环境下无人机感知受限问题将导航模型定义为部分可观马尔科夫决策过程（POMDP）引入门控循环单元（GRU），使得策略网络能够利用历史状态的时序信息获取最优策略，避免陷入局部最优；基于TD3算法引入softmax算子对值函数进行处理，同时采用双策略网络，以解决TD3算法中存在策略函数不稳定和值函数低估问题；设计非稀疏奖励函数，解决强化学习策略在稀疏奖励条件下难以收敛的问题。最后，在AirSim平台上进行仿真实验，结果表明，相比传统深度强化学习算法，改进算法在无人机无地图避障导航问题上，具有更快的收敛速度和更高的任务成功率。

关键词: 无地图导航, 深度强化学习, 确定策梯度, 无人机, 动态环境

Abstract:

To address the challenges of mapping and navigation in unknown dynamic environments for drone navigation systems， a mapless navigation method based on an improved Twin Delayed Deep Deterministic policy gradient （TD3） is proposed. To solve the perception limitations in a mapless environment， the navigation model is defined as a Partially Observable Markov Decision Process （POMDP）. A Gated Recurrent Units （GRU） is introduced to enable the policy network to utilize the temporal information from historical states， allowing it to obtain an optimal policy and avoid falling into local optima. Based on the TD3 algorithm， a softmax operator is employed to the value function， and a dual policy networks is adopted to address issues of policy function instability and value function underestimation in the TD3 algorithm. A non-sparse reward function is designed to resolve the challenge of policy convergence in reinforcement learning under sparse reward conditions. Finally， simulation experiments conducted on the AirSim platform demonstrate that the improved algorithm achieves faster convergence and higher task success rates in drone mapless obstacle avoidance navigation compared to traditional deep reinforcement learning algorithms.

Key words: mapless navigation, deep reinforcement learning, deterministic policy gradient, UAVs, dynamic environment

中图分类号:

V279

姜凌峰, 李新凯, 张海, 李涵玮, 张宏立. 基于改进TD3算法的无人机动态环境无地图导航[J]. 航空学报, 2025, 46(8): 331035.

Lingfeng JIANG, Xinkai LI, Hai ZHANG, Hanwei LI, Hongli ZHANG. Mapless navigation of UAVs in dynamic environments based on an improved TD3 algorithm[J]. Acta Aeronautica et Astronautica Sinica, 2025, 46(8): 331035.

图/表 18

图 1

图 2

图 3

图 4

图 5

图 6

图 7

图 8

图 9

表1

模型超参数设置

参数	数值
折扣率 $γ$	0.99
学习率 $μ$	$1 × 10 - 4$
经验池尺寸	$106$
经验池批量大小 $N$ 探索噪声参数 $τ$ 目标网络更新率 $σ$	256 $2 × 10 - 1$ $5 × 10 - 3$

表1

图 10

图 11

图 12

图 13

表2

表3

图 14

图 15

参考文献 31

1	HUANG Z Y， WU J D， LV C. Efficient deep reinforcement learning with imitative expert priors for autonomous driving［J］. IEEE Transactions on Neural Networks and Learning Systems， 2023， 34（10）： 7391-7403.
2	ADIL M， SONG H B， AHMAD JAN M， et al. UAV-assisted IoT applications， QoS requirements and challenges with future research directions［J］. ACM Computing Surveys， 2024， 56（10）： 1-35.
3	XUE Z H， GONSALVES T. Vision based drone obstacle avoidance by deep reinforcement learning［J］. AI， 2021， 2（3）： 366-380.
4	LI J， QIN H， WANG J Z， et al. OpenStreetMap-based autonomous navigation for the four wheel-legged robot via 3D-lidar and CCD camera［J］. IEEE Transactions on Industrial Electronics， 2022， 69（3）： 2708-2717.
5	CAI D P， LI R Q， HU Z H， et al. A comprehensive overview of core modules in visual SLAM framework［J］. Neurocomputing， 2024， 590： 127760.
6	YANG C G， CHEN C Z， HE W， et al. Robot learning system based on adaptive neural control and dynamic movement primitives［J］. IEEE Transactions on Neural Networks and Learning Systems， 2019， 30（3）： 777-787.
7	ALMAZROUEI K， KAMEL I， RABIE T. Dynamic obstacle avoidance and path planning through reinforcement learning［J］. Applied Sciences， 2023， 13（14）： 8174.
8	SATHYAMOORTHY A J， PATEL U， GUAN T R， et al. Frozone： Freezing-free， pedestrian-friendly navigation in human crowds［J］. IEEE Robotics and Automation Letters， 2020， 5（3）： 4352-4359.
9	周彬，郭艳，李宁，等. 基于导向强化Q学习的无人机路径规划［J］. 航空学报， 2021， 42（9）： 325109.
	ZHOU B， GUO Y， LI N， et al. Path planning of UAV using guided enhancement Q-learning algorithm［J］. Acta Aeronautica et Astronautica Sinica， 2021， 42（9）： 325109 （in Chinese）.
10	CHAI R Q， NIU H L， CARRASCO J， et al. Design and experimental validation of deep reinforcement learning-based fast trajectory planning and control for mobile robot in unknown environment‍［J］. IEEE Transactions on Neural Networks and Learning Systems， 2024， 35（4）： 5778-5792.
11	XIE Z T， DAMES P. DRL-VO： Learning to navigate through crowded dynamic scenes using velocity obstacles［J］. IEEE Transactions on Robotics， 2023， 39（4）： 2700-2719.
12	CAO X， REN L， SUN C Y. Research on obstacle detection and avoidance of autonomous underwater vehicle based on forward-looking sonar［J］. IEEE Transactions on Neural Networks and Learning Systems， 2023， 34（11）： 9198-9208.
13	WANG W Y， MA F， LIU J L. Course tracking control for smart ships based on A deep deterministic policy gradient-based algorithm‍［C］‍∥2019 5th International Conference on Transportation Information and Safety （ICTIS）. Piscataway： IEEE Press， 2019.
14	LILLICRAP T P， HUNT J， PRITZEL A， et al. Continuous control with deep reinforcement learning［DB/OL］. arXiv preprint： 1509.02971，2019.
15	SILVER D， LEVER G， HEESS N， et al. Deterministic policy gradient algorithms［C］‍∥Proceedings of the 31st International Conference on International Conference on Machine Learning. New York： ACM， 2014.
16	FUJIMOTO S， HOOF H V， MEGER D. Addressing function approximation error in actor-critic methods［C］‍∥Proceedings of the 35th International Conference on Machine Learning. New York： ACM， 2018
17	寇凯，杨刚，张文启，等. 基于SAC的无人机自主导航方法研究［J］. 西北工业大学学报， 2024， 42（2）： 310-318.
	KOU K， YANG G， ZHANG W Q， et al. Exploring UAV autonomous navigation algorithm based on soft actor-critic‍［J］. Journal of Northwestern Polytechnical University， 2024， 42（2）： 310-318 （in Chinese）.
18	SINGLA A， PADAKANDLA S， BHATNAGAR S. Memory-based deep reinforcement learning for obstacle avoidance in UAV with limited environment knowledge［J］. IEEE Transactions on Intelligent Transportation Systems， 2021， 22（1）： 107-118.
19	CUI Z Y， WANG Y. UAV path planning based on multi-layer reinforcement learning technique［J］. IEEE Access， 2021， 9： 59486-59497.
20	XUE Y T， CHEN W S. A UAV navigation approach based on deep reinforcement learning in large cluttered 3D environments‍［J］. IEEE Transactions on Vehicular Technology， 2023， 72（3）： 3001-3014.
21	EVERETT M， CHEN Y F， HOW J P. Motion planning among dynamic， decision-making agents with deep reinforcement learning‍［C］‍∥2018 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Piscataway： IEEE Press， 2018.
22	KAELBLING L P， LITTMAN M L， CASSANDRA A R. Planning and acting in partially observable stochastic domains‍［J］. Artificial Intelligence， 1998， 101（1）： 99-134.
23	XIAO C X， LU P， HE Q Z. Flying through a narrow gap using end-to-end deep reinforcement learning augmented with curriculum learning and Sim2Real［J］. IEEE Transactions on Neural Networks and Learning Systems， 2023， 34（5）： 2701-2708.
24	JIA J Y， XING X W， CHANG D E. GRU-attention based TD3 network for mobile robot navigation‍［C］‍∥2022 22nd International Conference on Control， Automation and Systems （ICCAS）. Piscataway： IEEE Press， 2022.
25	DEY R， SALEM F M. Gate-variants of gated recurrent unit （GRU） neural networks［C］‍∥2017 IEEE 60th International Midwest Symposium on Circuits and Systems （MWSCAS）. Piscataway： IEEE Press， 2017.
26	KALIDAS A P， JOSHUA C J， MD A Q， et al. Deep reinforcement learning for vision-based navigation of UAVs in avoiding stationary and mobile obstacles［J］. Drones， 2023， 7（4）： 245.
27	杨卫平. 新一代飞行器导航制导与控制技术发展趋势［J］. 航空学报， 2024， 45（5）： 529720.
	YANG W P. Development trend of navigation guidance and control technology for new generation aircraft‍［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（5）： 529720 （in Chinese）.
28	ZHANG F J， LI J， LI Z. A TD3-based multi-agent deep reinforcement learning method in mixed cooperation-competition environment‍［J］. Neurocomputing， 2020， 411： 206-215.
29	PAN L， CAI Q P， HUANG L B. Softmax deep double deterministic policy gradients［DB/OL］. arXiv preprint： 2010. 09177， 2020.
30	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［DB/OL］. arXiv preprint： 1707. 06347， 2017.
31	HAARNOJA T， ZHOU A， ABBEEL P， et al. Soft actor-critic： Off-policy maximum entropy deep reinforcement learning with a stochastic actor［DB/OL］. arXiv preprint： 1801. 01290， 2018.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

算法	平均步数	最大步数	最小步数	平均行驶时间/s	最小行驶时间/s	成功率/%
DDPG	67.15	129	17	73.71	18.67	45.0
TD3	58.60	65	7	62.59	6.47	62.0
PPO	71.70	201	14	80.20	16.43	38.5
SAC	54.70	61	15	59.62	14.72	66.5
G-SD3	53.40	77	12	57.68	12.42	88.0

[1]	万开方, 吴志林, 武韫晖, 强皓植, 吴艺博, 李波. 拒止环境下基于深度强化学习的多无人机协同定位[J]. 航空学报, 2025, 46(8): 331024-331024.
[2]	向锦武, 马凯, 阚梓, 李道春, 郑可欣, 陈汉轩. 氢能源无人机关键技术研究进展[J]. 航空学报, 2025, 46(5): 531603-531603.
[3]	丁奇帅, 雷帮军, 吴正平. 基于孪生网络的轻量型无人机单目标跟踪算法[J]. 航空学报, 2025, 46(4): 330925-330925.
[4]	吴付杰, 王博文, 齐静雅, 曹铭智, 桑英俊, 李晟, 张玉珍, 陈钱, 左超. 机载多孔径全景图像合成技术研究进展[J]. 航空学报, 2025, 46(3): 630505-630505.
[5]	马诺, 卫社春, 孟军辉, 刘清洋, 雷宇声. 考虑减速伞作用的无人机内埋舱体分离流场特性与动力学[J]. 航空学报, 2025, 46(3): 130755-130755.
[6]	杨敏, 刘关俊, 周子渊. 基于安全强化学习的月球着陆器控制[J]. 航空学报, 2025, 46(3): 630553-630553.
[7]	吴一全, 童康. 基于深度学习的无人机航拍图像小目标检测研究进展[J]. 航空学报, 2025, 46(3): 30848-030848.
[8]	郭廷宇, 闫溟, 解春雷. 聚合-分体飞行器气动特性[J]. 航空学报, 2024, 45(S1): 730596-730596.
[9]	余莎莎, 陈星雨. 城市空中交通领域关键技术创新与挑战[J]. 航空学报, 2024, 45(S1): 730657-730657.
[10]	马骥, 陈向勇, 温广辉, 程龙, 邱建龙. 随机切换拓扑下无人机集群预设时间跟踪控制[J]. 航空学报, 2024, 45(S1): 730793-730793.
[11]	柳汀, 周国鑫, 徐扬, 罗德林, 郭正玉, 杨梦杰. 融合信息图的优化哈里斯鹰多无人机动态目标搜索[J]. 航空学报, 2024, 45(S1): 730773-730773.
[12]	王海峰, 刘坤澎, 江泓鑫, 杜晨曦. 螺旋桨多设计点气动优化方法和变桨距角策略[J]. 航空学报, 2024, 45(9): 528831-528831.
[13]	赵春晖, 刘安萌, 吕洋, 潘泉. 无人机韧性自主定位技术综述[J]. 航空学报, 2024, 45(8): 28839-028839.
[14]	张薇, 何若俊. 面向物联网数据收集的无人机自主路径规划[J]. 航空学报, 2024, 45(8): 329054-329054-1.
[15]	张鸿林, 罗建军, 马卫华. 基于机器学习的航天器规避目标威胁博弈决策[J]. 航空学报, 2024, 45(8): 329136-329136.

基于改进TD3算法的无人机动态环境无地图导航

Mapless navigation of UAVs in dynamic environments based on an improved TD3 algorithm

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 18

参考文献 31

相关文章 15

编辑推荐

Metrics

本文评价

环境	成功率/%					碰撞率/%
环境	DDPG	TD3	SAC	PPO	G-SD3	DDPG	TD3	SAC	PPO	G-SD3
平均值	45	62	67	39	88	55	38	34	62	12
环境a	48	67	68	41	86	52	33	32	59	14
环境b	53	74	77	50	95	47	26	23	50	5
环境c	31	56	61	32	89	69	44	39	68	11
环境d	45	51	60	31	82	55	49	40	69	18