基于安全强化学习的月球着陆器控制

doi:10.7527/S1000-6893.2024.30553

深空光电测量与智能感知技术专栏

本期目录 | 过刊浏览 | 高级检索

前一篇 | 后一篇

基于安全强化学习的月球着陆器控制

杨敏, 刘关俊(), 周子渊

同济大学计算机科学与技术系，上海 201804

收稿日期:2024-04-19 修回日期:2024-05-07 接受日期:2024-07-24 出版日期:2024-08-21 发布日期:2024-08-20
通讯作者: 刘关俊 E-mail:liuguanjun@tongji.edu.cn
基金资助:
国家自然科学基金(62172299);北京控制工程研究所空间光电测量与感知实验室开放基金(LabSOMP-2023-03);中央高校基本科研业务费专项资金(2023-4-YB-05);上海市科技创新行动计划(22511105500)

Control of lunar landers based on secure reinforcement learning

Min YANG, Guanjun LIU(), Ziyuan ZHOU

Department of Computer Science and Technology，Tongji University，Shanghai 201804，China

Received:2024-04-19 Revised:2024-05-07 Accepted:2024-07-24 Online:2024-08-21 Published:2024-08-20
Contact: Guanjun LIU E-mail:liuguanjun@tongji.edu.cn
Supported by:
National Natural Science Foundation of China(62172299);Space Optoelectronic Measurement and Perception Lab., Beijing Institute of Control Engineering(LabSOMP-2023-03);The Fundamental Research Funds for the Central Universities(2023-4-YB-05);Shanghai Technological Innovation Action Plan(22511105500)

摘要/Abstract

摘要：

在月球着陆任务中，着陆器必须在极端环境下进行精确操作，并且通常面临着通信延迟的挑战，这些因素严重限制了地面控制的实时操作能力。针对这些挑战，研究提出了一种基于半马尔可夫决策过程（SMDP）的深度强化学习安全性提升框架，旨在提高航天器自主着陆的操作安全性。为了实现状态空间的压缩并保持决策过程的关键特征，该框架通过对历史轨迹的马尔可夫决策过程（MDP）压缩成SMDP，并根据压缩后的轨迹数据构建抽象SMDP状态转移图，然后识别潜在风险的关键状态-动作对，并实施实时监控及干预，有效提高了航天器的自主着陆安全性。采用了反向广度优先搜索方法，搜索出对任务结果有决定性影响的状态-动作对，并通过搭建的状态-动作监控器实现对模型的实时调整。实验结果显示，该框架在不需增加额外传感器或显著改变现有系统配置的条件下，能够在预训练的深度Q网络（DQN）、Dueling DQN、DDQN模型上，提升月球着陆器在模拟环境中的任务成功率高达22%，在预设的安全性评价标准下，该框架能提升最高42%的安全性。此外，虚拟环境中的模拟结果展示了该框架在月球着陆等复杂航天任务中的实际应用潜力，可以有效提升操作安全性和效率。

关键词: 深度强化学习, 自主着陆, 抽象SMDP状态转移图, 安全性提升, 实时监控, 反向广度优先搜索

Abstract:

In lunar landing missions， the lander must perform precise operations in extreme environments and often faces the challenge of communication delays. These factors severely limit the real-time operation capabilities of ground control. In response to these challenges， this study proposes a Deep Reinforcement Learning （DRL） framework for safety enhancement based on the Semi-Markov Decision Process （SMDP） to improve the operational safety of autonomous spacecraft landing. To compress the state space and maintain the key characteristics of the decision-making process， this framework compresses the Markov Decision Process （MDP） of the historical trajectory into a SMDP， and constructs an abstract SMDP state transition diagram based on the compressed trajectory. Then， the key state-action pairs of potential risks are identified， and the real-time monitoring and intervention strategy is implemented. The framework effectively improves the safety of the spacecraft’s autonomous landing. Furthermore， the reverse breadth first search method is used to search for the state-action pairs that have decisive impact on task results， and real-time adjustment of the model is realized through the built state-action monitor. Experimental results show that this framework increases the mission success rate of the lunar lander by up to 22% in a simulated environment on the pre-trained Deep Q-Network （DQN）， Dueling DQN， and DDQN models without adding additional sensors or significantly changing the existing system configuration. According to the preset safety evaluation standards， the framework can improve safety by up to 42%. In addition， simulation results in a virtual environment demonstrate the practical application potential of this framework in complex space missions such as lunar landing， which can effectively improve operational safety and efficiency.

Key words: deep reinforcement learning, autonomous landing, abstract SMDP state transition diagram, safety enhancement, real-time monitoring, reverse breadth-first search

中图分类号:

V448

杨敏, 刘关俊, 周子渊. 基于安全强化学习的月球着陆器控制[J]. 航空学报, 2025, 46(3): 630553.

Min YANG, Guanjun LIU, Ziyuan ZHOU. Control of lunar landers based on secure reinforcement learning[J]. Acta Aeronautica et Astronautica Sinica, 2025, 46(3): 630553.

图/表 10

图 1

图 2

图 3

图 4

图 5

表 1

表 2

表 3

表 4

表 5

参考文献 37

1	SMIRNOV N N. Safety in space［J］. Acta Astronautica， 2023， 204： 679-681.
2	TIPALDI M， IERVOLINO R， MASSENIO P R. Reinforcement learning in spacecraft control applications： Advances， prospects， and challenges［J］. Annual Reviews in Control， 2022， 54： 1-23.
3	LORENZ R D. Planetary landings with terrain sensing and hazard avoidance： A review［J］. Advances in Space Research， 2023， 71（1）： 1-15.
4	XIA Y Q， CHEN R F， PU F， et al. Active disturbance rejection control for drag tracking in Mars entry guidance［J］. Advances in Space Research， 2014， 53（5）： 853-861.
5	DAI J， XIA Y Q. Mars atmospheric entry guidance for reference trajectory tracking［J］. Aerospace Science and Technology， 2015， 45： 335-345.
6	LONG J T， ZHU S Y， CUI P Y， et al. Barrier Lyapunov function based sliding mode control for Mars atmospheric entry trajectory tracking with input saturation constraint［J］. Aerospace Science and Technology， 2020， 106： 106213.
7	SHEN G H， XIA Y Q， ZHANG J H， et al. Adaptive fixed-time trajectory tracking control for Mars entry vehicle［J］. Nonlinear Dynamics， 2020， 102（4）： 2687-2698.
8	DANG Q Q， GUI H C， LIU K， et al. Relaxed-constraint pinpoint lunar landing using geometric mechanics and model predictive control［J］. Journal of Guidance， Control， and Dynamics， 2020， 43（9）： 1617-1630.
9	邓云山，夏元清，孙中奇，等. 扰动环境下火星精确着陆自主轨迹规划方法［J］. 航空学报， 2021， 42（11）： 524834.
	DENG Y S， XIA Y Q， SUN Z Q， et al. Autonomous trajectory planning method for Mars precise landing in disturbed environment［J］. Acta Aeronautica et Astronautica Sinica， 2021， 42（11）： 524834 （in Chinese）.
10	KHALID A， JAFFERY M H， JAVED M Y， et al. Performance analysis of Mars-powered descent-based landing in a constrained optimization control framework［J］. Energies， 2021， 14（24）： 8493.
11	YUAN X， ZHU S Y， YU Z S， et al. Hazard avoidance guidance for planetary landing using a dynamic safety margin index［C］∥2018 IEEE Aerospace Conference. Piscataway： IEEE Press， 2018： 1-11.
12	D’AMBROSIO A， CARBONE A， SPILLER D， et al. PSO-based soft lunar landing with hazard avoidance： Analysis and experimentation［J］. Aerospace， 2021， 8（7）： 195.
13	SHAKYA A K， PILLAI G， CHAKRABARTY S. Reinforcement learning algorithms： A brief survey［J］. Expert Systems with Applications， 2023， 231： 120495.
14	ZHOU Z Y， LIU G J， TANG Y. Multi-agent reinforcement learning： methods， applications， visionary prospects， and challenges［DB/OL］. arXiv preprint： 2305.10091， 2023.
15	高锡珍，汤亮，黄煌. 深度强化学习技术在地外探测自主操控中的应用与挑战［J］. 航空学报， 2023， 44（6）： 026762.
	GAO X Z， TANG L， HUANG H. Deep reinforcement learning in autonomous manipulation for celestial bodies exploration： Applications and challenges［J］. Acta Aeronautica et Astronautica Sinica， 2023， 44（6）： 026762 （in Chinese）.
16	MOHOLKAR U R， PATIL D D. Comprehensive survey on agent based deep learning techniques for space landing missions［J］. International Journal of Intelligent Systems and Applications in Engineering， 2024， 12（16S）： 188-200.
17	CHENG L， WANG Z B， JIANG F H. Real-time control for fuel-optimal Moon landing based on an interactive deep reinforcement learning algorithm［J］. Astrodynamics， 2019， 3（4）： 375-386.
18	HARRIS A， VALADE T， TEIL T， et al. Generation of spacecraft operations procedures using deep reinforcement learning［J］. Journal of Spacecraft and Rockets， 2022， 59（2）： 611-626.
19	MALI R， KANDE N， MANDWADE S， et al. Lunar lander using reinforcement learning algorithm［C］∥2023 7th International Conference on Computing， Communication， Control and Automation （ICCUBEA）. Piscataway： IEEE Press， 2023： 1-5.
20	DHARRAO D， GITE S， WALAMBE R. Guided cost learning for lunar lander environment using human demonstrated expert trajectories［C］∥2023 International Conference on Advances in Intelligent Computing and Applications （AICAPS）. Piscataway： IEEE Press， 2023： 1-6.
21	SHEN D L. Comparison of three deep reinforcement learning algorithms for solving the lunar lander problem［M］∥Advances in Intelligent Systems Research. Dordrecht： Atlantis Press International BV， 2024： 187-199.
22	GU S D， YANG L， DU Y L， et al. A review of safe reinforcement learning： Methods， theory and applications［DB/OL］. arXiv preprint： 2205.10330， 2022.
23	CHEN W Q， SUBRAMANIAN D， PATERNAIN S. Probabilistic constraint for safety-critical reinforcement learning［J］. IEEE Transactions on Automatic Control， 2024， 69（10）： 6789-6804.
24	SELIM M， ALANWAR A， EL-KHARASHI M W， et al. Safe reinforcement learning using data-driven predictive control［C］∥2022 5th International Conference on Communications， Signal Processing， and their Applications （ICCSPA）. Piscataway： IEEE Press， 2022： 1-6.
25	BRUNKE L， GREEFF M， HALL A W， et al. Safe learning in robotics： From learning-based control to safe reinforcement learning［J］. Annual Review of Control， Robotics， and Autonomous Systems， 2022， 5： 411-444.
26	JIN P， TIAN J X， ZHI D P， et al. Trainify： A CEGAR-driven training and verification framework for safe deep reinforcement learning［C］∥International Conference on Computer Aided Verification. Cham： Springer， 2022： 193-218.
27	ZHI D P， WANG P X， CHEN C， et al. Robustness verification of deep reinforcement learning based control systems using reward martingales［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2024， 38（18）： 19992-20000.
28	TAPPLER M， CÓRDOBA F C， AICHERNIG B K， et al. Search-based testing of reinforcement learning［DB/OL］. arXiv preprint： 2205.04887， 2022.
29	TAPPLER M， PFERSCHER A， AICHERNIG B K， et al. Learning and repair of deep reinforcement learning policies from fuzz-testing data［C］∥Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. New York： ACM， 2024： 1-13.
30	WANG H N， LIU N， ZHANG Y Y， et al. Deep reinforcement learning： A survey［J］. Frontiers of Information Technology & Electronic Engineering， 2020， 21（12）： 1726-1744.
31	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human-level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）： 529-533.
32	WANG Z Y， SCHAUL T， HESSEL M， et al. Dueling network architectures for deep reinforcement learning［C］∥Proceedings of the 33rd International Conference on International Conference on Machine Learni. New York： ACM， 2016， 48： 1995-2003.
33	VAN HASSELT H， GUEZ A， SILVER D. Deep reinforcement learning with double Q-learning［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2016， 30（1）： 2094-2100.
34	BROCKMAN G， CHEUNG V， PETTERSSON L， et al. OpenAI gym［DB/OL］. arXiv preprint： 1606.01540， 2016.
35	GUO S Q， YAN Q， SU X， et al. State-temporal compression in reinforcement learning with the reward-restricted geodesic metric［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2022， 44（9）： 5572-5589.
36	JIN P， WANG Y， ZHANG M. Efficient LTL model checking of deep reinforcement learning systems using policy extraction［C］∥The 34th International Conference on Software Engineering and Knowledge Engineering. San Francisco： KSI Research Inc.， 2022： 357-362.
37	KORKMAZ E. Adversarial robust deep reinforcement learning requires redefining robustness［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2023， 37（7）： 8369-8377.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

状态簇	动作	概率
0	1	3/1 395
24	4	2/3 650
	1	4/3 650
	2	8/3 650
25	4	2/130
25	3	1/130
20	2	2/332
28	1	1/40
28	4	1/40

预训练模型	奖励		任务成功率/%
预训练模型	未监控	监控器纠正	未监控	监控器纠正	成功率提升
DQN	197.07	235.87	68	80	12
Dueling DQN	223.05	248.53	67	89	22
DDQN	147.08	216.85	66	81	15

预训练模型	平均奖励对比/分
预训练模型	未扰动	扰动后	监控器纠正
DQN	197.07	110.20	133.56
Dueling DQN	223.05	139.18	164.93
DDQN	147.08	134.52	176.51

[1]	万开方, 吴志林, 武韫晖, 强皓植, 吴艺博, 李波. 拒止环境下基于深度强化学习的多无人机协同定位[J]. 航空学报, 2025, 46(8): 331024-331024.
[2]	姜凌峰, 李新凯, 张海, 李涵玮, 张宏立. 基于改进TD3算法的无人机动态环境无地图导航[J]. 航空学报, 2025, 46(8): 331035-331035.
[3]	张鸿林, 罗建军, 马卫华. 基于机器学习的航天器规避目标威胁博弈决策[J]. 航空学报, 2024, 45(8): 329136-329136.
[4]	蔡云鹏, 周大鹏, 丁江川. 具有防撞安全约束的无人机集群智能协同控制[J]. 航空学报, 2024, 45(5): 529683-529683.
[5]	单圣哲, 张伟伟. 基于自博弈深度强化学习的空战智能决策方法[J]. 航空学报, 2024, 45(4): 328723-328723.
[6]	高兵, 张哲婕, 邹启杰, 刘治国, 赵锡玲. 基于深度强化学习和信息论的多智能体通信方法[J]. 航空学报, 2024, 45(18): 329862-329862.
[7]	李佐龙, 朱纪洪, 匡敏驰, 张杰, 任洁. 基于混合动作的空战分层强化学习决策算法[J]. 航空学报, 2024, 45(17): 530053-530053.
[8]	武天才, 王宏伦, 任斌, 刘一恒, 吴星雨, 严国乘. 考虑规避与突防的高超声速飞行器智能容错制导控制一体化设计[J]. 航空学报, 2024, 45(15): 329607-329607.
[9]	倪炜霖, 王永海, 徐聪, 赤丰华, 梁海朝. 基于强化学习的高超飞行器协同博弈制导方法[J]. 航空学报, 2023, 44(S2): 729400-729400.
[10]	王雪鉴, 文永明, 石晓荣, 张宁宁, 刘洁玺. 多智能体多耦合任务混合式智能决策架构设计[J]. 航空学报, 2023, 44(S2): 729770-729770.
[11]	高锡珍, 汤亮, 黄煌. 深度强化学习技术在地外探测自主操控中的应用与挑战[J]. 航空学报, 2023, 44(6): 26762-026762.
[12]	周攀, 黄江涛, 章胜, 刘刚, 舒博文, 唐骥罡. 基于深度强化学习的智能空战决策与仿真[J]. 航空学报, 2023, 44(4): 126731-126731.
[13]	朱祥维, 沈丹, 肖凯, 马岳鑫, 廖祥, 古富强, 余芳文, 高柯夫, 刘经南. 类脑导航的机理、算法、实现与展望[J]. 航空学报, 2023, 44(19): 28569-028569.
[14]	董磊, 陈泓兵, 陈曦, 赵长啸. 基于DQN的单一飞行员驾驶模式分布式多智能体联盟任务分配策略[J]. 航空学报, 2023, 44(13): 327895-327895.
[15]	陈文雪, 高长生, 荆武兴. 拦截机动目标的信赖域策略优化制导算法[J]. 航空学报, 2023, 44(11): 327596-327596.

基于安全强化学习的月球着陆器控制

Control of lunar landers based on secure reinforcement learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 37

相关文章 15

编辑推荐

Metrics

本文评价