基于价值滤波的空战机动决策优化方法

doi:10.7527/S1000-6893.2023.28871

专栏

本期目录 | 过刊浏览 | 高级检索

前一篇 | 后一篇

基于价值滤波的空战机动决策优化方法

付宇鹏¹, 邓向阳¹^,²(), 朱子强¹, 张立民¹

^1.海军航空大学航空作战勤务学院，烟台 264001
^2.清华大学自动化系，北京 100084

收稿日期:2023-04-14 修回日期:2023-05-30 接受日期:2023-06-14 出版日期:2023-11-25 发布日期:2023-06-27
通讯作者: 邓向阳 E-mail:skl18@mails.tsinghua.edu.cn
基金资助:
国家自然科学基金(91538201);国防高层次人才基金(202220539)

Value-filter based air-combat maneuvering optimization

Yupeng FU¹, Xiangyang DENG¹^,²(), Ziqiang ZHU¹, Limin ZHANG¹

^1.School of Aviation Support，Naval Aeronautical University，Yantai 264001，China
^2.Department of Automation，Tsinghua University，Beijing 100084，China

Received:2023-04-14 Revised:2023-05-30 Accepted:2023-06-14 Online:2023-11-25 Published:2023-06-27
Contact: Xiangyang DENG E-mail:skl18@mails.tsinghua.edu.cn
Supported by:
National Natural Science Foundation of China(91538201);Foundation Program for National Defense High Level Talent(202220539)

摘要/Abstract

摘要：

针对传统强化学习算法面对复杂状态空间的空战机动决策优化问题时，存在的经验数据利用率低、算法不易收敛等问题，分析了价值滤波的概念和原理，提出了基于价值滤波的示例策略约束（DPC）算法，构建了基于DPC算法的空战机动决策优化方法和流程。算法利用价值滤波器提取回放经验池和示例经验池的优势数据，对智能体策略优化方向进行基于状态价值的约束。仿真基于JSBSim平台的F-16飞机空气动力学模型，仿真结果表明算法收敛效率明显提高并避免示例策略的次优问题，生成的机动决策模型具备较好的智能性。

关键词: 价值滤波, 策略约束, 机动决策, 强化学习, 模仿学习

Abstract:

To address the issues of low data utilization efficiency and convergence difficulty of traditional reinforcement learning algorithm in air-combat maneuvering decision optimization with large state space， the concept and principle of value filter are proposed and analyzed. A reinforcement learning algorithm named Demonstration Policy Constrain （DPC） is presented based on the value filter. A maneuvering decision optimization method based on the DPC algorithm is designed. With the value filter， the state-value based advantage data of the replay buffer and the demonstration buffer are extracted to constrain the optimization direction of the policy. Based on the JSBSim's aerodynamic model of F-16 Aircraft， the simulation results show that the convergence efficiency of the algorithm is significantly improved and the sub-optimal problem of the demonstration policy is mitigated， and the maneuvering decision method proposed achieves good intelligence.

Key words: value filter, policy constrain, maneuvering decision, reinforcement learning, imitation learning

中图分类号:

V249.12

付宇鹏, 邓向阳, 朱子强, 张立民. 基于价值滤波的空战机动决策优化方法[J]. 航空学报, 2023, 44(22): 628871.

Yupeng FU, Xiangyang DENG, Ziqiang ZHU, Limin ZHANG. Value-filter based air-combat maneuvering optimization[J]. Acta Aeronautica et Astronautica Sinica, 2023, 44(22): 628871.

图/表 12

图 1

图 2

表 1

PPO-DPC算法参数设置

名称	数值	名称	数值
Actor网络	23×（256）₄×4	#Worker N	6
Critic网络	23×（256）₄×1	K	8
$α$ ， $β$	1×10^-4	$D n o n$ size	1 024
$ε$	0.2	batch size	256
$γ$	0.998	$D E$ size	2×10⁵
$λ$	0.95	$D o f f$ size	1×10⁶
$η 0$	1	$ζ$	5×10^-6

表 1

图 3

图 4

图 5

图 6

图 7

图 8

图 9

图 10

图 11

参考文献 34

1	WANG Z A， LI H， WU H L， et al. Improving maneuver strategy in air combat by alternate freeze games with a deep reinforcement learning algorithm［J］. Mathematical Problems in Engineering， 2020， 2020： 1-17.
2	马文，李辉，王壮，等. 基于深度随机博弈的近距空战机动决策［J］. 系统工程与电子技术， 2021， 43（2）： 443-451.
	MA W， LI H， WANG Z， et al. Close air combat maneuver decision based on deep stochastic game［J］. Systems Engineering and Electronics， 2021， 43（2）： 443-451 （in Chinese）.
3	李宪港，李强. 典型智能博弈系统技术分析及指控系统智能化发展展望［J］. 智能科学与技术学报， 2020， 2（1）： 36-42.
	LI X G， LI Q. Technical analysis of typical intelligent game system and development prospect of intelligent command and control system［J］. Chinese Journal of Intelligent Science and Technology， 2020， 2（1）： 36-42 （in Chinese）.
4	POPE A P， IDE J S， MIĆOVIĆ D， et al. Hierarchical reinforcement learning for air-to-air combat［C］∥2021 International Conference on Unmanned Aircraft Systems （ICUAS）. Piscataway： IEEE Press， 2021： 275-284.
5	SUFIYAN D， WIN L T S， WIN S K H， et al. A reinforcement learning approach for control of a nature-inspired aerial vehicle［C］∥2019 International Conference on Robotics and Automation （ICRA）. Piscataway： IEEE Press， 2019： 6030-6036.
6	ZHEN Y， HAO M R， SUN W D. Deep reinforcement learning attitude control of fixed-wing UAVs［C］∥2020 3rd International Conference on Unmanned Systems （ICUS）. Piscataway： IEEE Press， 2020： 239-244.
7	WANG C， YAN C， XIANG X， et al. A continuous actor-critic reinforcement learning approach to flocking with fixed-wing UAVs［C］∥Asian Conference on Machine Learning. Berlin： Springer， 2020： 239-244.
8	周攀，黄江涛，章胜，等. 基于深度强化学习的智能空战决策与仿真［J］. 航空学报， 2023， 44（4）： 126731.
	ZHOU P， HUANG J T， ZHANG S， et al. Intelligent air combat decision making and simulation based on deep reinforcement learning［J］. Acta Aeronautica et Astronautica Sinica， 2023， 44（4）： 126731 （in Chinese）.
9	吴宜珈，赖俊，陈希亮，等. 强化学习算法在超视距空战辅助决策上的应用研究［J］. 航空兵器， 2021， 28（2）： 55-61.
	WU Y J， LAI J， CHEN X L， et al. Research on the application of reinforcement learning algorithm in decision support of beyond-visual-range air combat［J］. Aero Weaponry， 2021， 28（2）： 55-61 （in Chinese）.
10	王欢，周旭，邓亦敏，等. 分层决策多机空战对抗方法［J］. 中国科学：信息科学， 2022， 52（12）： 2225-2238.
	WANG H， ZHOU X， DENG Y M， et al. A hierarchical decision-making method for multi-aircraft air combat confrontation［J］. Scientia Sinica （Informationis）， 2022， 52（12）： 2225-2238 （in Chinese）.
11	POMERLEAU D A. Alvinn： An autonomous land vehicle in a neural network［C］∥Conference and Workshop on Neural Information Processing Systems. New York： ACM， 1989： 305-313.
12	BOJARSKI M， DEL TESTA D， DWORAKOWSKI D， et al. End to end learning for self-driving cars［DB/OL］. arXiv preprint： 1604.07316. 2016.
13	GIUSTI A， GUZZI J， CIREŞAN D C， et al. A machine learning approach to visual perception of forest trails for mobile robots［J］. IEEE Robotics and Automation Letters， 2016， 1（2）： 661-667.
14	NAKANISHI J， MORIMOTO J， ENDO G， et al. Learning from demonstration and adaptation of biped locomotion［J］. Robotics and Autonomous Systems， 2004， 47（2-3）： 79-91.
15	ROSS S， GORDON G J， BAGNELL J A. A reduction of imitation learning and structured prediction to No-regret online learning［C］∥Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. New York： PMLR， 2011： 627–635.
16	NG A Y， RUSSELL S J. Algorithms for inverse reinforcement learning［C］∥Proceedings of the Seventeenth International Conference on Machine Learning. New York： ACM， 2000： 663-670.
17	ZIEBART B D， MAAS A， BAGNELL J A， et al. Maximum entropy inverse reinforcement learning［C］∥ Proceedings of the 23rd National Conference on Artificial Intelligence. New York： ACM， 2008： 1433-1438.
18	FINN C， LEVINE S， ABBEEL P. Guided cost learning： Deep inverse optimal control via policy optimization［C］∥Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York： ACM， 2016： 49-58.
19	NAIR A， MCGREW B， ANDRYCHOWICZ M， et al. Overcoming exploration in reinforcement learning with demonstrations［C］∥2018 IEEE International Conference on Robotics and Automation （ICRA）. Piscataway： IEEE Press， 2018： 6292-6299.
20	XU H R， ZHAN X Y， YIN H L， et al. Discriminator-weighted offline imitation learning from suboptimal demonstrations［C］∥Proceedings of the 39th International Conference on Machine Learning. New York： ACM， 2022： 24725-24742.
21	VINYALS O， BABUSCHKIN I， CZARNECKI W M， et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning［J］. Nature， 2019， 575（7782）： 350-354.
22	WANG P， LIU D P， CHEN J Y， et al. Decision making for autonomous driving via augmented adversarial inverse reinforcement learning［C］∥2021 IEEE International Conference on Robotics and Automation （ICRA）. Piscataway： IEEE Press， 2021： 1036-1042.
23	俞扬，詹德川，周志华，等. 基于模仿学习和强化学习算法的无人机飞行控制方法： CN112162564B［P］. 2021-09-28.
	YU Y， ZHAN D C， ZHOU Z H， et al. Unmanned aerial vehicle flight control method based on imitation learning and reinforcement learning algorithms： CN112162564B［P］. 2021-09-28 （in Chinese）.
24	ZHU Z D， LIN K X， DAI B， et al. Self-adaptive imitation learning： Learning tasks with delayed rewards from sub-optimal demonstrations［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2022， 36（8）： 9269-9277.
25	SCHULMAN J， LEVINE S， MORITZ P， et al. Trust region policy optimization［C］∥International Conference on Machine Learning. New York： ACM， 2015： 1889-1897.
26	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［DB/OL］. arXiv preprint： 1707.06347， 2017.
27	李茹杨，彭慧民，李仁刚，等. 强化学习算法与应用综述［J］. 计算机系统应用， 2020， 29（12）： 13-25.
	LI R Y， PENG H M， LI R G， et al. Overview on algorithms and applications for reinforcement learning［J］. Computer Systems and Applications， 2020， 29（12）： 13-25 （in Chinese）.
28	OH J， GUO Y， SINGH S， et al. Self-imitation learning ［C］∥Proceedings of the 35th International Conference on Machine Learning. New York： ACM， 2018： 3778-3887.
29	HAARNOJA T， TANG H R， ABBEEL P， et al. Reinforcement learning with deep energy-based policies［C］∥Proceedings of the 34th International Conference on Machine Learning-Volume 70. New York： ACM， 2017： 1352-1361.
30	LI C， WU F G， ZHAO J S. Accelerating self-imitation learning from demonstrations via policy constraints and Q-ensemble［C］∥2023 International Joint Conference on Neural Networks （IJCNN）. Piscataway： IEEE Press， 2023： 1-8.
31	SCHULMAN J， MORITZ P， LEVINE S， et al. High-dimensional continuous control using generalized advantage estimation［DB/OL］. arXiv preprint： 1506.02438， 2015.
32	KINGMA D P and BA J. Adam： A method for stochastic optimization［C］∥International Conference for Learning Representations （ICLR）. San Juan： Puerto Rico， 2015.
33	MCGREW J S， HOW J P， WILLIAMS B， et al. Air-combat strategy using approximate dynamic programming［J］. Journal of Guidance， Control， and Dynamics， 2010， 33（5）： 1641-1654.
34	Fujimoto S， Gu S S. A minimalist approach to offline reinforcement learning［C］∥Thirty-Fifth Conference on Neural Information Processing Systems （NeurIPS）. New York： ACM， 2021： 20132-20145.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

[1]	万开方, 吴志林, 武韫晖, 强皓植, 吴艺博, 李波. 拒止环境下基于深度强化学习的多无人机协同定位[J]. 航空学报, 2025, 46(8): 331024-331024.
[2]	姜凌峰, 李新凯, 张海, 李涵玮, 张宏立. 基于改进TD3算法的无人机动态环境无地图导航[J]. 航空学报, 2025, 46(8): 331035-331035.
[3]	罗越群, 丁达理, 谭目来, 刘屹东, 周欢. 无人作战飞机自主机动决策方法综述[J]. 航空学报, 2025, 46(7): 30877-030877.
[4]	杨敏, 刘关俊, 周子渊. 基于安全强化学习的月球着陆器控制[J]. 航空学报, 2025, 46(3): 630553-630553.
[5]	王辰, 魏才盛, 殷泽阳, 靳锴, 李星辰. 考虑信道资源约束的多无人机航迹与通信策略协同规划[J]. 航空学报, 2025, 46(18): 331837-331837.
[6]	罗祎喆, 张辉, 余新得, 金钊, 冯朔, 石育澄, 徐明亮. 面向舰载机多波次弹药保障任务的分层动态调度[J]. 航空学报, 2025, 46(18): 331945-331945.
[7]	黄湘松, 王梦宇, 潘大鹏. 基于对抗强化学习的无人机逃离路径规划方法[J]. 航空学报, 2025, 46(17): 331637-331637.
[8]	王昱, 谢志鹏, 田永健, 孟光磊. 虚拟结构引领强化学习分布式无人机编队控制[J]. 航空学报, 2025, 46(15): 331354-331354.
[9]	陈伟, 李璐璐, 陈董, 张少辉, 李亚飞, 王可, 靳远远, 徐明亮. 差异化保障需求驱动的舰载机多机协同决策方法[J]. 航空学报, 2025, 46(13): 531274-531274.
[10]	陈旭东, 陈琦琦, 罗祎喆, 王佳宝, 徐明亮. 异构舰载机舰面保障作业动态并行调度[J]. 航空学报, 2025, 46(13): 531329-531329.
[11]	王政, 王华, 崔可可, 李超超, 刘俊楠, 徐明亮. 局部引导强化学习的舰载机自主调运方法[J]. 航空学报, 2025, 46(13): 531333-531333.
[12]	凌文辉, 牟春晖, 聂聆聪, 杜宪, 孙希明. 基于改进DDPG的宽速域几何可调燃烧室压力分布控制[J]. 航空学报, 2025, 46(12): 131092-131092.
[13]	余子杰, 郑征, 李清东, 郭林, 任素萍, 郭健. 基于深度强化学习的太阳能无人机航迹规划[J]. 航空学报, 2025, 46(12): 331420-331420.
[14]	赵长啸, 孙亦轩. 面向适航要求的eVTOL航电系统安全调度模型[J]. 航空学报, 2025, 46(11): 531252-531252.
[15]	高树一, 林德福, 郑多, 徐骋. 考虑拦截器探测能力限制的飞行器智能机动突防制导策略[J]. 航空学报, 2025, 46(10): 331304-331304.

基于价值滤波的空战机动决策优化方法

Value-filter based air-combat maneuvering optimization

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 34

相关文章 15

编辑推荐

Metrics

本文评价