基于价值滤波的空战机动决策优化方法

doi:10.7527/S1000-6893.2023.28871

专栏

本期目录 | 过刊浏览 | 高级检索

前一篇 | 后一篇

基于价值滤波的空战机动决策优化方法

付宇鹏¹, 邓向阳¹^,²(), 朱子强¹, 张立民¹

^1.海军航空大学航空作战勤务学院，烟台 264001
^2.清华大学自动化系，北京 100084

收稿日期:2023-04-14 修回日期:2023-05-30 接受日期:2023-06-14 出版日期:2023-06-29 发布日期:2023-06-27
通讯作者: 邓向阳 E-mail:skl18@mails.tsinghua.edu.cn
基金资助:
国家自然科学基金(91538201);国防高层次人才基金(202220539)

Value-filter based air-combat maneuvering optimization

Yupeng FU¹, Xiangyang DENG¹^,²(), Ziqiang ZHU¹, Limin ZHANG¹

^1.School of Aviation Support，Naval Aeronautical University，Yantai 264001，China
^2.Department of Automation，Tsinghua University，Beijing 100084，China

Received:2023-04-14 Revised:2023-05-30 Accepted:2023-06-14 Online:2023-06-29 Published:2023-06-27
Contact: Xiangyang DENG E-mail:skl18@mails.tsinghua.edu.cn
Supported by:
National Natural Science Foundation of China(91538201);Foundation Program for National Defense High Level Talent(202220539)

摘要/Abstract

摘要：

针对传统强化学习算法面对复杂状态空间的空战机动决策优化问题时，存在的经验数据利用率低、算法不易收敛等问题，分析了价值滤波的概念和原理，提出了基于价值滤波的示例策略约束（DPC）算法，构建了基于DPC算法的空战机动决策优化方法和流程。算法利用价值滤波器提取回放经验池和示例经验池的优势数据，对智能体策略优化方向进行基于状态价值的约束。仿真基于JSBSim平台的F-16飞机空气动力学模型，仿真结果表明算法收敛效率明显提高并避免示例策略的次优问题，生成的机动决策模型具备较好的智能性。

关键词: 价值滤波, 策略约束, 机动决策, 强化学习, 模仿学习

Abstract:

To address the issues of low data utilization efficiency and convergence difficulty of traditional reinforcement learning algorithm in air-combat maneuvering decision optimization with large state space， the concept and principle of value filter are proposed and analyzed. A reinforcement learning algorithm named Demonstration Policy Constrain （DPC） is presented based on the value filter. A maneuvering decision optimization method based on the DPC algorithm is designed. With the value filter， the state-value based advantage data of the replay buffer and the demonstration buffer are extracted to constrain the optimization direction of the policy. Based on the JSBSim's aerodynamic model of F-16 Aircraft， the simulation results show that the convergence efficiency of the algorithm is significantly improved and the sub-optimal problem of the demonstration policy is mitigated， and the maneuvering decision method proposed achieves good intelligence.

Key words: value filter, policy constrain, maneuvering decision, reinforcement learning, imitation learning

中图分类号:

V249.12

付宇鹏, 邓向阳, 朱子强, 张立民. 基于价值滤波的空战机动决策优化方法[J]. 航空学报, 2023, 44(22): 628871-628871.

Yupeng FU, Xiangyang DENG, Ziqiang ZHU, Limin ZHANG. Value-filter based air-combat maneuvering optimization[J]. Acta Aeronautica et Astronautica Sinica, 2023, 44(22): 628871-628871.

图/表 12

图 1

图 2

表 1

PPO-DPC算法参数设置

名称	数值	名称	数值
Actor网络	23×（256）₄×4	#Worker N	6
Critic网络	23×（256）₄×1	K	8
$α$ ， $β$	1×10^-4	$D n o n$ size	1 024
$ε$	0.2	batch size	256
$γ$	0.998	$D E$ size	2×10⁵
$λ$	0.95	$D o f f$ size	1×10⁶
$η 0$	1	$ζ$	5×10^-6

表 1

图 3

图 4

图 5

图 6

图 7

图 8

图 9

图 10

图 11

参考文献 34

1	WANG Z A， LI H， WU H L， et al. Improving maneuver strategy in air combat by alternate freeze games with a deep reinforcement learning algorithm［J］. Mathematical Problems in Engineering， 2020， 2020： 1-17.
2	马文，李辉，王壮，等. 基于深度随机博弈的近距空战机动决策［J］. 系统工程与电子技术， 2021， 43（2）： 443-451.
	MA W， LI H， WANG Z， et al. Close air combat maneuver decision based on deep stochastic game［J］. Systems Engineering and Electronics， 2021， 43（2）： 443-451 （in Chinese）.
3	李宪港，李强. 典型智能博弈系统技术分析及指控系统智能化发展展望［J］. 智能科学与技术学报， 2020， 2（1）： 36-42.
	LI X G， LI Q. Technical analysis of typical intelligent game system and development prospect of intelligent command and control system［J］. Chinese Journal of Intelligent Science and Technology， 2020， 2（1）： 36-42 （in Chinese）.
4	POPE A P， IDE J S， MIĆOVIĆ D， et al. Hierarchical reinforcement learning for air-to-air combat［C］∥2021 International Conference on Unmanned Aircraft Systems （ICUAS）. Piscataway： IEEE Press， 2021： 275-284.
5	SUFIYAN D， WIN L T S， WIN S K H， et al. A reinforcement learning approach for control of a nature-inspired aerial vehicle［C］∥2019 International Conference on Robotics and Automation （ICRA）. Piscataway： IEEE Press， 2019： 6030-6036.
6	ZHEN Y， HAO M R， SUN W D. Deep reinforcement learning attitude control of fixed-wing UAVs［C］∥2020 3rd International Conference on Unmanned Systems （ICUS）. Piscataway： IEEE Press， 2020： 239-244.
7	WANG C， YAN C， XIANG X， et al. A continuous actor-critic reinforcement learning approach to flocking with fixed-wing UAVs［C］∥Asian Conference on Machine Learning. Berlin： Springer， 2020： 239-244.
8	周攀，黄江涛，章胜，等. 基于深度强化学习的智能空战决策与仿真［J］. 航空学报， 2023， 44（4）： 126731.
	ZHOU P， HUANG J T， ZHANG S， et al. Intelligent air combat decision making and simulation based on deep reinforcement learning［J］. Acta Aeronautica et Astronautica Sinica， 2023， 44（4）： 126731 （in Chinese）.
9	吴宜珈，赖俊，陈希亮，等. 强化学习算法在超视距空战辅助决策上的应用研究［J］. 航空兵器， 2021， 28（2）： 55-61.
	WU Y J， LAI J， CHEN X L， et al. Research on the application of reinforcement learning algorithm in decision support of beyond-visual-range air combat［J］. Aero Weaponry， 2021， 28（2）： 55-61 （in Chinese）.
10	王欢，周旭，邓亦敏，等. 分层决策多机空战对抗方法［J］. 中国科学：信息科学， 2022， 52（12）： 2225-2238.
	WANG H， ZHOU X， DENG Y M， et al. A hierarchical decision-making method for multi-aircraft air combat confrontation［J］. Scientia Sinica （Informationis）， 2022， 52（12）： 2225-2238 （in Chinese）.
11	POMERLEAU D A. Alvinn： An autonomous land vehicle in a neural network［C］∥Conference and Workshop on Neural Information Processing Systems. New York： ACM， 1989： 305-313.
12	BOJARSKI M， DEL TESTA D， DWORAKOWSKI D， et al. End to end learning for self-driving cars［DB/OL］. arXiv preprint： 1604.07316. 2016.
13	GIUSTI A， GUZZI J， CIREŞAN D C， et al. A machine learning approach to visual perception of forest trails for mobile robots［J］. IEEE Robotics and Automation Letters， 2016， 1（2）： 661-667.
14	NAKANISHI J， MORIMOTO J， ENDO G， et al. Learning from demonstration and adaptation of biped locomotion［J］. Robotics and Autonomous Systems， 2004， 47（2-3）： 79-91.
15	ROSS S， GORDON G J， BAGNELL J A. A reduction of imitation learning and structured prediction to No-regret online learning［C］∥Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. New York： PMLR， 2011： 627–635.
16	NG A Y， RUSSELL S J. Algorithms for inverse reinforcement learning［C］∥Proceedings of the Seventeenth International Conference on Machine Learning. New York： ACM， 2000： 663-670.
17	ZIEBART B D， MAAS A， BAGNELL J A， et al. Maximum entropy inverse reinforcement learning［C］∥ Proceedings of the 23rd National Conference on Artificial Intelligence. New York： ACM， 2008： 1433-1438.
18	FINN C， LEVINE S， ABBEEL P. Guided cost learning： Deep inverse optimal control via policy optimization［C］∥Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York： ACM， 2016： 49-58.
19	NAIR A， MCGREW B， ANDRYCHOWICZ M， et al. Overcoming exploration in reinforcement learning with demonstrations［C］∥2018 IEEE International Conference on Robotics and Automation （ICRA）. Piscataway： IEEE Press， 2018： 6292-6299.
20	XU H R， ZHAN X Y， YIN H L， et al. Discriminator-weighted offline imitation learning from suboptimal demonstrations［C］∥Proceedings of the 39th International Conference on Machine Learning. New York： ACM， 2022： 24725-24742.
21	VINYALS O， BABUSCHKIN I， CZARNECKI W M， et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning［J］. Nature， 2019， 575（7782）： 350-354.
22	WANG P， LIU D P， CHEN J Y， et al. Decision making for autonomous driving via augmented adversarial inverse reinforcement learning［C］∥2021 IEEE International Conference on Robotics and Automation （ICRA）. Piscataway： IEEE Press， 2021： 1036-1042.
23	俞扬，詹德川，周志华，等. 基于模仿学习和强化学习算法的无人机飞行控制方法： CN112162564B［P］. 2021-09-28.
	YU Y， ZHAN D C， ZHOU Z H， et al. Unmanned aerial vehicle flight control method based on imitation learning and reinforcement learning algorithms： CN112162564B［P］. 2021-09-28 （in Chinese）.
24	ZHU Z D， LIN K X， DAI B， et al. Self-adaptive imitation learning： Learning tasks with delayed rewards from sub-optimal demonstrations［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2022， 36（8）： 9269-9277.
25	SCHULMAN J， LEVINE S， MORITZ P， et al. Trust region policy optimization［C］∥International Conference on Machine Learning. New York： ACM， 2015： 1889-1897.
26	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［DB/OL］. arXiv preprint： 1707.06347， 2017.
27	李茹杨，彭慧民，李仁刚，等. 强化学习算法与应用综述［J］. 计算机系统应用， 2020， 29（12）： 13-25.
	LI R Y， PENG H M， LI R G， et al. Overview on algorithms and applications for reinforcement learning［J］. Computer Systems and Applications， 2020， 29（12）： 13-25 （in Chinese）.
28	OH J， GUO Y， SINGH S， et al. Self-imitation learning ［C］∥Proceedings of the 35th International Conference on Machine Learning. New York： ACM， 2018： 3778-3887.
29	HAARNOJA T， TANG H R， ABBEEL P， et al. Reinforcement learning with deep energy-based policies［C］∥Proceedings of the 34th International Conference on Machine Learning-Volume 70. New York： ACM， 2017： 1352-1361.
30	LI C， WU F G， ZHAO J S. Accelerating self-imitation learning from demonstrations via policy constraints and Q-ensemble［C］∥2023 International Joint Conference on Neural Networks （IJCNN）. Piscataway： IEEE Press， 2023： 1-8.
31	SCHULMAN J， MORITZ P， LEVINE S， et al. High-dimensional continuous control using generalized advantage estimation［DB/OL］. arXiv preprint： 1506.02438， 2015.
32	KINGMA D P and BA J. Adam： A method for stochastic optimization［C］∥International Conference for Learning Representations （ICLR）. San Juan： Puerto Rico， 2015.
33	MCGREW J S， HOW J P， WILLIAMS B， et al. Air-combat strategy using approximate dynamic programming［J］. Journal of Guidance， Control， and Dynamics， 2010， 33（5）： 1641-1654.
34	Fujimoto S， Gu S S. A minimalist approach to offline reinforcement learning［C］∥Thirty-Fifth Conference on Neural Information Processing Systems （NeurIPS）. New York： ACM， 2021： 20132-20145.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

[1]	范之琳, 杨洪勇, 韩艺琳. 基于强化学习的多智能体系统目标围捕控制[J]. 航空学报, 2023, 44(S1): 727487-727487.
[2]	符小卫, 徐哲, 朱金冬, 王楠. 基于PER-MATD3的多无人机攻防对抗机动决策[J]. 航空学报, 2023, 44(7): 327083-327083.
[3]	高锡珍, 汤亮, 黄煌. 深度强化学习技术在地外探测自主操控中的应用与挑战[J]. 航空学报, 2023, 44(6): 26762-026762.
[4]	周攀, 黄江涛, 章胜, 刘刚, 舒博文, 唐骥罡. 基于深度强化学习的智能空战决策与仿真[J]. 航空学报, 2023, 44(4): 126731-126731.
[5]	朱祥维, 沈丹, 肖凯, 马岳鑫, 廖祥, 古富强, 余芳文, 高柯夫, 刘经南. 类脑导航的机理、算法、实现与展望[J]. 航空学报, 2023, 44(19): 28569-028569.
[6]	岳承磊, 汪雪川, 岳晓奎, 宋婷. 基于逆强化学习的航天器交会对接方法[J]. 航空学报, 2023, 44(19): 328420-328420.
[7]	高树一, 林德福, 郑多, 胡馨予. 针对集群攻击的飞行器智能协同拦截策略[J]. 航空学报, 2023, 44(18): 328301-328301.
[8]	张嘉城, 朱阅訸, 罗亚中. 基于指针网络的空间目标遍历交会序列规划[J]. 航空学报, 2023, 44(15): 528698-528698.
[9]	董磊, 陈泓兵, 陈曦, 赵长啸. 基于DQN的单一飞行员驾驶模式分布式多智能体联盟任务分配策略[J]. 航空学报, 2023, 44(13): 327895-327895.
[10]	陈文雪, 高长生, 荆武兴. 拦截机动目标的信赖域策略优化制导算法[J]. 航空学报, 2023, 44(11): 327596-327596.
[11]	惠俊鹏, 汪韧, 郭继峰. 基于强化学习的禁飞区绕飞智能制导技术[J]. 航空学报, 2023, 44(11): 327416-327416.
[12]	章胜, 周攀, 何扬, 黄江涛, 刘刚, 唐骥罡, 贾怀智, 杜昕. 基于深度强化学习的空战机动决策试验[J]. 航空学报, 2023, 44(10): 128094-128094.
[13]	惠俊鹏, 汪韧, 俞启东. 基于强化学习的再入飞行器“新质”走廊在线生成技术[J]. 航空学报, 2022, 43(9): 325960-325960.
[14]	符小卫, 王辉, 徐哲. 基于DE-MADDPG的多无人机协同追捕策略[J]. 航空学报, 2022, 43(5): 325311-325311.
[15]	孙智孝, 杨晟琦, 朴海音, 白成超, 葛俊. 未来智能空战发展综述[J]. 航空学报, 2021, 42(8): 525799-525799.

基于价值滤波的空战机动决策优化方法

Value-filter based air-combat maneuvering optimization

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 34

相关文章 15

编辑推荐

Metrics

本文评价