基于自博弈深度强化学习的空战智能决策方法

doi:10.7527/S1000-6893.2023.28723

电子电气工程与控制

本期目录 | 过刊浏览 | 高级检索

前一篇 | 后一篇

基于自博弈深度强化学习的空战智能决策方法

单圣哲¹^,², 张伟伟¹()

^1.西北工业大学航空学院，西安　710072
^2.中国人民解放军93995部队，西安　710306

收稿日期:2023-03-21 修回日期:2023-06-12 接受日期:2023-08-29 出版日期:2024-02-25 发布日期:2023-09-01
通讯作者: 张伟伟 E-mail:aeroelastic@nwpu.edu.cn
基金资助:
国防科技重点实验室基金(6142219190302)

Air combat intelligent decision-making method based on self-play and deep reinforcement learning

Shengzhe SHAN¹^,², Weiwei ZHANG¹()

^1.School of Aeronautics，Northwestern Polytechnical University，Xi’an 　710072，China
^2.93995 Unit of the Chinese People’s Liberation Army，Xi’an 　710306，China

Received:2023-03-21 Revised:2023-06-12 Accepted:2023-08-29 Online:2024-02-25 Published:2023-09-01
Contact: Weiwei ZHANG E-mail:aeroelastic@nwpu.edu.cn
Supported by:
Science and Technology Foundation of National Defense Key Laboratory(6142219190302)

摘要/Abstract

摘要：

空战是战争走向立体的重要环节，智能空战已经成为国内外军事领域的研究热点和重点，深度强化学习是实现空战智能化的重要技术途径。针对单智能体训练方法难以构建高水平空战对手问题，提出基于自博弈的空战智能体训练方法，搭建研究平台，根据飞行员领域知识合理设计观测、动作与奖励，通过“左右互搏”方式训练空战智能体至收敛，并通过仿真试验验证空战决策模型的有效性。研究结果表明通过自博弈训练，空战智能体战术水平逐步提升，最终对单智能体训练的决策模型构成70%以上胜率，并涌现类似人类“单/双环”战术的空战策略。

关键词: 空战, 人工智能, 深度强化学习, 自博弈, 智能体

Abstract:

Air combat is an important element in the three-dimensional nature of war， and intelligent air combat has become a hotspot and focus of research in the military field both domestically and internationally. Deep reinforcement learning is an important technological approach to achieving air combat intelligence. To address the challenge of constructing high-level opponents in single agent training method， a self-play based air combat agent training method is proposed， and a visualization research platform is built to develop a decision-making agent for close-range air combat. The field knowledge of pilots is embedded in the design process of the agent’s observation， action， and reward， training the agent to convergence. Simulation experiments show that the air combat tactics of agent gradually improves by self-play training， achieving a win rate of over 70% against the decision making by single agent training and the emerging of the strategies similar to human “single/double loop” tactics.

Key words: air combat, artificial intelligence, deep reinforcement learning, self-play, agent

中图分类号:

V249

单圣哲, 张伟伟. 基于自博弈深度强化学习的空战智能决策方法[J]. 航空学报, 2024, 45(4): 328723-328723.

Shengzhe SHAN, Weiwei ZHANG. Air combat intelligent decision-making method based on self-play and deep reinforcement learning[J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(4): 328723-328723.

图/表 21

图 1

图 2

图 3

表 1

图 4

图 5

图 6

表 2

图 7

表 3

图 8

图 9

图 10

图 11

表 4

图 12

表 5

图 13

图 14

图 15

图 16

参考文献 41

1	杨伟. 关于未来战斗机发展的若干讨论［J］. 航空学报， 2020， 41（6）： 524377.
	YANG W. Development of future fighters［J］. Acta Aeronautica et Astronautica Sinica， 2020， 41（6）： 524377 （in Chinese）.
2	Defense Advanced Research Projects Agency. Alpha dog fight trials go virtual for final event［EB/OL］. （2020-08-07）［2021-03-10］. ：.
3	董一群，艾剑良. 自主空战技术中的机动决策：进展与展望［J］. 航空学报， 2020， 41（S2）： 724264.
	DONG Y Q， AI J L. Decision making in autonomous air combat： review and prospects［J］. Acta Aeronautica et Astronautica Sinica， 2020， 41（S2）： 724264 （in Chinese）.
4	SILVER D， HUANG A， MADDISON C J， et al. Mastering the game of Go with deep neural networks and tree search［J］. Nature， 2016， 529（7587）： 484-489.
5	SILVER D， SCHRITTWIESER J， SIMONYAN K， et al. Mastering the game of go without human knowledge［J］. Nature， 2017， 550（7676）： 354-359.
6	SILVER D， HUBERT T， SCHRITTWIESER J， et al. A general reinforcement learning algorithm that masters chess， shogi， and Go through self-play［J］. Science， 2018， 362（6419）： 1140-1144.
7	JUMPER J， EVANS R， PRITZEL A， et al. Highly accurate protein structure prediction with AlphaFold［J］. Nature， 2021， 596（7873）： 583-589.
8	FAWZI A， BALOG M， HUANG A， et al. Discovering faster matrix multiplication algorithms with reinforcement learning［J］. Nature， 2022， 610（7930）： 47-53.
9	SILVER D， SINGH S， PRECUP D， et al. Reward is enough［J］. Artificial Intelligence， 2021， 299： 103535.
10	VINYALS O， BABUSCHKIN I， CZARNECKI W M， et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning［J］. Nature， 2019， 575（7782）： 350-354.
11	VINYALS O， EWALDS T， BARTUNOV S， et al. StarCraft II： A new challenge for reinforcement learning［DB/OL］. 2017：arXiv preprint：1708.04782.
12	OpenAI. OpenAI five［EB/OL］. 2018. .
13	BAKER B， KANITSCHEIDER I， MARKOV T M， et al. Emergent tool use from multi-agent autocurricula［DB/OL］. arXiv preprint：1909.07528， 2020.
14	OH I， RHO S， MOON S， et al. Creating pro-level AI for a real-time fighting game using deep reinforcement learning［J］. IEEE Transactions on Games， 2022， 14（2）： 212-220.
15	KURNIAWAN B， VAMPLEW P， PAPASIMEON M， et al. An empirical study of reward structures for actor-critic reinforcement learning in air combat manoeuvring simulation［C］∥ Australasian Joint Conference on Artificial Intelligence. Cham： Springer， 2019： 54-65.
16	YANG Q M， ZHU Y， ZHANG J D， et al. UAV air combat autonomous maneuver decision based on DDPG algorithm［C］∥ 2019 IEEE 15th International Conference on Control and Automation （ICCA）. Piscataway： IEEE Press， 2019： 37-42.
17	YANG Q M， ZHANG J D， SHI G Q， et al. Maneuver decision of UAV in short-range air combat based on deep reinforcement learning［J］. IEEE Access， 2019， 8： 363-378.
18	PIAO H Y， SUN Z X， MENG G L， et al. Beyond-visual-range air combat tactics auto-generation by reinforcement learning［C］∥ 2020 International Joint Conference on Neural Networks （IJCNN）. Piscataway： IEEE Press， 2020： 1-8.
19	单圣哲，杨孟超，张伟伟，等. 自主空战连续决策方法［J］. 航空工程进展， 2022， 13（5）： 47-58.
	SHAN S Z， YANG M C， ZHANG W W， et al. Continuous decision-making method for autonomous air combat［J］. Advances in Aeronautical Science and Engineering， 2022， 13（5）： 47-58 （in Chinese）.
20	SUTTON R S， BARTO A G. Reinforcement learning： An introduction［M］. 2nd Ed.Cambridge： MIT Press， 2018.
21	MATHEW A， AMUDHA P， SIVAKUMARI S. Deep learning techniques： an overview［C］∥International Conference on Advanced Machine Learning Technologies and Applications. Singapore： Springer， 2021： 599-608.
22	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Playing atari with deep reinforcement learning［DB/OL］. arXiv preprint： 1312.5602， 2013.
23	Github. Unity technologies［EB/OL］.（2022-12-14）. .
24	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms ［DB/OL］. arXiv preprint： 1707.06347， 2017.
25	VON NEUMANN J， MORGENSTERN O. Theory of games and economic behavior： 60th anniversary commemorative edition［M］. Princeton： Princeton University Press， 2007.
26	SHAPLEY L S. Stochastic games［J］. Proceedings of the National Academy of Sciences of the United States of America， 1953， 39（10）： 1095-1100.
27	LITTMAN M L. Markov games as a framework for multi-agent reinforcement learning［M］∥KAUFMANN M. Machine learning proceedings. Amsterdam： Elsevier， 1994： 157-163.
28	BROWN G W. Iterative solution of games by fictitious play［J］. Activity Analysis of Production and Allocation， 1951， 13（1）： 374-376.
29	SCHRITTWIESER J， ANTONOGLOU I， HUBERT T， et al. Mastering Atari， Go， chess and shogi by planning with a learned model［J］. Nature， 2020， 588（7839）： 604-609.
30	ZHA D C， XIE J R， MA W Y， et al. DouZero： Mastering DouDizhu with self-play deep reinforcement learning［DB/OL］. arXiv preprint： 2106.06135， 2021.
31	BANSAL T， PACHOCKI J， SIDOR S， et al. Emergent complexity via multi-agent competition［DB/OL］. arXiv preprint： 1710.03748， 2017.
32	JADERBERG M， CZARNECKI W M， DUNNING I， et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning［J］. Science， 2019， 364（6443）： 859-865.
33	JULIANI A， BERGES V P， VCKAY E， et al. Unity： a general platform for intelligent agentsV［DB/OL］. arXiv preprint： 1809.02627， 2020.
34	BONANNI P. The art of the kill： A comprehensive guide to modern air combat［M］. Boulder： Spectrum HoloByte， 1993.
35	吴文海，周思羽，高丽，等. 基于导弹攻击区的超视距空战态势评估改进［J］. 系统工程与电子技术， 2011， 33（12）： 2679-2685.
	WU W H， ZHOU S Y， GAO L， et al. Improvements of situation assessment for beyond-visual-range air combat based on missile launching envelope analysis［J］. Systems Engineering and Electronics， 2011， 33（12）： 2679-2685 （in Chinese）.
36	YANG Y D， WANG J. An overview of multi-agent reinforcement learning from game theoretical perspective［DB/OL］. arXiv preprint： 2011.00583v3， 2021.
37	SCHULMAN J， MORITZ P， LEVINE S， et al. High-dimensional continuous control using generalized advantage estimation［DB/OL］. arXiv preprint： 1506.02438， 2015.
38	Technologies Unity. Unity ML-agents toolkit［EB/OL］. （2023-07-10）.
39	JADERBERG M， CZARNECKI W M， DUNNING I， et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning［J］. Science， 2019， 364（6443）： 859-865.
40	Wikipedia. Elo rating system［EB/OL］. 2021. .
41	Github.NWPU-SSZ［EB/OL］.（2023-08-28）. .

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

子模式	水平范围/（°）	俯仰范围/（°）	扫描时间/s
HS	-10~10	-15~5	3
VS	-5~5	-10~30	3
BS	-1~1	-1.5~-0.5	0.01

扫描发射方式	解锁时机	搜索范围
定瞄定发	不解锁	2°锥角
定瞄离发	截获后解锁	2°锥角
定扫离发	截获前解锁	5°锥角
随动离发	雷达截获目标后解锁	水平和垂直±20°方形区域内雷达随动

分类	特征名称	数值界限	维度
我机坐标	东西坐标/m	（0，30 000）	1
	南北坐标/m	（0，30 000）	1
	飞行高度/m	（0，12 000）	1
飞行状态	飞行空速/（m·s^-1）	（0，500）	2
	飞行表速/（m·s^-1）	（0，500）	2
	马赫数	（0，1.6）	2
	纵向过载	（-1，2）	2
	法向过载	（-4，9）	2
	侧向过载	（-1，1）	2
	转弯角速率/（（°）·s^-1）	（0，50）*	2
	姿态四元数	（-1，1）	8
几何态势	敌我距离向量/m	（-8 000，8 000）*	4
	敌我速度向量/m	（-500，500）*	6
	敌我高度差/m	（-1 000，1 000）*	1
	机炮瞄准系数	（0，1）	2
	水平离轴角/（°）	（-180，180）	2
	离轴角/（°）	（-90，90）	2
	雷达扫描范围/（°）	（-20，20）	8
	导弹扫描范围/（°）	（-20，20）	8
	进入角/（°）	（0，180）	2
	天线偏角/（°）	（0，180）	2
	导弹最大距离/m	（0，8 000）*	2
	导弹最小距离/m	（0，3 000）*	2
	敌我距离标量/m	（0，8 000）*	1
总计			67

奖励类型	博弈分类	奖励名称	权重分配		奖励特性
奖励类型	博弈分类	奖励名称	我方	敌方	奖励特性
结果奖励	零和博弈	导弹杀敌	1	-1	稀疏
		机炮杀敌	1	-1
		敌机撞地	1	-1
		飞出边界	-1	1
		相撞/互杀	0	0
事件奖励	零和博弈	雷达照射	0.05	-0.05	稀疏
		雷达锁定	0.2	-0.2
		导弹锁敌	0.3	-0.3
		机炮瞄准	0.5	-0.5
		达成发射	0.55	-0.55
过程奖励	零和博弈	角度优势	0.005	-0.005	稠密
	零和博弈	能量优势	0.008	-0.008
	相同利益	距离奖励	-0.000 1	-0.000 1
	相同利益	高度差奖励	-0.000 1	-0.000 1
边界奖励	零和博弈	控制区	0.1	-0.1	连续
	非博弈	空域坐标	0.8	0.8
		飞行马赫数	0.8	0.8
		飞行表速	0.8	0.8
		双机距离	0.8	0.8

名称	超参1	超参2	超参3
群落人口M	100	100	200
博弈概率ε	0.5	0.8	0.5
保存间隔u	2×10³	2×10³	4×10³
重置对手间隔v	100	100	100
训练切换间隔w	2×10⁵	2×10⁵	4×10⁵

基于自博弈深度强化学习的空战智能决策方法

Air combat intelligent decision-making method based on self-play and deep reinforcement learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 21

参考文献 41

相关文章 15

编辑推荐

Metrics

本文评价

[1]	张鸿林, 罗建军, 马卫华. 基于机器学习的航天器规避目标威胁博弈决策[J]. 航空学报, 2024, 45(8): 329136-329136.
[2]	蔡云鹏, 周大鹏, 丁江川. 具有防撞安全约束的无人机集群智能协同控制[J]. 航空学报, 2024, 45(5): 529683-529683.
[3]	王雪鉴, 文永明, 石晓荣, 张宁宁, 刘洁玺. 多智能体多耦合任务混合式智能决策架构设计[J]. 航空学报, 2023, 44(S2): 729770-729770.
[4]	邓有朋, 范佳宣, 郑岩, 王振亚, 吕勇梁, 李雨霄. 不完全信息下多智能体对手建模[J]. 航空学报, 2023, 44(S2): 729782-729782.
[5]	倪炜霖, 王永海, 徐聪, 赤丰华, 梁海朝. 基于强化学习的高超飞行器协同博弈制导方法[J]. 航空学报, 2023, 44(S2): 729400-729400.
[6]	于连波, 曹品钊, 石亮, 连捷, 王东. 基于改进冲突搜索的多智能体路径规划算法[J]. 航空学报, 2023, 44(S1): 727648-727648.
[7]	马金毅, 王灿, 薛涛, 艾剑良, 董一群. 空战格斗飞行机动数据库建立及应用[J]. 航空学报, 2023, 44(S1): 727538-727538.
[8]	范之琳, 杨洪勇, 韩艺琳. 基于强化学习的多智能体系统目标围捕控制[J]. 航空学报, 2023, 44(S1): 727487-727487.
[9]	张百川, 毕文豪, 张安, 毛泽铭, 杨咪. 基于Transformer模型的空战飞行器轨迹预测误差补偿方法[J]. 航空学报, 2023, 44(9): 327413-327413.
[10]	马亚杰, 王娟, 姜斌, 龚建业. 一种无人机⁃无人车编队系统容错控制方法[J]. 航空学报, 2023, 44(8): 327216-327216.
[11]	符小卫, 徐哲, 朱金冬, 王楠. 基于PER-MATD3的多无人机攻防对抗机动决策[J]. 航空学报, 2023, 44(7): 327083-327083.
[12]	高锡珍, 汤亮, 黄煌. 深度强化学习技术在地外探测自主操控中的应用与挑战[J]. 航空学报, 2023, 44(6): 26762-026762.
[13]	周攀, 黄江涛, 章胜, 刘刚, 舒博文, 唐骥罡. 基于深度强化学习的智能空战决策与仿真[J]. 航空学报, 2023, 44(4): 126731-126731.
[14]	何林坤, 薛文超, 张冉, 李惠峰. 运载火箭动力着陆段制导控制方法综述与展望[J]. 航空学报, 2023, 44(23): 628462-628462.
[15]	朱祥维, 沈丹, 肖凯, 马岳鑫, 廖祥, 古富强, 余芳文, 高柯夫, 刘经南. 类脑导航的机理、算法、实现与展望[J]. 航空学报, 2023, 44(19): 28569-028569.