引入混合超网络改进MADDPG的双机编队空战自主机动决策

doi:10.7527/S1000-6893.2023.29460

论文

本期目录 | 过刊浏览 | 高级检索

引入混合超网络改进MADDPG的双机编队空战自主机动决策

李文韬¹, 方峰¹(), 王振亚², 朱奕超¹, 彭冬亮¹

^1.杭州电子科技大学自动化学院，杭州 310018
^2.中国航天科技创新研究院，北京 100076

收稿日期:2023-08-18 修回日期:2023-09-26 接受日期:2023-10-24 出版日期:2023-11-02 发布日期:2023-11-01
通讯作者: 方峰 E-mail:fangf@hdu.edu.cn
基金资助:
浙江省属高校基本科研业务费专项资金(GK209907299001-021)

Intelligent maneuvering decision-making in two-UCAV cooperative air combat based on improved MADDPG with hybrid hyper network

Wentao LI¹, Feng FANG¹(), Zhenya WANG², Yichao ZHU¹, Dongliang PENG¹

^1.School of Automation，Hangzhou Dianzi University，Hangzhou 　310018，China
^2.China Academy of Aerospace Science and Innovation，Beijing 　100076，China

Received:2023-08-18 Revised:2023-09-26 Accepted:2023-10-24 Online:2023-11-02 Published:2023-11-01
Contact: Feng FANG E-mail:fangf@hdu.edu.cn
Supported by:
the Fundamental Research Funds for the Provincial Universities of Zhejiang(GK209907299001-021)

摘要/Abstract

摘要：

针对局部信息可观测的双机编队空战协同奖励难以量化设计、智能体协同效率低、机动决策效果欠佳的问题，提出了一种引入混合超网络改进多智能体深度确定性策略梯度（MADDPG）的空战机动决策方法。采用集中式训练-分布式执行架构，满足单机智能体在局部观测数据下对于全局最优机动决策的训练需求。在为各单机设计兼顾局部快速引导和全局打击优势的奖励函数基础上，引入混合超网络将各单机估计的Q值进行单调非线性混合得到双机协同的全局策略Q值，指导分布式Actor网络更新参数，解决多智能体深度强化学习中信度分配难的问题。大量仿真结果表明，相较于典型的MADDPG方法，该方法能够更好地引导各单机做出符合全局协同最优的机动决策指令，且拥有更高的对抗胜率。

关键词: 无人作战飞机, 空战机动决策, 多智能体深度确定性策略梯度（MADDPG）, 混合超网络, 集中式训练-分布式执行

Abstract:

In the case of two Unmanned Combat Aerial Vehicles （UCAVs） cooperative air combat with local observation， there are the problems such as hard-to-design collaborative rewards， low collaboration efficiency and poor decision-making effect. To solve these problems， an intelligent maneuver decision-making method is proposed based on the improved the Multi-Agent Deep Deterministic Policy Gradient （MADDPG） with hybrid hyper network. A Centralized Training with Decentralized Execution （CTDE） architecture is adopted to meet the training requirements of global coordinated maneuvering decision in the situation of single agent with local observation. A reward function is designed for the UCAV agent by considering both the local reward of fast guidance for obtaining attack advantage and the global reward for winning air combat. Then， a hybrid hyper network is introduced to mix the estimated Q values of each agent monotonically and nonlinearly to obtain the global policy value function. By using the global policy value function， the decentralized Actor network update parameters to solve the problem of credit assignment in multi-agent deep reinforcement learning. Simulation results show that compared with the traditional MADDPG method， the proposed method can produce the optimal global cooperative maneuver commands for achieving better coordination performance， and can obtain a higher winning rate with the same agent opponent.

Key words: unmanned combat aerial vehicle, air combat maneuvering decision, Multi-Agent Deep Deterministic Policy Gradient (MADDPG), hybrid hyper network, centralized training with decentralized execution

中图分类号:

V249.12

李文韬, 方峰, 王振亚, 朱奕超, 彭冬亮. 引入混合超网络改进MADDPG的双机编队空战自主机动决策[J]. 航空学报, 2024, 45(17): 529460-529460.

Wentao LI, Feng FANG, Zhenya WANG, Yichao ZHU, Dongliang PENG. Intelligent maneuvering decision-making in two-UCAV cooperative air combat based on improved MADDPG with hybrid hyper network[J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(17): 529460-529460.

图/表 17

图 1

图 2

图 3

图 4

表 1

表 2

表 3

表 4

表 5

表 6

图 5

图 6

图 7

表 7

图 8

表 8

表 9

参考文献 20

1	范晋祥，陈晶华. 未来空战新概念及其实现挑战［J］. 航空兵器， 2020， 27（2）： 15-24.
	FAN J X， CHEN J H. New concepts of future air warfare and the challenges for its realization［J］. Aero Weaponry， 2020， 27（2）： 15-24 （in Chinese）.
2	孙智孝，杨晟琦，朴海音，等. 未来智能空战发展综述［J］. 航空学报， 2021， 42（8）： 525799.
	SUN Z X， YANG S Q， PIAO H Y， et al. A survey of air combat artificial intelligence［J］. Acta Aeronautica et Astronautica Sinica， 2021， 42（8）： 525799 （in Chinese）.
3	徐光达，吕超，王光辉，等. 基于双矩阵对策的UCAV空战自主机动决策研究［J］. 舰船电子工程， 2017， 37（11）： 24-28， 39.
	XU G D， LV C， WANG G H， et al. Research on UCAV autonomous air combat maneuvering decision-making based on Bi-matrix game［J］. Ship Electronic Engineering， 2017， 37（11）： 24-28， 39 （in Chinese）.
4	邵将，徐扬，罗德林. 无人机多机协同对抗决策研究［J］. 信息与控制， 2018， 47（3）： 347-354.
	SHAO J， XU Y， LUO D L. Cooperative combat decision-making research for multi UAVs［J］. Information and Control， 2018， 47（3）： 347-354 （in Chinese）.
5	黄长强，赵克新，韩邦杰，等. 一种近似动态规划的无人机机动决策方法［J］. 电子与信息学报， 2018， 40（10）： 2447-2452.
	HUANG C Q， ZHAO K X， HAN B J， et al. Maneuvering decision-making method of UAV based on approximate dynamic programming［J］. Journal of Electronics & Information Technology， 2018， 40（10）： 2447-2452 （in Chinese）.
6	谢建峰，杨啟明，戴树岭，等. 基于强化遗传算法的无人机空战机动决策研究［J］. 西北工业大学学报， 2020， 38（6）： 1330-1338.
	XIE J F， YANG Q M， DAI S L， et al. Air combat maneuver decision based on reinforcement genetic algorithm［J］. Journal of Northwestern Polytechnical University， 2020， 38（6）： 1330-1338 （in Chinese）.
7	李伟东，黄振柱，何精武，等. 改进行为克隆与DDPG的无人驾驶决策模型［J/OL］. 计算机工程与应用，［2023-07-06］（2023-08-18）. .
	LI W D， HUANG Z Z， HE J W， et al. Improved behavioral cloning and DDPG’s driverless decision model［J/OL］. Computer Engineering and Applications，［2023-07-06］（2023-08-18）. （in Chinese）.
8	BOTVINICK M， WANG J X， DABNEY W， et al. Deep reinforcement learning and its neuroscientific implications［J］. Neuron， 2020， 107（4）： 603-616.
9	SILVER D， HUANG A， MADDISON C J， et al. Mastering the game of Go with deep neural networks and tree search［J］. Nature， 2016， 529： 484-489.
10	THERESA H. DARPA’s AlphaDogfight tests AI pilot’s combat chops［EB/OL］. （2020-08）［2023-08-18］. ， 2020.
11	马文，李辉，王壮，等. 基于深度随机博弈的近距空战机动决策［J］. 系统工程与电子技术， 2021， 43（2）： 443-451.
	MA W， LI H， WANG Z， et al. Close air combat maneuver decision based on deep stochastic game［J］. Systems Engineering and Electronics， 2021， 43（2）： 443-451 （in Chinese）.
12	周攀，黄江涛，章胜，等. 基于深度强化学习的智能空战决策与仿真［J］. 航空学报， 2023， 44（4）： 126731.
	ZHOU P， HUANG J T， ZHANG S， et al. Intelligent air combat decision making and simulation based on deep reinforcement learning［J］. Acta Aeronautica et Astronautica Sinica， 2023， 44（4）： 126731 （in Chinese）.
13	左家亮，杨任农，张滢，等. 基于启发式强化学习的空战机动智能决策［J］. 航空学报， 2017， 38（10）： 321168.
	ZUO J L， YANG R N， ZHANG Y， et al. Intelligent decision-making in air combat maneuvering based on heuristic reinforcement learning［J］. Acta Aeronautica et Astronautica Sinica， 2017， 38（10）： 321168 （in Chinese）.
14	KONG W R， ZHOU D Y， YANG Z， et al. UAV autonomous aerial combat maneuver strategy generation with observation error based on state-adversarial deep deterministic policy gradient and inverse reinforcement learning［J］. Electronics， 2020， 9（7）： 1121.
15	高敬鹏，王国轩，高路. 基于异步合作更新的LSTM-MADDPG多智能体协同决策算法［J］. 吉林大学学报（工学版）， 2024， 54（3）： 797-806.
	GAO J P， WANG G X， GAO L. LSTM-MADDPG multi-agent cooperative decision algorithm based on asynchronous collaborative update［J］. Journal of Jilin University （Engineering and Technology Edition）， 2024， 54（3）： 797-806 （in Chinese）.
16	孔维仁，周德云，赵艺阳，等. 基于深度强化学习与自学习的多无人机近距空战机动策略生成算法［J］. 控制理论与应用， 2022， 39（2）： 352-362.
	KONG W R， ZHOU D Y， ZHAO Y Y， et al. Maneuvering strategy generation algorithm for multi-UAV in close-range air combat based on deep reinforcement learning and self-play［J］. Control Theory & Applications， 2022， 39（2）： 352-362 （in Chinese）.
17	邓可，彭宣淇，周德云. 基于矩阵对策与遗传算法的无人机空战决策［J］. 火力与指挥控制， 2019， 44（12）： 61-66， 71.
	DENG K， PENG X Q， ZHOU D Y. Study on air combat decision method of UAV based on matrix game and genetic algorithm［J］. Fire Control & Command Control， 2019， 44（12）： 61-66， 71 （in Chinese）.
18	LOWE R， WU Y， TAMAR A， et al. Multi-agent actor-critic for mixed cooperative-competitive environments［C］∥Proceedings of the 31st International Conference on Neural Information Processing Systems. New York： ACM， 2017： 6382-6393.
19	施伟，冯旸赫，程光权，等. 基于深度强化学习的多机协同空战方法研究［J］. 自动化学报， 2021， 47（7）： 1610-1623.
	SHI W， FENG Y H， CHENG G Q， et al. Research on multi-aircraft cooperative air combat method based on deep reinforcement learning［J］. Acta Automatica Sinica， 2021， 47（7）： 1610-1623 （in Chinese）.
20	RASHID T， SAMVELYAN M， DE WITT C S， et al. Monotonic value function factorisation for deep multi-agent reinforcement learning［J］. The Journal of Machine Learning Research， 2020， 21（1）： 7234-7284.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

参数	符号	数值
仿真步长/s	t_s	0.1
决策步长/s	t_p	1
仿真时长/s	t_max	60
仿真最小速度/（m·s^-1）	V_min	100
仿真最大速度/（m·s^-1）	V_max	300

网络层	神经元数量	激活函数
输入层	18
隐藏层1~4	256	Relu
输出层	3	Tanh

网络层	神经元数量	激活函数
输入层	21
隐藏层1~4	256	Relu
输出层	1

网络层	神经元数量	激活函数
输入层	2
隐藏层1	64	Relu
隐藏层2	128	Relu
输出层	1

网络层	神经元数量	激活函数
输入层	36
隐藏层1	128	Relu
权重输出层	64/128	Abs
偏置输出层	128/1

引入混合超网络改进MADDPG的双机编队空战自主机动决策

Intelligent maneuvering decision-making in two-UCAV cooperative air combat based on improved MADDPG with hybrid hyper network

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 17

参考文献 20

相关文章 4

编辑推荐

Metrics

本文评价

参数	红方编队		蓝方编队
参数	R₁	R₂	B₁	B₂
横坐标/m	0	0	1 000	1 000
纵坐标/m	0	3 000	1 000	2 000
高度/m	5 000	5 000	6 000	6 000
速度/（m·s^-1）	200	200	150	150
俯仰角/（°）	0	0	0	0
偏航角/（°）	0	0	0	0

参数	场景1	场景2	场景3	场景4
遗传算法	×	√	√	√
横坐标/m	±300	±300	+1 000	±800
纵坐标/m	±300	±300	0	±800
高度/m	±100	±100	0	±500
速度/（m·s^-1）	±20	±20	0	±50
俯仰角/（°）	0	0	0	0
偏航角/（°）	0	0	+180	0

场景编号	胜率/%
场景编号	MADDPG	改进MADDPG
1	83.1	92.4
2	76.3	88.6
3	67.5	80.4
4	54.8	72.1

决策方法	单步决策时间/ms
MADDPG	0.8
改进MADDPG	0.8
矩阵对策	47.2
矩阵对策与遗传算法	128.6

[1]	符小卫, 李金亮, 高晓光. 威胁联网下无人作战飞机突防作战航迹规划[J]. 航空学报, 2014, 35(4): 1042-1052.
[2]	张煜, 张万鹏, 陈璟, 沈林成. 基于Gauss伪谱法的UCAV对地攻击武器投放轨迹规划[J]. 航空学报, 2011, 32(7): 1240-1251.
[3]	龙涛;陈岩;沈林成. 基于合同机制的多UCAV分布式协同任务控制[J]. 航空学报, 2007, 28(2): 352-357.
[4]	高晓光;杨有龙. 基于不同威胁体的无人作战飞机初始路径规划[J]. 航空学报, 2003, 24(5): 435-438.