航空学报 > 2024, Vol. 45 Issue (17): 529460-529460   doi: 10.7527/S1000-6893.2023.29460

引入混合超网络改进MADDPG的双机编队空战自主机动决策

李文韬1, 方峰1(), 王振亚2, 朱奕超1, 彭冬亮1   

  1. 1.杭州电子科技大学 自动化学院,杭州 310018
    2.中国航天科技创新研究院,北京 100076
  • 收稿日期:2023-08-18 修回日期:2023-09-26 接受日期:2023-10-24 出版日期:2023-11-02 发布日期:2023-11-01
  • 通讯作者: 方峰 E-mail:fangf@hdu.edu.cn
  • 基金资助:
    浙江省属高校基本科研业务费专项资金(GK209907299001-021)

Intelligent maneuvering decision-making in two-UCAV cooperative air combat based on improved MADDPG with hybrid hyper network

Wentao LI1, Feng FANG1(), Zhenya WANG2, Yichao ZHU1, Dongliang PENG1   

  1. 1.School of Automation,Hangzhou Dianzi University,Hangzhou  310018,China
    2.China Academy of Aerospace Science and Innovation,Beijing  100076,China
  • Received:2023-08-18 Revised:2023-09-26 Accepted:2023-10-24 Online:2023-11-02 Published:2023-11-01
  • Contact: Feng FANG E-mail:fangf@hdu.edu.cn
  • Supported by:
    the Fundamental Research Funds for the Provincial Universities of Zhejiang(GK209907299001-021)

摘要:

针对局部信息可观测的双机编队空战协同奖励难以量化设计、智能体协同效率低、机动决策效果欠佳的问题,提出了一种引入混合超网络改进多智能体深度确定性策略梯度(MADDPG)的空战机动决策方法。采用集中式训练-分布式执行架构,满足单机智能体在局部观测数据下对于全局最优机动决策的训练需求。在为各单机设计兼顾局部快速引导和全局打击优势的奖励函数基础上,引入混合超网络将各单机估计的Q值进行单调非线性混合得到双机协同的全局策略Q值,指导分布式Actor网络更新参数,解决多智能体深度强化学习中信度分配难的问题。大量仿真结果表明,相较于典型的MADDPG方法,该方法能够更好地引导各单机做出符合全局协同最优的机动决策指令,且拥有更高的对抗胜率。

关键词: 无人作战飞机, 空战机动决策, 多智能体深度确定性策略梯度(MADDPG), 混合超网络, 集中式训练-分布式执行

Abstract:

In the case of two Unmanned Combat Aerial Vehicles (UCAVs) cooperative air combat with local observation, there are the problems such as hard-to-design collaborative rewards, low collaboration efficiency and poor decision-making effect. To solve these problems, an intelligent maneuver decision-making method is proposed based on the improved the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) with hybrid hyper network. A Centralized Training with Decentralized Execution (CTDE) architecture is adopted to meet the training requirements of global coordinated maneuvering decision in the situation of single agent with local observation. A reward function is designed for the UCAV agent by considering both the local reward of fast guidance for obtaining attack advantage and the global reward for winning air combat. Then, a hybrid hyper network is introduced to mix the estimated Q values of each agent monotonically and nonlinearly to obtain the global policy value function. By using the global policy value function, the decentralized Actor network update parameters to solve the problem of credit assignment in multi-agent deep reinforcement learning. Simulation results show that compared with the traditional MADDPG method, the proposed method can produce the optimal global cooperative maneuver commands for achieving better coordination performance, and can obtain a higher winning rate with the same agent opponent.

Key words: unmanned combat aerial vehicle, air combat maneuvering decision, Multi-Agent Deep Deterministic Policy Gradient (MADDPG), hybrid hyper network, centralized training with decentralized execution

中图分类号: