航空学报 > 2024, Vol. 45 Issue (12): 329446-329446   doi: 10.7527/S1000-6893.2023.29446

基于强化学习与种群博弈的近距空战决策

王宝来1, 高显忠2, 谢涛1(), 侯中喜2   

  1. 1.国防科技大学 计算机学院,长沙 410073
    2.国防科技大学 空天科学学院,长沙 410073
  • 收稿日期:2023-08-15 修回日期:2023-09-06 接受日期:2023-10-31 出版日期:2023-11-02 发布日期:2023-11-01
  • 通讯作者: 谢涛 E-mail:hamishxie@vip.sina.com
  • 基金资助:
    国家自然科学基金(61903369);湖南省自然科学基金(2018JJ3587)

Decision⁃making in close⁃range air combat based on reinforcement learning and population game

Baolai WANG1, Xianzhong GAO2, Tao XIE1(), Zhongxi HOU2   

  1. 1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China
    2.College of Aerospace Science and Engineering,National University of Defense Technology,Changsha 410073,China
  • Received:2023-08-15 Revised:2023-09-06 Accepted:2023-10-31 Online:2023-11-02 Published:2023-11-01
  • Contact: Tao XIE E-mail:hamishxie@vip.sina.com
  • Supported by:
    National Natural Science Foundation of China(61903369);Natural Science Foundation of Hunan Province(2018JJ3587)

摘要:

随着人工智能与无人机(UAV)技术的发展,近距空战智能决策得到了世界各国的广泛关注。针对传统强化学习在解决近距空战智能决策问题时存在过拟合与策略循环等问题,提出了一种基于种群博弈的空战智能决策模型训练范式。通过构建由多个无人机智能体组成的种群,并为每个智能体赋予不同奖励权重系数,实现了无人机智能体多样化的风险偏好。种群中不同风险偏好的智能体模型相互进行对抗训练,能够有效避免过拟合和策略循环问题。在训练过程中,每个无人机智能体根据与不同对手策略的对抗结果自适应地优化奖励权重系数。在数值仿真实验中,种群博弈训练中的智能体5与智能体3分别以88%和85%的胜率击败了专家系统对抗训练和自博弈训练得到的智能决策模型,算法性能得到有效验证。此外,通过进一步实验表明了种群博弈训练范式中权重系数动态调整的必要性,并在异构机型上验证了所提训练范式的通用性。

关键词: 近距空战, 智能决策, 强化学习, 种群博弈, SAC算法

Abstract:

With the development of artificial intelligence and Unmanned Aerial Vehicle (UAV) technologies, intelligent decision-making in close-range air combat has attracted extensive attention from all over the world. To solve the problems of overfitting and strategy cycles in using traditional reinforcement learning for intelligent decision-making in close-range air combat, a training paradigm of air combat decision model is proposed based on population game. By constructing a population composed of multiple UAV agents and assigning different reward weight coefficients to each agent, the diversified risk preference of UAV agents is realized. The problem of overfitting and strategy cycle can be avoided effectively by training agents of different risk preferences to fight against each other. During the training process, each UAV agent in the population adaptively optimizes the reward weight coefficient according to the results of the confrontation with different opponent strategies. In the numerical simulation experiment, Agent 5 and Agent 3 in population game training beat the intelligent decision model obtained by expert system adversarial training and self-play training with 88% and 85% success rate, respectively, which verifies the effectiveness of the algorithm. In addition, further experiments demonstrate the necessity of dynamic adjustment of weight coefficients in the training paradigm of population game, and verify the generality of the proposed training paradigm on heterogeneous models.

Key words: close-range air combat, intelligent decision-making, reinforcement learning, population game, SAC algorithm

中图分类号: