航空学报 > 2023, Vol. 44 Issue (22): 628871-628871   doi: 10.7527/S1000-6893.2023.28871

基于价值滤波的空战机动决策优化方法

付宇鹏1, 邓向阳1,2(), 朱子强1, 张立民1   

  1. 1.海军航空大学 航空作战勤务学院,烟台 264001
    2.清华大学 自动化系,北京 100084
  • 收稿日期:2023-04-14 修回日期:2023-05-30 接受日期:2023-06-14 出版日期:2023-06-29 发布日期:2023-06-27
  • 通讯作者: 邓向阳 E-mail:skl18@mails.tsinghua.edu.cn
  • 基金资助:
    国家自然科学基金(91538201);国防高层次人才基金(202220539)

Value-filter based air-combat maneuvering optimization

Yupeng FU1, Xiangyang DENG1,2(), Ziqiang ZHU1, Limin ZHANG1   

  1. 1.School of Aviation Support,Naval Aeronautical University,Yantai 264001,China
    2.Department of Automation,Tsinghua University,Beijing 100084,China
  • Received:2023-04-14 Revised:2023-05-30 Accepted:2023-06-14 Online:2023-06-29 Published:2023-06-27
  • Contact: Xiangyang DENG E-mail:skl18@mails.tsinghua.edu.cn
  • Supported by:
    National Natural Science Foundation of China(91538201);Foundation Program for National Defense High Level Talent(202220539)

摘要:

针对传统强化学习算法面对复杂状态空间的空战机动决策优化问题时,存在的经验数据利用率低、算法不易收敛等问题,分析了价值滤波的概念和原理,提出了基于价值滤波的示例策略约束(DPC)算法,构建了基于DPC算法的空战机动决策优化方法和流程。算法利用价值滤波器提取回放经验池和示例经验池的优势数据,对智能体策略优化方向进行基于状态价值的约束。仿真基于JSBSim平台的F-16飞机空气动力学模型,仿真结果表明算法收敛效率明显提高并避免示例策略的次优问题,生成的机动决策模型具备较好的智能性。

关键词: 价值滤波, 策略约束, 机动决策, 强化学习, 模仿学习

Abstract:

To address the issues of low data utilization efficiency and convergence difficulty of traditional reinforcement learning algorithm in air-combat maneuvering decision optimization with large state space, the concept and principle of value filter are proposed and analyzed. A reinforcement learning algorithm named Demonstration Policy Constrain (DPC) is presented based on the value filter. A maneuvering decision optimization method based on the DPC algorithm is designed. With the value filter, the state-value based advantage data of the replay buffer and the demonstration buffer are extracted to constrain the optimization direction of the policy. Based on the JSBSim's aerodynamic model of F-16 Aircraft, the simulation results show that the convergence efficiency of the algorithm is significantly improved and the sub-optimal problem of the demonstration policy is mitigated, and the maneuvering decision method proposed achieves good intelligence.

Key words: value filter, policy constrain, maneuvering decision, reinforcement learning, imitation learning

中图分类号: