航空学报 > 2023, Vol. 44 Issue (11): 327596-327596   doi: 10.7527/S1000-6893.2022.27596

拦截机动目标的信赖域策略优化制导算法

陈文雪, 高长生(), 荆武兴   

  1. 哈尔滨工业大学 航天学院,哈尔滨 150001
  • 收稿日期:2022-06-09 修回日期:2022-06-21 接受日期:2022-07-21 出版日期:2023-06-15 发布日期:2022-07-25
  • 通讯作者: 高长生 E-mail:gaocs@hit.edu.cn
  • 基金资助:
    国家自然科学基金(12072090)

Trust region policy optimization guidance algorithm for intercepting maneuvering target

Wenxue CHEN, Changsheng GAO(), Wuxing JING   

  1. School of Astronautics,Harbin Institute of Technology,Harbin 150001,China
  • Received:2022-06-09 Revised:2022-06-21 Accepted:2022-07-21 Online:2023-06-15 Published:2022-07-25
  • Contact: Changsheng GAO E-mail:gaocs@hit.edu.cn
  • Supported by:
    National Natural Science Foundation of China(12072090)

摘要:

针对临近空间高超声速飞行器的高速性、机动性等特性,为提高制导算法针对不同初始状态、不同机动性目标的准确性、鲁棒性及智能性,提出一种基于信赖域策略优化(TRPO)算法的深度强化学习制导算法。基于TRPO算法的制导算法由2个策略(动作)网络、1个评价网络共同组成,将临近空间目标与拦截弹相对运动系统状态以端对端的方式直接映射为制导指令。在算法训练过程中合理选取连续动作空间、状态空间、并通过权衡能量消耗、相对距离等因素构建奖励函数加快其收敛速度,最终依据训练的智能体模型针对不同任务场景进行拦截测试。仿真结果表明:与传统比例导引律(PN)及改进比例导引律(IPN)相比,本文算法针对学习场景及未知场景均具有更小的脱靶量、更稳定的拦截效果、鲁棒性,并能够在多种配置计算机上广泛应用。

关键词: 深度强化学习, 信任域策略优化, 临近空间拦截, 导弹末制导, 机动目标, 马尔可夫过程

Abstract:

Considering the characteristics of high speed and maneuverability of hypersonic vehicles in near-space, this paper proposes a deep reinforcement learning guidance algorithm based on the Trust Region Policy Optimization (TRPO) algorithm to improve the accuracy, robustness, and intelligence of the guidance algorithm for intercepting targets with different initial states and different maneuverability modes. The guidance algorithm based on the TRPO algorithm is composed of two policy (action) networks and a critic network, directly mapping the relative motion system state of the near-space target and the interceptor to the guidance command of the interceptor. In the algorithm training process, continuous action space and state space are reasonably designed, and the reward function is constructed to accelerate the training convergence speed by weighing energy consumption, relative distance, and other factors. Finally, tests are conducted for different task scenarios according to the trained agent model. The simulation results show that, compared with the traditional Proportional Navigation guidance law (PN) and the Improved Proportional Navigation guidance law (IPN), the guidance algorithm in this paper has smaller miss distances, a more stable interception effect, and robustness for learned scenarios and unknown scenarios, and can be widely used on multiple configuration computers.

Key words: deep reinforcement learning, trust region policy optimization, near-space interception, missile terminal guidance, maneuvering targets, Markov process

中图分类号: