航空学报 > 2025, Vol. 46 Issue (S1): 732184-732184   doi: 10.7527/S1000-6893.2025.32184

第二届空天前沿大会/第二十七届中国科协年会优秀论文

面向直升机姿态控制的强化学习奖励函数设计

章涛, 李攀(), 王梓旭, 朱振华   

  1. 南京航空航天大学 直升机动力学全国重点实验室,南京 210016
  • 收稿日期:2025-02-25 修回日期:2025-03-08 接受日期:2025-05-06 出版日期:2025-05-29 发布日期:2025-05-27
  • 通讯作者: 李攀 E-mail:lipan@nuaa.edu.cn
  • 基金资助:
    国家级项目

Design of reward functions for helicopter attitude control in reinforcement learning

Tao ZHANG, Pan LI(), Zixu WANG, Zhenhua ZHU   

  1. National Key Laboratory of Helicopter Dynamics,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China
  • Received:2025-02-25 Revised:2025-03-08 Accepted:2025-05-06 Online:2025-05-29 Published:2025-05-27
  • Contact: Pan LI E-mail:lipan@nuaa.edu.cn
  • Supported by:
    National Level Project

摘要:

奖励函数设计是基于强化学习的直升机姿态控制核心技术之一,直接决定着控制器的训练与性能,设计完备且高效的奖励函数成为业界重点研究课题。为此,提出阶段奖励函数框架,将全时域控制过程划分为2个控制阶段,针对每一阶段设计奖励函数子项,同时引入可宏观调整控制性能的可调参数;基于Actor-Critic方法简单设计神经网络姿态控制器结构,并使用近端策略优化算法(PPO)进行训练,通过引入传感器误差的鲁棒性测试以及与基线奖励函数的对比实验,验证了所提出方法的有效性。100次阶跃仿真试验表明:相较于基线方法,系统稳态误差低于10%的算例数量增加16%,系统超调量低于指令幅值10%的算例数量增加9%,系统调节时间低于4 s的算例数量增加7%。此外,在传感器误差较大的工况下,控制器依然能够成功完成姿态控制任务。

关键词: 飞行器智能控制, 直升机姿态控制, 强化学习, 奖励函数, 神经网络

Abstract:

Design of the reward function is one of the core technologies for helicopter attitude control based on reinforcement learning, directly determining the training and performance of the controller. Designing a comprehensive and efficient reward function has become a key research topic in the field. To this end, a phased reward function framework is proposed, dividing the full-time domain control process into two control stages. Reward function sub-items are designed for each stage, while introducing adjustable parameters that allow macroscopic adjustment of control performance. Based on the Actor-Critic method, a simple neural network attitude controller structure is designed, and the Proximal Policy Optimization algorithm (PPO) is used for training. The effectiveness of the proposed method is validated through robustness tests involving sensor error introduction and comparative experiments with the baseline reward function. 100 step simulation trials show that compared to the baseline method, the number of cases where system steady-state error is less than 10% increases by 16%, the number of cases where system overshoot is less than 10% of the command amplitude increases by 9%, and the number of cases where system settling time is less than 4 s increases by 7%. Additionally, under conditions of significant sensor error, the controller can still successfully complete the attitude control task.

Key words: aircraft intelligent control, helicopter attitude control, reinforcement learning, reward function, neural networks

中图分类号: