面向直升机姿态控制的强化学习奖励函数设计

  • 章涛 ,
  • 李攀 ,
  • 王梓旭 ,
  • 朱振华
展开
  • 1. 南京航空航天大学
    2. 南京航空航天大学 直升机动力学全国重点实验室

收稿日期: 2025-04-30

  修回日期: 2025-05-19

  网络出版日期: 2025-05-27

Design of Reward Functions for Helicopter Attitude Control in Reinforcement Learning

  • ZHANG Tao ,
  • LI Pan ,
  • WANG Zi-Xu ,
  • ZHU Zhen-Hua
Expand

Received date: 2025-04-30

  Revised date: 2025-05-19

  Online published: 2025-05-27

摘要

奖励函数设计是基于强化学习的旋翼飞行器姿态控制的核心技术之一,直接决定着控制器的训练与性能,设计完备且高效的奖励函数成为业界重点研究课题。为此,提出阶段奖励函数框架,将全时域控制过程划分为两个控制阶段,针对每一阶段设计奖励函数子项,同时引入可宏观调整控制性能的可调参数;基于Actor-Critic方法简单设计神经网络姿态控制器结构,并使用PPO算法进行训练,通过引入传感器误差的鲁棒性测试以及与基线奖励函数的对比实验,验证了所提出方法的有效性。100次阶跃仿真试验表明:相较于基线方法,系统稳态误差低于10%的算例数量增加16%,系统超调量低于指令幅值10%的算例数量增加9%,系统调节时间低于4s的算例数量增加7%。此外,在传感器误差较大的工况下,控制器依然能够成功完成姿态控制任务。

本文引用格式

章涛 , 李攀 , 王梓旭 , 朱振华 . 面向直升机姿态控制的强化学习奖励函数设计[J]. 航空学报, 0 : 1 -0 . DOI: 10.7527/S1000-6893.2025.32184

Abstract

Design of the reward function is one of the core technologies for rotorcraft attitude control based on reinforcement learning, directly determining the training and performance of the controller. Designing a comprehensive and effi-cient reward function has become a key research topic in the field. To this end, a phased reward function framework is proposed, dividing the full-time domain control process into two control stages. Reward function sub-items are designed for each stage, while introducing adjustable parameters that allow macroscopic adjustment of control per-formance. Based on the Actor-Critic method, a simple neural network attitude controller structure is designed, and the PPO algorithm is used for training. The effectiveness of the proposed method is validated through robustness tests involving sensor error introduction and comparative experiments with the baseline reward function. 100 step simulation trials show that compared to the baseline method, the number of cases where system steady-state error is less than 10% increases by 16%, the number of cases where system overshoot is less than 10% of the command amplitude increases by 9%, and the number of cases where system settling time is less than 4 s increases by 7%. Additionally, under conditions of significant sensor error, the controller can still successfully complete the attitude control task.

参考文献

[1] 高正, 陈仁良. 直升机飞行动力学[M]. 科学出版社, 2003: 1-232.
Gao Z, Chen R L. Helicopter Flight Dynamics [M]. Sci-ence Press, 2003: 1-232. (in Chinese).
[2] 陈仁良, 李攀, 吴伟, 等. 直升机飞行动力学数学建模问题[J]. 航空学报, 2017, 38(7): 6-22.
CHEN R L, LI P, WU W, et al. A review of mathematical modeling of helicopter flight dynamics[J]. Acta Aero-nautica et Astronautica Sinica, 2017, 38(7): 6-22 (in Chi-nese).
[3] 李攀. 旋翼非定常自由尾迹及高置信度直升机飞行力学建模研究[D]. 南京: 南京航空航天大学, 2010: 1-169.
LI P. Research on the rotor unsteady free-vortex wake and high-fidelity mathematical modeling of helicopter flight dynamics[D]. Nanjing: Nanjing University of Aeronautics and Astronautics, 2010: 1-169. (in Chinese).
[4] Balas G J, Packard A K, Renfrow J, et al. Control of the F-14 aircraft lateral-directional axis during powered ap-proach[J]. Journal of Guidance, Control, and Dynamics, 1998, 21(6): 899-908.
[5] 郑峰婴, 沈志敏, 李雅琴, 等. 共轴高速直升机增益自适应多模式切换控制[J]. 航空学报, 2024, 45(09): 219-236.
ZHENG F Y, SHEN Z M, LI Y Q, et al. Gain adaptive multi-mode switching control for coaxial high-speed heli-copter [J]. Acta Aeronauticaet AstronauticaSinica, 2024, 45(9): 219-236. (in Chinese).
[6] Catak A, Altunkaya E C, Demir M, et al. Enhanced Flight Envelope Protection: A Novel Reinforcement Learning Approach[J]. IFAC-PapersOnLine, 2024, 58(30): 207-212.
[7] Wise K A. Design parameter tuning in adaptive observer-based flight control architectures[M] AIAA Information Systems-AIAA Infotech@ Aerospace. 2018: 2-48.
[8] 仇钰清, 李俨, 郎金溪, 等. 高速直升机过渡模态鲁棒自适应姿态控制[J]. 航空学报, 2024, 45(09): 248-261.
QIU Y Q, LI Y, LANG J X, et al. Robust adaptive atti-tude control of high-speed helicopters in transition mode[J]. Acta Aeronauticaet Astronautica Sinica, 2024, 45(09): 248-261. (in Chinese).
[9] Lake B M, Baroni M. Human-like systematic generaliza-tion through a meta-learning neural network[J]. Nature, 2023, 623(7985): 115-121.
[10] Sutton R S, Barto A G. Reinforcement learning: An in-troduction[M]. Cambridge: MIT press, 1998: 1-322.
[11] S?nmez S, Rutherford M J, Valavanis K P. A Survey of Offline-and Online-Learning-Based Algorithms for Multi-rotor Uavs[J]. Drones, 2024, 8(4): 1-116.
[12] Richter D J, Calix R A, Kim K. A Review of Reinforce-ment Learning for Fixed-Wing Aircraft Control Tasks[J]. IEEE Access, 2024, 12: 103026-103048.
[13] Shadeed O, Hasanzade M, Koyuncu E. Deep Reinforce-ment Learning based Aggressive Flight Trajectory Track-er[C]. AIAA scitech 2021 Forum. 2021: 0777-0779.
[14] Manukyan A, Olivares-Mendez M A, Geist M, et al. Deep reinforcement learning-based continuous control for multicopter systems[C]. 2019 6th International Con-ference on Control, Decision and Information Technolo-gies (CoDIT). IEEE, 2019: 1876-1881.
[15] Hwangbo J, Sa I, Siegwart R, et al. Control of a quad-rotor with reinforcement learning[J]. IEEE Robotics and Automation Letters, 2017, 2(4): 2096-2103.
[16] Koch W, Mancuso R, West R, et al. Reinforcement learn-ing for UAV attitude control[J]. ACM Transactions on Cyber-Physical Systems, 2019, 3(2): 1-21.
[17] Xu J, Du T, Foshey M, et al. Learning to fly: computa-tional controller design for hybrid uavs with reinforce-ment learning[J]. ACM Transactions on Graphics (TOG), 2019, 38(4): 1-12.
[18] Lopes G C, Ferreira M, da Silva Sim?es A, et al. Intelli-gent control of a quadrotor with proximal policy optimiza-tion reinforcement learning[C]. 2018 Latin American Ro-botic Symposium, 2018 Brazilian Symposium on Robot-ics (SBR) and 2018 Workshop on Robotics in Education (WRE). IEEE, 2018: 503-508.
[19] Molchanov A, Chen T, H?nig W, et al. Sim-to-(multi)-real: Transfer of low-level robust control policies to multiple quadrotors[C]. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019: 59-66.
[20] Li Z, Xue S, Lin W, et al. Training a robust reinforcement learning controller for the uncertain system based on poli-cy gradient method[J]. Neurocomputing, 2018, 316: 313-321.
[21] Zhen Y, Hao M, Sun W. Deep reinforcement learning attitude control of fixed-wing UAVs[C]. 2020 3rd Inter-national Conference on Unmanned Systems (ICUS). IEEE, 2020: 239-244.
[22] Bekar C, Yuksek B, Inalhan G. High fidelity progressive reinforcement learning for agile maneuvering UAVs[C]. AIAA Scitech 2020 Forum. 2020: 0898-0900.
[23] Kim J, Jung S. Enhancing UAV Stability: A Deep Rein-forcement Learning Strategy[C]. 2024 International Con-ference on Electronics, Information, and Communication (ICEIC). IEEE, 2024: 1-4.
[24] Aoun C, Moncayo H. Disturbance Observer-based Rein-forcement Learning Control and the Application to a Non-linear Dynamic System[C]. AIAA SCITECH 2022 Fo-rum. 2022: 1586-1588.
[25] Wang Y, Sun J, He H, et al. Deterministic policy gradient with integral compensator for robust quadrotor control[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2019, 50(10): 3713-3725.
[26] Akhtar M, Maqsood A. Comparative Analysis of Deep Reinforcement Learning Algorithms for Hover-to-Cruise Transition Maneuvers of a Tilt-Rotor Unmanned Aerial Vehicle[J]. Aerospace, 2024, 11(12): 1040-1042.
[27] Puterman M L. Markov decision processes[J]. Hand-books in operations research and management science, 1990, 2: 331-434.
[28] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[DB/OL]. arxiv preprint arxiv: 1707. 06347, 2017.
[29] Schulman J, Moritz P, Levine S, et al. High-dimensional continuous control using generalized advantage estima-tion[DB/OL]. arxiv preprint arxiv: 1506. 02438, 2015.
[30] Grondman I, Busoniu L, Lopes G A D, et al. A survey of actor-critic reinforcement learning: Standard and natural policy gradients[J]. IEEE Transactions on Systems, Man, and Cybernetics, part C (applications and reviews), 2012, 42(6): 1291-1307.
[31] Borisov A, Mamaev I S. Rigid body dynamics[M]. Walter de Gruyter GmbH & Co KG, 2018: 1-271.
[32] Pan, Li and Chen Ren liang. A Mathematical Model for Helicopter Comprehensive Analysis[J]. Chinese Journal of Aeronautics 23 (2010): 320-326.
[33] Pitt D M, Peters D A. Theoretical prediction of dynamic inflow derivatives[J]. Vertica, 1981, 5(1): 21-34.
[34] Ballin M G. Validation of a real-time engineering simula-tion of the UH-60A helicopter[R]. NASA: NASA-TM-88360, 1987.
[35] Andrychowicz M, Raichuk A, Stańczyk P, et al. What matters for on-policy deep actor-critic methods? a large-scale study[C]. International conference on learning rep-resentations. 2021: 1-10.
[36] Wu Y F, Zhang W, Xu P, et al. A finite-time analysis of two time-scale actor-critic methods[J]. Advances in Neu-ral Information Processing Systems, 2020, 33: 17617-17628.
[37] Welcer M, Szczepański C, Krawczyk M. The impact of sensor errors on flight stability[J]. Aerospace, 2022, 9(3): 169.
[38] Tripathi S, Wagh P, Chaudhary A B. Modelling, simula-tion & sensitivity analysis of various types of sensor er-rors and its impact on Tactical Flight Vehicle naviga-tion[C]. 2016 International Conference on electrical, elec-tronics, and optimization techniques (ICEEOT). IEEE, 2016: 938-942.
[39] Zheng T, Xu A, Xu X, et al. Modeling and compensation of inertial sensor errors in measurement systems[J]. Elec-tronics, 2023, 12(11): 2458.
文章导航

/