面向直升机姿态控制的强化学习奖励函数设计

doi:10.7527/S1000-6893.2025.32184

第二届空天前沿大会/第二十七届中国科协年会优秀论文

本期目录 | 过刊浏览 | 高级检索

前一篇 |

面向直升机姿态控制的强化学习奖励函数设计

章涛, 李攀(), 王梓旭, 朱振华

南京航空航天大学直升机动力学全国重点实验室，南京 210016

收稿日期:2025-02-25 修回日期:2025-03-08 接受日期:2025-05-06 出版日期:2025-05-29 发布日期:2025-05-27
通讯作者: 李攀 E-mail:lipan@nuaa.edu.cn
基金资助:
国家级项目

Design of reward functions for helicopter attitude control in reinforcement learning

Tao ZHANG, Pan LI(), Zixu WANG, Zhenhua ZHU

National Key Laboratory of Helicopter Dynamics，Nanjing University of Aeronautics and Astronautics，Nanjing 210016，China

Received:2025-02-25 Revised:2025-03-08 Accepted:2025-05-06 Online:2025-05-29 Published:2025-05-27
Contact: Pan LI E-mail:lipan@nuaa.edu.cn
Supported by:
National Level Project

摘要/Abstract

摘要：

奖励函数设计是基于强化学习的直升机姿态控制核心技术之一，直接决定着控制器的训练与性能，设计完备且高效的奖励函数成为业界重点研究课题。为此，提出阶段奖励函数框架，将全时域控制过程划分为2个控制阶段，针对每一阶段设计奖励函数子项，同时引入可宏观调整控制性能的可调参数；基于Actor-Critic方法简单设计神经网络姿态控制器结构，并使用近端策略优化算法（PPO）进行训练，通过引入传感器误差的鲁棒性测试以及与基线奖励函数的对比实验，验证了所提出方法的有效性。100次阶跃仿真试验表明：相较于基线方法，系统稳态误差低于10%的算例数量增加16%，系统超调量低于指令幅值10%的算例数量增加9%，系统调节时间低于4 s的算例数量增加7%。此外，在传感器误差较大的工况下，控制器依然能够成功完成姿态控制任务。

关键词: 飞行器智能控制, 直升机姿态控制, 强化学习, 奖励函数, 神经网络

Abstract:

Design of the reward function is one of the core technologies for helicopter attitude control based on reinforcement learning， directly determining the training and performance of the controller. Designing a comprehensive and efficient reward function has become a key research topic in the field. To this end， a phased reward function framework is proposed， dividing the full-time domain control process into two control stages. Reward function sub-items are designed for each stage， while introducing adjustable parameters that allow macroscopic adjustment of control performance. Based on the Actor-Critic method， a simple neural network attitude controller structure is designed， and the Proximal Policy Optimization algorithm （PPO） is used for training. The effectiveness of the proposed method is validated through robustness tests involving sensor error introduction and comparative experiments with the baseline reward function. 100 step simulation trials show that compared to the baseline method， the number of cases where system steady-state error is less than 10% increases by 16%， the number of cases where system overshoot is less than 10% of the command amplitude increases by 9%， and the number of cases where system settling time is less than 4 s increases by 7%. Additionally， under conditions of significant sensor error， the controller can still successfully complete the attitude control task.

Key words: aircraft intelligent control, helicopter attitude control, reinforcement learning, reward function, neural networks

中图分类号:

V275.1

章涛, 李攀, 王梓旭, 朱振华. 面向直升机姿态控制的强化学习奖励函数设计[J]. 航空学报, 2025, 46(S1): 732184.

Tao ZHANG, Pan LI, Zixu WANG, Zhenhua ZHU. Design of reward functions for helicopter attitude control in reinforcement learning[J]. Acta Aeronautica et Astronautica Sinica, 2025, 46(S1): 732184.

图/表 18

图 1

图 2

图 3

图 4

图 5

图 6

表 1

奖励函数超参数汇总

阶段	参数名称	数值
全过程	$e b$ /（°）	0.5
全过程	$c I$	0.2
快速收敛	$ω t c m i n$ /（（°）·s^-1）	0.2
	$ω t c m a x$ /（（°）·s^-1）	10
	$k e b$ /（（°）·s^-1）	0.7
	$c f c$	0.2
	$c c r a s h$	0.2
稳定收敛	$ω t s m i n$ /（（°）·s^-1）	0.2
	$ω t s m a x$ /（（°）·s^-1）	2
	$c s c$	0.2
	$c c p$	0.2

表 1

表2

PPO算法超参数

超参数	本文设置	基准研究^［35-36］
折扣因子 $γ$	0.99	0.99
裁剪系数 $ε$	0.2	0.2~0.3
熵损失权重 $w$	0.02	无需熵正则化
策略网络动作标准差缩放系数 $A σ$	0.6	0.5
广义优势估计器平滑因子 $λ$	0.9	0.9
Actor网络学习率	6×10^-4	3×10^-4
Critic网络学习率	6×10^-3	3×10^-4
优化器	Adam优化器	Adam优化器

表2

图 7

图 8

图 9

图 10

表3

角速度传感器误差值

误差参数	误差值
$α i j i, j = x, y, z; i ≠ j$	0.2°
$δ k i i = x, y, z$	300 ×10^-6
$β i i = x, y, z$	30 （°）/h
$σ i i = x, y, z$	1 （°）/ $h$

表3

图 11

图 12

图 13

图 14

表4

参考文献 39

[1]	高正，陈仁良. 直升机飞行动力学［M］. 北京：科学出版社， 2003： 1-232.
	GAO Z， CHEN R L. Helicopter flight dynamics［M］. Beijing： Science Press， 2003： 1-232 （in Chinese）.
[2]	陈仁良，李攀，吴伟，等. 直升机飞行动力学数学建模问题［J］. 航空学报， 2017， 38（7）： 520915.
	CHEN R L， LI P， WU W， et al. A review of mathematical modeling of helicopter flight dynamics［J］. Acta Aeronautica et Astronautica Sinica， 2017， 38（7）： 520915 （in Chinese）.
[3]	李攀. 旋翼非定常自由尾迹及高置信度直升机飞行力学建模研究［D］. 南京：南京航空航天大学， 2010： 1-169.
	LI P. Research on unsteady free wake of rotor and high confidence helicopter flight mechanics modeling［D］. Nanjing： Nanjing University of Aeronautics and Astronautics， 2010： 1-169 （in Chinese）.
[4]	BALAS G J， PACKARD A K， RENFROW J， et al. Control of the F-14 aircraft lateral-directional axis during powered approach［J］. Journal of Guidance， Control， and Dynamics， 1998， 21（6）： 899-908.
[5]	郑峰婴，沈志敏，李雅琴，等. 共轴高速直升机增益自适应多模式切换控制［J］. 航空学报， 2024， 45（9）： 529088.
	ZHENG F Y， SHEN Z M， LI Y Q， et al. Gain adaptive multi-mode switching control for coaxial high-speed helicopter［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（9）： 529088 （in Chinese）.
[6]	CATAK A， ALTUNKAYA E C， DEMIR M， et al. Enhanced flight envelope protection： A novel reinforcement learning approach［J］. IFAC-PapersOnLine， 2024， 58（30）： 207-212.
[7]	WISE K A. Design parameter tuning in adaptive observer-based flight control architectures［C］∥2018 AIAA Information Systems-AIAA Infotech @ Aerospace. Reston： AIAA， 2018.
[8]	仇钰清，李俨，郎金溪，等. 高速直升机过渡模态鲁棒自适应姿态控制［J］. 航空学报， 2024， 45（9）： 529927.
	QIU Y Q， LI Y， LANG J X， et al. Robust adaptive attitude control of high-speed helicopters in transition mode［J］. Acta Aeronautica et Astronautica Sinica， 2024， 45（9）： 529927 （in Chinese）.
[9]	LAKE B M， BARONI M. Human-like systematic generalization through a meta-learning neural network［J］. Nature， 2023， 623（7985）： 115-121.
[10]	SUTTON R S， BARTO A G. Reinforcement learning： An introduction［J］. IEEE Transactions on Neural Networks， 1998， 9（5）： 1054.
[11]	SÖNMEZ S， RUTHERFORD M J， VALAVANIS K P. A survey of offline-and online-learning-based algorithms for multirotor UAVs ［J］. Drones， 2024， 8（4）： 116.
[12]	RICHTER D J， CALIX R A， KIM K. A review of reinforcement learning for fixed-wing aircraft control tasks［J］. IEEE Access， 2024， 12： 103026-103048.
[13]	SHADEED O， HASANZADE M， KOYUNCU E. Deep reinforcement learning based aggressive flight trajectory tracker［C］∥AIAA Scitech 2021 Forum. Reston： AIAA， 2021.
[14]	MANUKYAN A， OLIVARES-MENDEZ M A， GEIST M， et al. Deep reinforcement learning-based continuous control for multicopter systems［C］∥2019 6th International Conference on Control， Decision and Information Technologies （CoDIT）. Piscataway： IEEE Press， 2019： 1876-1881.
[15]	HWANGBO J， SA I， SIEGWART R， et al. Control of a quadrotor with reinforcement learning［J］. IEEE Robotics and Automation Letters， 2017， 2（4）： 2096-2103.
[16]	KOCH W， MANCUSO R， WEST R， et al. Reinforcement learning for UAV attitude control［J］. ACM Transactions on Cyber-Physical Systems， 2019， 3（2）： 1-21.
[17]	XU J， DU T， FOSHEY M， et al. Learning to fly： Computational controller design for hybrid UAVs with reinforcement learning［J］. ACM Transactions on Graphics， 2019， 38（4）： 1-12.
[18]	CANO LOPES G， FERREIRA M， SILVA SIMÕES A DA， et al. Intelligent control of a quadrotor with proximal policy optimization reinforcement learning［C］∥2018 Latin American Robotic Symposium， 2018 Brazilian Symposium on Robotics （SBR） and 2018 Workshop on Robotics in Education （WRE）. Piscataway： IEEE Press， 2018： 503-508.
[19]	MOLCHANOV A， CHEN T， HÖNIG W， et al. Sim-to-（multi）-real： Transfer of low-level robust control policies to multiple quadrotors［C］∥2019 IEEE/RSJ International Conference on Intelligent Robots and Systems （IROS）. Piscataway： IEEE Press， 2019： 59-66.
[20]	LI Z， XUE S R， LIN W Y， et al. Training a robust reinforcement learning controller for the uncertain system based on policy gradient method［J］. Neurocomputing， 2018， 316： 313-321.
[21]	ZHEN Y， HAO M R， SUN W D. Deep reinforcement learning attitude control of fixed-wing UAVs［C］∥2020 3rd International Conference on Unmanned Systems （ICUS）. Piscataway： IEEE Press， 2020： 239-244.
[22]	BEKAR C， YUKSEK B， INALHAN G. High fidelity progressive reinforcement learning for agile maneuvering UAVs［C］∥AIAA Scitech 2020 Forum. Reston： AIAA， 2020.
[23]	KIM J， JUNG S. Enhancing UAV stability： A deep reinforcement learning strategy［C］∥2024 International Conference on Electronics， Information， and Communication （ICEIC）. Piscataway： IEEE Press， 2024： 1-4.
[24]	AOUN C， MONCAYO H. Disturbance observer-based reinforcement learning control and the application to a nonlinear dynamic system［C］∥AIAA Scitech 2020 Forum. Reston： AIAA， 2022.
[25]	WANG Y D， SUN J， HE H B， et al. Deterministic policy gradient with integral compensator for robust quadrotor control［J］. IEEE Transactions on Systems， Man， and Cybernetics： Systems， 2020， 50（10）： 3713-3725.
[26]	AKHTAR M， MAQSOOD A. Comparative analysis of deep reinforcement learning algorithms for hover-to-cruise transition maneuvers of a tilt-rotor unmanned aerial vehicle［J］. Aerospace， 2024， 11（12）： 1040-1042.
[27]	PUTERMAN M L. Markov decision processes［M］∥Handbooks in Operations Research and Management Science. New York： Springer Science+Business Media， 1990， 2： 331-434.
[28]	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［DB/OL］. ArXiv preprint： 1707. 06347， 2017.
[29]	SCHULMAN J， MORITZ P， LEVINE S， et al. High-dimensional continuous control using generalized advantage estimation［DB/OL］. ArXiv preprint： 1506. 02438， 2015.
[30]	GRONDMAN I， BUSONIU L， LOPES G A D， et al. A survey of actor-critic reinforcement learning： Standard and natural policy gradients［J］. IEEE Transactions on Systems， Man， and Cybernetics， Part C （Applications and Reviews）， 2012， 42（6）： 1291-1307.
[31]	BORISOV A， MAMAEV I S. Rigid body dynamics［M］. New York： Springer Science+Business Media， 2018： 1-271.
[32]	LI P， CHEN R L. A mathematical model for helicopter comprehensive analysis［J］. Chinese Journal of Aeronautics， 2010， 23（3）： 320-326.
[33]	PITT D M， PETERS D A. Theoretical prediction of dynamic inflow derivatives［J］. Vertica， 1981， 5（1）： 21-34.
[34]	BALLIN M G. Validation of a real-time engineering simulation of the UH-60A helicopter： NASA-TM-88360 ［R］. Washington， D.C.： NASA， 1987.
[35]	ANDRYCHOWICZ M， RAICHUK A， STAŃCZYK P， et al. What matters for on-policy deep actor-critic methods？ A large-scale study［C］∥International Conference on Learning Representations： OpenReview. 2021： 1-10.
[36]	WU Y F， ZHANG W， XU P， et al. A finite-time analysis of two time-scale actor-critic methods［J］. Advances in Neural Information Processing Systems， 2020， 33： 17617-17628.
[37]	WELCER M， SZCZEPAŃSKI C， KRAWCZYK M. The impact of sensor errors on flight stability［J］. Aerospace， 2022， 9（3）： 169.
[38]	TRIPATHI S， WAGH P， CHAUDHARY A B. Modelling， simulation & sensitivity analysis of various types of sensor errors and its impact on tactical flight vehicle navigation［C］∥2016 International Conference on Electrical， Electronics， and Optimization Techniques （ICEEOT）. Piscataway： IEEE Press， 2016： 938-942.
[39]	ZHENG T， XU A G， XU X C， et al. Modeling and compensation of inertial sensor errors in measurement systems［J］. Electronics， 2023， 12（11）： 2458.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

控制性能指标	本文方法	基线方法
稳态误差≤10%	占比81%	占比65%
超调量≤10%	占比89%	占比80%
调节时间≤4 s	占比93%	占比86%
仿真发散	占比0%	占比2%

面向直升机姿态控制的强化学习奖励函数设计

Design of reward functions for helicopter attitude control in reinforcement learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 18

参考文献 39

相关文章 15

编辑推荐

Metrics

本文评价

[1]	郑礼雄, 陈喆, 王鑫, 招启军. 融合数据自适应BPNN的倾转旋翼机回转颤振边界预测[J]. 航空学报, 2025, 46(S1): 732159-732159.
[2]	陈海鹏, 符文星, 闫杰. 基于AGABP神经网络的运载火箭推力偏移损失故障在线诊断[J]. 航空学报, 2025, 46(8): 231148-231148.
[3]	万开方, 吴志林, 武韫晖, 强皓植, 吴艺博, 李波. 拒止环境下基于深度强化学习的多无人机协同定位[J]. 航空学报, 2025, 46(8): 331024-331024.
[4]	姜凌峰, 李新凯, 张海, 李涵玮, 张宏立. 基于改进TD3算法的无人机动态环境无地图导航[J]. 航空学报, 2025, 46(8): 331035-331035.
[5]	陈谋, 黄正国, 申耀华, 刘凡. 先进飞行器复合抗干扰控制技术综述[J]. 航空学报, 2025, 46(6): 531303-531303.
[6]	杨智春, 杨特. 动载荷识别物理嵌入式神经网络模型与方法[J]. 航空学报, 2025, 46(5): 531450-531450.
[7]	杨敏, 刘关俊, 周子渊. 基于安全强化学习的月球着陆器控制[J]. 航空学报, 2025, 46(3): 630553-630553.
[8]	逯宇斌, 聂小华, 吴振. 基于可解释性机器学习算法的碳纤维增强复合材料剩余刚度预测方法[J]. 航空学报, 2025, 46(21): 532249-532249.
[9]	樊佳坤, 艾俊强, 董宁娟, 徐家宽, 乔磊, 白俊强. 基于卷积神经网络的超声速后掠翼横流驻波转捩预测方法[J]. 航空学报, 2025, 46(20): 532012-532012.
[10]	王雨桐, 罗骁, 刘红阳, 宋超, 赵莹, 周铸. 基于多保真深度神经网络的超声速客机声爆预测[J]. 航空学报, 2025, 46(20): 531936-531936.
[11]	赵辰豪, 吴德伟, 何晶, 吴倩. 一种无人机视觉位姿估计的语义特征匹配算法[J]. 航空学报, 2025, 46(2): 330406-330406.
[12]	施英杰, 刘斌超, 鲁嵩嵩, 陈亮, 尚海, 鲍蕊. 融合试验-仿真标定数据的机翼应变载荷关系神经网络模型[J]. 航空学报, 2025, 46(19): 530921-530921.
[13]	郭程洁, 徐典, 李进宝, 程超宇, 郭硕昌, 李锐. 基于数据融合-知识迁移方法的高温数字图像相关试验应力表征[J]. 航空学报, 2025, 46(19): 531574-531574.
[14]	张音旋, 张起, 许镇勇, 孟琳书. 基于残差神经网络的飞机力学响应预测方法[J]. 航空学报, 2025, 46(19): 531295-531295.
[15]	张玉刚, 杨哲, 何森朋, 杨文青. 基于物理信息神经网络的飞机姿态预测模型[J]. 航空学报, 2025, 46(19): 531850-531850.