首页 >

基于多目标强化学习的太阳能无人机航迹规划

徐体超1,蒙文跃1,张健2   

  1. 1. 中国科学院工程热物理研究所
    2. 中科院工程热物理研究所
  • 收稿日期:2025-09-23 修回日期:2025-12-21 出版日期:2025-12-23 发布日期:2025-12-23
  • 通讯作者: 张健

基于多目标强化学习的太阳能无人机航迹规划

  • Received:2025-09-23 Revised:2025-12-21 Online:2025-12-23 Published:2025-12-23
  • Contact: jian ZHANG

摘要: 高空长航时太阳能无人机获取太阳能与梯度风能的影响因素之间存在显著耦合,同时优化两种能源的获取效率往往会产生冲突。针对该问题,本研究提出一种基于多目标强化学习的航迹规划方法,该方法采取基于多目标马尔可夫决策过程的多目标软角色-批评家算法,将无人机的能量获取功率和能量消耗功率组合为奖励向量,并在每个更新步骤中添加随机生成的权重。训练收敛的策略神经网络可以根据飞行信息和给定的权重向量输出推力、攻角和滚转角指令,能够生成权重空间内的能量最优航迹解集。仿真结果表明,该方法与最小能耗策略和基于常规单目标软角色-批评家算法的策略相比始终具备更好的能量优化效率,并且能够自适应能量目标的权重变化。与基于非支配排序遗传算法的离线优化航迹解集相比,该方法在航迹解集的超体积达到前者的90.07%的同时还具备优异的实时性。此外,该方法还表现了一定的泛化能力,能够一定程度上适应风梯度变化后的新风场。

关键词: 太阳能无人机, 航迹规划, 动态滑翔, 能量优化, 强化学习, 多目标强化学习

Abstract: There is a significant coupling between the influencing factors of high-altitude long-endurance solar-powered UAVs in harvesting solar energy and gradient wind energy, and optimizing the harvesting efficiency of these two types of energy simultaneously often leads to conflicts. Therefore, the energy-optimal trajectory planning problem that comprehensively considers solar energy and gradient wind energy is a complex decision-making process that re-quires balancing multiple energy objectives. To address this issue, this study proposes a trajectory planning method based on multi-objective reinforcement learning. This method adopts the multi-objective Soft Actor-Critic (SAC) algorithm based on the multi-objective Markov decision process, combines the UAV's energy harvesting power and energy consumption power into a reward vector, and adds randomly generated weights in each update step. The converged trained policy network can output thrust, attack angle, and bank angle commands based on flight infor-mation and a given weight vector, enabling the generation of a set of energy-optimal trajectory solutions within the weight space. Simulation results show that compared with the minimum energy consumption strategy and the strategy based on the conventional single-objective SAC algorithm, this method consistently achieves better energy optimization efficiency and can adaptively respond to weight changes of energy objectives. Compared with the offline optimized trajectory solution set based on the Non-dominated Sorting Genetic Algorithm Ⅱ, the hypervolume of the trajectory solution set generated by this method can reach 90.07% of the former, while the single decision-making time is maintained at the millisecond level. In addition, this method also demonstrates a certain degree of generalization ability and can adapt to new untrained wind fields.

Key words: Solar-powered Unmanned Aerial Vehicle, Trajectory planning, Dynamic soaring, Energy optimization, Reinforcement learning, Multi-objective reinforcement learning

中图分类号: