航空学报 > 2021, Vol. 42 Issue (1): 524151-524151   doi: 10.7527/S1000-6893.2020.24151

基于多智能体强化学习的空间机械臂轨迹规划

赵毓, 管公顺, 郭继峰, 于晓强, 颜鹏   

  1. 哈尔滨工业大学 航天学院, 哈尔滨 150001
  • 收稿日期:2020-04-28 修回日期:2020-05-21 发布日期:2020-07-10
  • 通讯作者: 郭继峰 E-mail:guojifeng@hit.edu.cn
  • 基金资助:
    国家自然科学基金(61973101);航空科学基金(20180577005)

Trajectory planning of space manipulator based on multi-agent reinforcement learning

ZHAO Yu, GUAN Gongshun, GUO Jifeng, YU Xiaoqiang, YAN Peng   

  1. School of Astronautics, Harbin Institute of Technology, Harbin 150001, China
  • Received:2020-04-28 Revised:2020-05-21 Published:2020-07-10
  • Supported by:
    National Natural Science Foundation of China (61973101); Aeronautical Science Foundation of China (20180577005)

摘要: 针对某型六自由度(DOF)空间漂浮机械臂对运动目标捕捉场景,开展了基于深度强化学习的在线轨迹规划方法研究。首先给出了机械臂DH (Denavit-Hartenberg)模型,考虑组合体力学耦合特性建立了多刚体运动学和动力学模型。然后提出了一种改进深度确定性策略梯度算法,以各关节为决策智能体建立了多智能体自学习系统。而后建立了"线下集中学习,线上分布执行"的空间机械臂对匀速直线运动目标捕捉训练系统,构建以目标相对距离和总操作时间为参数的奖励函数。最后通过数学仿真验证,实现了机械臂对各向匀速运动目标的快速捕捉,平均完成耗时5.4 s。与传统基于随机采样的规划算法对比,本文提出的自主决策运动规划方法求解速度和鲁棒性更优。

关键词: 机械臂, 轨迹规划, 多智能体, 策略梯度, 在轨捕捉

Abstract: An online self-learning trajectory planning method based on the deep reinforcement learning is studied for a six Degree-of-Freedom (DOF) space floating manipulator to capture moving objects. The DH(Denavit-Hartenberg) model of the manipulator is presented, and the kinematic and dynamic models of multi-rigid bodies established considering the mechanical coupling characteristics of the combination. An improved deep determination policy gradient algorithm is further proposed, and a multi-agent self-learning system established with each joint as a decision-making agent. Additionally, a training model of the space manipulator is built based on "offline centralized learning and online distributed execution", constructing a reward function with the variables of the target relative distance and the total operation time. Simulation results show that the robot can capture the moving target rapidly with average time of 5.4 s. Compared with the traditional planning algorithm based on random sampling, the autonomous decision-making motion planning method proposed in this paper exhibits better solution speed and robustness.

Key words: manipulators, trajectory planning, multi-agent, policy gradient, on orbit capture

中图分类号: