1. 军事科学院, 北京 100091;
2. 解放军 32032部队, 北京 100094;
3. 航天工程大学, 北京 101416

Strategy solution of non-cooperative target pursuit-evasion game based on branching deep reinforcement learning
LIU Bingyan1,2, YE Xiongbing1, GAO Yong2, WANG Xinbo2, NI Lei3
1. Academy of Military Sciences, Beijing 100091, China;
2. 32032 Troops, Beijing 100094, China;
3. Space Engineering University, Beijing 101416, China
Abstract: To solve the space rendezvous problem between spacecraft and non-cooperative targets and alleviate application limitations of deep reinforcement learning in continuous space, this paper proposes a pursuit-evasion game algorithm based on branching deep reinforcement learning to obtain the space rendezvous strategy. The differential game is used to solve the optimal control problem of space intersection for non-cooperative targets, which is described as a pursuit-evasion game problem under the action of continuous thrust. To avoid the dimension disaster of the traditional deep reinforcement learning in dealing with continuous space, this paper constructs a fuzzy inference model to represent the continuous space, and proposes a branching deep reinforcement learning architecture with multiple parallel neural networks and a shared decision module. The combination of optimal control and game theory is realized, effectively overcoming the difficulty in solving the highly nonlinear differential game model by the classical optimal control theory, and further improving the training ability of deep reinforcement learning on discrete behaviors. Finally, an example is given to verify the effectiveness of the algorithm.
Keywords: non-cooperative targets    space rendezvous    pursuit-evasion problem of spacecraft    continuous space    differential game    deep reinforcement learning    branching architectures

1 航天器与非合作目标的动力学模型

 图 1 航天器与非合作目标对策的坐标示意图 Fig. 1 Coordinate frame sketch of spacecraft and non-cooperative target
 $\left\{ \begin{array}{l} \begin{array}{*{20}{l}} {{{\ddot x}_i}(t) = 2\frac{\mu }{{{r^3}(t)}}{x_i}(t) + 2\omega (t){{\dot y}_i}(t) + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \dot \omega (t){y_i}(t) + {\omega ^2}(t){x_i}(t) + \frac{{{T_i}}}{{{m_i}}}{\rm{cos}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\delta _i}{\rm{cos}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\theta _i}} \end{array}\\ \begin{array}{*{20}{l}} {{{\ddot y}_i}(t) = - \frac{\mu }{{{r^3}(t)}}{y_i}(t) + 2\omega (t){{\dot x}_i}(t) + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \dot \omega (t){x_i}(t) + {\omega ^2}(t){y_i}(t) + \frac{{{T_i}}}{{{m_i}}}{\rm{cos}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\delta _i}{\rm{sin}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\theta _i}} \end{array}\\ {{\ddot z}_i}(t) = - {\omega ^2}(t){z_i}(t) + \frac{{{T_i}}}{{{m_i}}}{\rm{sin}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\delta _i} \end{array} \right.$ （1）

 $j({u_{\rm{p}}},{u_{\rm{e}}}) = \left\| {[{x_{\rm{p}}} - {x_{\rm{e}}},{y_{\rm{p}}} - {y_{\rm{e}}},{z_{\rm{p}}} - {z_{\rm{e}}}]_{|t = {t_{\rm{f}}}}^{\rm{T}}} \right\|_2^2$ （2）

 $J({u_{\rm{p}}},{u_{\rm{e}}}) = k\int_{{t_0}}^{{t_{\rm{f}}}} {\rm{d}} t + (1 - k)\int_{{t_0}}^{{t_{\rm{f}}}} j ({u_{\rm{p}}},{u_{\rm{e}}}){\rm{d}}t$ （3）

 $J(u_{\rm{p}}^*,{u_{\rm{e}}}) \le J(u_{\rm{p}}^*,u_{\rm{e}}^*) \le J({u_{\rm{p}}},u_{\rm{e}}^*)$ （4）

 ${J^*} = \mathop {{\rm{min}}}\limits_{u_{\rm{p}}^*} \mathop {{\rm{max}}}\limits_{u_{\rm{e}}^*} J = \mathop {{\rm{max}}}\limits_{u_{\rm{e}}^*} \mathop {{\rm{min}}}\limits_{u_{\rm{p}}^*} J$ （5）

2 空间行为的模糊推理模型

 $\begin{array}{l} {R_l}: {\rm{IF}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_1}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_1^l{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_2}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_2^l{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{AND }} \cdots \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \begin{array}{*{20}{l}} {{\rm{ AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_i}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_i^l}\\ {{\rm{ THEN}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {u_l} = {c_l}} \end{array} \end{array}$ （6）

 图 2 空间行为的模糊推理模型 Fig. 2 TSK fuzzy inference model

 $L_l^2 = \prod\limits_{i = 1}^n {{\mu ^{A_i^l}}} ({x_i})$ （7）

 $L_l^3 = {\varPsi ^l} = \frac{{L_l^2}}{{\sum\limits_{l = 1}^L {L_l^2} }} = \frac{{\prod\limits_{i = 1}^n {{\mu ^{A_i^l}}} ({x_i})}}{{\sum\limits_{l = 1}^L {(\prod\limits_{i = 1}^n {\mu _i^{A_i^l}} ({x_i}))} }}$ （8）

 $L_l^4 = L_l^3{c_l}$ （9）

 $\begin{array}{l} {L^5} = u = \sum\limits_{l = 1}^L {L_l^4} = \frac{{\sum\limits_{l = 1}^L {(\prod\limits_{i = 1}^N {{\mu ^{A_i^l}}} ({x_i}){c_l})} }}{{\sum\limits_{l = 1}^L {(\prod\limits_{i = 1}^N {{\mu ^{A_i^l}}} ({x_i}))} }} = \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{l = 1}^L {({\Psi ^l}{c_l})} \end{array}$ （10）

 ${\mu ^{A_i^l}}({x_i}) = {\rm{exp}}\left( { - {{\left( {\frac{{{x_i} - m_i^l}}{{\sigma _i^l}}} \right)}^2}} \right)$ （11）

3 追逃博弈的分支深度强化学习

3.1 多组并行的网络分支

 图 3 分支深度强化学习架构示意图 Fig. 3 Schematic diagram of branching deep reinforcement learning architecture
3.2 共享行为决策模块

 图 4 基于改进强化学习的共享行为决策示意图 Fig. 4 Schematic diagram of shared behavior decision based on improved reinforcement learning

 $\begin{array}{l} {R_l}:{\rm{IF}}{\kern 1pt} {\kern 1pt} {x_1}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_1^l{\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_2}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_2^l{\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} \cdots {\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \begin{array}{*{20}{l}} {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_i^l}\\ {{\rm{THEN}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {u_l} = {a^l}} \end{array} \end{array}$ （12）

 ${a^l}\left\{ \begin{array}{l} {\rm{在}}a{\rm{中进行随机选取}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{Prob}} (\varepsilon )\\ {\rm{arg}}\mathop {{\rm{max}}}\limits_{{a^l} \in a} (q(S,{a^l}))\quad {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ {\rm{Prob}} (1 - \varepsilon ) \end{array} \right.$ （13）

 ${q_t}(S,{a^l}) = E[{G_t}|{S_t} = S,{a_t} = a,\varepsilon {\rm{ - greedy }}]$ （14）
 ${v_t}(S) = {E_{a \backsim \varepsilon - {\rm{greedy}} }}[{q_t}(S,{a^l})]$ （15）

 ${o_t}(S,{a^l}) = {q_t}(S,{a^l}) - {v_t}(S)$ （16）

 $\begin{array}{*{20}{l}} {{q_t}(S,{a^l}) = {v_t}(S) + }\\ {\quad ({o_t}(S,{a^l}) - \mathop {{\rm{max}}}\limits_{{a^l} \in a} {o_t}(S,{a^l}))} \end{array}$ （17）

 $\begin{array}{l} {u^*}({{\bar x}_t}) = {v_t}(S) + \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{l = 1}^L {(\varPsi _t^l{o_t}(} S,{a^l}) - \varPsi _t^l\mathop {{\rm{max}}}\limits_{{a^l} \in a} {o_t}(S,{a^l})) \end{array}$ （18）

 ${p_t} = {R_{t + 1}} + \gamma {u^*}({\bar x_{t + 1}}) - u({\bar x_t})$ （19）

q函数更新阶段，通过自主迭代训练进行更新

 ${q_{t + 1}}(l,{a^l}) = {q_t}(l,{a^l}) + \eta {p_t}\varPsi _t^l$ （20）

3.3 航天器与非合作目标的博弈交互

4 算例分析

 对象 x/km y/km z/km ${\dot x}$/(km·s-1) ${\dot y}$/(km·s-1) ${\dot z}$/(km·s-1) P 0 0 0 -0.049 6 0.041 8 0 E 50 50 0 -0.037 1 0.031 4 0

 $\left\{ {\begin{array}{*{20}{l}} {\Delta \delta = {\delta _{\rm{e}}} - {\delta _{\rm{p}}}}\\ {\Delta \theta = {\rm{ta}}{{\rm{n}}^{ - 1}}\left( {\frac{{{y_{\rm{e}}} - {y_{\rm{p}}}}}{{{x_{\rm{e}}} - {x_{\rm{p}}}}}} \right) - {\theta _{\rm{p}}}} \end{array}} \right.$ （21）

 $\dot \varphi = \frac{{\varphi - {\varphi ^\prime }}}{T}$ （22）

 图 5 两种算法的学习曲线 Fig. 5 Training curves of two algorithms

 图 6 学习0次后的追逃博弈轨迹 Fig. 6 Trajectory of pursuit-evasion game after learning 0 time
 图 7 学习500次后的追逃博弈轨迹 Fig. 7 Trajectory of pursuit-evasion game after learning 500 times

 图 8 q函数误差率随训练次数的变化情况 Fig. 8 Variation rate of q function error with training times

 图 9 追逃行为概率分布 Fig. 9 Probability distribution of pursuit-evasion behavior
 图 10 学习1 000次后的行为控制量 Fig. 10 Amount of behavioral control after learning 1 000 times
 图 11 学习1 000次后的追逃博弈轨迹 Fig. 11 Trajectory of pursuit-evasion game after learning 1 000 times
5 结论

1) 构建了近地轨道航天器的追逃运动模型，给出了追逃博弈的纳什均衡策略，将非合作目标空间交会策略问题转述为微分对策问题。

2) 构建了空间行为模糊推理模型，实现了连续状态经由模糊推理再到连续行为输出的映射转换，有效避免了传统深度强化学习应对连续空间存在的维数灾难问题。

3) 提出了一种新的分支深度强化学习架构，实现了行为策略的分支训练与共享决策，有效解决了行为数量与映射规则的组合增长问题。

