文章快速检索 高级检索

1. 军事科学院, 北京 100091;
2. 解放军 32032部队, 北京 100094;
3. 航天工程大学, 北京 101416

Strategy solution of non-cooperative target pursuit-evasion game based on branching deep reinforcement learning
LIU Bingyan1,2, YE Xiongbing1, GAO Yong2, WANG Xinbo2, NI Lei3
1. Academy of Military Sciences, Beijing 100091, China;
2. 32032 Troops, Beijing 100094, China;
3. Space Engineering University, Beijing 101416, China
Abstract: To solve the space rendezvous problem between spacecraft and non-cooperative targets and alleviate application limitations of deep reinforcement learning in continuous space, this paper proposes a pursuit-evasion game algorithm based on branching deep reinforcement learning to obtain the space rendezvous strategy. The differential game is used to solve the optimal control problem of space intersection for non-cooperative targets, which is described as a pursuit-evasion game problem under the action of continuous thrust. To avoid the dimension disaster of the traditional deep reinforcement learning in dealing with continuous space, this paper constructs a fuzzy inference model to represent the continuous space, and proposes a branching deep reinforcement learning architecture with multiple parallel neural networks and a shared decision module. The combination of optimal control and game theory is realized, effectively overcoming the difficulty in solving the highly nonlinear differential game model by the classical optimal control theory, and further improving the training ability of deep reinforcement learning on discrete behaviors. Finally, an example is given to verify the effectiveness of the algorithm.
Keywords: non-cooperative targets    space rendezvous    pursuit-evasion problem of spacecraft    continuous space    differential game    deep reinforcement learning    branching architectures

1 航天器与非合作目标的动力学模型

 图 1 航天器与非合作目标对策的坐标示意图 Fig. 1 Coordinate frame sketch of spacecraft and non-cooperative target
 $\left\{ \begin{array}{l} \begin{array}{*{20}{l}} {{{\ddot x}_i}(t) = 2\frac{\mu }{{{r^3}(t)}}{x_i}(t) + 2\omega (t){{\dot y}_i}(t) + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \dot \omega (t){y_i}(t) + {\omega ^2}(t){x_i}(t) + \frac{{{T_i}}}{{{m_i}}}{\rm{cos}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\delta _i}{\rm{cos}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\theta _i}} \end{array}\\ \begin{array}{*{20}{l}} {{{\ddot y}_i}(t) = - \frac{\mu }{{{r^3}(t)}}{y_i}(t) + 2\omega (t){{\dot x}_i}(t) + }\\ {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \dot \omega (t){x_i}(t) + {\omega ^2}(t){y_i}(t) + \frac{{{T_i}}}{{{m_i}}}{\rm{cos}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\delta _i}{\rm{sin}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\theta _i}} \end{array}\\ {{\ddot z}_i}(t) = - {\omega ^2}(t){z_i}(t) + \frac{{{T_i}}}{{{m_i}}}{\rm{sin}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\delta _i} \end{array} \right.$ （1）

 $j({u_{\rm{p}}},{u_{\rm{e}}}) = \left\| {[{x_{\rm{p}}} - {x_{\rm{e}}},{y_{\rm{p}}} - {y_{\rm{e}}},{z_{\rm{p}}} - {z_{\rm{e}}}]_{|t = {t_{\rm{f}}}}^{\rm{T}}} \right\|_2^2$ （2）

 $J({u_{\rm{p}}},{u_{\rm{e}}}) = k\int_{{t_0}}^{{t_{\rm{f}}}} {\rm{d}} t + (1 - k)\int_{{t_0}}^{{t_{\rm{f}}}} j ({u_{\rm{p}}},{u_{\rm{e}}}){\rm{d}}t$ （3）

 $J(u_{\rm{p}}^*,{u_{\rm{e}}}) \le J(u_{\rm{p}}^*,u_{\rm{e}}^*) \le J({u_{\rm{p}}},u_{\rm{e}}^*)$ （4）

 ${J^*} = \mathop {{\rm{min}}}\limits_{u_{\rm{p}}^*} \mathop {{\rm{max}}}\limits_{u_{\rm{e}}^*} J = \mathop {{\rm{max}}}\limits_{u_{\rm{e}}^*} \mathop {{\rm{min}}}\limits_{u_{\rm{p}}^*} J$ （5）

2 空间行为的模糊推理模型

 $\begin{array}{l} {R_l}: {\rm{IF}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_1}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_1^l{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_2}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_2^l{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{AND }} \cdots \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \begin{array}{*{20}{l}} {{\rm{ AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_i}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_i^l}\\ {{\rm{ THEN}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {u_l} = {c_l}} \end{array} \end{array}$ （6）

 图 2 空间行为的模糊推理模型 Fig. 2 TSK fuzzy inference model

 $L_l^2 = \prod\limits_{i = 1}^n {{\mu ^{A_i^l}}} ({x_i})$ （7）

 $L_l^3 = {\varPsi ^l} = \frac{{L_l^2}}{{\sum\limits_{l = 1}^L {L_l^2} }} = \frac{{\prod\limits_{i = 1}^n {{\mu ^{A_i^l}}} ({x_i})}}{{\sum\limits_{l = 1}^L {(\prod\limits_{i = 1}^n {\mu _i^{A_i^l}} ({x_i}))} }}$ （8）

 $L_l^4 = L_l^3{c_l}$ （9）

 $\begin{array}{l} {L^5} = u = \sum\limits_{l = 1}^L {L_l^4} = \frac{{\sum\limits_{l = 1}^L {(\prod\limits_{i = 1}^N {{\mu ^{A_i^l}}} ({x_i}){c_l})} }}{{\sum\limits_{l = 1}^L {(\prod\limits_{i = 1}^N {{\mu ^{A_i^l}}} ({x_i}))} }} = \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{l = 1}^L {({\Psi ^l}{c_l})} \end{array}$ （10）

 ${\mu ^{A_i^l}}({x_i}) = {\rm{exp}}\left( { - {{\left( {\frac{{{x_i} - m_i^l}}{{\sigma _i^l}}} \right)}^2}} \right)$ （11）

3 追逃博弈的分支深度强化学习

3.1 多组并行的网络分支

 图 3 分支深度强化学习架构示意图 Fig. 3 Schematic diagram of branching deep reinforcement learning architecture
3.2 共享行为决策模块

 图 4 基于改进强化学习的共享行为决策示意图 Fig. 4 Schematic diagram of shared behavior decision based on improved reinforcement learning

 $\begin{array}{l} {R_l}:{\rm{IF}}{\kern 1pt} {\kern 1pt} {x_1}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_1^l{\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_2}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_2^l{\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} \cdots {\kern 1pt} {\kern 1pt} {\rm{AND}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {x_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \begin{array}{*{20}{l}} {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{is}}{\kern 1pt} {\kern 1pt} {\kern 1pt} A_i^l}\\ {{\rm{THEN}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {u_l} = {a^l}} \end{array} \end{array}$ （12）

 ${a^l}\left\{ \begin{array}{l} {\rm{在}}a{\rm{中进行随机选取}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{Prob}} (\varepsilon )\\ {\rm{arg}}\mathop {{\rm{max}}}\limits_{{a^l} \in a} (q(S,{a^l}))\quad {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ {\rm{Prob}} (1 - \varepsilon ) \end{array} \right.$ （13）

 ${q_t}(S,{a^l}) = E[{G_t}|{S_t} = S,{a_t} = a,\varepsilon {\rm{ - greedy }}]$ （14）
 ${v_t}(S) = {E_{a \backsim \varepsilon - {\rm{greedy}} }}[{q_t}(S,{a^l})]$ （15）

 ${o_t}(S,{a^l}) = {q_t}(S,{a^l}) - {v_t}(S)$ （16）

 $\begin{array}{*{20}{l}} {{q_t}(S,{a^l}) = {v_t}(S) + }\\ {\quad ({o_t}(S,{a^l}) - \mathop {{\rm{max}}}\limits_{{a^l} \in a} {o_t}(S,{a^l}))} \end{array}$ （17）

 $\begin{array}{l} {u^*}({{\bar x}_t}) = {v_t}(S) + \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\limits_{l = 1}^L {(\varPsi _t^l{o_t}(} S,{a^l}) - \varPsi _t^l\mathop {{\rm{max}}}\limits_{{a^l} \in a} {o_t}(S,{a^l})) \end{array}$ （18）

 ${p_t} = {R_{t + 1}} + \gamma {u^*}({\bar x_{t + 1}}) - u({\bar x_t})$ （19）

q函数更新阶段，通过自主迭代训练进行更新

 ${q_{t + 1}}(l,{a^l}) = {q_t}(l,{a^l}) + \eta {p_t}\varPsi _t^l$ （20）

3.3 航天器与非合作目标的博弈交互

4 算例分析

 对象 x/km y/km z/km ${\dot x}$/(km·s-1) ${\dot y}$/(km·s-1) ${\dot z}$/(km·s-1) P 0 0 0 -0.049 6 0.041 8 0 E 50 50 0 -0.037 1 0.031 4 0

 $\left\{ {\begin{array}{*{20}{l}} {\Delta \delta = {\delta _{\rm{e}}} - {\delta _{\rm{p}}}}\\ {\Delta \theta = {\rm{ta}}{{\rm{n}}^{ - 1}}\left( {\frac{{{y_{\rm{e}}} - {y_{\rm{p}}}}}{{{x_{\rm{e}}} - {x_{\rm{p}}}}}} \right) - {\theta _{\rm{p}}}} \end{array}} \right.$ （21）

 $\dot \varphi = \frac{{\varphi - {\varphi ^\prime }}}{T}$ （22）

 图 5 两种算法的学习曲线 Fig. 5 Training curves of two algorithms

 图 6 学习0次后的追逃博弈轨迹 Fig. 6 Trajectory of pursuit-evasion game after learning 0 time
 图 7 学习500次后的追逃博弈轨迹 Fig. 7 Trajectory of pursuit-evasion game after learning 500 times

 图 8 q函数误差率随训练次数的变化情况 Fig. 8 Variation rate of q function error with training times

 图 9 追逃行为概率分布 Fig. 9 Probability distribution of pursuit-evasion behavior
 图 10 学习1 000次后的行为控制量 Fig. 10 Amount of behavioral control after learning 1 000 times
 图 11 学习1 000次后的追逃博弈轨迹 Fig. 11 Trajectory of pursuit-evasion game after learning 1 000 times
5 结论

1) 构建了近地轨道航天器的追逃运动模型，给出了追逃博弈的纳什均衡策略，将非合作目标空间交会策略问题转述为微分对策问题。

2) 构建了空间行为模糊推理模型，实现了连续状态经由模糊推理再到连续行为输出的映射转换，有效避免了传统深度强化学习应对连续空间存在的维数灾难问题。

3) 提出了一种新的分支深度强化学习架构，实现了行为策略的分支训练与共享决策，有效解决了行为数量与映射规则的组合增长问题。

 [1] 常燕, 陈韵, 鲜勇, 等. 机动目标的空间交会微分对策制导方法[J]. 宇航学报, 2016, 37(7): 795-801. CHANG Y, CHEN Y, XIAN Y, et al. Differential game guidance for space rendezvous of maneuvering target[J]. Journal of Astronautics, 2016, 37(7): 795-801. (in Chinese) Cited By in Cnki | Click to display the text [2] 柴源, 罗建军, 王明明, 等. 基于追逃博弈的非合作目标接近控制[J]. 宇航总体技术, 2020, 4(1): 30-38. CHAI Y, LUO J J, WANG M M, et al. Pursuit-Evasion game control for approaching space non-cooperative target[J]. Astronautical Systems Engineering Technology, 2020, 4(1): 30-38. (in Chinese) Cited By in Cnki | Click to display the text [3] 王强, 叶东, 范宁军, 等. 基于零控脱靶量的卫星末端追逃控制方法[J]. 北京理工大学学报, 2016, 36(11): 1171-1176. WANG Q, YE D, FAN N J, et al. Terminal orbital control of satellite pursuit evasion game based on zero effort miss[J]. Transactions of Beijing Institute of Technology, 2016, 36(11): 1171-1176. (in Chinese) Cited By in Cnki (1) | Click to display the text [4] ISAACS R. Differential games[M]. New York: Wiley, 1965. [5] FRIEDMAN A. Differential games[M]. Rhode Island: American Mathematical Society, 1974. [6] DICKMANNS E, WELL K. Approximate solution of optimal control problems using third order hermite polynomial functions[C]//Optimization Techniques IFIP Technical Conference, 1974: 1-7. [7] 张秋华, 孙松涛, 谌颖, 等. 时间固定的两航天器追逃策略及数值求解[J]. 宇航学报, 2014, 35(5): 537-544. ZHANG Q H, SUN S T, CHEN Y, et al. Strategy and numerical solution of pursuit-evasion with fixed duration for two spacecraft[J]. Journal of Astronautics, 2014, 35(5): 537-544. (in Chinese) Cited By in Cnki (12) | Click to display the text [8] 赵琳, 周俊峰, 刘源, 等. 三维空间"追-逃-防"三方微分对策方法[J]. 系统工程与电子技术, 2019, 41(2): 322-335. ZHAO L, ZHOU J F, LIU Y, et al. Three-body differential game approach of pursuit-evasion-defense in three dimensional space[J]. Systems Engineering and Electronics, 2019, 41(2): 322-335. (in Chinese) Cited By in Cnki | Click to display the text [9] 郝志伟, 孙松涛, 张秋华, 等. 半直接配点法在航天器追逃问题求解中的应用[J]. 宇航学报, 2019, 40(6): 628-635. HAO Z W, SUN S T, ZHANG Q H, et al. Application of semi-direct collocation method for solving pursuit-evasion problems of spacecraft[J]. Journal of Astronautics, 2019, 40(6): 628-635. (in Chinese) Cited By in Cnki | Click to display the text [10] 李龙跃, 刘付显, 史向峰, 等. 导弹追逃博弈微分对策建模与求解[J]. 系统工程理论与实践, 2016, 36(8): 2161-2168. LI L Y, LIU F X, SHI X F, et al. Differential game model and solving method for missile pursuit-evasion[J]. Systems Engineering-Theory & Practice, 2016, 36(8): 2161-2168. (in Chinese) Cited By in Cnki | Click to display the text [11] 陈燕妮.基于微分对策的有限时间自适应动态规划制导研究[D].南京: 南京航空航天大学, 2019. CHEN Y N. Research on differential games-based finite-time adaptive dynamic programming guidance law[D]. Nanjing: Nanjing University of Aeronautics and Astronautics, 2019(in Chinese). [12] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529. Click to display the text [13] 刘冰雁, 叶雄兵, 周赤非, 等. 基于改进DQN的复合模式在轨服务资源分配[J]. 航空学报, 2020, 41(4): 323630. LIU B Y, YE X B, ZHOU C F, et al. Composite mode on-orbit service resource allocation based on improved DQN[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(4): 323630. (in Chinese) Cited By in Cnki | Click to display the text [14] 曹雷. 基于深度强化学习的智能博弈对抗关键技术[J]. 指挥信息系统与技术, 2019, 10(5): 1-7. CAO L. Key technologies of intelligent game confrontation based on deep reinforcement learning[J]. Command Information Systemand Technology, 2019, 10(5): 1-7. (in Chinese) Cited By in Cnki (1) | Click to display the text [15] CHENG Y, SUN Z J, HUANG Y X, et al. Fuzzy categorical deep reinforcement learning of a defensive game for an unmanned surface vessel[J]. International Journal of Fuzzy Systems, 2019, 21(2): 592-606. Click to display the text [16] LIU B Y, YE X B, GAO Y, et al. Forward-looking imaginative planning framework combined with prioritized replay double DQN[C]//International Conferenceon Control, Automation and Robotics, 2019: 336-341. [17] 吴晓光, 刘绍维, 杨磊, 等.基于深度强化学习的双足机器人斜坡步态控制方法[J/OL].自动化学报, 2020: 1-13[2020-02-28].https://doi.org/10.16383/j.aas.c190547. WU X G, LIU S W, YANG L et al. A gait control method for biped robot on slope based on deep reinforcement learning[J].Acta Automatica Sinica, 2020: 1-13[2020-02-28]. https://doi.org/10.16383/j.aas.c190547 (in Chinese). [18] 吴其昌, 张洪波. 基于生存型微分对策的航天器追逃策略及数值求解[J]. 控制与信息技术, 2019(4): 39-43. WU Q C, ZHANG H B. Spacecraft pursuit strategy and numerical solution based on survival differential strategy[J]. Control and Information Technology, 2019(4): 39-43. (in Chinese) Cited By in Cnki | Click to display the text [19] ENGWERDA J. Algorithms for computing Nash equilibria indeterministic LQ games[J]. Computational Management Science, 2007, 4(2): 113-140. Click to display the text [20] 约翰纳什. 博弈论经典[M]. 北京: 中国人民大学出版社, 2013. NASH J. Classic in game theory[M]. Beijing: China Renmin University Press, 2013. (in Chinese) [21] SUN S T, ZHANG Q H, LOXTON R, et al. Numerical solution of a pursuit-evasion differential game involving two spacecraft in low earth orbit[J]. Journal of Industrial and Management Optimization, 2015, 11(4): 1127-1147. Click to display the text [22] CRANDALL M G, ISHII H, LIONS P L. User's guide to viscosity solutions of second order partial differential equations[J]. Bulletin of the American Mathematical Society, 1992, 27(1): 1-67. Click to display the text [23] 孙松涛.近地轨道上两航天器追逃对策及数值求解方法研究[D].哈尔滨: 哈尔滨工业大学, 2015. SUN S T. Two spacecraft pursuit-evasion strategies on low earth orbit and numerical solution[D]. Harbin: Harbin Institute of Technology, 2015(in Chinese). [24] SCHWARTZ H M. Multi-agent machine learning: A reinforcement approach[M]. New York: John Wiley & Sons, Inc., 2014. [25] WANG L X. A course in fuzzy systems and control[M]. New Jersey: Prentice-Hall, Inc., 1997. [26] TAKAGI T, SUGENO M. Fuzzy identifcation of systems and its applications to modelling ad control[J]. IEEE Transactions on Systems Man and Cyberetics, 1985, 15: 116-132. [27] JANG J S R, SUN C T. Neuro-fuzzy and soft computing: A computational approach to learning and machine intelligence[M]. New Jersey: Prentice-Hall, Inc., 1997. [28] DAI X, LI C, RAD A. An approach to tune fuzzy contorllers based on reinforcement learning for autonomous vehicle control[J]. IEEE Transactions on Intelligent Transportation Systems, 2005, 6(3): 285-293. Click to display the text [29] DESOUKY S, SCHWARTZ H. Q(λ)-learning fuzzy logic controller for a multi-robot system[C]//IEEE International Conference on Systems, Man and Cybernetics. Piscataway: IEEE Press, 2010: 4075-4080. [30] JANG J S R, SUN C T. Neuro-fuzzy and soft computing:A computational approach to learning and machine intelligence[M]. New Jersey: Prentice-Hall, Inc., 1997. [31] ROSS T J. Fuzzy logic with engineering applications[M]. New York: John Wiley & Sons, Ltd., 2010. [32] MATIGNON L, LAURENT G J, LE F P. Independent reinforcement learners in cooperative Markov games:A survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31. Click to display the text [33] FRANK EJ, HÄRDLE W K, HAFNER C M. Neural networks and deep learning[M]. Verlag: Springer, 2019. [34] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[C]//In International Conference on Learning Representations, 2016. [35] RICHARD S S, ANDREW G B.强化学习[M]. 2版.北京: 电子工业出版社, 2019. RICHARD S S, ANDREW G B. Reinforcement learning[M]. 2nd ed. Beijing: Publishing House of Electronics Industry, 2019. [36] HESSEL M, MODAYIL J, VAN H H, et al. Rainbow:Combining improvements in deep reinforcement learning[J]. Association for the Advancement of Artificial Intelligence, 2017, 10(6): 3215-3222. Click to display the text [37] 苏飞, 刘静, 张耀, 等. 航天器面内机动规避最优脉冲分析[J]. 系统工程与电子技术, 2018, 40(12): 2782-2789. SU F, LIU J, ZHANG Y, et al. Analysis of optimal impulse for in-plane collision avoidance maneuver[J]. Systems Engineering and Electronics, 2018, 40(12): 2782-2789. (in Chinese) Cited By in Cnki (2) | Click to display the text
http://dx.doi.org/10.7527/S1000-6893.2020.24040

0

#### 文章信息

LIU Bingyan, YE Xiongbing, GAO Yong, WANG Xinbo, NI Lei

Strategy solution of non-cooperative target pursuit-evasion game based on branching deep reinforcement learning

Acta Aeronautica et Astronautica Sinica, 2020, 41(10): 324040.
http://dx.doi.org/10.7527/S1000-6893.2020.24040