不完全信息下多智能体对手建模
收稿日期: 2023-10-27
修回日期: 2023-11-20
录用日期: 2023-12-20
网络出版日期: 2024-01-04
基金资助
国家自然科学基金(62106172);小米青年学者项目
Multiagent opponent modeling with incompleted information
Received date: 2023-10-27
Revised date: 2023-11-20
Accepted date: 2023-12-20
Online published: 2024-01-04
Supported by
National Natural Science Foundation of China(62106172);Xiaomi Young Talents Program
对手建模的目标是对对手策略进行建模,以最大化主智能体的回报。大多数先前的工作未能有效处理对手信息有限的情况。为此,提出了一种不完全信息下的对手建模(OMII)方法,能够在对手信息有限的情况下,只使用自身观察,提取跨轮次的对手策略表征。OMII提出一种全新的基于策略的数据增广方式,通过对比学习,离线地学习对手策略表征并将其作为额外输入训练一个通用的响应策略。在线测试阶段,OMII从最近几轮的历史轨迹数据中提取对手策略表征,与通用策略结合实现动态的对手策略响应。此外,OMII还通过保守与利用间的平衡保证了期望收益的下界。实验结果表明在对手信息有限的情况下,OMII仍能准确提取对手策略表征,并对未知策略具有一定泛化能力,在性能上优于现有的对手建模算法。
邓有朋 , 范佳宣 , 郑岩 , 王振亚 , 吕勇梁 , 李雨霄 . 不完全信息下多智能体对手建模[J]. 航空学报, 2023 , 44(S2) : 729782 -729782 . DOI: 10.7527/S1000-6893.2023.29782
The goal of opponent modeling is to model the opponent’s strategy to maximize the payoff of the main intelligent agent. Most previous work has failed to effectively handle the situations where opponent information is limited. To address this problem, we propose an approach for opponent modeling with incomplete information (OMII), which is capable of extracting cross-epoch opponent strategy representations by solely using self-observations in case of limited opponent information. OMII introduces a novel policy-based data augmentation method that, through contrastive learning, offline learns opponent strategy representations and employs them as additional input to train a general responsive policy. During online testing, OMII extracts opponent strategy representations from recent historical trajectory data and dynamically responds to opponent strategies in conjunction with the general policy. Moreover, OMII ensures a lower bound on the expected payoff by balancing conservatism and exploitation. Experimental results demonstrate that even in scenarios with limited opponent information, OMII can accurately extract opponent strategy representations, and possesses certain generalization capabilities for unknown strategies, outperforming existing opponent modeling algorithms in performance.
1 | FOERSTER J N, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness [DB/OL]. arXiv preprint: 1709.04326, 2017. |
2 | HE H, BOYD-GRABER J, KWOK K, et al. Opponent modeling in deep reinforcement learning[C]∥ Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. New York: ACM, 2016: 1804–1813. |
3 | HONG Z W, SU S Y, SHANN T Y, et al. A deep policy inference Q-network for multi-agent systems[DB/OL]. arXiv preprint: 1712.07893, 2017. |
4 | ZHENG Y, MENG Z, HAO J, et al. A deep Bayesian policy reuse approach against non-stationary agents[C]∥ NeurIPS, 2018: 31. |
5 | ZHENG Y, MENG Z P, HAO J Y, et al. Weighted double deep multiagent reinforcement learning in stochastic cooperative environments[M]∥ Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018: 421-429. |
6 | ZHENG Y, HAO J Y, ZHANG Z Z, et al. Efficient multiagent policy optimization based on weighted estimators in stochastic cooperative environments[J]. Journal of Computer Science and Technology, 2020, 35(2): 268-280. |
7 | 郑岩, 郝建业, 章宗长, 等. 一种多智能体合作式环境下基于带权估计的策略优化算法[J]. 计算机科学技术学报, 2020, 35(2): 268-280 (in Chinese). |
ZHENG Y, HAO J Y, ZHANG Z Z, et al. A strategy optimization algorithm based on weighted estimation in a multi-agent cooperative environment [J]. Journal of Computer Science and Technology, 2020, 35(2): 268-280. | |
8 | ZHENG Y, HAO J Y, ZHANG Z Z, et al. Efficient policy detecting and reusing for non-stationarity in Markov games[J]. Autonomous Agents and Multi-Agent Systems, 2020, 35(1): 2. |
9 | HAO X T, WANG W X, MAO H Y, et al. API: Boosting multi-agent reinforcement learning via agent-permutation-invariant networks[DB/OL]. arXiv preprint: 2203.05285, 2022. |
10 | SUN J W, ZHENG Y, HAO J Y, et al. Continuous multiagent control using collective behavior entropy for large-scale home energy management[DB/OL]. arXiv preprint: 2005.10000, 2020. |
11 | LI P Y, TANG H Y, YANG T P, et al. PMIC: Improving multi-agent reinforcement learning with progressive mutual information collaboration[DB/OL]. arXiv preprint: 2203.08553, 2022. |
12 | RAILEANU R, DENTON E, SZLAM A, et al. Modeling others using oneself in multi-agent reinforcement learning[DB/OL]. arXiv preprint: 1802.09640, 2018. |
13 | ROSMAN B, HAWASLY M, RAMAMOORTHY S. Bayesian policy reuse[J]. Machine Learning, 2016, 104(1): 99-127. |
14 | HERNANDEZ-LEAL P, TAYLOR M E, ROSMAN B S, et al. Identifying and tracking switching, non-stationary opponents: A Bayesian approach[C]∥ Workshop on Multiagent Interaction without Prior Coordination (MIPC) at AAAI-16, 2016: 560-566. |
15 | YANG T P, HAO J Y, MENG Z P, et al. Towards efficient detection and optimal response against sophisticated opponents[DB/OL]. arXiv preprint: 1809.04240, 2018. |
16 | GANZFRIED S, WANG K A, CHISWICK M. Bayesian opponent modeling in multiplayer imperfect-information games[DB/OL]. arXiv preprint: 2212.06027, 2022. |
17 | VON KüGELGEN J, USTYUZHANINOV I, GEHLER P, et al. Towards causal generative scene models via competition of experts[DB/OL]. arXiv preprint: 2004.12906, 2020. |
18 | 罗俊仁, 张万鹏, 袁唯淋, 等. 面向多智能体博弈对抗的对手建模框架[J]. 系统仿真学报, 2022, 34(9): 1941-1955. |
LUO J R, ZHANG W P, YUAN W L, et al. Research on opponent modeling framework for multi-agent game confrontation[J]. Journal of System Simulation, 2022, 34(9): 1941-1955 (in Chinese). | |
19 | 吴天栋, 石英. 不完美信息博弈中对手模型的研究[J]. 河南科技大学学报(自然科学版), 2019, 40(1): 54-59, 7. |
WU T D, SHI Y. Research on opponent modeling in imperfect information games[J]. Journal of Henan University of Science and Technology (Natural Science), 2019, 40(1): 54-59, 7 (in Chinese). | |
20 | VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding[DB/OL]. arXiv preprint: 1807.03748, 2018. |
21 | HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]∥ 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2020: 9726-9735. |
22 | CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[DB/OL]. arXiv preprint: 2002.05709, 2020. |
23 | AUER P, CESA-BIANCHI N, FREUND Y, et al. The nonstochastic multiarmed bandit problem[J]. SIAM Journal on Computing, 2002, 32(1): 48-77. |
24 | FU H B, TIAN Y, YU H X, et al. Greedy when sure and conservative when uncertain about the opponents[C]∥ ICML, 2022: 6829-6848. |
/
〈 |
|
〉 |