ACTA AERONAUTICAET ASTRONAUTICA SINICA >
Multiagent opponent modeling with incompleted information
Received date: 2023-10-27
Revised date: 2023-11-20
Accepted date: 2023-12-20
Online published: 2024-01-04
Supported by
National Natural Science Foundation of China(62106172);Xiaomi Young Talents Program
The goal of opponent modeling is to model the opponent’s strategy to maximize the payoff of the main intelligent agent. Most previous work has failed to effectively handle the situations where opponent information is limited. To address this problem, we propose an approach for opponent modeling with incomplete information (OMII), which is capable of extracting cross-epoch opponent strategy representations by solely using self-observations in case of limited opponent information. OMII introduces a novel policy-based data augmentation method that, through contrastive learning, offline learns opponent strategy representations and employs them as additional input to train a general responsive policy. During online testing, OMII extracts opponent strategy representations from recent historical trajectory data and dynamically responds to opponent strategies in conjunction with the general policy. Moreover, OMII ensures a lower bound on the expected payoff by balancing conservatism and exploitation. Experimental results demonstrate that even in scenarios with limited opponent information, OMII can accurately extract opponent strategy representations, and possesses certain generalization capabilities for unknown strategies, outperforming existing opponent modeling algorithms in performance.
Youpeng DENG , Jiaxuan FAN , Yan ZHENG , Zhenya WANG , Yongliang LYU , Yuxiao LI . Multiagent opponent modeling with incompleted information[J]. ACTA AERONAUTICAET ASTRONAUTICA SINICA, 2023 , 44(S2) : 729782 -729782 . DOI: 10.7527/S1000-6893.2023.29782
1 | FOERSTER J N, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness [DB/OL]. arXiv preprint: 1709.04326, 2017. |
2 | HE H, BOYD-GRABER J, KWOK K, et al. Opponent modeling in deep reinforcement learning[C]∥ Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. New York: ACM, 2016: 1804–1813. |
3 | HONG Z W, SU S Y, SHANN T Y, et al. A deep policy inference Q-network for multi-agent systems[DB/OL]. arXiv preprint: 1712.07893, 2017. |
4 | ZHENG Y, MENG Z, HAO J, et al. A deep Bayesian policy reuse approach against non-stationary agents[C]∥ NeurIPS, 2018: 31. |
5 | ZHENG Y, MENG Z P, HAO J Y, et al. Weighted double deep multiagent reinforcement learning in stochastic cooperative environments[M]∥ Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018: 421-429. |
6 | ZHENG Y, HAO J Y, ZHANG Z Z, et al. Efficient multiagent policy optimization based on weighted estimators in stochastic cooperative environments[J]. Journal of Computer Science and Technology, 2020, 35(2): 268-280. |
7 | 郑岩, 郝建业, 章宗长, 等. 一种多智能体合作式环境下基于带权估计的策略优化算法[J]. 计算机科学技术学报, 2020, 35(2): 268-280 (in Chinese). |
ZHENG Y, HAO J Y, ZHANG Z Z, et al. A strategy optimization algorithm based on weighted estimation in a multi-agent cooperative environment [J]. Journal of Computer Science and Technology, 2020, 35(2): 268-280. | |
8 | ZHENG Y, HAO J Y, ZHANG Z Z, et al. Efficient policy detecting and reusing for non-stationarity in Markov games[J]. Autonomous Agents and Multi-Agent Systems, 2020, 35(1): 2. |
9 | HAO X T, WANG W X, MAO H Y, et al. API: Boosting multi-agent reinforcement learning via agent-permutation-invariant networks[DB/OL]. arXiv preprint: 2203.05285, 2022. |
10 | SUN J W, ZHENG Y, HAO J Y, et al. Continuous multiagent control using collective behavior entropy for large-scale home energy management[DB/OL]. arXiv preprint: 2005.10000, 2020. |
11 | LI P Y, TANG H Y, YANG T P, et al. PMIC: Improving multi-agent reinforcement learning with progressive mutual information collaboration[DB/OL]. arXiv preprint: 2203.08553, 2022. |
12 | RAILEANU R, DENTON E, SZLAM A, et al. Modeling others using oneself in multi-agent reinforcement learning[DB/OL]. arXiv preprint: 1802.09640, 2018. |
13 | ROSMAN B, HAWASLY M, RAMAMOORTHY S. Bayesian policy reuse[J]. Machine Learning, 2016, 104(1): 99-127. |
14 | HERNANDEZ-LEAL P, TAYLOR M E, ROSMAN B S, et al. Identifying and tracking switching, non-stationary opponents: A Bayesian approach[C]∥ Workshop on Multiagent Interaction without Prior Coordination (MIPC) at AAAI-16, 2016: 560-566. |
15 | YANG T P, HAO J Y, MENG Z P, et al. Towards efficient detection and optimal response against sophisticated opponents[DB/OL]. arXiv preprint: 1809.04240, 2018. |
16 | GANZFRIED S, WANG K A, CHISWICK M. Bayesian opponent modeling in multiplayer imperfect-information games[DB/OL]. arXiv preprint: 2212.06027, 2022. |
17 | VON KüGELGEN J, USTYUZHANINOV I, GEHLER P, et al. Towards causal generative scene models via competition of experts[DB/OL]. arXiv preprint: 2004.12906, 2020. |
18 | 罗俊仁, 张万鹏, 袁唯淋, 等. 面向多智能体博弈对抗的对手建模框架[J]. 系统仿真学报, 2022, 34(9): 1941-1955. |
LUO J R, ZHANG W P, YUAN W L, et al. Research on opponent modeling framework for multi-agent game confrontation[J]. Journal of System Simulation, 2022, 34(9): 1941-1955 (in Chinese). | |
19 | 吴天栋, 石英. 不完美信息博弈中对手模型的研究[J]. 河南科技大学学报(自然科学版), 2019, 40(1): 54-59, 7. |
WU T D, SHI Y. Research on opponent modeling in imperfect information games[J]. Journal of Henan University of Science and Technology (Natural Science), 2019, 40(1): 54-59, 7 (in Chinese). | |
20 | VAN DEN OORD A, LI Y Z, VINYALS O. Representation learning with contrastive predictive coding[DB/OL]. arXiv preprint: 1807.03748, 2018. |
21 | HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[C]∥ 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2020: 9726-9735. |
22 | CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[DB/OL]. arXiv preprint: 2002.05709, 2020. |
23 | AUER P, CESA-BIANCHI N, FREUND Y, et al. The nonstochastic multiarmed bandit problem[J]. SIAM Journal on Computing, 2002, 32(1): 48-77. |
24 | FU H B, TIAN Y, YU H X, et al. Greedy when sure and conservative when uncertain about the opponents[C]∥ ICML, 2022: 6829-6848. |
/
〈 |
|
〉 |