航空学报 > 2023, Vol. 44 Issue (S2): 729782-729782   doi: 10.7527/S1000-6893.2023.29782

不完全信息下多智能体对手建模

邓有朋1,2, 范佳宣1, 郑岩2(), 王振亚1, 吕勇梁2, 李雨霄2   

  1. 1.中国航天科技创新研究院,北京 102600
    2.天津大学 智能与计算学部,天津 300072
  • 收稿日期:2023-10-27 修回日期:2023-11-20 接受日期:2023-12-20 出版日期:2023-12-25 发布日期:2024-01-04
  • 通讯作者: 郑岩 E-mail:yanzheng@tju.edu.cn
  • 基金资助:
    国家自然科学基金(62106172);小米青年学者项目

Multiagent opponent modeling with incompleted information

Youpeng DENG1,2, Jiaxuan FAN1, Yan ZHENG2(), Zhenya WANG1, Yongliang LYU2, Yuxiao LI2   

  1. 1.Academy of Aerospace Science and Innovation,Beijing 102600,China
    2.College of Intelligence and Computing,Tianjin University,Tianjin 300072,China
  • Received:2023-10-27 Revised:2023-11-20 Accepted:2023-12-20 Online:2023-12-25 Published:2024-01-04
  • Contact: Yan ZHENG E-mail:yanzheng@tju.edu.cn
  • Supported by:
    National Natural Science Foundation of China(62106172);Xiaomi Young Talents Program

摘要:

对手建模的目标是对对手策略进行建模,以最大化主智能体的回报。大多数先前的工作未能有效处理对手信息有限的情况。为此,提出了一种不完全信息下的对手建模(OMII)方法,能够在对手信息有限的情况下,只使用自身观察,提取跨轮次的对手策略表征。OMII提出一种全新的基于策略的数据增广方式,通过对比学习,离线地学习对手策略表征并将其作为额外输入训练一个通用的响应策略。在线测试阶段,OMII从最近几轮的历史轨迹数据中提取对手策略表征,与通用策略结合实现动态的对手策略响应。此外,OMII还通过保守与利用间的平衡保证了期望收益的下界。实验结果表明在对手信息有限的情况下,OMII仍能准确提取对手策略表征,并对未知策略具有一定泛化能力,在性能上优于现有的对手建模算法。

关键词: 决策智能, 强化学习, 对手建模, 对比学习, 多智能体系统

Abstract:

The goal of opponent modeling is to model the opponent’s strategy to maximize the payoff of the main intelligent agent. Most previous work has failed to effectively handle the situations where opponent information is limited. To address this problem, we propose an approach for opponent modeling with incomplete information (OMII), which is capable of extracting cross-epoch opponent strategy representations by solely using self-observations in case of limited opponent information. OMII introduces a novel policy-based data augmentation method that, through contrastive learning, offline learns opponent strategy representations and employs them as additional input to train a general responsive policy. During online testing, OMII extracts opponent strategy representations from recent historical trajectory data and dynamically responds to opponent strategies in conjunction with the general policy. Moreover, OMII ensures a lower bound on the expected payoff by balancing conservatism and exploitation. Experimental results demonstrate that even in scenarios with limited opponent information, OMII can accurately extract opponent strategy representations, and possesses certain generalization capabilities for unknown strategies, outperforming existing opponent modeling algorithms in performance.

Key words: decision making intelligence, reinforcement learning, opponent modeling, constrastive learning, multiagent system

中图分类号: