Human-Agent Coordination in Games under Incomplete Information via Multi-Step Intent

Shenghui Chen^*, Ruihan Zhao^*, Sandeep Chinchali, Ufuk Topcu

University of Texas at Austin
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2025
^*Indicates Equal Contribution

Paper Code

Example human-robot interaction in our Gnomes at Night environment: The agent and human partner take turns controlling a single token S in a maze-like environment. The maze layout is different for both players and is unknown to their partner. The goal is for the team to reach the goal state T as quickly as possible. Highlighted in blue is the player in control.

Within each turn, the player in control can take multiple movement steps. After finishing moving, they can communicate their intent to their partner and hand over control. The red dots represent the agent's intent, and the green grids represent the human's intent. The agent is powered by our proposed IntentMCTS algorithm, which builds on a memory module to infer the human's map, and a Monte Carlo Tree Search to plan the best action sequence for goal reaching and intent following.

Overview

We study a coordination problem between an autonomous agent and a human partner, where they take turns controlling a single token in a maze-like environment. Unlike traditional turn-based games, each player can take multiple actions per turn and transfer control. We also allow players to communicate multi-step intent—a sequence of desired future states. The objective is to develop an agent policy that help the team reach the goal as quickly as possible while respecting game dynamics and the human partner's intent.

Our approach consists of two main components: a memory module and the IntentMCTS planner.

Memory Module

Our Memory Module helps the agent understand the human's maze layout by keeping track of human actions during the game. The belief about the human's maze layout helps the agent quantify the feasibility of human actions, and enables our planning algorithm to compute the expected return.

Take Actions, Not Words: Unlike some previous methods that rely on what players say, which can be incorrect, by focusing on observed behavior, the agent avoids these pitfalls.
Confidence Matters: The agent gives stronger weight to actions it sees (e.g., "The partner crossed a wall on the right—this right path is definitely open!") and weaker weight to actions it doesn't see (e.g., "They didn't go right—maybe it's blocked, or maybe they just chose left").

Given partner history $ h^H = (x_1, a_1, x_2, a_2, \dots, x_m, a_m) $, and action space $A$, the agent can do:

Parameter Updates:

\[ \begin{aligned} \text{For the action $a_t$ taken:}& \quad \alpha(x_t,a_t) = \alpha(x_t,a_t) + c^+ \\ \text{For the actions $a'\in \{A\setminus a_t\}$ not taken:}& \quad \beta(x_t,a') = \beta(x_t,a') + c^- \end{aligned} \]

Belief Computation:

\[ b(x,a) = \frac{\alpha(x,a)}{\alpha(x,a) + \beta(x,a)}, \quad c^+ > c^- \]

* Via Bernoulli-Beta conjugacy (See Section 4.1 and Algorithm 1 in the paper)

IntentMCTS

Our planning algorithm, IntentMCTS, builds on the Monte Carlo Tree Search framework and incorporates three main modifications for applicability to our problem.

Agent vs. Human Controlled Nodes: Either the agent (circle) or the human (square) has control at an environment state. When the agent has control, expansion and rollout follow its maze layout, hence all the actions (blue arrows) are feasible. The agent must estimate the feasiblility of human actions (dashed green arrows) using the memory module. The "switch turn" action is always feasible.
Human Intent Reward Augmentation: We propose a principled and smooth approach to balance goal reaching and multi-step intent following by adding a intent bonus to the sparse goal reward. Specifically, given an intent trajectory (states that the human wants to reach) $\zeta^h=\{x^h_1, x^h_2, \dots, x^h_m\}$, we assign a discounted intent bonus to provide a smooth gradient of rewards along the intent trajectory: \[ R^\mathrm{int}(x, \zeta^h) = \begin{cases} \lambda^{m-i} &\text{if } x=x^h_i \in \zeta^h\\ 0 &\text{otherwise} \end{cases} \] Intuitively, the intent bonus is higher for states that are appear later in the human's intent trajectory, which we assume are closer to the goal location.
Feasibility-Aware Value Estimation: The agent uses the feasibility estimation from the memory module to compute the expected return during Simulation and Backpropagation.

In Simulation, the agent imagines the effect of a human action. The transition is executed when a random number drawn uniformly between 0 and 1 is smaller than the feasibility belief. Otherwise, the state remains unchanged.

In Backpropagation, the rollout return depends on the feasibility of the child node from which backpropagation comes. Suppose the child node $v'$ gets a sample return $q'_\mathrm{sample}$. Its parent node's return $q_\mathrm{sample}$ should consist of three terms: (1) the step reward $r$, (2) with probability $\delta'$, the transition is feasible, and the discounted future return is $\gamma q'_\mathrm{sample}$, (3) with probability $(1 - \delta')$, the transition is invalid (action has no effect) and the discounted future return is the discounted value estimate at the current state: \[ q_\mathrm{sample} = r + \gamma[\delta' q'_\mathrm{sample} + (1 - \delta') \frac{Q(v)}{N(v)}] \]

* See Section 4.2 and Algorithm 2 in the paper

User Study Procedure

[expand to view details]

We adopt a within-subject design where participants interact with three agent partners in the Gnomes at Night testbed: Agent Alice (Shortest-Path-Heuristic), Agent Bob (Multi-Step-Intent MCTS), and Agent Charlie (Single-Step-Intent MCTS). We measure steps taken, control switches taken, and goal achievement, alongside a NASA-TLX survey. Sessions are counterbalanced and configurations randomized to control for biases, ensuring reliable comparisons.

User Study Results

The user study recruits 18 participants (average age 26.28, gender distribution 0.83 male, 0.11 female, and 0.06 non-binary).

Results show that the Multi-Step-Intent agent significantly outperforms the Shortest-Path-Heuristic and Single-Step-Intent agents in coordination efficiency and user satisfaction. Participants completed tasks with fewer steps and control switches, and achieved higher success rates with the Multi-Step-Intent agent. Survey results indicate lower cognitive load and higher satisfaction with this agent. The study also highlights the effectiveness of the memory module in refining the agent's understanding of maze layouts.