COMMA: A Communicative Multimodal Multi-Agent Benchmark

1University of Wisconsin-Madison, 2Nanjing University

COMMA is a multimodal benchmark designed to assess the collaborative abilities of multimodal agents. Our benchmark is inspired by the cooperative gameplay scenario in the Keep Talking and Nobody Explodes Game. In this game, two players work together to defuse a bomb under time pressure. One agent, the defuser, can see the bomb but lacks the instructions to disarm it. The other agent, the expert, has access to the bomb's manual but cannot see the bomb itself. The agents must rely on effective communication to exchange information, navigate challenges, and defuse the bomb.

Example puzzles presented to the solver in our benchmark. Notably, each puzzle is multimodal and cannot be solved without communicating with the expert.

Abstract

We introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-theart models, including proprietary models like GPT-4o. These models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved.

Agent Interaction

Agent Setup

Overview of the interaction between the Solver and Expert agents in our benchmark. Both agents operate with structured input corresponding to working and episodic memory. The Solver receives an image of the puzzle state (working memory) and makes decisions based on the available actions described in the task prompt. The Expert, guided by instruction manuals (working memory), provides advice based on the Solver's descriptions, such as indicating which buttons to press. The Solver can choose to execute actions by interacting with the environment or communicate with the Expert for further guidance. Their interaction is documented through a dialogue, showcasing the cooperation required to complete the task. Both agents engage in self-reflection by referencing the conversation history, which is continuously updated and incorporated into their input as episodic memory.

Results

Graph of Conversation Length and Success Rate.

We find that even the most powerful closed-source LLMs struggle to communicate without human involvement. Here we plot the success rate as a function of conversation length for different settings. The open-source models such as LLaVA, InternVL, actually underperform the random baseline for most conversation lengths, suggesting limited ability to utilize episodic memory and plateau after about 4-5 conversation turns.

Pie chart of common failure modes.

We further analyze the underlying reason for failure based on conversations of all 1000 puzzles. We look at the best closed-source and open-source performing model conversations: GPT-4o (left) and LLaMA 3.2 (right), and define the following failure modes:

  • Repetition Loop: The solver repeats its past incorrect actions, even if is in a situation it has encountered before.
  • Misinterpretation: The solver misunderstands the current puzzle state/signal due to poor visual grounding or lack of situational awareness, resulting in failure.
  • Roleplay: The expert thinks it is the solver or vice versa, despite the prompt assigning it a role.
  • Miscommunication: The solver agent occasionally does not listen to the expert's instructions, attempting to solve the puzzle on its own as if it were the expert.
For more detailed examples and results, feel free to check out our paper!

Leaderboard 🏆

Table of performance.

Using COMMA, we benchmark the collaborative capabilities of closed-source and open-source multimodal LLMs, summarized in the table above.

Related Work

There's a lot of excellent work related to multimodal agents which inspired ours.

Visual Web Arena introduces the idea of using multimodal agents to perform complex multi-step web tasks in a controlled environment. We are inspired by their environment, and extend their framework to involve multi-agent collaboration.

Alane Suhr has published many related works on the subject of using LLMs decision-making agents such as Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning and Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker. Her works strongly inspired the ideas in our benchmark.

Some works illustrate the potential of agents to collaborate to perform difficult tasks such as Software Development and Function Generation.

For a more comprehensive list, feel free to check out this survey paper.

BibTeX

@article{ossowski2024comma,
      title={COMMA: A Communicative Multimodal Multi-Agent Benchmark},
      author={Ossowski, Timothy and Chen, Jixuan and Maqbool, Danyal and Cai, Zefan and Bradshaw, Tyler and Hu, Junjie},
      journal={arXiv preprint arXiv:2410.07553},
      year={2024}
    }