Example puzzles presented to the solver in our benchmark. Notably, each puzzle is multimodal and cannot be solved without communicating with the expert.
We introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. By testing both agent-agent and agent-human collaborations using open-source and closed-source models, our findings reveal surprising weaknesses in state-of-theart models, including proprietary models like GPT-4o. These models struggle to outperform even a simple random agent baseline in agent-agent collaboration and only surpass the random baseline when a human is involved.
Overview of the interaction between the Solver and Expert agents in our benchmark. Both agents operate with structured input corresponding to working and episodic memory. The Solver receives an image of the puzzle state (working memory) and makes decisions based on the available actions described in the task prompt. The Expert, guided by instruction manuals (working memory), provides advice based on the Solver's descriptions, such as indicating which buttons to press. The Solver can choose to execute actions by interacting with the environment or communicate with the Expert for further guidance. Their interaction is documented through a dialogue, showcasing the cooperation required to complete the task. Both agents engage in self-reflection by referencing the conversation history, which is continuously updated and incorporated into their input as episodic memory.
We find that even the most powerful closed-source LLMs struggle to communicate without human involvement. Here we plot the success rate as a function of conversation length for different settings. The open-source models such as LLaVA, InternVL, actually underperform the random baseline for most conversation lengths, suggesting limited ability to utilize episodic memory and plateau after about 4-5 conversation turns.
We further analyze the underlying reason for failure based on conversations of all 1000 puzzles. We look at the best closed-source and open-source performing model conversations: GPT-4o (left) and LLaMA 3.2 (right), and define the following failure modes:
Using COMMA, we benchmark the collaborative capabilities of closed-source and open-source multimodal LLMs, summarized in the table above.
There's a lot of excellent work related to multimodal agents which inspired ours.
Visual Web Arena introduces the idea of using multimodal agents to perform complex multi-step web tasks in a controlled environment. We are inspired by their environment, and extend their framework to involve multi-agent collaboration.
Alane Suhr has published many related works on the subject of using LLMs decision-making agents such as Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning and Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker. Her works strongly inspired the ideas in our benchmark.
Some works illustrate the potential of agents to collaborate to perform difficult tasks such as Software Development and Function Generation.
For a more comprehensive list, feel free to check out this survey paper.
@article{ossowski2024comma,
title={COMMA: A Communicative Multimodal Multi-Agent Benchmark},
author={Ossowski, Timothy and Chen, Jixuan and Maqbool, Danyal and Cai, Zefan and Bradshaw, Tyler and Hu, Junjie},
journal={arXiv preprint arXiv:2410.07553},
year={2024}
}