Pinpointing Failure Sources in LLM Multi-Agent Systems: A New Benchmark and Automated Attribution Methods
Multi-agent systems powered by large language models (LLMs) are increasingly used to tackle complex tasks by dividing work among specialized agents. But when these systems fail, developers often face a daunting puzzle: which agent made the mistake, and at what point did things go wrong? Traditionally, diagnosing failures means manually combing through lengthy interaction logs—a process both slow and error-prone. A team of researchers from Penn State University, Duke University, and collaborators at Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University has introduced a new research direction called automated failure attribution. They built the first dedicated benchmark dataset, Who&When, and developed several automated methods to pinpoint failure sources. Their work, accepted as a Spotlight presentation at ICML 2025, is fully open-source, offering developers a much-needed tool to speed up debugging and improve system reliability.
What is automated failure attribution and why is it important?
Automated failure attribution refers to the process of automatically identifying which agent in a multi-agent LLM system caused a task failure and at what step the error occurred. In complex systems where multiple agents collaborate autonomously, a single misstep—like an incorrect assumption, a miscommunication, or a flawed output—can cascade into a full task failure. Without automated attribution, developers must manually sift through logs, a time-consuming and skill-intensive task. This bottleneck slows down iteration and optimization. By automating the diagnosis, teams can quickly locate root causes, test fixes, and improve system robustness. The research formalizes this problem and provides the first benchmark, enabling systematic evaluation of attribution methods.

What challenges do developers face when debugging LLM multi-agent systems?
Developers face two major challenges: manual log archaeology and reliance on deep expertise. When a multi-agent system fails, the interaction logs can be vast and intertwined. Finding the exact moment an agent deviated from its intended behavior is like finding a needle in a haystack. Moreover, understanding the context of each agent's decisions requires intimate knowledge of the system's design and the agents' roles. This process is not only labor-intensive but also prone to human error. The autonomous and non-deterministic nature of LLMs adds another layer of complexity—similar inputs can produce different outputs, making failures inconsistent. These challenges motivated the researchers to create automated methods that can reduce debugging time from hours to minutes.
How did the researchers create the Who&When benchmark dataset?
The Who&When dataset was constructed by simulating multi-agent task scenarios and systematically introducing controlled failures. The researchers designed a set of collaborative tasks requiring multiple agents with distinct roles. For each task, they recorded the full interaction log and then intentionally injected mistakes at specific agents and time steps. They then had human annotators label the exact source of each failure—both the responsible agent and the step where the error occurred. This created a ground-truth dataset that can be used to train and evaluate attribution models. The benchmark covers various tasks and failure types, ensuring that automated attribution methods are tested on diverse and realistic scenarios. The dataset is publicly available on Hugging Face.
What methods did the researchers develop for automated failure attribution?
The team explored several automated attribution approaches, ranging from simple heuristic baselines to more sophisticated machine learning models. One method uses causal tracing—tracking how changes in an agent's output affect the final task outcome. Another employs attention analysis to identify which agent-to-agent messages had the highest influence on the failure. They also tested probe-based classifiers trained on the Who&When dataset to directly predict attribution. The researchers compared these methods against manual inspection by experts, showing that automated approaches can achieve competitive accuracy while being far faster. The best methods combine multiple signals, such as error propagation patterns and agent interaction graphs, to pinpoint failures robustly.
What were the key findings of the study?
The study revealed that automated failure attribution is both feasible and challenging. Among the tested methods, those leveraging causal information consistently outperformed simpler baselines, but no single method worked perfectly across all scenarios. The researchers found that failures often propagate non-locally—a mistake by one agent may only surface many steps later, making attribution difficult. Importantly, the Who&When benchmark highlighted that human experts also sometimes disagree on failure sources, underscoring the complexity of the task. The team's work sets a baseline for future research, showing that while automated attribution is not trivial, it can dramatically reduce debugging effort. The paper provides concrete performance metrics on the benchmark, encouraging further improvement.
How does this research impact the future of multi-agent systems?
By introducing the problem of automated failure attribution and providing an open benchmark, this research enables a new line of work focused on improving the reliability of LLM multi-agent systems. Developers can now systematically evaluate and improve debugging tools. The methods developed can be integrated into existing development pipelines, allowing real-time detection of failure sources during testing or even during live operation. This could accelerate the deployment of multi-agent systems in high-stakes domains like autonomous customer service, code generation, and complex data analysis. The open-source release of code and data encourages the community to build upon these findings, ultimately making multi-agent collaboration more trustworthy and efficient.
Where can developers access the code and dataset?
The research team has made all resources publicly available. The paper is accessible on arXiv at https://arxiv.org/pdf/2505.00212. The code repository is on GitHub: https://github.com/mingyin1/Agents_Failure_Attribution. The Who&When dataset is hosted on Hugging Face: https://huggingface.co/datasets/Kevin355/Who_and_When. Developers can download the dataset to test attribution methods or use the provided code to replicate experiments and build on them.