IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

IBM Research and UC Berkeley have teamed up to tackle a long-standing challenge in IT automation—understanding why agent-based systems break down in real-world applications. Specifically, they examined incidents like Kubernetes failures, log and metric queries, and other complex IT tasks, which involve long-running interactions and high-stakes decisions. Traditional benchmarks typically measure agent success with a single number, failing to explain the underlying reasons for failure. To overcome this, the researchers applied MAST (Multi-Agent System Failure Taxonomy) to the IT-Bench benchmark suite, enabling a more detailed and actionable analysis of agent performance across different models.

By analyzing 310 SRE traces from ITBench, produced by three distinct models—Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B—the researchers identified key failure modes that could lead to system breakdowns, offering insights into how these issues could be fixed. Their findings revealed that frontier models, such as Gemini-3-Flash, are more prone to isolated, predictable failures, while larger open-source models, like GPT-OSS-120B, often face cascading failures that compound and escalate over time.

Key Findings from the Study:

Gemini-3-Flash: Typically encounters isolated bottlenecks like verification failures, resulting in fewer failure modes per trace.

GPT-OSS-120B: Struggles with cascading failures where small errors early in the process trigger compounding issues.

Kimi-K2: Exhibits termination issues, either quitting prematurely or failing to recognize when a task is complete.

These findings underline the importance of structured failure diagnosis in improving IT automation agents. It is not enough to simply know an agent failed; developers need to understand why and how to intervene.

What Undercode Says:

The collaboration between IBM and UC Berkeley shines a much-needed light on the complexities of agent failures, particularly in enterprise IT environments. The research emphasizes the necessity of going beyond simple success rate metrics and delving into the root causes of failures in agent systems.

The application of MAST to ITBench allows for an in-depth exploration of failure modes, and this is crucial for enhancing the reliability of agents in complex workflows. One of the most insightful discoveries is the stark contrast between frontier models, like Gemini-3-Flash, and open-source models, such as GPT-OSS-120B. While the former tends to fail in more predictable and isolated ways, making it easier to debug and improve, the latter faces cascading failures that often spiral out of control. This has important implications for developers deciding which model to use for specific tasks.

For example,

Additionally, the study illustrates how failure mode taxonomy (like MAST) can provide a more granular understanding of agentic behavior. By distinguishing between “fatal” failures—such as loss of conversation history or premature termination—and “non-fatal” failures like step repetition, developers can prioritize which issues to address first. Addressing fatal failures could significantly improve performance, especially for complex IT automation tasks.

As enterprises increasingly turn to AI agents for mission-critical operations, understanding these failure modes will be key to building systems that are both efficient and reliable. While traditional benchmarks focus on success rates, the introduction of MAST offers a roadmap for diagnosing and addressing the specific causes of agent failure.

Fact Checker Results:

✅ Accurate Benchmarking: The analysis confirms that traditional benchmarks only measure success rates, not the reasons behind failures.
✅ MAST as a Reliable Diagnostic Tool: MAST provides a detailed and systematic approach to failure analysis, based on real-world agent performance data.
✅ Actionable Solutions: Recommendations for addressing failure modes in models like Gemini-3-Flash and GPT-OSS-120B are supported by experimental evidence.

Prediction:

📊 Future of Agent Evaluation: As agentic systems continue to evolve, MAST and similar frameworks will become standard tools for diagnosing and improving IT automation agents. With their ability to reveal hidden failure modes and provide targeted recommendations, they will help developers build more robust and reliable agents for enterprise IT workflows.

🕵️‍📝✔️Let’s dive deep and fact‑check.

References:

Reported By: huggingface.co
Extra Source Hub (Possible Sources for article):
https://stackoverflow.com
Wikipedia
OpenAi & Undercode AI

Image Source:

Unsplash
Undercode AI DI v2
Bing

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeNews & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky | 🐘Mastodon

Listen to this Post