OpenAI's Deep Research Is Smarter, Faster—But Still Fails Half the Time

Introduction: AI Agents Are Learning to Surf Smarter, But Not Yet Better

OpenAI’s latest research into autonomous AI agents capable of web searching is making waves with impressive, but far from flawless, results. Dubbed Deep Research, the new system demonstrates tireless, parallelized searching capabilities far beyond what a human researcher can maintain. Yet despite its relentless stamina and significantly better performance than even OpenAI’s most advanced models like GPT-4o and GPT-4.5, Deep Research still stumbles nearly 50% of the time.

This speaks volumes about the evolving—but still raw—state of agentic AI, which seeks to push large language models beyond passive text prediction into active information gathering. The big question remains: Can AI truly replace skilled human researchers? This benchmark study reveals that we’re getting closer, but we’re not there yet.

the Original

OpenAI recently published findings on its experimental system called Deep Research, designed to browse the internet and answer complex, fact-based questions better than humans or previous models. Unlike earlier models, Deep Research isn’t limited to just memory or prediction—it actively uses the web, sifting through large swaths of data to locate obscure facts.

The evaluation benchmark, named BrowseComp, was built to test AI agents’ ability to answer difficult, multi-hop web-based questions that go far beyond standard fact retrieval. Questions included layers of constraints—like author credentials, time of publication, and content themes—making them nearly impossible for standard models like GPT-4 or GPT-4o, even with browsing, to answer. Most scored near zero.

Human performance wasn’t much better. In fact, even participants who were already familiar with the data struggled. Around 70% of humans gave up after two hours, with only 30% answering correctly, and 14% providing incorrect answers. The authors hypothesized that expert researchers might fare better, but even they would need far more time and effort.

Deep Research, in contrast, achieved 51.5% accuracy. It outperformed all previous models, especially in areas where multiple steps were required or the facts were deeply entangled. The key advantage: Deep Research doesn’t get tired, can handle parallel tasks, and shows remarkable persistence.

However, its biggest flaw was overconfidence in wrong answers, a known issue called calibration error. Even with browsing tools, the model often failed to convey uncertainty properly. To mitigate this, the researchers tried generating 64 possible answers per query, then choosing the best. This improved accuracy significantly, suggesting that the model often does “know” the correct answer, but fails to express certainty reliably.

The study also highlighted that performance improved with more computational power, meaning that more compute equals better results. Still, the test itself was limited—it only included easily verifiable answers and excluded ambiguous or long-form responses.

In essence, Deep Research shows how powerful AI agents are becoming at searching the web and answering hard questions. But it also reminds us that even with web access, AI is not yet a reliable replacement for nuanced, human-driven investigation.

What Undercode Say:

The performance of Deep Research signals a pivotal transformation in AI’s role within the information ecosystem. Unlike passive LLMs that rely solely on training data, Deep Research brings active intelligence into play—an agent that not only interprets language but interacts with the web in real time.

This evolution from static to dynamic AI is monumental. Traditional models hit performance ceilings because they cannot update themselves or validate claims externally. But Deep Research breaks that mold—it actively seeks, verifies, and selects the best information through real-time browsing. It’s not just guessing better; it’s learning to investigate.

That said, this isn’t a free pass to trust AI agents blindly. A 51.5% success rate—while a huge leap forward—is still a failure nearly half the time. When confidence calibration breaks, users are fed incorrect answers with unjustified certainty, which is potentially dangerous in high-stakes contexts like medical, legal, or financial queries. The fact that more answers improve accuracy suggests that Deep Research’s real strength lies in self-evaluation and internal competition, not single-pass inference.

Another takeaway is scalability. Performance improves significantly with more compute and answer variation, revealing a future where AI agents may run multiple inference chains, cross-reference them, and synthesize outputs before presenting an answer. This is a paradigm shift from a “one question, one answer” model to a multi-threaded reasoning architecture.

But the limitations of the benchmark

In short: Deep Research is an exciting leap forward in agentic AI, especially for journalists, researchers, and analysts. But it’s not yet a silver bullet—and trusting it blindly could lead to real-world misinformation if not used with care.

🔍 Fact Checker Results:

✅ Deep Research truly outperformed humans and previous models on complex web questions.
✅ Overconfidence in wrong answers remains a documented problem (calibration error).
❌ BrowseComp is not a holistic benchmark—it doesn’t reflect all real-world tasks or ambiguity.

📊 Prediction:

By mid-2026, Deep Research or its successor will likely surpass 75% accuracy on benchmarks like BrowseComp—especially if OpenAI integrates multi-agent reasoning, answer cross-validation, and context-sensitive calibration techniques. Expect to see it embedded into search engines, research tools, and journalistic platforms, redefining how truth is sourced and verified in real time.