Can AI Really Hack? New Study Reveals Why LLMs Still Fail at Cyber Exploitation

The Illusion of AI-Powered Cybercrime: Are Threat Actors Overestimating LLMs?

As artificial intelligence rapidly evolves, many industries are finding innovative ways to leverage its power — but when it comes to cybersecurity threats like vulnerability discovery and exploit development, AI still has a long way to go. A recent study by Forescout Research’s Vedere Labs offers a sobering perspective. The team tested 50 large language models (LLMs), ranging from commercial and open-source tools to those circulating in underground forums, to see whether they could effectively discover software vulnerabilities or generate working exploits. The results? Unimpressive, to say the least.

LLMs Still Struggle to Hack Like Humans

Forescout’s study revealed significant shortcomings across all models tested. When put through basic vulnerability research (VR) and exploit development (ED) tasks, most LLMs failed outright. Nearly half of the models couldn’t complete the first VR challenge, and over half bombed the second. Exploit generation saw even higher failure rates — 66% failed the first task, and a staggering 93% failed the second.

No model was able to complete all tasks successfully. Even when an LLM managed to get close to generating a working exploit, it required extensive human assistance — interpreting errors, debugging code, and guiding the process manually. Worse still, many models were unstable, delivering inconsistent results, or simply timing out. Creating a usable exploit often took hours and multiple reruns.

This consistent underperformance has kept many cybercriminals skeptical. According to forum analyses included in the study, experienced hackers expressed strong doubts about AI’s current capabilities in real-world hacking. The few who showed enthusiasm were often newcomers lacking advanced skills. Despite AI’s strong performance in coding-related tasks like boilerplate generation or automation, the leap to full-blown hacking remains elusive.

Forescout also compared different types of LLMs. Open-source models, especially those available on platforms like HuggingFace, performed the worst. Customized underground models such as WormGPT and GhostGPT did better but struggled with usability, limited context length, and poor output formatting. Commercial models like ChatGPT and Gemini fared the best but were still restricted by alignment safeguards and could only solve a few of the more advanced ED tasks.

Despite these setbacks, there’s a silver lining. Over the course of the study, which spanned three months, researchers observed rapid improvements in generative AI’s ability to handle both VR and ED. While the tools aren’t fully autonomous exploit generators yet, they’re getting better. Forescout warns that a new era — one they call “vibe hacking” — may be on the horizon. In preparation, cybersecurity teams are urged to double down on foundational strategies like least privilege, network segmentation, and zero trust frameworks.

What Undercode Say:

Lack of Autonomy Still a Major Barrier

The most striking insight from the study is how far LLMs still are from operating independently in a cyberattack scenario. Unlike what many sensational headlines might suggest, AI isn’t currently capable of crafting sophisticated exploits without detailed human direction. This aligns with the practical reality that true hacking — especially zero-day exploit development — requires contextual intelligence, creative problem-solving, and iterative testing that even top-tier LLMs haven’t yet mastered.

Underground Hype Fails to Match Reality

While forums on the dark web may buzz with excitement about AI-assisted hacking, seasoned cybercriminals know better. Their skepticism is rooted in direct experience: AI models frequently fail to deliver when pushed beyond basic scripting tasks. Tools like WormGPT and GhostGPT may sound ominous, but their actual performance is riddled with flaws — from instability to poor UX. This contradiction between hype and capability is a clear sign that threat actors aren’t rushing to replace their toolkits with AI anytime soon.

Commercial Models Lead, But With Limits

Commercial models like ChatGPT and Gemini displayed superior performance, yet they are designed with built-in safety restrictions that prevent malicious use. While these safeguards can sometimes be bypassed, they limit the model’s usefulness for advanced exploit development. This demonstrates a fundamental tension: the more advanced and aligned an AI model becomes, the less likely it is to be useful for malicious purposes. This built-in ethical constraint may remain a major barrier for criminal use.

Improvement Curve Is Real, But Slow

One of the most important takeaways is the incremental — but steady — improvement seen over the study period. As LLMs grow in complexity and contextual understanding, their ability to aid in cyber-offensive tasks will likely improve. However, the learning curve appears slow and nonlinear. New iterations may deliver small gains, but true automation of complex ED tasks is still out of reach.

Limited Context Length Bottlenecks Performance

A recurring issue for both open-source and underground models is their inability to manage large or complex codebases due to restricted context windows. In real-world vulnerability research, being able to analyze extensive code is critical. This limitation makes these models unfit for in-depth reverse engineering or multi-step exploitation strategies.

Usability Still Undermines Potential

Many underground models are simply not user-friendly. From poor formatting to inconsistent behavior, these tools add friction rather than removing it. Threat actors prefer reliability and speed — and AI, in its current state, often offers neither. The hours required to tweak and troubleshoot an exploit generation task make these tools more of a liability than an asset in fast-paced attack scenarios.

Future Potential Exists, But Needs Watchful Eyes

Forescout’s mention of a coming age of “vibe hacking” highlights a trend that defenders can’t afford to ignore. Even if LLMs aren’t yet capable of creating zero-days, they may soon be used to lower the skill barrier for new attackers. Automating simpler phases of an attack chain, such as reconnaissance or exploit script modification, could become common — and dangerously effective.

Defensive Postures Remain Crucial

In the absence of AI-generated super exploits, defenders can take some comfort — but not for long. Standard practices like enforcing least privilege, implementing zero trust, and segmenting networks remain the best bulwarks against intrusions. If LLMs ever do cross the threshold into reliable exploit generation, these basics will be the first line of defense.

🔍 Fact Checker Results:

✅ LLMs currently fail most VR and ED tasks tested in controlled studies
✅ Open-source and underground models are significantly less effective than commercial models
✅ Cybercriminal communities remain largely skeptical of LLMs’ real-world hacking potential

📊 Prediction:

Generative AI will not revolutionize cyber exploitation in 2025, but it will start reshaping the lower tiers of the attack chain. Expect LLMs to become tools for reconnaissance, phishing customization, and automated scripting — not yet for high-end vulnerability exploitation. Cyber defenders should prepare now for the next wave of AI-assisted, entry-level threats. 🚨🧠

References:

Reported By: www.infosecurity-magazine.com
Extra Source Hub:
https://www.digitaltrends.com
Wikipedia
OpenAi & Undercode AI

Image Source:

Unsplash
Undercode AI DI v2

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeNews & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin

Listen to this Post