Anthropic published the prompt injection failure rates that enterprise security teams have been asking every vendor for

Binance
Anthropic published the prompt injection failure rates that enterprise security teams have been asking every vendor for
Binance



Thank you for reading this post, don't forget to subscribe!

Run a prompt injection attack against Claude Opus 4.6 in a constrained coding environment, and it fails every time, 0% success rate across 200 attempts, no safeguards needed. Move that same attack to a GUI-based system with extended thinking enabled, and the picture changes fast. A single attempt gets through 17.8% of the time without safeguards. By the 200th attempt, the breach rate hits 78.6% without safeguards and 57.1% with them.

The latest models’ 212-page system card, released February 5, breaks out attack success rates by surface, by attempt count, and by safeguard configuration.

Why surface-level differences determine enterprise risk

For years, prompt injection was a known risk that no one quantified. Security teams treated it as theoretical. AI developers treated it as a research problem. That changed when Anthropic made prompt injection measurable across four distinct agent surfaces, with attack success rates that security leaders can finally build procurement decisions around.

OpenAI's GPT-5.2 system card includes prompt injection benchmark results, including scores on evaluations like Agent JSK and PlugInject, but does not break out attack success rates by agent surface or show how those rates change across repeated attempts. The original GPT-5 system card described more than 5,000 hours of red teaming from over 400 external testers. The Gemini 3 model card describes it as "our most secure model yet" with "increased resistance to prompt injections," sharing relative safety improvements versus previous models but not publishing absolute attack success rates by surface or persistence scaling data.

What each developer discloses and what they withhold

Disclosure Category

Anthropic (Opus 4.6)

OpenAI (GPT-5.2)

Google (Gemini 3)

Per-surface attack success rates

Published (0% to 78.6%)

Benchmark scores only

Relative improvements only

Attack persistence scaling

Published (1 to 200 attempts)

Not published

Not published

Safeguard on/off comparison

Published

Not published

Not published

Agent monitoring evasion data

Published (SHADE-Arena)

Not published

Not published

Zero-day discovery counts

500+ with projects named

Not published

Not published

Third-party red teaming

Gray Swan, UK AISI, Apollo

400+ external testers

UK AISI, Apollo, Vaultis, Dreadnode

Third-party testing highlights why granular vendor disclosures matter. Promptfoo's independent red team evaluation of GPT-5.2 found jailbreak success rates climbing from a 4.3% baseline to 78.5% in multi-turn scenarios, the kind of persistence-scaled data that reveals how defenses degrade under sustained attack. OpenAI's own system card does not include equivalent metrics.

When the agent evades its own maker's monitor

Buried in the alignment assessment is a finding that should reshape how enterprises think about agent governance. On SHADE-Arena, a benchmark designed to test whether models can complete suspicious tasks without triggering automated oversight, Opus 4.6 succeeded 18% of the time when extended thinking was enabled. The monitor was calibrated to Anthropic's own internal detection threshold.

The system card states the model has "an improved ability to complete suspicious side tasks without attracting the attention of automated monitors." The agent built by Anthropic evades the monitor built by Anthropic.

Anthropic’s separately published Sabotage Risk Report uses this same 18% figure to argue the opposite conclusion: that a model with genuine hidden goals would be “overwhelmingly likely to be caught” at current capability levels. But the report also acknowledges that the number “represents an increase relative to recent models.” For enterprise security teams, the trend line matters more than the current number. A capability that improves with each model generation is not a capability you can safely ignore in your risk model.

For security teams, the takeaway is architectural. Constrain what an agent can access, limit its action space, and require human approval for high-risk operations. The traditional model of deploy and monitor assumes the thing being monitored is predictable. Agents that can reason about their own oversight are not.

Bruce Schneier, a fellow and lecturer at Harvard Kennedy School and a board member of the Electronic Frontier Foundation, says enterprises deploying AI agents face a "security trilemma," where they can optimize for speed, intelligence, or security, but not all three.

Anthropic's own data illustrates the tradeoff. The strongest surface is narrow and constrained. The weakest is broad and autonomous.

500 zero-days shift the economics of vulnerability discovery

Opus 4.6 discovered more than 500 previously unknown vulnerabilities in open-source code, including flaws in GhostScript, OpenSC and CGIF. Anthropic detailed these findings in a blog post accompanying the system card release.

Five hundred zero-days from a single model. For context, Google's Threat Intelligence Group tracked 75 zero-day vulnerabilities being actively exploited across the entire industry in 2024. Those are vulnerabilities found after attackers were already using them. One model proactively discovered more than six times that number in open-source codebases before attackers could find them. It is a different category of discovery, but it shows the scale AI brings to defensive security research.

Real-world attacks are already validating the threat model

Days after Anthropic launched Claude Cowork, security researchers at PromptArmor found a way to steal confidential user files through hidden prompt injections. No human authorization required.

The attack chain works like this:

A user connects Cowork to a local folder containing confidential data. An adversary plants a file with a hidden prompt injection in that folder, disguised as a harmless "skill" document. The injection tricks Claude into exfiltrating private data through the whitelisted Anthropic API domain, bypassing sandbox restrictions entirely. PromptArmor tested it against Claude Haiku. It worked. They tested it against Claude Opus 4.5, the company's most capable model at the time. That worked, too.

Simon Willison, the independent AI researcher who coined the term “prompt injection” in 2022, wrote in a blog post after reviewing Cowork that he does “not think it is fair to tell regular non-programmer users to watch out for ‘suspicious actions that may indicate prompt injection.’” The vulnerability Anthropic disclosed in its own system card, which persistent adversaries breach defenses more than half the time, is the same one that played out in production two weeks before Opus 4.6 shipped.

The evaluation integrity problem that affects every vendor

The system card's release decision process contains an admission that should give every enterprise security team pause. Anthropic used Opus 4.6 itself, via Claude Code, to debug its own evaluation infrastructure, analyze results, and fix issues under time pressure.

The company acknowledges the risk explicitly. "A misaligned model could influence the very infrastructure designed to measure its capabilities." Anthropic states it does not believe this presented a significant risk because the model showed no evidence of dangerous misaligned goals. But the dynamic is real and getting worse. As models get more capable and development timelines stay compressed, teams accept code changes they do not fully understand.

Anthropic went further than any other lab in quantifying what could go wrong. A separate Sabotage Risk Report, released alongside the system card, maps eight specific pathways through which Opus 4.6 could cause catastrophic harm if it acted on misaligned goals while embedded in Anthropic’s own infrastructure. These include sabotaging safety research through deliberate sandbagging, inserting security backdoors into code for future models to exploit, and exfiltrating its own model weights to operate autonomously.

For each pathway, the report rates both the frequency and complexity of misaligned actions required, then assesses whether Anthropic’s monitoring, alignment testing, and security controls are sufficient to contain the risk.

The overall verdict: “very low but not negligible.” The report argues that Opus 4.6 lacks the consistent dangerous goals, the reasoning opacity, and the long-horizon planning reliability needed to execute these pathways without detection. But it also admits that its automated monitors do not cover every internal deployment surface, that its human reporting systems have not been rigorously evaluated, and that data poisoning by a competent human adversary could make the model more capable of bypassing safeguards than its baseline capabilities suggest. No other frontier AI lab has published an equivalent analysis of how its own model could turn against its own infrastructure.

For security leaders evaluating AI agent deployments, this means third-party red teaming and independent evaluation are essential.

That urgency reflects regulatory pressure building across the industry. NIST's Center for AI Standards and Innovation issued a request for information in January that was focused specifically on securing AI agent systems. The RFI warns that AI agent systems “may be susceptible to hijacking, backdoor attacks, and other exploits” that could “impact public safety, undermine consumer confidence, and curb adoption of the latest AI innovations.” Procurement teams are starting to ask vendors for quantified agent security data. One developer just made it possible to answer.

What security leaders should do before their next vendor evaluation

Anthropic just set a new baseline by publishing prompt injection attack success rates broken out by surface. Agent monitoring evasion results are identified and analyzed rather than buried. Zero-day discovery counts are disclosed with affected projects named. The company dropped its direct prompt injection metric, arguing that indirect injection is the more relevant enterprise threat. That reasoning is sound, but the change makes year-over-year comparisons harder. The SHADE-Arena results raise questions about agent monitoring that the system card acknowledges but does not resolve.

The Sabotage Risk Report adds another layer of accountability. It includes a “looking forward” framework that lists the specific conditions under which Anthropic’s own safety conclusions would no longer hold. These tripwires include: a model scoring above 60% on SHADE-Arena, a model achieving capability jumps equivalent to a 5x compute scale-up over the previous generation, critical safety research functions becoming fully automated without human participation, or fewer than 25 technical staff having meaningful visibility into a model’s behavior. Security leaders should ask every AI agent vendor for equivalent criteria — the conditions under which the vendor’s own safety case breaks down.

Three things security leaders should do now:

Ask every AI agent vendor in your evaluation pipeline for per-surface attack success rates, not just benchmark scores. If they cannot provide persistence-scaled failure data, factor that gap into your risk scoring.

Commission independent red team evaluations before any production deployment. When the vendor's own model helped build the evaluation infrastructure, vendor-provided safety data alone is not enough.

Consider validating agent security claims against independent red team results for 30 days before expanding deployment scope.



Source link