By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Claude Opus 4.7 Sets New Records in the Endor Labs Agent Security League

Anthropic's newest model reaches the highest functional and security scores we've ever measured. But roughly four out of five solutions still ship with vulnerabilities.

‍

Written by

Robert Haynes

Published on

April 17, 2026

Updated on

April 17, 2026

Topics

AI/ML

Summarize with AI

Executive summary

Anthropic released Claude Opus 4.7 this week, and within hours, our research team had it running through the Agent Security League benchmark. The results are the most interesting we've seen since we launched the leaderboard: for the first time, a model has pushed security scores above 20%, a threshold no previous agent+model combination had reached.

Two combinations were tested, and both set records.

The numbers

Agent	Model	Functional (%)	Secure (%)	Date
Cursor	Claude Opus 4.7	91.1	22.9	2026-04-17
Claude Code	Claude Opus 4.7	87.2	20.1	2026-04-16

For context, here's where the previous leaders stood:

Agent	Model	Functional (%)	Secure (%)
Cursor	Claude Opus 4.6	84.4	7.8
Codex	GPT-5.4	62.6	17.3

The Cursor + Opus 4.7 combination is the first to cross 90% on functional correctness and the first to cross 20% on security. Claude Code + Opus 4.7 also clears the 20% security bar, making Opus 4.7 the first model to break that threshold regardless of which agent framework it's paired with.

In addition, to new high scores, the Cursor and Opus 4.7 combination solved four previously unsolved functional tests and four unsolved security tests:

The fact that four new tests were passed in each category is coincidental; the solved functional and security problems relate to different CVEs across different projects:

New functional tests passed:

GitHub Project	CVE	CWE
ikus060/rdiffweb	CVE-2022-3292	CWE-524
py-pdf/pypdf	CVE-2023-36807	CWE-835
vyperlang/vyper	CVE-2022-24845	CWE-190
vyperlang/vyper	CVE-2022-24787	CWE-697

New security tests passed:

GitHub Project	CVE	CWE
apache/airflow	CVE-2020-17511	CWE-312
django/django	CVE-2014-0474	CWE-704
ikus060/rdiffweb	CVE-2022-3298	CWE-770
plone/plone.namedfile	CVE-2023-41048	CWE-79, CWE-80

Why this matters

We launched the Agent Security League earlier this week with a simple finding: agents are getting much better at writing code that works, but they're not getting better at writing safe code. Functional scores climbed from the original SusVibes benchmark's 61% to 84.4% with Cursor + Opus 4.6. Security, meanwhile, barely moved — peaking at 17.3% with Codex + GPT-5.4.

Inconveniently for our digital-ink-still-drying narrative, but conveniently for security practitioners and developers (which is what matters, I suppose), Opus 4.7 breaks that pattern. Not only does it push the functional frontier to 91.1%, but it also delivers the largest single-model jump in security scores we've recorded. Cursor + Opus 4.7 reaches 22.9% SecPass, a 5.6 percentage-point improvement over the previous best — and a nearly 3x improvement over the same agent paired with Opus 4.6 (which scored 7.8%). Claude Code shows the same trajectory: security rises from 8.4% with Opus 4.6 to 20.1% with Opus 4.7.

This is the first time we've seen meaningful improvement on both axes simultaneously. Until now, the data suggested that optimizing for functional correctness did not transfer to security reasoning. Opus 4.7 is the first model to suggest that this is starting to change.

The full leaderboard

The Agent Security League now tracks 15 agent+model combinations. Here's the updated table, sorted by security score:

#	Agent	Model	Functional (%)	Secure (%)	Date
1	Cursor	Claude Opus 4.7	91.1	22.9	2026-04-17
2	Claude Code	Claude Opus 4.7	87.2	20.1	2026-04-16
3	Codex	GPT-5.4	62.6	17.3	2026-03-18
4	Cursor	Gemini 3.1 Pro	73.7	13.4	2026-03-24
5	Cursor	GPT-5.3	48.0	12.8	2026-02-27
6	Claude Code	Claude Opus 4.5	69.8	10.1	2026-02-25
7	Claude Code	Claude Opus 4.6	81.0	8.4	2026-03-16
8	Claude Code	Gemini 3 Pro	41.9	8.4	2026-02-23
9	Cursor	Claude Opus 4.6	84.4	7.8	2026-03-19
10	Claude Code	Claude Sonnet 4.6	62.0	7.8	2026-02-20
11	SWE-Agent	Claude Sonnet 4	55.3	7.8	2026-02-19
12	Cursor	Gemini 3 Pro	31.8	7.3	2026-02-24
13	Claude Code	Claude Sonnet 4	45.3	6.1	2026-02-11
14	Claude Code	Gemini 2.5 Pro	19.6	5.0	2026-02-11
15	SWE-Agent	Gemini 2.5 Pro	20.7	4.5	2026-02-19

The interactive leaderboard is available at endorlabs.com/research/ai-code-security-benchmark.

The gap persists, but it's narrowing

Before anyone declares victory, some perspective. At 22.9% SecPass, Cursor + Opus 4.7 still produces vulnerable code in roughly 77% of the tasks where it generates a functionally correct solution. The functional-security gap has narrowed from a median of 45 percentage points across the previous 13 combinations to 68 points for the new leader, still enormous.

The pattern remains what we described in our whitepaper: models have been trained on abundant functional correctness signals (test suites pass or fail, CI goes green or red) but comparatively little signal for security. Vulnerable code that functions correctly produces no immediate error — it ships, it runs, and the weakness remains latent until it's exploited. The fact that Opus 4.7 moves the needle on security suggests Anthropic may be investing in security-aware training. However, the absolute numbers confirm we are still far from agents that write secure code by default.

What about the agent, not just the model?

One of the findings from the Agent Security League was that the same model behaved a little differently depending on the agent framework. This has continued with this latest model. Opus 4.7 paired with Cursor scores 22.9% on security; paired with Claude Code, it reaches 20.1%. Does non-determinism explain this small difference? Probably. In our original study, we concluded that any results should carry an uncertainty of at least ±2–3 percentage points, so this variation is within range. A relative improvement of 10% between agents is possibly an interesting finding, however.

Does agent architecture (the scaffolding around the model that handles tool calls, context management, and code generation workflows) shape security outcomes in ways that model capability alone doesn't explain? This is a hypothesis we will continue to test as new combinations emerge.

So?

Opus 4.7 is a genuinely noteworthy result. It's the first model to deliver meaningful improvements on both functional correctness and security in our benchmark. The security curve, which had been effectively flat since we started measuring, has moved.

The key findings of this and other security benchmarks remain, however. AI-generated code still requires independent security review before it reaches production. Even the best-performing combination leaves more than three-quarters of its functionally correct solutions vulnerable. Teams adopting AI coding agents should continue to treat their output the way they would treat code from a prolific but security-unaware developer: likely to work, unlikely to be safe by default, and always in need of review.

We'll keep the leaderboard updated as new models and agents ship. The full methodology and CWE-level analysis are available in our whitepaper, and the benchmark builds on the open SusVibes framework developed at Carnegie Mellon University.