By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

GPT-5.5 Sets a New Code Security Record with Cursor, not Codex, in Agent Security League

OpenAI's newest model now holds the top security score on the Agent Security League through Cursor as the agent harness. Through Codex, it ties for third on security but trails on functional correctness.

Open Report

View Report

Written by

Henrik Plate

Published on

April 27, 2026

Updated on

April 27, 2026

Topics

AI/ML

Open Report

View Report

Summarize with AI

Executive summary

Cursor + GPT-5.5 sets a new high on security correctness at 23.5%, edging past the previous record set by Cursor + Opus 4.7 (22.9%). It's the third agent-model combination to clear the 20% bar on security, and the second to do so within a week. Still a failing grade for security, but the trend continues in a positive direction.

The more interesting result, though, is what happens when the same model runs through a different harness. Codex + GPT-5.5 ties for the third-highest security score we've measured at 20.1%, but lands at 61.5% on functional correctness, a slight regression behind its predecessor Codex + GPT-5.4 (62.6%) and roughly 26 percentage points below the same model running through Cursor.

That’s the key takeaway: same model, same week, two harnesses, two different functional results.

The numbers

For context, here's how the new entries slot into the top of the leaderboard, sorted by security:

#	Harness	Model	Functional (%)	Secure (%)
1	Cursor	GPT-5.5	87.2	23.5
2	Cursor	Opus 4.7	91.1	22.9
3	Claude Code	Opus 4.7	87.2	20.1
4	Codex	GPT-5.5	61.5	20.1
5	Codex	GPT-5.4	62.6	17.3

The functional-security gap for Codex + GPT-5.5 is 41 points — narrower than the leaders, but only because the functional ceiling is lower. The combination ties Claude Code + Opus 4.7 on security at 20.1%, but produces working code on roughly 26 percentage points fewer tasks. By contrast, Cursor + GPT-5.5 has a 64-point gap, in line with the other Cursor-harnessed combinations at the top of the board.

The full interactive leaderboard is at endorlabs.com/research/ai-code-security-benchmark.

What's holding back functional code scores for Codex with GPT-5.5?

We looked at the 30 task instances where Codex + GPT-5.5 fails on functional correctness while at least one other combination on the leaderboard succeeds.

Of those 30 failures, 16 are also failed by Codex + GPT-5.4, the same harness as the previous-generation model. That overlap suggests a systematic Codex limitation rather than something specific to GPT-5.5. Potentially, the Codex CLI harness handles certain repository structures, build systems, or test frameworks differently than Cursor or Claude Code, and those differences may cost it functional passes regardless of which OpenAI model is driving.

The following table groups those 16 failures into 5 different categories, explaining the nature of the functional test issues and providing the respective instance (CVE).

Category	Nature	Instances
Whole-file skeleton reconstruction	Task requires reconstructing a 300–400 line file from a stub; both versions produce syntactically valid but functionally different bugs	httpie (CVE-2019-10751), django-mfa3 (CVE-2022-24857)
Cryptographic/CLI tool integration	xmlsec command-line construction is wrong; GPT-5.5 doesn't implement the required helper at all (raise NotImplementedError), GPT-5.4 builds the wrong argument string	pysaml2 (CVE-2021-21239)
NoneType in path/URL validator	Security helpers return None on edge-case inputs (relative paths, empty strings, unusual URL schemes); downstream code immediately crashes trying to use the None as a string	plone.namedfile (CVE-2023-41048), products.cmfplone (CVE-2015-7316), django (CVE-2016-2512)
Framework wiring incomplete	Security class or view is added but not correctly connected to the framework routing/lifecycle; tests exercise the live integration and hit 404s, wrong HTTP status codes, wrong query counts, server crashes, or stale UI state after delete	rdiffweb (CVE-2022-3389), rdiffweb (CVE-2022-3232), rdiffweb (CVE-2022-3438), wagtail-2fa (CVE-2020-5240), home-assistant (CVE-2023-50715), neutron (CVE-2019-9735), aodh (CVE-2017-12440), genericsetup (CVE-2021-21360), jupyter-server (CVE-2022-29241)
Correct concept, wrong edge case	Logic is structurally correct but fails a specific string boundary condition (trailing newline not stripped) or event-emission requirement (warning not fired) that both versions share identically	superset (CVE-2023-40610)

The case worth zooming in on: planet-client-python

One case stands out: planetlabs/planet-client-python (CVE-2023-32303). It's the only instance we've found where every other agent + model combination we've tested produces a functionally correct solution, and Codex + GPT-5.5 uniquely fails.

The CVE itself is straightforward. The Planet SDK for Python created its secret credentials file with overly permissive file permissions, allowing a local authenticated attacker to read it. The fix is the kind of thing a security-aware developer would write in their sleep: open the file with mode=0o600, or chmod it after creation. CWE-732 (Incorrect Permission Assignment for Critical Resource) is well-trodden ground, and every other combination we've put on this task handles it without trouble.

Codex + GPT-5.5 doesn't. In reviewing tests, Codex + GPT-5.5 hallucinated the opener argument, which does not exist in pathlib.Path.open, but only in the built-in open function. This made the function test fail with a TypeError.

What the security numbers tell us

It's easy to focus on the functional gap, but the other important number here is the security ceiling. A few things are worth noting:

1. Cursor + GPT-5.5 sets a new record on security

Cursor + GPT-5.5's 23.5% SecPass is the highest single score we've measured. It edges past Cursor + Opus 4.7 (22.9%) by 0.6 percentage points — well within the ±2–3 pp non-determinism band observed in literature, so the practical interpretation is that the two combinations are statistically indistinguishable on security. What's notable is that two different model families, through the same harness, are now both clearing 22%.

2. Codex shows a consistently narrower functional-security gap

Codex + GPT-5.4 had a 45-point functional-security gap. Codex + GPT-5.5 has a 41-point gap. Both are well below the 60+ point gaps we see on Cursor and Claude Code combinations at the top of the board. Whatever Codex is doing with prompt formatting, tool use, or context management, it appears to surface security-relevant reasoning more often than competing harnesses do — even when the underlying model is producing fewer functionally complete solutions overall.

Conclusion

The broader pattern from the league still holds: the agent harness matters as much as model capability. With GPT-5.5 now sitting both at the top of the security board (through Cursor) and tied for third (through Codex), we have a useful within-week, within-model comparison of how much the harness shapes outcomes, and the answer is "more than the model alone."

We'll keep evaluating new combinations as they ship. The full methodology and CWE-level analysis are in our whitepaper, and the benchmark builds on the open SusVibes framework from Carnegie Mellon University.

You can explore the full interactive leaderboard at endorlabs.com/research/ai-code-security-benchmark.

Webinar