Executive summary
Cursor + GPT-5.5 sets a new high on security correctness at 23.5%, edging past the previous record set by Cursor + Opus 4.7 (22.9%). It's the third agent-model combination to clear the 20% bar on security, and the second to do so within a week. Still a failing grade for security, but the trend continues in a positive direction.
The more interesting result, though, is what happens when the same model runs through a different harness. Codex + GPT-5.5 ties for the third-highest security score we've measured at 20.1%, but lands at 61.5% on functional correctness, a slight regression behind its predecessor Codex + GPT-5.4 (62.6%) and roughly 26 percentage points below the same model running through Cursor.
That’s the key takeaway: same model, same week, two harnesses, two different functional results.
The numbers
For context, here's how the new entries slot into the top of the leaderboard, sorted by security:
The functional-security gap for Codex + GPT-5.5 is 41 points — narrower than the leaders, but only because the functional ceiling is lower. The combination ties Claude Code + Opus 4.7 on security at 20.1%, but produces working code on roughly 26 percentage points fewer tasks. By contrast, Cursor + GPT-5.5 has a 64-point gap, in line with the other Cursor-harnessed combinations at the top of the board.
The full interactive leaderboard is at endorlabs.com/research/ai-code-security-benchmark.
What's holding back functional code scores for Codex with GPT-5.5?
We looked at the 30 task instances where Codex + GPT-5.5 fails on functional correctness while at least one other combination on the leaderboard succeeds.
Of those 30 failures, 16 are also failed by Codex + GPT-5.4, the same harness as the previous-generation model. That overlap suggests a systematic Codex limitation rather than something specific to GPT-5.5. Potentially, the Codex CLI harness handles certain repository structures, build systems, or test frameworks differently than Cursor or Claude Code, and those differences may cost it functional passes regardless of which OpenAI model is driving.
The following table groups those 16 failures into 5 different categories, explaining the nature of the functional test issues and providing the respective instance (CVE).
The case worth zooming in on: planet-client-python
One case stands out: planetlabs/planet-client-python (CVE-2023-32303). It's the only instance we've found where every other agent + model combination we've tested produces a functionally correct solution, and Codex + GPT-5.5 uniquely fails.
The CVE itself is straightforward. The Planet SDK for Python created its secret credentials file with overly permissive file permissions, allowing a local authenticated attacker to read it. The fix is the kind of thing a security-aware developer would write in their sleep: open the file with mode=0o600, or chmod it after creation. CWE-732 (Incorrect Permission Assignment for Critical Resource) is well-trodden ground, and every other combination we've put on this task handles it without trouble.
Codex + GPT-5.5 doesn't. In reviewing tests, Codex + GPT-5.5 hallucinated the opener argument, which does not exist in pathlib.Path.open, but only in the built-in open function. This made the function test fail with a TypeError.
What the security numbers tell us
It's easy to focus on the functional gap, but the other important number here is the security ceiling. A few things are worth noting:
1. Cursor + GPT-5.5 sets a new record on security
Cursor + GPT-5.5's 23.5% SecPass is the highest single score we've measured. It edges past Cursor + Opus 4.7 (22.9%) by 0.6 percentage points — well within the ±2–3 pp non-determinism band observed in literature, so the practical interpretation is that the two combinations are statistically indistinguishable on security. What's notable is that two different model families, through the same harness, are now both clearing 22%.
2. Codex shows a consistently narrower functional-security gap
Codex + GPT-5.4 had a 45-point functional-security gap. Codex + GPT-5.5 has a 41-point gap. Both are well below the 60+ point gaps we see on Cursor and Claude Code combinations at the top of the board. Whatever Codex is doing with prompt formatting, tool use, or context management, it appears to surface security-relevant reasoning more often than competing harnesses do — even when the underlying model is producing fewer functionally complete solutions overall.
Conclusion
The broader pattern from the league still holds: the agent harness matters as much as model capability. With GPT-5.5 now sitting both at the top of the security board (through Cursor) and tied for third (through Codex), we have a useful within-week, within-model comparison of how much the harness shapes outcomes, and the answer is "more than the model alone."
We'll keep evaluating new combinations as they ship. The full methodology and CWE-level analysis are in our whitepaper, and the benchmark builds on the open SusVibes framework from Carnegie Mellon University.
You can explore the full interactive leaderboard at endorlabs.com/research/ai-code-security-benchmark.
Give Your AI Coding Assistants the Security Tools They Deserve



What's next?
When you're ready to take the next step in securing your software supply chain, here are 3 ways Endor Labs can help:









