By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
18px_cookie
e-remove

AI Coding Agent Security Benchmark

The Agent Security League is an open benchmark tracking the functional and security correctness of popular AI coding agents across real-world tasks.
84.9%
highest functional correctness score
(tie, Cursor with GPT-5.5 or Claude Opus 4.6)
21
agent + model combos independently benchmarked
24%
highest security correctness score
(Cursor with GPT-5.5)

Agent Security League

Updated as new commercial agents and models ship. Sorted by security correctness.
HARNESS:
All
MODEL:
All Models
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
#
HARNESS
MODEL
FUNCTIONAL (%)*
SECURE (%)
DATE
01
Cursor
GPT-5.5
84.9
24.0
2026-04-25
02
Codex
GPT-5.5
62.6
22.4
2026-04-25
03
Codex
GPT-5.4
63.1
21.8
2026-03-18
04
Cursor
Claude Opus 4.8
75.4
20.7
2026-05-30
05
Cursor
Claude Opus 4.7
79.9
18.4
2026-04-17
06
Cursor
Gemini 3.5 Flash
79.3
17.9
2026-05-20
07
Claude Code
Claude Opus 4.7
76.5
16.8
2026-04-16
08
Cursor
Gemini 3.1 Pro
73.2
15.1
2026-03-24
09
Claude Code
Claude Opus 4.8
73.7
14.5
2026-05-29
10
Cursor
GPT-5.3
48.0
14.5
2026-02-27
11
Cursor
Composer 2.5
75.4
14.0
2026-05-19
12
Claude Code
Claude Sonnet 4.6
69.8
12.3
2026-02-20
13
Cursor
Claude Opus 4.6
84.9
11.2
2026-03-19
14
Claude Code
Claude Opus 4.6
78.2
10.6
2026-03-16
15
Claude Code
Gemini 3 Pro
44.1
9.5
2026-02-23
16
Claude Code
Claude Opus 4.5
68.7
8.4
2026-02-25
17
Cursor
Gemini 3 Pro
27.9
5.6
2026-02-24
18
Claude Code
Claude Sonnet 4
45.3
6.1
2026-02-11
19
Claude Code
Gemini 2.5 Pro
19.6
5.0
2026-02-11
20
SWE-Agent
Claude Sonnet 4
55.3
7.8
2026-02-19
21
SWE-Agent
Gemini 2.5 Pro
20.7
4.5
2026-02-19
*Functional (FuncPass): The generated code passes the task's functional test suite — it works.
Secure (SecPass): The generated code also passes security tests — works and it's safe.

Methodology

Peer-reviewed research

The Agent Security League extends SusVibes, a foundational benchmark developed at Carnegie Mellon University.

Real-world tasks

The benchmark consists of 200 tasks drawn from 108 open-source Python projects spanning 77 CWE vulnerability classes.

Anti-cheating safeguards

Our evaluation pipeline includes prompt hardening, workspace sanitization, and automated cheating detection.

Security in every line, so you can code without compromise.

AI coding agents can write code, but they lack security context. AURI is the security harness your coding agent is missing.
Always free for developers.