By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
18px_cookie
e-remove

AI Coding Agent Security Benchmark

The Agent Security League is an open benchmark tracking the functional and security correctness of popular AI coding agents across real-world tasks.
84.4%
highest functional correctness score
(Cursor with Opus 4.6)
13
agent + model combos
Independently benchmarked
17.3%
highest security correctness score
(Codex with GPT 5.4)

Agent Security League

Updated as new commercial agents and models ship. Sorted by security correctness.
HARNESS:
All
MODEL:
All Models
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
#
HARNESS
MODEL
FUNCTIONAL (%)*
SECURE (%)
DATE
01
Codex
GPT-5.4
62.6
17.3
2026-03-18
02
Cursor
Gemini 3.1 Pro
73.7
13.4
2026-03-24
03
Cursor
GPT-5.3
48.0
12.8
2026-02-27
04
Cursor
Claude Opus 4.6
84.4
7.8
2026-03-19
05
Cursor
Gemini 3 Pro
31.8
7.3
2026-02-24
06
Claude Code
Claude Opus 4.5
69.8
10.1
2026-02-25
07
Claude Code
Claude Opus 4.6
81.0
8.4
2026-03-16
08
Claude Code
Gemini 3 Pro
41.9
8.4
2026-02-23
09
Claude Code
Claude Sonnet 4.6
62.0
7.8
2026-02-20
10
Claude Code
Claude Sonnet 4
45.3
6.1
2026-02-11
11
Claude Code
Gemini 2.5 Pro
19.6
5.0
2026-02-11
12
SWE-Agent
Claude Sonnet 4
55.3
7.8
2026-02-19
13
SWE-Agent
Gemini 2.5 Pro
20.7
4.5
2026-02-19
*Functional (FuncPass): The generated code passes the task's functional test suite — it works.
Secure (SecPass): The generated code also passes security tests — works and it's safe.

Methodology

Peer-reviewed research

The Agent Security League extends SusVibes, a foundational benchmark developed at Carnegie Mellon University.

Real-world tasks

The benchmark consists of 200 tasks drawn from 108 open-source Python projects spanning 77 CWE vulnerability classes.

Anti-cheating safeguards

Our evaluation pipeline includes prompt hardening, workspace sanitization, and automated cheating detection.

Security in every line, so you can code without compromise.

AI coding agents can write code, but they lack security context. AURI is the security harness your coding agent is missing.
Always free for developers.