By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

AI Coding Agent Security Benchmark

The Agent Security League is an open benchmark tracking the functional and security correctness of popular AI coding agents across real-world tasks.

Download paper

84.9%

highest functional correctness score
(tie, Cursor with GPT-5.5 or Claude Opus 4.6)

agent + model combos independently benchmarked

24%

highest security correctness score
(Cursor with GPT-5.5)

Agent Security League

Updated as new commercial agents and models ship. Sorted by security correctness.

HARNESS

MODEL

FUNCTIONAL (%)*

SECURE (%)^†

DATE

Cursor

GPT-5.5

84.9

24.0

2026-04-25

Codex

GPT-5.5

62.6

22.4

2026-04-25

Codex

GPT-5.4

63.1

21.8

2026-03-18

Cursor

Claude Opus 4.8

75.4

20.7

2026-05-30

Cursor

Claude Opus 4.7

79.9

18.4

2026-04-17

Cursor

Gemini 3.5 Flash

79.3

17.9

2026-05-20

Claude Code

Claude Opus 4.7

76.5

16.8

2026-04-16

Cursor

Gemini 3.1 Pro

73.2

15.1

2026-03-24

Claude Code

Claude Opus 4.8

73.7

14.5

2026-05-29

Cursor

GPT-5.3

48.0

14.5

2026-02-27

Cursor

Composer 2.5

75.4

14.0

2026-05-19

Claude Code

Claude Sonnet 4.6

69.8

12.3

2026-02-20

Cursor

Claude Opus 4.6

84.9

11.2

2026-03-19

Claude Code

Claude Opus 4.6

78.2

10.6

2026-03-16

Claude Code

Gemini 3 Pro

44.1

9.5

2026-02-23

Claude Code

Claude Opus 4.5

68.7

8.4

2026-02-25

Cursor

Gemini 3 Pro

27.9

5.6

2026-02-24

Claude Code

Claude Sonnet 4

45.3

6.1

2026-02-11

Claude Code

Gemini 2.5 Pro

19.6

5.0

2026-02-11

SWE-Agent

Claude Sonnet 4

55.3

7.8

2026-02-19

SWE-Agent

Gemini 2.5 Pro

20.7

4.5

2026-02-19

*Functional (FuncPass): The generated code passes the task's functional test suite — it works.

^†Secure (SecPass): The generated code also passes security tests — works and it's safe.

Methodology

Peer-reviewed research

The Agent Security League extends SusVibes, a foundational benchmark developed at Carnegie Mellon University.

Real-world tasks

The benchmark consists of 200 tasks drawn from 108 open-source Python projects spanning 77 CWE vulnerability classes.

Anti-cheating safeguards

Our evaluation pipeline includes prompt hardening, workspace sanitization, and automated cheating detection.

Resources

Two glowing green code editor windows side by side connected by flowing lines, representing data or code synchronization.

Research

How AI coding agents cheat benchmarks

Green padlock icon centered on a glowing shield symbol against a dark background representing security.

Whitepaper

Evaluating the Security of AI Coding Agents

Laptop screen showing a table with agent numbers and security percentages, overlaid with a glowing green magnifying glass icon.

Blog Post

Introducing the Agent Code Security Leaderboard

Security in every line, so you can code without compromise.

AI coding agents can write code, but they lack security context. AURI is the security harness your coding agent is missing.
Always free for developers.

Start Now

Book a Demo