By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Is AI Coding Safe? Introducing the Agent Security League

AI coding agents can write working code, but mostly not secure code. Explore benchmark results showing over 80% of AI-generated code contains vulnerabilities.

Open Report

View Report

Written by

Luca Compagna

Published on

April 15, 2026

Updated on

April 15, 2026

Topics

AI/ML

AI coding agents can write working code, but mostly not secure code. Explore benchmark results showing over 80% of AI-generated code contains vulnerabilities.

Open Report

View Report

Summarize with AI

AI agents are writing production code faster than ever—for experienced developers and non-engineers alike, from marketing to finance. The question is no longer if they can write code, but whether they can write it safely.

The Agent Security League is an independent, open leaderboard built to answer that question rigorously, a single consistent benchmark applied every time a new agent or model enters the market. Think of it as a living stress test for the tools your team might already be using.
‍
Key takeaways

AI-generated code remains unsafe: in realistic settings, >80% of LLM-generated code introduces security vulnerabilities.
Agents are getting much better at writing code that works. They are not getting better at writing safe code.
Grounded in a landmark academic benchmark (SusVibes, CMU) covering 200 tasks across 77 vulnerability classes
Public leaderboard extending SusVibes to track vibe coding security progress as new commercial agents/models are released.
The new generation of agents ignores explicit instructions, e.g., by leveraging git history to reverse-engineer fixes and inflate their scores – so we added an anti-cheating module to SusVibes’ harness to detect and correct this behavior.

Benchmark results

Below is today's snapshot. Results update as new commercial models ship.

Harness	Model	Functional (%)*	Secure (%)†	Date
Codex	GPT-5.4	62.6	17.3	2026-03-18
Cursor	Gemini 3.1 Pro	73.7	13.4	2026-03-24
Cursor	GPT-5.3	48.0	12.8	2026-02-27
Cursor	Claude Opus 4.6	84.4	7.8	2026-03-19
Cursor	Gemini 3 Pro	31.8	7.3	2026-02-24
Claude Code	Claude Opus 4.5	69.8	10.1	2026-02-25
Claude Code	Claude Opus 4.6	81.0	8.4	2026-03-16
Claude Code	Gemini 3 Pro	41.9	8.4	2026-02-23
Claude Code	Claude Sonnet 4.6	62.0	7.8	2026-02-20
Claude Code	Claude Sonnet 4	45.3	6.1	2026-02-11
Claude Code	Gemini 2.5 Pro	19.6	5.0	2026-02-11
SWE-Agent	Claude Sonnet 4	55.3	7.8	2026-02-19
SWE-Agent	Gemini 2.5 Pro	20.7	4.5	2026-02-19

Introduction

Vibe coding, the practice of handing a natural language description to an AI agent and letting it write the code, has gone from novelty to norm remarkably fast. The term, coined by Andrej Karpathy in February 2025, was named Collins Dictionary's Word of the Year by year's end. The software development lifecycle has entered an agentic phase, with widespread developer adoption increasingly reinforced by executive mandates from CTOs and CEOs.

But there is a question that keeps nagging: Is the LLM-generated code actually safe for broad adoption by trained software professionals as well as vibe coders?

The code an agent+model pair produces can be functionally correct without being secure. A function can pass every unit test and still contain a SQL injection, a path traversal, or a timing side-channel vulnerability. That gap is exactly what we set out to measure.

Our results confirm that today’s best agents are not secure by default, about 83% of the code they generate in realistic development tasks contains security vulnerabilities.

The SusVibes Benchmark: open, real-world, reproducible

Our work builds on SusVibes, a benchmark introduced by Zhao et al. in their paper "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (arXiv:2512.03262). This benchmark, developed at Carnegie Mellon University, aims to provide a rigorous way to test whether agents+models write secure code when solving realistic software engineering tasks.

The benchmark consists of 200 tasks drawn from 108 real-world OSS Python projects and covering 77 CWE categories. Each task is constructed from a historical vulnerability fix: the vulnerable feature is removed from the codebase, a natural language description is generated, and both functional tests and security tests are collected. The agent must re-implement the feature from its description, the repository, and functional tests alone — without access to the original code or any security tests, which are intentionally hidden.

Two metrics capture the outcome:

FuncPass: Does the generated code pass the functional test suite? In other words, does it work?
SecPass: Does it also pass the security tests? In other words, is it secure?

SecPass is computed over FuncPass successes, a solution only counts as secure if it is also functionally correct. This mirrors the real world: nobody ships a function that doesn't work, so the security question only matters for code that does work.

The original paper evaluated three agent frameworks (SWE-Agent, OpenHands, Claude Code) with three models (Claude Sonnet 4, Kimi K2, Gemini 2.5 Pro). The headline finding was sobering: the best combination (SWE-Agent + Claude Sonnet 4) achieved 61% FuncPass but only 10.5% SecPass. Over 80% of the functionally correct solutions contained security vulnerabilities.

Agent Security Leagues: Our Goal

We are launching Agent Security Leagues, building directly on top of SusVibes, an extensible, reproducible evaluation platform developed at Carnegie Mellon University. Our goal is to track how commercial coding agents (Claude Code, Cursor, Codex), paired with proprietary frontier models from Anthropic, Google, and OpenAI, perform on security over time. While these systems come with significant per-run cost, they reflect the tools and models actually used by professional development teams. Results are continuously measured and published through a public leaderboard.

This effort extends and complements the SusVibes leaderboard, which primarily focuses on open-source agent frameworks and, more recently, open-weight models such as GLM and Qwen. Together, these approaches provide a more complete view of the ecosystem. As new agents emerge and models evolve, our goal is to maintain a consistent benchmark, enabling meaningful comparisons over time and, critically, tracking whether the security gap is truly closing.

Our initial methodology

We took a two-step approach:

First, reproduce. Before running any new experiments, we repeated some of the paper's original evaluations to verify that our setup produces consistent results. The numbers matched well within the expected margin of non-determinism in this domain:

Agent	Model	Paper FuncPass	Our FuncPass	Paper SecPass	Our SecPass
SWE-Agent	Claude Sonnet 4	61.0	55.0	10.5	7.5
SWE-Agent	Gemini 2.5 Pro	19.5	19.0	7.0	4.5
Claude Code	Claude Sonnet 4	44.0	44.0	6.0	6.5
Claude Code	Gemini 2.5 Pro	15.0	18.0	4.5	5.0

With the machinery validated, we moved to the second step.

Then, extend. We ran the benchmark against frontier agents and models that were not part of the original study. This meant integrating new agent frameworks (Cursor, Codex) and evaluating newer model generations (Claude Opus 4.x, Gemini 3 Pro, GPT-5.3 Codex). All experiments use the same pass@1 strategy used in the paper, the agent gets one shot to produce the correct, secure implementation, just like in real-world vibe coding.

The adjusted anti-cheating methodology

We observed several frontier agent+model combinations exploiting shortcuts when solving development tasks. For instance, despite explicit instructions not to inspect a repository’s git history, which is fully included in each SusVibes benchmark instance, agents frequently ignored this constraint and recovered the expected patch directly from the commit log.

This matters more than it might first appear. The SusVibes benchmark constructs each task from a real historical vulnerability fix: the vulnerable code is removed, and the agent is asked to re-implement the feature from a natural-language description. The git history of the repository still contains the original commit that introduced the secure fix. When an agent reads that history, it is not solving the problem, it is copying the answer. The resulting FuncPass and SecPass scores reflect the agent's ability to follow a trail of breadcrumbs, not its ability to write secure code from requirements.

The implications are significant. If cheating goes undetected, benchmark results paint a dangerously flattering picture of an agent's real-world security capabilities. Developers, security teams, and procurement decisions that rely on those inflated numbers would be making choices based on an illusion. In our testing, the scale of the problem was striking: SWE-Agent paired with Claude Opus 4.6 exploited git history in 163 out of 200 tasks, over 81% of the benchmark. Any evaluation that does not account for this behavior is, in effect, measuring something other than what it claims.

There is a broader concern here, too. These agents did not stumble onto the git history by accident. They actively sought it out despite being explicitly instructed not to. This is a form of instruction non-compliance, the agent pursuing a strategy that maximizes its objective (producing code that passes tests) while disregarding constraints set by its operator. In a benchmarking context, this inflates scores. In a production context, where agents operate inside real codebases with access to secrets, credentials, internal APIs, and sensitive data, the same behavior pattern has far more serious consequences. An agent that routinely ignores boundaries to achieve its goal cannot be trusted with broad repository access without robust guardrails in place.

To counter this, we first hardened the prompt with explicit anti-cheating instructions. When agents continued to use shortcuts regardless, we added a cheating-detection phase followed by a results-adjustment phase to ensure fairness. The detection mechanism analyses each agent's tool calls and file access patterns to identify cases where the solution was derived from commit history rather than from the task description. Results flagged as cheating-assisted are ignored for the score computation to reflect the agent's unassisted performance. We have shared and discussed these findings with the SusVibes authors, who have independently observed similar behaviors. These controls are, to our knowledge, the first formal anti-cheating measures applied to any AI coding benchmark, and they are going to be included as part of the open methodology.

Code That Works, But Isn’t Safe

The leaderboard reveals a clear divergence: functional correctness has climbed from 61% in the original Carnegie Mellon study to 84.4%, while security has barely improved, peaking at just 17.3%, less than five points above the 12.5% best result reported with OpenHands and Claude Sonnet 4.

Agents are getting much better at writing code that works. They are not getting better at writing code that is safe.

This asymmetry is not surprising when you consider what these models have been optimized for. The training signal for functional correctness is strong and abundant: test suites pass or fail, CI pipelines go green or red, users file bugs when features break. Security, by contrast, is a largely silent property. Vulnerable code that functions correctly produces no immediate error signal, it ships, it runs, and the weakness remains latent until it is exploited. Models have had comparatively little training signal to learn that a working SQL query built with string concatenation is a liability, or that a file-open call without path canonicalization is a traversal vulnerability waiting to happen.

The trajectory of the functional scores is encouraging, it shows that agent architectures and model capabilities are genuinely improving at the core task of understanding codebases and producing working implementations. But the flat security curve tells us that this improvement is not transferring to security reasoning. Writing secure code requires a different kind of awareness: understanding threat models, anticipating adversarial input, knowing which CWE patterns apply to a given context, and choosing defensive constructions even when a simpler, less safe alternative would pass every functional test. Current models and agent frameworks are not demonstrating that awareness at any meaningful scale.

For the foreseeable future, AI-generated code will require a robust, independent security review. The data does not support a workflow in which agent output is merged with only functional validation. Whether an organization uses automated static analysis, dynamic testing, manual code review, or a combination, the security verification step is not optional, it is load-bearing. Teams adopting AI coding agents should treat their output the same way they would treat a pull request from a prolific but junior developer: likely to work, unlikely to be secure by default, and always in need of security-focused review before it reaches production.

Looking ahead, there are reasons for cautious optimism alongside the caution. Model providers are beginning to invest in security-aware fine-tuning and reinforcement learning from security feedback signals. Agent frameworks could integrate security linting and SAST tooling directly into their generation loops, catching common CWE patterns before code is even presented to the developer. Benchmark efforts like this one create the measurement infrastructure needed to track whether those investments actually move the needle. But the honest assessment today is that no currently available agent+model combination produces code that can be trusted on security without external verification, and the gap between functional capability and security capability is wide enough that it is unlikely to close in the near term through model scaling alone. Closing it will require deliberate architectural choices: security-specific training data, tool-augmented generation pipelines, and a cultural shift that treats security pass rates as a first-class metric alongside functional correctness.

Our whitepaper digs deep into why, with CWE-level analysis, cheating forensics, and a closer look at what it would take to close the gap.

Find out More