By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
18px_cookie
e-remove
Blog

Build vs. Buy Code Security: Same Model, Same Tasks, 12x the Token Bill

Written by
Varun Badhwar
Varun Badhwar
Published on
June 11, 2026
Updated on
June 11, 2026

There's a question I get from security leaders more than any other right now. It usually comes from their CFO or their board, and it goes something like this:

"AI is really good at writing code now. Why are we still buying security tools? Why can't my team just build this themselves?"

It's a fair question. Press coverage about Mythos, Claude Code Security, and Codex Security makes it feel like the entire application security is getting replaced. And the question is landing at the worst possible moment. 

The same board that's asking why you can't build your own tools is also asking how you'll survive an AI-driven threat curve. They've read the headlines: the CVE program published roughly 42,000 new vulnerabilities in 2025, with a projected 50% increase in 2026 over the year prior. Mandiant's M-Trends 2026 now estimates the mean time-to-exploit at negative seven days — attackers are weaponizing flaws before the patch ships. In fact, at Endor Labs, we’ve observed successful exploits taking in place in under 10 hours.

So the build question and the threat question arrive in the same meeting. Build it yourself, and do it faster than the attackers. Here's how I'd actually answer it.

The assumption is backwards

The instinct behind "let's build it" is that you save money and get better capabilities. The reality is more nuanced, however, and it’s been a journey we’ve been at Endor Labs for over a year. We extensively build with and use LLMs in our own products, and we’ve found they work best as a reasoning layer, not a replacement for other security methods.

Agents can act like a security engineer for the code your team writes — reading logic, or spotting subtle authentication gaps and logic flaws. But they cannot be your static analysis tool, software composition analysis, your secrets scanning, your container posture, or your supply chain defense. Those are different problems, and most of them aren't reasoning problems at all.

Someone pays for the tokens

To start, there's a cost nobody puts in the build-vs-buy spreadsheet: someone, somewhere, pays for the tokens. Security might not be paying today, but that bill is real, and it scales with every pull request your developers open. Building a stack entirely from scratch moves the cost, but doesn’t eliminate it.

Agentic workloads are not chat. A chat turn is one round trip; an agent loops, re-sending its entire accumulated context at every reasoning step. The result is that agent workloads burn 5 to 30 times the tokens of a single chat interaction. This is now enough of a problem that the FinOps Foundation reports 98% of organizations are actively managing AI spend — up from 63% a year earlier — and more than half admit they don't understand the full scope of what they're spending.

So what does "build your own" actually cost? Try the exercise yourself: ask Claude to estimate the annual budget for native AI-powered PR and code security at an enterprise with 150 developers, using publicly available pricing. It comes back with a range of $600,000 to $1.3 million a year, midpoint around $900,000. 

And that number has a short shelf life, frontier pricing is moving the wrong direction. Anthropic's newest model, Claude Fable 5, is priced at exactly double the previous generation: $10 per million input tokens and $50 per million output, versus $5 and $25 for Opus 4.8. Run the same 150-developer exercise at Fable 5 rates and the range jumps to roughly $1 million to $2.4 million a year, with a midpoint around $1.6 million. The budget you scoped in January is wrong by June.

And that assumes you build it well. Build it badly and the numbers get worse fast. One 50-engineer team ran a per-PR security agent and watched it generate an $8,400 bill in the first month, the root cause was a 50,000-token security policy getting injected into every single prompt.

Deterministic tools change the equation

The naive way to use an LLM for security is to hand it a codebase and say find the vulnerabilities. It works, sort of, in a demo. At scale it does three bad things:

  1. It burns tokens linearly with the size of your repo
  2. It hallucinates
  3. It floods your triage queue with false positives

An academic baseline that tried exactly this generated nearly 14,000 candidate "sinks" in a single project and confirmed zero real vulnerabilities. 

The better way is to give the agent deterministic tools. Why ask an agent to run a calculation when you give it a calculator? The same applies to security work. You let the deterministic systems do the deterministic work, and you spend the model's expensive reasoning where it enhances the baseline analysis and genuinely needs judgment. This isn't just an Endor Labs opinion; independent work on tool-use architectures (the kind that powers Anthropic's own Model Context Protocol) shows token reductions up to 98% with accuracy preserved or improved.

Benchmark results

To benchmark this, we ran 442 realistic application security prompts through two identical setups. The agent was prompted to determine five common security tasks:

  • Reachability
  • Exploitability
  • Prioritization
  • Remediation
  • License compliance

One agent got compact evidence from Endor Labs: findings, reachability and call paths, upgrade impact analysis, license risk. The other got nothing but the local repo and its own reasoning. Same prompts, same model, same questions. 

These results correspond to a run on 12 large open-source projects spanning six programming languages: JavaScript, Python, Java, Go, Rust, and C#. For each language, we picked one CMS and one widely-used library or framework (for example: Ghost and socket.io for JavaScript, Umbraco and Dapper for C#, OpenCMS and Thymeleaf for Java), plus a wide task across all the codebases at once.

Metric With Endor Labs context LLM reasoning alone Result
Total tokens 6.6M 79.5M 91.7% fewer
Wall-clock time 133 min 370 min 2.8x faster
Tool-loop calls 1,380 6,165 77.6% fewer

Where the token savings come from

The token gap isn't subtle, and the trace data shows exactly where it opens up. The LLM-alone agent averaged ~14 tool calls per question, and they were almost all low-level repository primitives. Across the run, the LLM-only agent issued:

  • 1,985 web searches
  • 849 grep calls
  • 783 manifest reads
  • 689 directory listings
  • 547 file reads

It was hand-rolling software composition analysis from scratch, grepping for version strings, walking package.json files, and counting imports. Because an agent re-sends its entire accumulated context on every reasoning step, each of those repo dumps gets paid for again and again. The token drain doesn’t come from the agent thinking, but the constant re-reading of raw repository data the model had to gather itself.

The Endor Labs-equipped agent answered the same questions with ~2.5 tool calls per question, and they were high-level, security-aware constructs looking for a reachability summary, CVE exposure, or upgrade impact. One call returns the answer because it isn't reasoning, it's a data lookup. The 91.7% token reduction and the 4.5x drop in tool calls are the same phenomenon measured two ways: fewer, higher-level calls against a deterministic security-aware engine instead of many low-level calls against a raw filesystem.

What an agentic reasoning layer can't replace

AI code review does one thing: it analyzes the code you write. That leaves a lot uncovered. SAST, SCA, secrets scanning, container and IaC posture — and most importantly, package firewalls and dependency governance, the controls standing between your build and a poisoned open-source package.

Anthropic agrees. Michael Moore, Cybersecurity Lead at Anthropic, put it this way:

"There's a lot of questions about whether or not this should replace existing security tools, and our recommendation is no… our recommendation, and what we're seeing customers do, is deploying [Claude Code Security] alongside their existing tools, and augmenting them."

The vendor building the best reasoning layer in the world is telling you it's a layer, not a replacement for your entire application security stack.

The teams handling this transition well do two things: they give their AI agents deterministic security tools to work with, and they invest in the capabilities an AI model can't cover. That's why Cursor, OpenAI, and other AI pioneers choose Endor Labs as the security verification and evidence layer for their agentic code security programs.

If you're paying for tokens, make sure you're paying for outcomes, not just inference. Read our team’s technical whitepaper for an in-depth analysis of the benchmark.