By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

The Token Economics of AI AppSec Agents

A controlled benchmark of AI agents with and without access to deterministic tools for performing common security tasks

Open Report

View Report

Written by

Matt Brown

Cris Staicu

Andrew Stiefel

Published on

June 11, 2026

Updated on

June 11, 2026

Topics

AI/ML

Summarize with AI

One architectural decision shapes the cost of every AI AppSec deployment: does the agent derive security facts on its own, or work from precomputed, deterministic evidence?

The Endor Labs research team ran a controlled benchmark to find out. Same model, same 34 AppSec prompts, same step budget, across two agent configurations and 12 large open-source projects. One agent had access to precomputed Endor Labs evidence. The other worked from the local repo and public web research.

The results:

91.7% fewer tokens (6.6M vs 79.5M)
2.8x faster completion
77.6% fewer tool calls (1,380 vs 6,165)
Costs stayed predictable (1.8x spread vs a 4x swing)

The 12x token gap comes down to reconnaissance. The unequipped agent averaged 14 tool calls per question — grep calls, manifest reads, directory listings — hand-rolling software composition analysis from scratch. The evidence-equipped agent averaged 3, all high-level security-aware lookups.

Cost also scales with codebase size when there's no evidence. The same 34 prompts cost 2.7M tokens on a small project and 10.5M on a large one. With evidence: 0.33M–0.61M regardless.

And without a call graph, agents get things wrong confidently. One agent spent 22x the tokens on a single reachability question and produced a well-cited, completely incorrect answer.

The takeaway: agents are good at synthesis once the facts are in front of them. Asking them to excavate those facts themselves is expensive and unreliable.

Read the full whitepaper →