EVMbench Signals a Shift in How DeFi Measures AI Security Capabilities

EVMbench Signals a Shift in How DeFi Measures AI Security Capabilities

Feb 20, 2026

OpenAI and Paradigm have jointly released EVMbench, an open benchmark that measures how effectively AI agents can detect, patch, and exploit high-severity vulnerabilities in EVM-based smart contracts.

The release, announced February 18, 2026 (OpenAI announcement | Paradigm announcement), provides a standardized framework for tracking AI progress in an area that directly affects DeFi risk. The full technical details are available in the joint paper.

The core narrative is not that AI can now "secure DeFi." It is that smart contract security is moving from anecdotal demos to measurable performance and that the same models improving defense are also improving offense.

What EVMbench actually measures

EVMbench is built from 120 high-severity vulnerabilities sourced from 40 real audits, largely drawn from public code audit competitions. It also includes scenarios from the security review of Tempo, a purpose-built Layer-1 optimized for stablecoin payments.

The benchmark evaluates AI agents across three modes:

  • Detect: Scan a full contract repository and identify known high-severity vulnerabilities. Scoring emphasizes recall against ground-truth findings.
  • Patch: Produce functional fixes that resolve vulnerabilities without breaking contract logic.
  • Exploit: Execute end-to-end attacks in isolated sandboxes, including full fund-drain scenarios.

A public interface allows users to upload contract folders and test detection performance against the benchmark's methodology.

The design choice to include exploit mode is notable. This is not framed as a static analysis leaderboard. It measures whether agents can operationalize vulnerabilities, not just describe them.

Early results: exploitation is improving faster than defense

Initial testing shows frontier models making significant gains in exploit capability. GPT-5.3-Codex reportedly scored 72.2% in exploit mode, compared with 31.9% for GPT-5 released six months earlier.

Detection recall and patching performance remain materially lower.

If those numbers hold under broader scrutiny, they suggest a familiar pattern: offensive capability scaling faster than robust remediation. That asymmetry matters more than the headline score.

Before drawing strong conclusions, several methodological questions need clarification:

  • Are vulnerabilities evenly distributed across categories (reentrancy, access control, logic errors, oracle misuse)?
  • How complex are the contract repositories relative to production DeFi systems?
  • How sensitive are results to prompt engineering and agent scaffolding?

Without that context, raw percentage gains risk overstating real-world readiness.

Why this matters for EVM-native DeFi

EVM-based ecosystems — Ethereum mainnet plus Layer-2s such as Base, Arbitrum, Optimism and others — still custody the overwhelming majority of DeFi capital: approximately $95 billion in total value locked as of mid-February 2026 (DeFiLlama).

Smart-contract logic flaws remain one of the few attack surfaces that turn code directly into immediate capital loss. In January 2026 alone, DeFi protocols on EVM chains lost roughly $86 million across seven incidents exceeding $1 million each (Halborn, CertiK consensus). For context, 2025 industry-wide hack losses totaled $2.8–3.4 billion, with a material (though declining) share still tied to EVM smart-contract exploits even as operational and social-engineering attacks grew.
EVMbench introduces a public, reproducible metric for AI-assisted security work. That changes a few things:

  1. Security tooling becomes benchmarkable. Protocol teams can evaluate internal AI workflows against a standardized dataset instead of relying on anecdotal performance.
  2. Audit conversations shift from "AI-assisted" to quantified claims. Vendors and internal teams will need to reference measurable recall, patch success, and exploit containment rates.
  3. Security timelines compress. If detection and patching performance meaningfully improve, the gap between code freeze and vulnerability discovery could narrow, especially for high-severity, pattern-recognizable flaws.

In short, EVMbench may become less about model marketing and more about procurement discipline inside serious DeFi teams.

The dual-use problem is structural, not theoretical

The inclusion of exploit mode underscores the obvious but uncomfortable reality: any system that improves automated detection of vulnerabilities will likely improve automated exploitation as well.

The open publication of the benchmark is implicitly a bet that transparency accelerates defensive capability faster than it meaningfully lowers the barrier for attackers.

That assumption needs scrutiny.

Attackers do not need perfect recall across 120 curated vulnerabilities. They need one working exploit on a high-TVL protocol. If exploit-mode capability scales faster than patch reliability, the near-term risk surface could widen before it narrows.

Protocols integrating AI into CI/CD pipelines will need guardrails:

  • Strict isolation for exploit testing.
  • Clear thresholds for automated fixes.
  • Human review for patch correctness and unintended side effects.

AI-assisted remediation that introduces new logic flaws is not hypothetical; it is an engineering risk that must be managed explicitly.

Second-order effects for the DeFi security stack

If EVMbench becomes a widely referenced benchmark, several downstream shifts are plausible:

Audit firms may reposition. Rather than competing on raw vulnerability discovery, firms could emphasize verification of AI-generated findings, adversarial testing, and economic attack modeling.

Insurance underwriting could change. Underwriters may begin requesting benchmark-aligned metrics on internal security workflows, similar to how SOC 2 or penetration testing reports function in traditional finance.

Disclosure norms may evolve. Protocols might eventually publish AI-assisted security scores as part of transparency reports, though this raises game-theoretic concerns around signaling weakness.

Smaller teams could close the security gap. If marginal cost per vulnerability found declines, high-quality security processes may become accessible to teams that cannot afford multiple top-tier audit rounds.

Whether that democratizes safety or accelerates deployment velocity without proportional oversight remains an open question.

What this is and what it is not

EVMbench is not proof that AI can replace audits. Current detection and patching results remain incomplete, and real-world contracts are entangled with off-chain components, governance processes, and economic design constraints that benchmarks cannot fully model.

It is, however, a credible attempt to move from speculative AI security claims to reproducible measurement in the domain that secures most of DeFi's capital.

The next 12 months will determine whether this becomes a procurement footnote or the reference dataset every serious protocol cites in its security stack.

Editor’s notes

Unresolved questions

  • How quickly will major DeFi teams adopt EVMbench in their review processes, and what thresholds will they set for acceptable agent performance?
  • Will detection and patching scores improve enough within 12 months to meaningfully displace human-led audits for routine high-severity issues?
  • Given January 2026's $86M in EVM-focused losses, how will protocols weigh AI tools against the dual-use risk of accelerated exploit discovery?