A few weeks ago I argued that AI coding benchmarks are eval theater. The thesis was qualitative. Now the receipt is on a public leaderboard, and the lab that built the benchmark we all quote has quietly stopped quoting it themselves.
— Mia Glaese, OpenAI, Frontier Evals podcastWe are kind of starting to measure the agent’s ability to correctly guess how to name a specific function.
That is the head of OpenAI’s evals work explaining why they stopped reporting SWE-bench Verified.
The 35-Point Cliff
Pull up the SWE-bench Pro public leaderboard. The top model as of writing is GPT-5.4 in xHigh mode at 59.10%. The reigning Anthropic flagship, Claude Opus 4.5 (2025-11-01), comes in at 45.89%. That same Opus 4.5 sits at 80.9% on SWE-bench Verified. Same model. Same task class. Different benchmark.
| Model | Verified | Pro (public) | Delta |
|---|---|---|---|
| Opus 4.5 | 80.9% | 45.89% | -35 |
| Sonnet 4.5 | ~77% | 43.60% | -33 |
| GPT-5 (Aug 2025) | ~70% | 41.78% | -28 |
| Top of original Sep 2025 frontier | 70%+ | 23.3% | -47 |
Mythos, the model Anthropic gated as too dangerous to release at $20/month, is not on the public leaderboard at all. Anthropic’s self-reported card has it around 77.8% on Pro versus 93.9% on Verified, an internally-disclosed 16-point gap. That is the best case, on Anthropic’s own scoreboard, on the cleanest model in their lineup. The cheaper models drop closer to half their Verified score.
What Pro Actually Is
SWE-bench Pro is 1,865 tasks across 41 repositories, broken into a 731-task public split, an 858-task held-out split, and 276 instances from 18 private partner codebases. The methodology change that matters is not the size. It is the licensing.
The public split is sourced exclusively from repositories under copyleft licences like GPL. Those licences create a legal deterrent against inclusion in proprietary training corpora. The private split sits behind NDAs. The held-out split is reserved for internal evaluation. None of this is a technical guarantee against contamination, but it is a much steeper legal bill than scraping permissive public Github.
The other numerical change is content. Verified problems average around one file modified and ten-or-so lines per patch. Pro problems average 4.1 files and 107 lines. Pro is harder because the data isn’t in the model’s belly, and because the problems look more like real prod work.
Why OpenAI Walked Away
The thing that should land is not a leaderboard. It is OpenAI quitting the leaderboard they own.
Glaese and Watkins, on the Frontier Evals podcast, walked through OpenAI’s audit of Verified. Forty-nine of its tasks were too narrowly defined. Twenty-six demanded behaviour the issue text did not specify. Worse, models could regurgitate the ground-truth patch from the task ID alone, with no other context. The “SWE-Bench Illusion” paper (Jun 2025) measured the contamination directly: top models identify the buggy file from issue text alone with 76% accuracy on Verified versus 53% on out-of-distribution repos. Five-gram verbatim overlap with training data sits at 35% on Verified versus 18% elsewhere.
OpenAI’s response was not to fix Verified. It was to stop reporting it. Their Frontier Evals push moved to harder, fresher, less contaminated harnesses, and Verified quietly fell off the press cards. The benchmark you keep seeing is the one its own architects have already decided was leaking.
The Structural Fix Is Legal, Not Technical
Every benchmark scraped from public GitHub eventually drifts into the next training set. There is no algorithmic fix for this. Scrubbing repository names, hashing file paths, paraphrasing issue text - all of those have been tried and all of them get re-aligned by larger and hungrier crawlers. Pro’s defence is not technical. It’s that the underlying repos cannot legally enter a proprietary corpus without paying lawyers, and the private split cannot enter at any price without partner consent.
This is not permanent. The public 731-problem set has been out for roughly seven months. The clock is ticking on it the same way it ticked on Verified. The held-out and partner splits are the only parts with real long-term contamination resistance, and you can’t audit those because that is what makes them resistant.
The buyer takeaway: numbers from a benchmark you can’t see are at least as trustworthy as numbers from a benchmark whose contents are now in the training data.
If a press release just says “SWE-bench” without a suffix, treat it as Verified and apply the gap. The Anthropic-internal Mythos number suggests at least 16 points down on Pro. Mid-tier models suggest 30+. A “leading on SWE-bench” claim that won’t say Pro is a claim you can move past.
What This Doesn’t Solve
The Pro number isn’t a clean signal either. Scaffolding choice swings results 4 to 10 points across harnesses. Scale’s leaderboard uses a standardised mini-swe-agent harness that strips the assists vendors layer in for their own runs, which makes vendor-card numbers and leaderboard numbers apples-to-oranges. The private partner split is unaudited by anyone outside Scale. And there is no published study yet showing Pro scores predict production performance better than Verified did. The argument for Pro is a priori (less contamination should mean more signal), not yet empirical.
The honest framing is not “Pro is the new Verified.” It is “Verified was leaking and at least Pro raises the legal bar.”
Receipts
The thesis was that benchmarks are eval theater. The receipt is that the company that built the marquee benchmark stopped quoting it because the same audit they kept telling the rest of us to ignore came back and said the model was guessing function names. Glaese said it on a podcast. Scale shipped the comparison. The leaderboard is public.
When a vendor card next quotes a 90-something on SWE-bench, ask which one. If they don’t say Pro, assume the gap.


