Benchmarks Are Bullshit

The model I reach for every day doesn’t top any major benchmark. It doesn’t win SWE-bench. It doesn’t win HumanEval. On paper, several models should be better. In practice, the model that works best is not the model that scores best.

Every developer I know who has tried everything arrives at some version of this observation. Their preferred model isn’t the leaderboard winner. The gap between the score and the experience is the entire problem.

Berkeley Just Proved It

Today, Berkeley’s Center for Responsible, Decentralized Intelligence published a paper titled “How We Broke Top AI Agent Benchmarks: And What Comes Next.” 264 points on Hacker News, 250+ comments.

They built an agent that explicitly games agent benchmarks by exploiting the structure of the evaluation, not by being better at the underlying task. The scores go up. The capability doesn’t. This is the Goodhart’s Law paper the field needed: when a measure becomes a target, it ceases to be a good measure.

The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it.
— Andrej Karpathy, January 2026

Karpathy said this in January. 15,500 likes. The observation isn’t new. Berkeley built the proof.

The Chinese Model Problem

This part sounds xenophobic. It isn’t. It’s about incentive structures.

Chinese models - DeepSeek, Qwen, Yi - consistently post impressive benchmark numbers. On paper, they should be serious coding contenders. In practice:

Poor instruction following on complex tasks. Simple prompts work. Multi-step refactors that require holding architectural constraints across a long conversation fall apart.
Benchmark-shaped outputs. The code looks like a test response, not production work. It passes a unit test but misses the intent. Hallucinated APIs at noticeably higher rates.
Censorship interference. Safety filtering that blocks legitimate engineering tasks around security, networking, and system administration.

The incentive structure explains it. Benchmark scores are the primary currency for funding, partnerships, and government approval in the Chinese AI ecosystem. The optimization target is the leaderboard, not the user. Western labs face the same investor pressure, but a larger base of paying developer users provides a corrective feedback loop.

This isn't about capability ceiling

Chinese labs have world-class researchers and enormous compute budgets. The problem is the incentive structure rewards leaderboard performance over practical usability. When enterprise adoption grows and the incentives shift, expect this gap to close fast.

What Benchmarks Miss

The structural rot runs deeper than any individual benchmark being gameable:

HumanEval has 164 problems. Trivially overfittable.
GSM8K saturated. 50% to 95%+ in two years, far faster than math reasoning actually improved.
MMLU scores vary 5-15% depending on evaluation harness. Same model, same benchmark, different numbers. Companies pick the highest one.
SWE-bench rewards passing tests, not code quality. A terrible patch that makes the test green scores the same as a clean one.
Chatbot Arena has verbosity bias. Users prefer longer, more confident responses in quick A/B comparisons, even when shorter answers are more accurate.

None of these measure the things that actually matter for coding work: understanding ambiguous intent, maintaining coherence across a large codebase, knowing when to push back, adapting to the developer’s context, or holding quality across a long multi-turn session. A model that scores 95% on a fresh prompt and 60% after 80 turns is worse than one that scores 85% and holds steady. Nobody measures this.

MIT Technology Review ran “AI Benchmarks Are Broken” on March 31. Zvi Mowshowitz wrote he’s sympathetic to “not even looking at the scores anymore because current benchmarks are terrible.” The labs know it too. They keep publishing scores because press and investors demand numbers.

What Would Actually Work

Private, rotating test sets. ARC-AGI got this right: if the test set is public, it will be trained on. Meaningful benchmarks need fresh evaluations that rotate frequently.
Long-session evaluations. Measure performance after 50 turns, not 1. Measure context retention, not context ingestion.
Practitioner preference at scale. Chatbot Arena was the right idea with the wrong execution. A coding-specific version with professional developers as evaluators would be more useful than every static benchmark combined.
Task-specific disaggregation. A single score is meaningless. Show me coding performance, long-context retention, instruction following, and refusal quality separately.

The vibe check is real

Karpathy coined “vibe check” as an evaluation method and people laughed. But the aggregated judgment of experienced practitioners who use models for real work all day is a more valid signal than any published benchmark. It doesn’t fit in a press release. That doesn’t make it wrong.

The benchmark leaderboard is the tech industry’s version of university rankings. Everyone knows it’s broken. Everyone looks at it anyway. The institutions optimize for the ranking instead of the thing the ranking claims to measure.

Stop reading leaderboards. Start reading code.

Benchmarks Are Bullshit

Berkeley Just Proved It

The Chinese Model Problem

What Benchmarks Miss

What Would Actually Work

Share this article

Related Posts

Receipts: SWE-bench Pro and the Lab That Walked Away

Your AGENTS.md is a Liability

The Fifth Rule