Last week I argued that agents don’t refactor. They add. They don’t subtract. The codebase accretes. That post made the case qualitatively. This one moves the receipt to the unbilled hours.

Speed without stability is just accelerated chaos.

— DORA 2025 State of AI-Assisted Software Development

The Headline Number

Lightrun’s 2026 State of AI-Powered Engineering surveyed 200 senior SRE and DevOps leaders across the US, UK, and EU. The number that should land: 43% of AI-generated code changes require debugging in production after passing QA and staging. The same survey: AI-created PRs ship 75% more logic and correctness errors than human-authored ones, totalling roughly 194 incidences per 100 PRs once you fold in dependency, config, and control-flow errors.

This is self-reported, enterprise-skewed, and worth caveating. It is also not an outlier.

The Texture

CodeRabbit’s State of AI vs Human Code Generation analysed 470 GitHub PRs (320 AI-co-authored, 150 human-only) and normalised by 100 PRs:

  • 1.7x more issues overall (10.83 vs 6.45 per PR)
  • 2.74x more security issues
  • ~8x more performance issues (excessive I/O the dominant class)
  • ~2x more concurrency bugs
  • ~2x more error-handling gaps (missing checks)
  • 1.4-1.7x more critical or major severity issues

The pattern is consistent: AI doesn’t invent new failure modes. It overproduces the existing ones. The CodeRabbit framing is sharper than the numbers:

Humans and AI make the same kinds of mistakes. AI just makes many of them more often and at a larger scale.

— CodeRabbit, State of AI vs Human Code Generation

The Perception Gap

METR’s July 2025 study put 16 experienced open-source devs on real tasks in their own repos with and without AI assistance. The result: AI made them 19% slower. The result the participants expected before the study: 24% faster. The result they believed afterwards: 20% faster. So the same population was wrong about the direction by ~40 points and didn’t update.

The Feb 2026 follow-up tried to redo the experiment and didn’t land. Selection bias collapsed the sample: the devs most sceptical of AI dropped out. METR’s own conclusion is that 2026 devs are probably faster than 2025 devs but they have “only very weak evidence.” The 19% number from the original study has not been credibly overturned.

If you only believed one piece of data in this post, this would be a defensible candidate. The slow-down was measured. The speed-up was felt.

The Steelman

DORA’s 2025 State of AI-Assisted Software Development is the bull case and deserves an honest read.

  • +21% tasks completed per developer
  • +98% PRs merged per individual

Those numbers are real, and they are the reason the AI productivity narrative will not die for years. DORA’s own caveat: organisational delivery metrics are flat, and stability decreases. The framing they land on is “amplifier,” not accelerator. AI magnifies high-performing organisations. It magnifies dysfunction with the same multiplier.

The bull and the bear are looking at the same data. The bull counts merged PRs. The bear counts the bug-fix commit that lands two weeks later.

The Senior-Versus-Junior Split

Fastly’s 2025 senior-developer study: seniors ship 2.5x more AI code into production than juniors. 32% of senior production code is AI-authored, vs 13% for juniors. A separate 160,000-dev / 30M-commit study went further: juniors use AI 37% more than seniors, but only the seniors got measurably faster.

The framing that survives this data: AI is a skill amplifier, not a skill equaliser. It pays back proportional to the judgement applied at the prompt, the diff, and the review. The cohort with the least judgement is using it the most and netting out flat or negative.

This is the part that should worry hiring committees more than it does.

The Denominator

The numerator of “AI productivity” is legible. PRs merged. Tasks closed. Lines shipped. Velocity dashboards. Each one is a number a manager can put in a deck.

The denominator is illegible. Verification time. Rework. Bug-fix commits in the 14 days after merge. On-call incidents triggered by logic and concurrency bugs the review missed. Cognitive overhead from reading drift-prone diffs. The 45% of devs who told Stack Overflow that debugging AI code takes longer than writing it.

You can run the velocity report. You cannot, in most orgs, run the rework report. So the denominator is invisible by default, and the productivity story keeps its pristine numerator.

Measure rework or you're guessing

The cheapest version of the rework report is the 14-day post-merge bug-fix commit count, scoped to files touched by AI-authored PRs. Compare to the same metric on human-authored PRs over a matched window. If your AI cohort isn’t lower-bug-fix-rate, the velocity gain is being eaten somewhere else.

If your team’s AI productivity story is “we shipped more PRs,” ask the next question. Net of rework. Net of the bug-fix commits in the second week. Net of the on-call page that traced back to a logic error in a generated diff. The numerator usually survives those questions. Sometimes it doesn’t. Either way, the answer is information you didn’t have before.

Speed without stability is just accelerated chaos. The math has a denominator now.