In 2025, METR ran one of the few honest experiments in this whole field. Experienced open-source developers did real tasks with and without AI assistance, timed. The result was uncomfortable: they felt roughly 20% faster with AI, and were measured roughly 19% slower. They spent the saved time finding and fixing the AI’s errors, steering it, and waiting on it.
In early 2026, METR went to run the study again, to see how a year of better models had moved the number. It couldn’t. Developers wouldn’t take part, because they will no longer work without AI, even for a few tasks, even for science.
The control group quit.
Why That Sentence Should Bother You
A control group is the whole game in measurement. It’s the “compared to what.” METR’s 2025 study mattered precisely because it had one: the same engineers, doing the same kind of work, with the tool taken away. That contrast is what exposed the gap between feeling faster and being faster.
Strip the control group out and you can’t run the experiment at all. Not “the result was inconclusive.” You can’t ask the question. The population that would answer it has opted out of the conditions required to answer it.
— The 2025 METR findingDevelopers felt about 20% faster. Measured, they were about 19% slower.
That gap is the most important number in agentic coding, and it’s the one we just lost the ability to remeasure.
Dependence Outran Measurement
This is the part I keep circling. The dependency didn’t grow gradually enough for our instruments to keep pace. It went total in about a year, and it took the experiment with it.
What replaced the experiment is telling. METR fell back to a survey, letting technical staff self-report their gains, and they reported feeling twice as valuable. We’ve kept the most unreliable signal, perception, and lost the reliable one, measurement, in the same motion. The 2025 study existed specifically to show that perception is wrong here. Now perception is most of what we have left.
I’ve argued before that the developer productivity story is mostly a feeling. This is worse than a wrong feeling. It’s a wrong feeling that can no longer be corrected by data, because the data requires a behavior nobody will perform.
The dangerous move is treating “everyone says they’re faster” as evidence they are. METR’s own work is the proof it isn’t. If your org is justifying AI spend on self-reported productivity, you’re standing exactly where the 2025 study said you’d be fooled. The honest version measures shipped outcomes against a baseline, and the baseline is the thing getting harder to construct every month.
What This Isn’t
I want to be careful, because the doom reading overshoots:
- This is not proof AI slows people down today. The 2025 result was early-2025 models on specific tasks. Models have improved a lot since. The point is that we can no longer check, not that the old number still holds.
- Refusing to work without AI can be rational. If the tool genuinely helps on your real work, declining to handicap yourself for a study is a sane individual choice. The tragedy is collective: every rational refusal removes one more data point.
- Self-report isn’t worthless. Developer sentiment matters for retention and adoption. It just can’t answer the productivity question, which is the one it’s being used to answer.
The Takeaway
We have reached the point where the most-used productivity tool in software cannot be evaluated against its own absence, because its absence is no longer something practitioners will tolerate, even briefly, even to find out.
That should make everyone less certain, not more. We didn’t measure our way to confidence about AI’s impact. We lost the ability to measure and kept the confidence anyway. The control group quit, and we let it, because none of us really wanted to be in it.


