This closes a trilogy. Post one said chill on craft and let the team build. Post two said but make agents clean up, because they won’t. The last question in that series is the one I’ve been avoiding: if the agent writes both the code and the tests, how do you know any of it works?
Uncle Bob himself, the father of TDD, posted on X recently with the cleanest answer yet:
— Robert C. Martin, on XTDD is very inefficient for AIs. Testing is essential for them but not in the micro steps that the three laws of TDD recommend. Principles remain the same but techniques must be adjusted to fit the different “mind” of the AI. Think of the AI as a highly focused idiot savant.
That’s the whole thesis in one tweet. TDD’s three laws, sacred for twenty years, don’t survive first contact with an agent. Testing itself didn’t die. It got more important. And the technique flipped.
The Idiot Savant Problem
Start with what actually goes wrong. An agent writes a feature. The feature passes a suite of tests the same agent generated. The PR is green. The reviewer skims the diff, sees green, merges. A week later production breaks in a way that every committed test has been, in a loose sense, confirming rather than catching.
This is not hypothetical. It’s the normal case now. The failure modes cluster:
- Confirmation tests. The agent wrote the code, then wrote tests that match the assumptions baked into the code. Green bar means “my model is self-consistent.” Not “the thing works.”
- Mock proliferation. Every boundary mocked, every dependency stubbed, every side effect intercepted. The test runs in a fictional world and reports success about the fictional world.
- Coverage theatre. 100% lines hit, zero meaningful assertions.
expect(result).toBeDefined().assert result is not None. The suite is a tick-box, not a filter. - Shape-tested implementation detail. Tests that assert on exact strings, exact JSON shapes, or exact call orderings. They break on regeneration without ever having caught a real bug. Brittle without being protective.
UC Berkeley recently ran an automated auditor against eight major AI coding benchmarks, including SWE-bench, WebArena, OSWorld, GAIA, and Terminal-Bench. Every one could be gamed to near-perfect scores without solving a single task. Coverage theatre at civilisation scale. If the benchmarks can be fooled, so can your suite.
Review the Tests, Not the Code
The practical flip is short. You used to review the code and glance at the test diff. Invert it.
- The code is long, regenerates often, and was written by an agent already. Line-by-line review is low-leverage. You cannot scale your attention to the volume the agent produces.
- The tests are short, encode intent, and are the contract the next regeneration has to honour. Misread the test diff and you miss the one thing determining whether the system is actually doing what you think.
Every PR, start with the test changes. If the test diff is wrong, stop there. Fix the tests. Whatever the code does below is only trustworthy to the degree the tests assert correctly.
When the same agent writes the function and the test in the same session, the test will almost always pass. This is not a sign of quality. It’s a sign of self-consistency. Any time one agent wrote both sides of the contract, assume the tests are confirming rather than verifying until you check the assertions yourself. If the tests wouldn’t catch the failure mode a naive reader would worry about, they are not tests. They are decoration.
The Gates That Still Pay
Concrete changes that make agent-written tests actually earn their keep:
- Ban mocks at system boundaries. Real HTTP. Real databases (test instances, real servers). Real file systems where the code touches disk. Mocked boundaries let the agent test its own fiction. Integration is the signal. Unit tests that stub the world are not.
- Integration and E2E over unit. Unit tests test the shape of the code, which regenerates every sprint. Integration tests test behaviour, which the user actually cares about. The pyramid inverts in the agent era. Invest up the pyramid, not down.
- Behaviour names, not shape names.
should_reject_expired_sessionbeatstest_session_handler_3. The name is the spec. The next agent reads the name and knows what must remain true. - Mutation testing as a sanity check. Run a tool like Stryker or Pitest once a week. It deletes lines, flips operators, inverts conditions. If the tests still pass, your tests are decoration. Cheap way to catch the 100%-coverage-zero-value case at scale.
- Write the test intent. Let the agent write the test code. Specify in plain English what behaviour matters and what the failure looks like. Let the agent fill in syntax. This is TDD’s surviving core once the three laws retire: humans own the contract, agents do the typing.
What This Doesn’t Fix
Testing is not salvation. Honest limits:
- Tests don’t catch design errors. A well-tested bad design is still a bad design.
- Performance cliffs, UX regressions, and subjective quality are still human jobs. The suite won’t tell you the flow feels wrong.
- Tests written against the wrong intent are useless no matter how rigorous. Intent comes from the product side and has to be stated. An agent cannot invent intent.
- The idiot savant can still fool the guardrails. If the agent writes a test that passes by accident of its own implementation choice, no amount of process catches it until production does. Testing is a filter, not a guarantee.
The Trilogy Close
The last three posts are one argument.
Let the savant build, because the tower is theirs to make. Remind it to clean up, because it won’t. Fence the output with tests that weren’t written to verify the savant’s own assumptions, because it’s brilliant at what you asked and blind to what you didn’t.
Uncle Bob’s line is the right ending. The principles remain. The techniques change. Testing is essential, and thinking of the agent as a highly focused idiot savant is the most useful mental model I’ve seen this year. Treat the tests as the guardrails. Review them first. Ban the mocks. Watch the savant.


