Vibing the Tool with the Tool

Eleven days ago I wrote The Trust Tax, about a Redditor who loaded Claude Code’s closed-source binary into Ghidra and proved Anthropic had silently dropped the cache TTL from one hour to five minutes. Uncached tokens cost 10-20x more. Nobody announced it. The community had to reverse-engineer the evidence.

Yesterday Anthropic published a postmortem confirming three Claude Code regressions between March and April. One of them is that cache bug. The sleuths were right.

That alone would be worth a short “I told you so.” The more interesting thing is what the postmortem admits about how the bugs got caught, or didn’t.

The Three Bugs

The postmortem names them cleanly. Dates are approximate:

Reasoning effort downgrade (March 4 to April 7). Default reasoning dropped from high to medium to tame latency. Users complained Claude felt “less intelligent.” Reverted after feedback stacked up.
Cache clearing bug (March 26 to April 10). A prompt-caching implementation error caused repeated discard of prior reasoning history within a session, not just once. Claude felt “forgetful and repetitive.” This is the bug the community reverse-engineered.
Verbosity constraint (April 16 to April 20). A system prompt instruction capping Claude to 25 words between tool calls inadvertently degraded coding performance by roughly 3%. It passed internal tests.

Three independent regressions, overlapping windows, all landing on coding quality. The aggregate feel from the user side was “Claude Code got worse this spring.” It did. Three different ways at once.

The 25-Word Parable

The verbosity constraint is the one to sit with.

Somebody inside Anthropic decided Claude was being too chatty between tool calls. They added a one-line instruction to the system prompt: stay under 25 words. This is the kind of change a reasonable person ships in an afternoon. It is precisely the kind of change a reasonable reviewer approves in thirty seconds.

It passed internal evals.

It dropped coding performance by around 3% across four production days.

Evals green, quality red

A 3% regression on coding benchmarks is hard to spot in aggregate noise but obvious in a ten-minute session. This is the signature of the modern eval gap: averaged benchmarks pass, long-horizon agentic work degrades, users feel it first.

The failure mode here is not a bug in the usual sense. No null pointer. No race. A plain-English instruction that made sense in isolation subtly starved the model of working-thought bandwidth during tool-heavy loops. You cannot lint for this. You cannot unit-test for it. You discover it by running the model on actual workloads long enough for the small delta to accumulate into something users notice.

Which, of course, is what happened. Users noticed. Evals didn’t.

Opus 4.7 Caught What Opus 4.6 Shipped

Here is the sentence from the postmortem I cannot stop thinking about:

Opus 4.7 found the bug, while Opus 4.6 didn’t.
— April 23, 2026 postmortem

Anthropic ships Claude Code. Code Review (Anthropic’s own product, built on Claude) runs on pull requests that change Claude Code itself. The reviewer agent running on the offending PRs was an earlier model, Opus 4.6. It signed off. Humans signed off. Automated tests signed off. The regression reached production.

When the postmortem team re-ran review with the newer Opus 4.7, it caught the bug.

Read that loop carefully:

The reviewer is a version of the product. Code Review (Claude) runs on PRs that change Claude Code.
Reviewer quality is model-version-dependent. An older reviewer misses things a newer reviewer catches.
Regressions ship while the reviewer is still the regression-prone version. By the time you have a reviewer good enough to catch a given bug, the bug is already in production.

This is the vibing-the-tool-with-the-tool problem. It is not unique to Anthropic. Every serious agent shop is running some version of this. But Anthropic is the one with the postmortem in public, because Anthropic is the one whose product quality is directly, visibly tied to the thing reviewing the product.

What Evals Cannot See

The deeper admission is the phrase “made it past multiple human and automated code reviews” plus “internal testing.” Three systems failed at once, for three different bugs, across six weeks.

The reasoning-effort change was a deliberate tradeoff users rejected after the fact. That is a product decision, not a process failure. Fine.

The cache bug and the 25-word constraint are process failures of a specific type:

Aggregate metrics mask long-horizon degradation. A 3% coding drop disappears in benchmark noise. It does not disappear in a four-hour agent session.
Within-session forgetting is invisible to isolated turn evals. If your evals are per-prompt, you will miss a bug where Claude drops its own reasoning state mid-conversation.
Human review reads the diff, not the behavior. A 25-word rule looks harmless in a diff. It is not harmless in the loop.

None of these are novel failure modes. They are the failure modes every team shipping agents hits. The difference is that Anthropic ships the thing other teams are using to ship the thing. The blast radius is everyone.

Credit Where It Is Due

I have been critical of Anthropic three times this month. It would be dishonest not to mark what they got right here.

They published. Named dates, named bugs, named the review gap. Many companies would have rolled fixes silently.
They confirmed what users suspected. The cache bug was dismissed as “vibes” for weeks. Now it is in writing.
They flagged the model-version-dependent review finding. That is an awkward admission, and they included it anyway.

The postmortem is the behavior you want to see more of. The way to get more of it is to say so when it happens.

What I Am Taking Away

For anyone building on agent harnesses, including me:

Your eval suite is probably lying to you about long-horizon work. Per-turn evals cannot catch the regressions users feel most. You need soak tests: real tasks, real sessions, real hours.
Reviewer drift is a thing. If your PR reviewer is an agent, its blind spots are your blind spots, and those blind spots move with model versions in ways you will not notice until the next version catches something.
Plain-English system prompt changes are load-bearing code. Treat “add one line to the system prompt” the way you treat “change a hot-path default.” It is.

Anthropic will ship more regressions. So will everyone. The question is whether the postmortems keep coming, and whether the lessons end up in the review process instead of just the blog post. I am cautiously optimistic, which, given the month I have had with this company, is more than I expected to be writing today.

Vibing the Tool with the Tool

The Three Bugs

The 25-Word Parable

Opus 4.7 Caught What Opus 4.6 Shipped

What Evals Cannot See

Credit Where It Is Due

What I Am Taking Away

Share this article

Related Posts

The Trust Tax: Anthropic's Worst Month

Denial, Then Admission: Why LLM Quality Drops Are Real

Ten Findings, Two Real: Grok 4.5 Reviews the Code That Runs Grok 4.5