Denial, Then Admission: Why LLM Quality Drops Are Real

For a blog that regularly fanboys Claude, this is going to be uncomfortable.

I’ve written posts about Claude Code’s plan mode, parallel subagents, context engineering. I use it daily. But when the evidence stacks up, intellectual honesty demands I follow it.

The evidence: LLM quality drops are real. Denials come first. Admissions come later.

The Pattern

August 2024. Reddit fills with complaints about Claude 3.5 Sonnet. “Forgetting tasks mid-conversation,” struggling with basic coding it handled weeks before. Developers cancel subscriptions.

Anthropic’s response from Alex Albert:

Our initial investigation does not show any widespread issues. We’d also like to confirm that we’ve made no changes to the 3.5 Sonnet model or inference pipeline.

— Alex Albert, Anthropic (August 2024)

A year later, history repeats. August 2025, complaints surge again. This time Anthropic publishes a postmortem admitting three infrastructure bugs degraded Claude between August 5 and September 4, 2025.

The Bugs

Context Window Routing Error: Short-context requests routed to servers configured for 1M token contexts. Started at 0.8% of Sonnet 4 requests, peaked at 16%. Around 30% of Claude Code users experienced at least one misrouted request.

TPU Output Corruption: Misconfiguration caused unexpected character generation. English answers occasionally sprouted Thai or Chinese text mid-sentence.

XLA:TPU Compiler Bug: Approximate top-K operation returned incorrect token candidates. Mixed-precision arithmetic conflicts eliminated highest-probability tokens.

Six weeks to diagnose

These bugs ran from August 5 to September 4. Anthropic’s postmortem came September 17. Their own evaluations didn’t capture what users were reporting.

The postmortem includes this line:

We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone.

— Anthropic Postmortem (September 2025)

Fair. But the denial-then-admission pattern had already played out.

Why Detection Is Hard

Anthropic acknowledged that their evaluations “simply didn’t capture the degradation users were reporting.” Part of the reason: Claude often recovers well from isolated mistakes. A single bad response gets masked by subsequent good ones.

Privacy constraints also slowed diagnosis. Engineers couldn’t access user conversations directly, limiting their ability to reproduce issues.

The uncomfortable truth: subjective quality is hard to benchmark. When users say “it feels dumber,” there’s no eval for that. Until there is, companies will default to “we found nothing.”

The User Side

Not all perceived degradation is real. Jon Stokes wrote a compelling counterargument after experiencing what felt like Claude Code quality collapse.

His conclusion: it was his behavior, not the model.

I had gotten comfortable, lazy, and overconfident in the model’s capabilities, and as I did that, the output quality started to collapse.

— Jon Stokes

He had stopped breaking tasks into small chunks, started using auto-accept mode, given Claude “bigger and bigger bites to chew on with less and less active direction.” When he returned to disciplined prompting, quality “jumped all the way back up.”

This is real. You and the model form a coupled system. User drift explains some complaints.

There’s another possible explanation: context rot. Research from Chroma shows LLM performance drops from ~95% accuracy on short inputs to 60-70% on longer contexts. The lost in the middle problem is well documented - models perform best when relevant information is at the beginning or end, with significant degradation for information buried in long sequences.

Power users accumulate chat history. They hit degradation thresholds first. This aligns with the 12 Factor Agents principle: own your context window. Fill past 40% and you enter the “dumb zone.”

But context rot doesn’t explain Thai characters in English output. Some degradation is real infrastructure failure.

The New Flagship

November 24, 2025. Opus 4.5 launches. Nine days later, status.claude.com shows elevated errors on both Opus 4.5 and Haiku 4.5.

Power users are reporting familiar patterns. One documented his experience with receipts: massive API spend, two Max plans, 14-hour coding sessions.

His findings:

Model substitution bugs: Gave Opus code specifying Gemini 3, Opus changed it to “Flash 2” (doesn’t exist). Told it five times to fix it. It refused each time. Switched to Sonnet, which worked immediately.
Reward hacking behavior: Commenting out important code, adding fake fallbacks, print-logging garbage instead of fixing issues.
Set GPT5 Nano as the model, Opus changed it to GPT4 mid-task claiming “this model doesn’t exist.”

I’ve hit all three. Explicit model IDs in quotes, ignored repeatedly. Code commented out with “we’ll fix this later.” Models swapped without asking.

At 6am, low traffic time, he couldn’t use Opus. Had to fall back to Sonnet.

The uncomfortable theory

Companies optimize for uptime metrics over model intelligence. Launch hot, handle the load by pushing the nozzle down. 99% uptime looks better on a status page than “best reasoning quality.”

The pattern repeats

Stanford researchers documented the same with GPT-4 in 2023. Accuracy on prime number identification dropped from 97.6% to 2.4% between March and June. OpenAI denied changes. The researchers concluded: “We don’t fully understand what causes these changes because these models are opaque.”

What This Means

Monitor your own workflows. Don’t trust “no changes” at face value.

When you notice degradation:

Document specific examples with timestamps and prompts
Check status pages for elevated errors or partial outages
Try a different model to isolate whether it’s the flagship or your context
Review your own behavior: are you vibe coding or actively supervising?

Track it yourself

aistupidlevel.info runs continuous benchmarks across models and tracks quality drift over time. Useful for confirming whether what you’re experiencing is isolated or widespread.

Research shows 91% of ML models degrade over time. LLMs are no exception. The question isn’t whether quality drops. It’s whether the vendor admits it before or after you’ve wasted weeks debugging your own code.

The uncomfortable truth

I still use Claude daily. I still think it’s the best coding model available. But “best” doesn’t mean “immune to degradation.” Trust the tool. Verify the output.

Update: The Pattern Repeated

January 2026 brought another wave. Significant degradation reported January 8. Inconsistent recommendations across projects January 12. By January 26, users reported Claude Code “started working just disgustingly.” A mass cancellation wave followed. Community tracking showed Claude Code usage dropping from 83% to 70% on Vibe Kanban metrics. Reddit’s top post: “Claude Is Dead” with 841 upvotes.

The denial-then-admission cycle this post documented isn’t a one-off. It’s structural. The gap between what users experience and what vendors acknowledge remains the defining trust problem in AI tooling.

Denial, Then Admission: Why LLM Quality Drops Are Real

The Pattern

The Bugs

Why Detection Is Hard

The User Side

The New Flagship

What This Means

Update: The Pattern Repeated

Share this article

Related Posts

Anthropic's Walled Garden: The Claude Code Crackdown

The Open Source Agentic Moment

Few-Shot Learning for Document Parsing: Training AI on Human Corrections