You Can't Delete a Hallucination

A team building a social-signal model hit a haunting. Their model kept “hearing” a specific phrase, “Yeah, Friday at five,” in video clips that had no audio at all. Not occasionally. Reliably. So they did the responsible thing and went looking for the source, the way you’d track down any bug: 30,000 training records, 4,600 transcripts, 800 inference probes. And they found it. The phrase was sitting in a worked example inside their own system prompt.

Mystery solved, except it wasn’t. They removed the example, and the model kept hallucinating. It just picked a different phrase to invent. The haunting didn’t have a source you could delete. It had a source you could only relocate.

Two Bugs, Stacked

The ablations told the real story. There were two things going on, not one:

The prompt supplied the script. The worked example told the model which words to produce. Remove it and you remove that particular phrase.
Post-training supplied the compulsion. Somewhere in how the model was tuned, it had learned that silence must be filled with something plausible. That tendency had nothing to do with the example. The example just gave the tendency its first set of words.

Delete the script and the compulsion is still there, so it reaches for new words. You didn’t fix the bug. You changed its costume.

The Model Didn’t Learn a Phrase. It Learned to Confabulate.

This is the distinction that matters, and almost every “the AI is hallucinating, clean the training data” conversation misses it. The team’s model had not memorised a false fact you could locate and excise. It had acquired a behaviour: when there’s a gap, produce a confident answer to fill it. That behaviour is general. Point it at silent video and it invents speech. Point it at a question it can’t answer and it invents a citation. Same reflex, different gap.

I’ve made versions of this argument from other angles. When Claude confidently tells you it’s DeepSeek, it isn’t reporting a fact about itself, it’s autocompleting a template, filling an identity slot the same way this model fills an audio slot. And the reason these tools resist being made provably safe is that the non-determinism is structural, not a list of defects waiting to be patched out. Confabulation is the same story: it’s a property of the machine, not an entry in the dataset.

They went looking for a hallucination the way you’d look for a typo, expecting to find it, delete it, and be done. What they found was that the model hadn’t memorised a lie. It had learned the habit of lying confidently into a vacuum, and you cannot grep for a habit.

Why “Clean the Data” Keeps Disappointing

There’s a comforting model of hallucination where every false output traces to a bad row in training, and enough data hygiene drives it to zero. Sometimes that’s even true: a specific wrong fact, repeated in the corpus, that you can identify and remove. Those are the easy ones, and they’re real.

But the tendency to fill gaps isn’t a row. It comes from how the model is trained to always produce a fluent, helpful, confident continuation. A model rewarded for never leaving you hanging is a model that will invent rather than abstain, because abstaining reads as unhelpful. You can’t clean that out of the data because it isn’t in the data. It’s in the objective.

Engineer around it, don't purify it

If the source can’t be patched, stop trying to patch the source. You defend against a confabulating model the way you defend against any unreliable component: with gates, not faith. I’ve written about using two models from different families and abstaining when they disagree, because a model that can’t agree with another instance on a hard case doesn’t actually know the answer, it’s filling a gap. Force an “I don’t know” path that the training never gave it. The model won’t volunteer uncertainty. Your system has to manufacture it.

What This Doesn’t Claim

Keeping it honest, because “you can’t fix hallucination” overstated is just doom:

Some hallucinations are genuinely fixable at the source. Data-driven false facts exist and data work removes them. The point isn’t that cleaning data is useless. It’s that it has a ceiling, and the gap-filling reflex sits above that ceiling.
Newer models confabulate less, and that’s real progress. Training models to flag uncertainty instead of bluffing measurably moves the needle. The reflex can be dampened. It just hasn’t been, and arguably can’t be, driven to zero while fluency is the reward.
This is one team’s write-up. The specific numbers are theirs, not a controlled study. Treat the anecdote as an illustration of a mechanism you can already see elsewhere, not as proof on its own.

Closing

The team went looking for a ghost and found a mirror. The phrase in the prompt was real, and deleting it felt like a fix, right up until the model proved that the phrase was never the problem. The problem was a machine built to never leave a silence unfilled, doing exactly what it was built to do.

You can’t delete that. It isn’t in a file. The most useful thing the story teaches is to stop hunting for the bad row and start building the system that assumes the model will confidently make something up, because the one that can’t say “I don’t know” always will.

You Can't Delete a Hallucination

Two Bugs, Stacked

The Model Didn’t Learn a Phrase. It Learned to Confabulate.

Why “Clean the Data” Keeps Disappointing

What This Doesn’t Claim

Closing

Share this article

Related Posts

The Eval Became the Incident: OpenAI's Model Breached Hugging Face

Only the Attacker Was Armed: The Hugging Face AI Agent Breach

Two Models or Nothing: LLM Consensus for Dirty Data