Two Models or Nothing: LLM Consensus for Dirty Data

I’m building a price-intelligence system that scrapes competitor catalogues and matches their products to mine. The matching is the whole game: if I can line up the same product across five retailers, I can track price, drift, and freshness. The problem is that scraped data is filthy. The same product shows up under a different name at every retailer - reworded, abbreviated, packed differently - and of the 90,000-odd products I pull, most are missing the fields matching depends on anyway: brand, manufacturer, pack quantity, size.

The master databases that would fill those gaps are gated, expensive, or thin on the long tail. So the obvious move is to point an LLM at the mess and ask it to extract what’s missing.

The obvious move is also where most of the danger lives.

The Single-LLM Trap

A fast, cheap model is great at the easy 90% and quietly catastrophic on the hard 10%. Ask it “who manufactures this brand?” and for a household name it answers correctly almost every time. For a contested or obscure brand, where acquisition history matters and the answer isn’t sitting in the model’s weights, it hallucinates. Worse, it hallucinates differently on each run. Run the same prompt three times and you get three confident, conflicting company names.

That non-determinism is usually treated as noise to be suppressed. I treat it as signal. If a model can’t agree with itself across runs, it doesn’t know the answer. And if it doesn’t know, I’d rather it said so.

Confidence is not knowledge

LLMs return the same fluent, self-assured tone whether they’re recalling a fact or inventing one. A pipeline that trusts the output’s confidence is trusting the one signal the model is worst at calibrating. You have to derive trust from somewhere the model can’t fake.

Deterministic First, LLM Last

Before any model runs, regex does the boring work. Pack counts (“60 tablets”), volumes (“150 ml”), weights (“50g”) are all parseable with a few careful patterns. This is free, instant, and never hallucinates.

That single step recovered over 40,000 pack-quantity values at zero cost, roughly 70% of the gap, before the LLM saw a single row. The model only gets the genuine tail: the cases parsing can’t crack. Deterministic extraction isn’t a fallback for when the LLM is too expensive. It’s the primary path. The LLM is the fallback.

Consensus Over Conviction

For the genuinely hard field - brand ownership - I run a two-model panel. It is the same instinct as reaching for a second model’s opinion in interactive work, but automated and forced to agree. Two models from different families (different vendors, different training, decorrelated failure modes) get the same question:

What company is the ultimate owner of the brand “X”? Reply with only the company name. If genuinely unsure, reply “UNKNOWN”.

Then the agreement logic decides what to do with the two answers:

Both agree (exact or clean substring match): accept it, mark it grounded.
Either says UNKNOWN, or they disagree: abstain. Leave the field empty and flag it for a human.

The agreement test is deliberately strict. An early version accepted a shared first word, which let “Nature’s Way” and “Nature’s Own” forge agreement on “Nature’s”. That’s a different company. I tore first-word matching out entirely. Now only distinctive, near-complete matches count, and anything borderline abstains. Precision over recall: a missed enrichment costs me nothing, but a wrong one silently corrupts every match downstream.

A single model is a coin flip you can’t see. Two models from different families turn disagreement into a visible, actionable signal: I don’t know.

The other reason two models is affordable: I dedup before I ask. There are 1,700 distinct brands across those 90,000 rows. I ground each brand once, not each row. Per-row grounding would have cost over a thousand dollars; per-brand grounding costs a couple of dollars. Entity-level deduplication made a two-model panel roughly 40x cheaper than a naive one-model-per-row loop.

Abstention Is a Feature

The instinct when building these pipelines is to maximise coverage: get an answer for every row. That instinct is wrong. The most important output of the consensus panel isn’t the answer, it’s the refusal. Roughly a fifth of brands abstain, and that fifth is exactly the set a single model would have confidently poisoned the database with.

An abstained row isn’t a failure. It’s the system correctly identifying the boundary of what it can know cheaply, and routing it to the one resource that can resolve it: a person. The pipeline’s job is to be honest about that boundary, not to paper over it.

Freshness Must Be Content, Not Time

Here’s the bug that taught me the most. The first freshness gate was the obvious one: re-enrich a row if it was scraped more recently than it was last enriched (enrichedAt < lastScrapedAt). Clean, simple, wrong.

Every scrape updates lastScrapedAt, even when only the price changed. So every scrape cycle re-flagged 42,000 rows as stale and re-charged the LLM to re-extract fields that hadn’t moved. The gate was watching the wrong signal: time, when it should have watched content.

The fix is a content hash. Stamp each enrichment with md5(name + description) and only re-enrich when that hash changes. Price churn no longer triggers a thing. The extraction inputs are the only thing that gates re-extraction.

Hash the inputs, not the clock

Any expensive, idempotent transformation needs a freshness gate keyed on the bytes that actually feed it, not on a timestamp that moves for unrelated reasons. If your cache invalidates on activity instead of on change, you don’t have a cache.

Let the Model Raise the Floor, Never the Ceiling

LLM-derived values are second-class citizens in the matcher, by design:

A model value is used only when the scraped field is missing. It can fill a gap; it can never override real data.
It can only raise match confidence, never lower it. A hallucinated value can’t sink a correct match.
Hard contradictions are final. “10 tablets” versus “100 tablets” is a different product, not a fuzzy variant, so a quantity mismatch is an automatic reject regardless of what any model thinks.
The grounded manufacturer is only trusted when both models agreed. Ungrounded guesses never touch the matcher.

Each pipeline stage also owns its own storage. The scraper writes its columns; enrichment writes a separate blob; nothing clobbers anything else on re-run. A re-scrape can’t wipe an expensive grounding, and a re-enrichment can’t overwrite a verified scrape.

The Pattern

None of this is about a clever prompt or a frontier model. It’s the opposite: assume the model is unreliable on the cases that matter, and build the structure that makes it safe anyway - parse before you prompt, make models vote and abstain when they disagree, dedup at the entity level so consensus stays cheap, gate re-work on content rather than time, and let model output fill gaps without ever overriding truth or demoting confidence.

Trust in an LLM pipeline isn’t something the model gives you. It’s something you engineer around it. The interesting work is in the abstentions, the gates, and the decisions you refuse to hand to the model at all.

Two Models or Nothing: LLM Consensus for Dirty Data

The Single-LLM Trap

Deterministic First, LLM Last

Consensus Over Conviction

Abstention Is a Feature

Freshness Must Be Content, Not Time

Let the Model Raise the Floor, Never the Ceiling

The Pattern

Share this article

Related Posts

Only the Attacker Was Armed: The Hugging Face AI Agent Breach

Three Instruments, All Lying: Debugging the Metrics Behind FameCake

You Can't Delete a Hallucination