Lazy Context: Why Retrieval Beats Compression

This month an open-source project called Headroom went from nowhere to 40,000 GitHub stars and the #1 trending spot, by doing one unglamorous thing: cutting the number of tokens an AI agent reads. The narrative everyone repeats is the timing. It shipped right as a new wave of reasoning models landed and token burn got worse, and it was already sitting there when the bill arrived. Built the cure, waited for the disease.

The timing was sharper than most people realize. The newer Anthropic models also changed their tokenizer, and the same input text now counts as meaningfully more tokens than it did a generation ago - up to roughly a third more depending on the content. So every prompt you were already sending got quietly more expensive without you touching a line of code. The bloat problem didn’t just stay the same and meet a fashionable fix. It actively got worse at exactly the moment a fix went viral.

The tool itself will age out fast - providers will absorb half of what it does. But underneath the hype is an architectural idea worth keeping, and it has nothing to do with the star count.

The Bill Nobody Budgeted For

Tool-heavy agents have a structural cost problem. Every tool call returns something: a wall of JSON, a stack trace, a file you asked it to read, a page of search results. That output lands in the context window and stays there. The API is stateless, so on the next turn you resend the entire conversation, stale tool output included. By turn ten of a real coding session you are paying, every single request, to re-transmit the same 30 KB build log, the same dumped API response, the same file you read on turn two - all of it data the model already digested and will never look at again.

It compounds quietly. Nobody writes the line of code that says “resend everything forever.” It is just the default shape of an agent loop. At company scale that default is what burned through Uber’s annual AI budget in four months.

Compression’s Dirty Secret

The obvious fix is compression: shrink the old tool output before it goes back. And the obvious way to compress is lossy - truncate the middle, or summarize a 400-line log down to “build failed, 3 errors.”

That works right up until the model needs the part you threw away. Three turns later it wants the exact stack frame, the specific line number, the field you trimmed. It is gone, and the model cannot get it back. Lossy compression trades a token bill now for a capability cliff later, and you cannot see the cliff until you walk off it.

Summarization is a one-way door

The moment you replace real tool output with a summary, you have made a bet that the model will never need the detail again. In a long agentic run, that bet loses often. The failure is invisible - the model doesn’t error, it just quietly works from a worse picture of the world.

Compress, Cache, Retrieve

Headroom’s central idea, which it calls Compress-Cache-Retrieve, refuses the trade. Instead of throwing detail away, it compresses the output aggressively for the context window, caches the full original locally, and injects a retrieve tool into the model’s toolset. If the compressed version is enough, great, you paid for the small version. If the model needs the full thing, it calls retrieve and gets it back in under a millisecond.

That is lazy evaluation applied to context. You do not pay for the expensive, complete representation until something actually demands it. Compression stops being a destructive summarization step and becomes a reversible cache miss.

The fix for context bloat isn’t a smaller summary. It’s making the full version one tool call away, so the model can stay cheap by default and reach for detail only when it needs it.

It is a genuinely different framing. Most context-management work treats the window as something you prune. CCR treats it as a cache hierarchy: a cheap hot tier the model reads by default, a full cold tier it pulls from on demand. The compression ratio stops being a quality knob you agonize over, because being wrong is recoverable.

The Cache Nobody Knows They’re Breaking

The piece I found most quietly useful is the part Headroom calls cache alignment, because it targets a mistake almost everyone makes without noticing.

Provider prompt caches work by prefix matching. The API hashes your prompt up to a checkpoint and, on the next request, reuses the work for any identical leading bytes. The economics are dramatic: on Anthropic’s API a cache read costs roughly a tenth of the normal input price, and entries live five minutes by default. For an agent resending a long history every turn, that is most of your bill.

The catch is the word identical. Change a single byte anywhere in the prefix and everything after it is invalidated. And agent prompts are full of dynamic noise at the front: a timestamp in the system header, a per-request UUID, a tool list serialized in non-deterministic order. Each of those silently shatters the cache. You are paying full price on every request and the dashboard says nothing.

Check your cache-read tokens

If you run agents against a caching API, look at cache_read_input_tokens on the response. If it stays near zero across requests that should share a prefix, something dynamic is at the front of your prompt nuking the discount. A cache aligner fixes this by stabilizing those volatile prefixes; you can also just stop interpolating timestamps and UUIDs into your system prompt.

That insight is worth more than the tool. Plenty of teams turn on prompt caching, see a smaller-than-promised saving, and shrug. The reason is usually a dynamic prefix they never thought to audit.

What It Doesn’t Solve

It is worth being honest about where the hype outruns the evidence, because the gap is real.

The headline numbers are measured generously. “60-95% fewer tokens” is the marketing line. The accuracy tests showing no quality degradation were run at far gentler compression ratios, in the 19-32% range. Nobody has cleanly shown that the aggressive setting and the no-degradation setting are the same setting.
For long conversations, native caching may beat it. If your workload is one long thread rather than many tool calls, the provider’s own prompt caching is simpler and often saves more than an external compression layer. CCR shines on tool-output bloat specifically, not on every token problem.
It is not free infrastructure. Real reviews flag hundreds of megabytes of memory overhead on long sessions, a heavy native dependency for the code-aware compression, and an English bias in the prose model. This is a running service with its own costs, not a magic flag.
The skeptics were not wrong. Hacker News, notably, was unimpressed across several submissions even as the stars piled up elsewhere. “Does this actually hold in production?” is still a fair question.

Why the Pattern Outlives the Tool

Context engineering is becoming the real bottleneck in agentic systems, more than raw model quality. As that happens, providers will keep absorbing these tricks - native caching already exists, native compression and context editing are arriving. A standalone tool sitting in the middle has a shelf life.

But the ideas transfer regardless of who ships them. Treat compression as reversible retrieval rather than destructive summarization. Keep the cheap representation hot and the expensive one one call away. Audit your prompt prefix for the dynamic bytes silently breaking your cache. Those are principles you can apply by hand, in your own agent loop, today, whether or not Headroom is the thing that taught you them.

The tool is riding a moment. The pattern is the part to keep.

Lazy Context: Why Retrieval Beats Compression

The Bill Nobody Budgeted For

Compression’s Dirty Secret

Compress, Cache, Retrieve

The Cache Nobody Knows They’re Breaking

What It Doesn’t Solve

Why the Pattern Outlives the Tool

Share this article

Related Posts

Cheap Is a Hardware Strategy

DeepSeek-OCR: Compressing Text by 20x Using Vision

Two Models or Nothing: LLM Consensus for Dirty Data