The Silicon Race

Six days ago, OpenAI shipped GPT-5.3-Codex. Strong model. 77.3% Terminal-Bench. Got overshadowed by Sam Altman’s Super Bowl rant.

Yesterday, they shipped something more interesting. GPT-5.3-Codex-Spark: a stripped-down variant that runs at 1,000+ tokens per second. 15x faster than the flagship. The model is interesting. The chip underneath it is the story.

Codex-Spark is OpenAI’s first model running on non-Nvidia hardware.

The Wafer

Codex-Spark runs on the Cerebras WSE-3: Wafer-Scale Engine, third generation. A single chip the size of a dinner plate. 4 trillion transistors. On-chip SRAM that’s roughly 1,000x faster than the HBM memory found on Nvidia’s best GPUs.

Traditional GPU inference has a bottleneck: moving data between compute and memory. Every token generated requires reading model weights from memory, running the computation, writing back. With large models, the memory bandwidth wall hits hard. You can have all the compute in the world - if you’re waiting on memory reads, you’re slow.

Cerebras sidesteps this entirely. The entire model sits on-chip. No off-chip memory reads. No interconnect latency between GPU clusters. One massive, tightly connected fabric where compute and memory coexist. The architecture is fundamentally different from anything Nvidia sells.

The result: Llama 3.1 405B at 969 tokens/second. Llama 4 Scout at 2,600. GPT-5.3-Codex-Spark at 1,000+. These aren’t incremental gains over GPU inference. They’re an order of magnitude.

The $10 billion bet

OpenAI signed a reported $10 billion deal with Cerebras in January 2025. This isn’t a research experiment. It’s a strategic infrastructure commitment to a non-Nvidia future.

The Tradeoff

Codex-Spark deliberately trades accuracy for speed.

Benchmark	Codex-Spark	Full GPT-5.3-Codex
Terminal-Bench 2.0	58.4%	77.3%
SWE-Bench Pro	~2-3 min/task	~15-17 min/task
Context window	128K	400K

That’s a meaningful accuracy drop. Nearly 20 percentage points on Terminal-Bench. OpenAI isn’t hiding it. They’re framing it as a feature.

The pitch: Spark is for interactive coding, not autonomous marathons. It makes minimal, targeted changes. It won’t run tests unless asked. It’s designed so you can interrupt and redirect in real time. Think pair programmer, not background agent.

A less accurate AI that acts right away is worth more than a perfect one that waits.

— OpenAI

This is a genuine philosophical split. The full GPT-5.3-Codex is your “fire and forget” model: hand it a task, wait 15 minutes, review the result. Spark is conversational: you think out loud, it responds before your train of thought derails.

Two Philosophies of Fast

Anthropic’s approach is the polar opposite.

Claude’s fast mode runs the same Opus 4.6 model at 2.5x speed. Same weights. Same intelligence. Optimized inference backend - likely speculative decoding and attention optimization, though Anthropic hasn’t disclosed specifics.

The difference matters:

OpenAI: Smaller model + custom silicon = 1,000+ tps, lower accuracy
Anthropic: Same model + software optimization = ~2.5x speed, same accuracy

OpenAI is saying: for interactive work, speed matters more than getting it right the first time. You’ll catch errors in real time because the model is fast enough to feel conversational.

Anthropic is saying: you shouldn’t have to choose. Same brain, faster mouth.

Both are defensible. They’re also betting on different futures. OpenAI is investing in purpose-built hardware. Anthropic is investing in inference software. The first approach scales with chip manufacturing. The second scales with engineering talent.

The pricing gap

Claude fast mode costs 6x standard Opus rates ($30/$150 per million tokens). OpenAI hasn’t disclosed Spark-specific pricing, but it’s currently bundled with ChatGPT Pro ($200/month). Neither approach is cheap. Speed is a premium feature.

Speed as a Weapon

There’s a persistent counterargument: model quality trumps speed. Who cares if it’s fast if it’s wrong?

During a live vibe check of Codex-Spark, Dan Shipper put it simply: “speed is a form of intelligence.” He’s been repeating this across his Vibe Check series for over a year: “I will judge the same model as smarter - and be able to make more progress using it - if it’s 10 times faster.” It sounds like marketing. The math says otherwise.

More Attempts, Better Outcomes

The Large Language Monkeys paper (Stanford, 2024) tested a simple hypothesis: what if you just let the model try more times? DeepSeek-Coder went from 15.9% on SWE-bench with a single attempt to 56% with 250 attempts. Five attempts from a cheap model cost less and solved more issues than one attempt from GPT-4o.

Test-time compute research (Snell et al., Berkeley) pushed this further: a 3B parameter model with compute-optimal test-time scaling outperformed a 405B model. 135x smaller.

The implication for Codex-Spark is direct. At 1,000 tokens per second, you can run five attempts in the time one attempt takes at 50 tps. The model is 20 points worse on Terminal-Bench. But if it gets five shots instead of one, the math might favor speed.

Boyd’s Law

Colonel John Boyd studied F-86 vs MiG-15 dogfights in the Korean War. The MiG-15 was the superior aircraft: faster climb rate, tighter turn radius. The F-86 won 9 out of 10 engagements. The decisive advantage: hydraulic flight controls versus the MiG’s manual stick. A small per-cycle advantage compounded into dominance over dozens of iterations.

The pilot who goes through the OODA cycle in the shortest time prevails because his opponent is responding to situations that have already changed.

— John Boyd

Same pattern in AI coding. Each cycle doesn’t need to be right. It needs to be fast enough that wrong answers get caught and corrected before momentum is lost.

The Missing Piece: Verification

There’s a critical caveat in the Large Language Monkeys paper. Best-of-N scaling only works with automated verifiers. Without a way to tell good attempts from bad ones, methods “plateau beyond several hundred samples and fail to fully scale with the sample budget.” More attempts without verification is just more noise.

This is where guardrails enter the picture. The formula isn’t just “fast model.” It’s:

Fast model (more iterations per unit time)
Automated verification (tests, linting, type checking tell you which attempt worked)
Tight feedback loop (errors feed back into the next attempt automatically)

OpenAI’s own Codex engineering guide says the quiet part out loud: “Since agents can run the test suite and iterate based on the output, defining high quality tests is often the first step to allowing an agent to build a feature.” Speed is the engine. Guardrails are the steering.

The Flow State Case

The METR study (July 2025) found AI coding tools made experienced developers 19% slower, despite those developers believing they were 24% faster. Root cause: context switching. Every prompt-and-wait cycle pulls you out of flow. The researchers themselves noted that “higher capability, lower latency” AI could change this. The tools weren’t fast enough.

At 50 tokens per second, the AI is a delegation tool. You send a task, you wait, you review. At 1,000 tokens per second, the AI is a conversation partner. The response arrives before your attention drifts. That’s not the same workflow made faster. It’s a different workflow entirely.

The threshold hypothesis

There’s likely a critical tps threshold where AI tools flip from “interrupt your flow” to “extend your flow.” Below it, every prompt is a context switch. Above it, the interaction feels like thinking out loud. Nobody knows exactly where this threshold is. But 1,000 tps is a strong candidate.

The Bigger Picture

Zoom out from Codex-Spark and the pattern gets interesting.

For years, the AI race was about models. Who has the best weights. Who leads SWE-Bench. Who wins Terminal-Bench. The assumption: better model = better product. Ship a smarter model, everything downstream improves.

That assumption is breaking down. Models are converging. Opus 4.6 and GPT-5.3-Codex launched on the same day with comparable benchmarks. Gemini 3 Flash hits 78% on SWE-Bench Verified. The gap between frontier models is shrinking quarter over quarter.

When model quality becomes table stakes, differentiation moves to infrastructure. Speed. Cost. Distribution. Ecosystem. And infrastructure means hardware.

OpenAI went to Cerebras. Meta powers its Llama API through Cerebras (2,600 tps on Llama 4 Scout). Groq has its own LPU architecture. SambaNova is building inference chips. Google runs everything on TPUs. The GPU monoculture is fracturing. Purpose-built silicon for specific workloads is replacing “just rent more A100s.”

This is the real significance of Codex-Spark. Not a faster coding model. A signal that the AI race has entered its hardware era. The companies that win the next phase won’t just have the best models. They’ll have the best chips to run them on.

What It Doesn’t Solve

Custom silicon doesn’t fix the hard problems.

Context understanding matters more than raw speed. The best AI coding tool isn’t the fastest autocomplete. It’s the one that understands your project well enough to make coordinated changes across files without breaking things. Spark’s 128K context window is less than a third of the full model’s 400K. You’re trading context for speed.

The accuracy gap is real. 58.4% vs 77.3% on Terminal-Bench is a 25% relative drop. For complex, multi-step refactors, that gap compounds. Each wrong intermediate step cascades. Interactive correction helps, but it shifts cognitive load back to the developer.

Developer productivity is still unproven. The METR study showed the problem. Nobody has shown that faster models solve it. The hypothesis is compelling. The evidence is thin.

Cost and access. Cerebras wafers are expensive and supply-constrained. This isn’t infrastructure that scales to every developer overnight. Fast inference is currently a premium tier, not a default.

Where This Leaves Us

One week. Two OpenAI launches. The first (GPT-5.3-Codex) was the best model they’ve shipped. The second (Codex-Spark) might be more important. Not because the model is better - it’s explicitly worse. Because the infrastructure underneath it points to where AI is going.

The model race is becoming a silicon race. Cerebras wafer-scale engines. Groq LPUs. Google TPUs. Purpose-built inference hardware that doesn’t look anything like the Nvidia GPUs that trained these models. The training stack and the inference stack are diverging. Train on GPUs. Run on something else entirely.

For developers, the practical implication is simple: AI coding tools are about to get much faster across the board. OpenAI rewrote their inference stack alongside the Cerebras launch - 80% reduction in per-roundtrip overhead, 50% reduction in time-to-first-token. Those improvements roll out to all their models, not just Spark. Anthropic’s fast mode is live now. The floor is rising.

Whether 1,000 tokens per second actually changes how you code, or just makes the same workflow feel nicer, is the open question. The chip war is just getting started.