Codex Hits 4M: Quality Is What's Left to Brag About

In January, OpenAI Codex had roughly 600,000 weekly active developers. By Apr 22, the number was 4M. The growth curve, by official OpenAI updates: ~600K Jan, 1M early Feb (desktop launch), 1.6M early Mar, 2M late Mar, 3M Apr 8, 4M about two weeks later. Six times bigger in four months. Token usage on Codex grew roughly 5x while user count grew 3x in Q1, which means existing users integrated deeper, not just signed up.

Anthropic publishes no comparable number for Claude Code. They publish revenue. Run-rate revenue $2.5B and doubled since Jan. Business subscriptions quadrupled YTD. Total Anthropic ARR went from roughly $9B at end of 2025 to $30B in March.

That asymmetry is the post.

ARR Versus WAU

OpenAI publishes weekly active developers because that number is going up. Anthropic publishes annual recurring revenue because that number is going up. Both companies are choosing the metric they can defend. Both are correct.

What this means for the discourse: when you read “Claude Code is better than Codex” in your feed, the writer is making a quality argument. Quality is what is left to argue about when one side is winning on volume and the other is winning on margin. The argument isn’t wrong. It just isn’t the same argument as “which one is more devs using on Tuesday.”

The fact that Anthropic almost certainly has a smaller WAU than Codex and won’t publish their number is itself information. Companies publish what’s flattering. Anthropic’s choice to lead with revenue and benchmarks instead of users is an honest disclosure of what their internal scoreboard reads.

Distribution Is the Moat

Codex’s growth curve is not GPT-5.5 the model. It is ChatGPT’s billion-user installed base and a coding tab inside the app most devs already have open. Claude Code is a separate CLI you install, configure, authenticate, learn the slash commands of, and remember to launch. Codex is a sidebar that appears next to a chat window. The 4M was a distribution number before it was a product number. The tab eats the CLI on convenience, on the order of magnitude that anyone who has ever shipped a B2C product can recognise.

This is also why “GPT-5.5 ships features X, Y, Z faster” lands harder than the equivalent Anthropic ship list. OpenAI shipped GPT-5.5 itself, in-app browser use, automatic approval reviews, and Computer Use on macOS in two weeks. Anthropic shipped /recap, /team-onboarding, and skills as a first-class primitive in the same window. Both teams ship aggressively. Only one ships into a product the user is already inside.

The 72-Hour Pricing Flip

The narrative window for “Claude Code is the premium quality choice at a premium price” closed over the weekend of April 21-23.

Apr 21: Anthropic pulled Claude Code from the $20 Pro tier. The new floor for serious Claude Code use was Max at $100 or $200.
Apr 22: Simon Willison wrote that if Codex has a free tier and Claude Code starts at $100, he should “obviously switch to Codex.” His trust in Anthropic’s pricing transparency, in his words, was “shaken.”
Apr 23: OpenAI shipped GPT-5.5 with Computer Use, in-app browser, and automatic approval reviews into Codex.

Should I obviously switch to Codex?

— Simon Willison, Apr 22, on Anthropic pulling Claude Code from $20 Pro

That is one of the most influential AI-tooling writers on the public internet asking out loud. The 72 hours flipped the narrative from “Claude Code is the craft tool that costs more because it’s worth more” to “Claude Code is the craft tool that costs more, full stop.”

Skills Versus Plugins

The product-shape difference under the headline numbers is open versus closed.

Codex ships ~90 OpenAI-curated plugins. The marquee partners are an enterprise-SaaS rolodex: Atlassian Rovo, CircleCI, CodeRabbit, GitLab Issues, Microsoft Suite, Neon, Render, Superpowers. Lock-in by partnership.
Claude Code runs on a marketplace plus community skills. One third-party plugin repo alone lists 423 plugins and 2,849 skills. Lock-in by composition.

Different lock-in shapes win different segments. Codex’s enterprise-partnership lock-in matches the procurement model that buys for ten thousand seats. Claude’s compose-your-own lock-in matches the dev who lovingly tunes their own harness. Both are real moats. Only one of them maps cleanly to the slide deck a CIO presents at a budget review.

Benchmarks Are Mixed

The benchmark scoreboard is genuinely mixed and worth not overclaiming on either side.

SWE-bench Verified: Opus 4.7 around 87.6%, GPT-5-Codex 74.9%
SWE-bench Pro: Opus 4.7 64.3%, GPT-5.4 57.7%
Terminal-Bench 2.0: GPT-5.4 75.1%, GPT-5.3-Codex 77.3%, Opus 4.7 69.4%
Blind paired tests across 36 trials: Claude Code wins ~67%
Token efficiency: Claude uses roughly 4x more tokens per task than Codex

Claude wins on the larger and more publicised benchmarks. Codex wins on Terminal-Bench 2.0 and on cost per task by a meaningful margin. The honest read is that the model gap between frontier labs has compressed enough that the right tool for any given org is now mostly about distribution and price, not capability ceiling.

If you can only look at one number

WAU is a verdict from the people doing the work. Benchmarks are a vendor’s homework. They are not the same kind of evidence. When the benchmark gap is a few points and the WAU gap is 5x, the WAU number is doing more work in the decision than the benchmark.

Habits Beat Tools

Anthropic built a craft tool. OpenAI built a habit. Habits beat tools at scale, even if tools score higher on benchmarks. That isn’t a moral judgment about either company. It’s the same dynamic that has decided every developer-tooling category in the last two decades. The IDE that ships in the box wins on Tuesday. The IDE that requires a deliberate setup wins on Saturday.

The open question for Claude Code is whether the craft narrative can hold a $100 floor when the habit is bundled into a $20 plan that already includes a billion-user chat client. The 72-hour flip says the narrative is starting to wobble. The ARR number says paying customers haven’t moved yet. Both are true at the same time, and the next two quarters are about which one wins.

If Anthropic wants to keep the “best in category” frame past 2026, they will need to publish a WAU number. The longer they don’t, the more the discourse around “which is better” is going to read as the metric a smaller company brags about because it can.

Codex Hits 4M: Quality Is What's Left to Brag About

ARR Versus WAU

Distribution Is the Moat

The 72-Hour Pricing Flip

Skills Versus Plugins

Benchmarks Are Mixed

Habits Beat Tools

Share this article

Related Posts

The Subscription Arbitrage Endgame

Claude Code Auto-Fix: The PR That Fixes Itself

Claude Code Auto Mode: The Absent Human