GLM-5.2: The Receipts Came In

Eleven days ago, when GLM-5.2 shipped, I put an asterisk on it: Zhipu published no benchmarks at launch, so the right move was to take the claims on credit, not faith. “Powerful coding” is a press release until someone independent runs it.

Someone independent ran it. The receipts are in, and they hold up better than I expected.

The benchmarks verified

GLM-5.2 now sits at number one on Design Arena’s coding leaderboard, ten Elo points above Claude Fable 5 in human-preference head-to-heads (it’s second on the WebDev board, behind Fable). Fireworks independently reproduced its 91.4% on GPQA-Diamond. The launch numbers that had no third-party backing now have it.

That’s not “GLM beats the frontier.” It doesn’t. On the hardest reasoning, Opus 4.8 is still ahead, and Design Arena is human preference, not a saturating capability test. But “good enough, open, and a fraction of the price” was always the claim that moves a market, and that claim is now verified rather than asserted.

The security receipt is the one I care about

Yesterday I argued that over-refusal taxes defenders and lets attackers walk, and that I’d moved authorized security work to an open-weight model because the frontier kept refusing it. That was my anecdote. Now there’s a third-party version.

Semgrep ran GLM-5.2 against Claude Code on real Insecure Direct Object Reference detection across open-source codebases. GLM-5.2 scored 39% F1 to Claude Code’s 32% - at roughly $0.17 per vulnerability found, a fraction of frontier pricing. Their headline, which I wish I’d thought of first, was “We Have Mythos at Home.”

They were also honest about the limit, and so should I be:

The harness still matters more than the model.
— Semgrep, on their GLM-5.2 vs Claude Code benchmark

Their own scaffolded pipeline beat both models outright (53-61% F1), which is the real lesson: the orchestration around the model is doing more work than the model swap. It’s one task on one dataset. But it’s a credible security shop finding the open model does the job better and cheaper, and that is exactly the receipt my post needed and didn’t have.

It runs on a Mac Studio

Here is the fact that made me grin. Unsloth shipped quantized builds within days of the weights dropping. The 2-bit dynamic quant compresses the full 1.5TB model down to about 239GB - which fits on a single 256GB Mac Studio (or a four-by-RTX-3090 rig with 192GB of system RAM), running through llama.cpp at three to nine tokens a second.

It’s slow. It’s a tinkerer’s setup, not a production deployment. But a model topping a leaderboard above Fable 5, running on a workstation you can buy off the Apple Store today - if it weren’t perpetually sold out - is the whole you-cannot-recall-a-weights-file argument made literal. The weights are free and unbannable now; it’s the silicon to run them that turned into the scarce part. The community framing wrote itself: “the open model nobody can ban.”

Trained without a single NVIDIA chip

The other receipt is upstream of all of them. GLM-5.2 was trained on roughly 100,000 Huawei Ascend 910B chips on the MindSpore stack, with zero NVIDIA involvement. The tradeoff is real - about 15% more compute time than an equivalent NVIDIA run, and slower inference throughput - but the headline survives the asterisk: a frontier-tier model was built entirely off the US chip supply chain.

That should unsettle the export-control crowd more than any benchmark. The lever that’s supposed to throttle China’s frontier training assumes the chips are the choke point. This is a proof of concept that the choke point leaks.

The receipts have limits

None of this makes GLM-5.2 a frontier-killer. It trails Opus 4.8 on the hardest reasoning, the Semgrep result is one dataset, the Mac Studio setup runs at single-digit tokens per second, and the hosted Z.ai API still applies its own content filtering plus the data-residency questions that come with where it runs. “Open” still carries asterisks. The point isn’t that GLM won. It’s that the claims I told you to discount turned out to be cashable.

The trash talk wrote itself

The market noticed. Zhipu’s stock spiked as much as 42% intraday on June 22 and crossed a trillion Hong Kong dollars (about US$128B) in market cap, on a JPMorgan note projecting 534% revenue growth this year. It’s up roughly 1,700% since its January IPO.

And the confidence is showing. When Elon Musk predicted China was “probably” a year out, into early 2027, from a Fable 5 rival, the GLM team’s Jie Tang replied on X that it “won’t take that long,” under a banner that doubles as the entire thesis of the open-weight pack:

GLM-5.2 is fully open. Frontier intelligence belongs to everyone.
— Jie Tang, GLM team, on X

That’s a swagger you earn by shipping, not by filing an S-1 and calling for a brake pedal. Eleven days ago the case for GLM-5.2 rested on Zhipu’s word. Today it rests on Design Arena, Semgrep, Fireworks, Unsloth, and a hundred thousand Huawei chips. The asterisks expired. The model kept shipping through the panic, exactly like the open pack always does.

GLM-5.2: The Receipts Came In

The benchmarks verified

The security receipt is the one I care about

It runs on a Mac Studio

Trained without a single NVIDIA chip

The trash talk wrote itself

Share this article

Related Posts

Same Terms, Different Treatment

Distillation Is Not Scraping: Why the Internet's Favourite Take Is Wrong

The Safety Team Left. We're Still Shipping.