Who Does the Refusal Actually Stop?

A few weeks ago I was deep in an authorized security engagement - scoped, signed off, the kind of work I’m paid and permitted to do - and Claude Opus had decided I was the enemy.

Not for anything exotic. The refusals crept in on the mundane stuff: hardening an auth flow, reviewing an architecture for weak points, asking the model to reason about how an attacker might approach a system so I could close the gap first. Each time, the same shape - a benign request, pattern-matched to “security,” quietly downgraded or refused outright. I spent more of the day fighting the tool than the problem.

So I did what a growing number of security people are doing as Anthropic tightens the walls. I stopped reaching for the frontier model and pointed Pi, a minimal coding harness I actually control, at an open-weight model instead: GLM-5.2. The work got unblocked.

That switch is a small story. The reason it keeps happening is a much bigger one, and it should bother anyone who cares about security.

The asymmetry nobody at the safety table wants to name

Here is the uncomfortable shape of it. The people who get refused are the ones operating in the open: paying for a subscription, using the official API, working inside the terms of service, applying to the access programs. The people who don’t get refused are the ones who were never going to ask permission in the first place.

Real threat actors use self-hosted models anyway, so legitimate defensive security researchers - not attackers - are the ones being affected by these blocks.
— Tim Becker, security practitioner

An attacker downloads an open-weight model, strips the safety tuning (the International AI Safety Report notes open-weight safeguards are far easier to remove than closed ones), and runs it on their own hardware. No refusals, no logs, no gatekeeper. The guardrail that just cost me an afternoon cost them nothing, because they were never standing at the gate. Safety alignment, calibrated this bluntly, taxes precisely the people who follow the rules and waves through the ones who don’t.

It isn’t me, and it isn’t one model

When I first hit this I assumed it was me: bad prompting, a request that looked spookier than it was. It isn’t.

Opus 4.7 earned The Register headline “overzealous query cop” in April after a surge of complaints - more than thirty formal ones in a single month. Developers reported it refusing to proofread a cybersecurity lab “containing simple crypto exercises,” refusing to analyze its own source code, refusing to write a toy illustration of a bug it had just flagged. A filed Claude Code issue catalogues debuggers, profilers, and binary-format parsers being misread as cyber-intent and killed mid-session.

Then Fable 5 arrived in June and made it worse. This time the complaints came with names attached. Valentina Palmiotti of IBM X-Force said it “rejects any request that could be tangentially cyber related. Even innocuous tasks like reading a blog post.” An immunologist found the word “cancer” flagged as a biosecurity risk. One user was refused on the first turn of a session whose only input was the word “hello.” And Matt Suiche of Tolmo described the exact failure I’d been living with:

If you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded.
— Matt Suiche, Tolmo

Read that again. Writing secure code - the most defensive act in the entire discipline - trips the wire. This is the part that should end the debate: there is no attacker-uplift story when the request is “help me make this harder to attack.” That’s not dual-use. That’s the model refusing to let you defend.

The official exits are locked or ghosted

Anthropic isn’t unaware of the problem. They built an escape hatch: the Cyber Verification Program, a free, application-based path for security professionals that unblocks exactly the dual-use work the default safeguards refuse - exploit review, payload generation, vulnerability research. They aim to respond within two business days.

I applied. I never heard back.

Maybe it’s backlogged. But the contrast with the flagship effort is hard to miss. Project Glasswing - Anthropic’s headline defensive-security initiative - launched with eleven companies: AWS, Google, Microsoft, JPMorgan, CrowdStrike, and friends. A hundred million dollars in credits. The frontier Mythos model. It is, by design, a room you get invited into, not one you apply to. The program for the rest of us has a two-day SLA it apparently doesn’t always honour; the real capability sits behind a velvet rope sized for the Fortune 500. If you’re an individual practitioner, the message is quietly consistent: you are not the customer this was built for.

What I actually switched to

GLM-5.2 is Zhipu’s open-weight model - 753B parameters, about 40B active per token, MIT licensed, a million-token context window. On Pi, pointed at it, it engaged with the work the frontier model wouldn’t touch. I run it through OpenRouter rather than Zhipu’s first-party API - routed to US-based providers with zero data retention, which keeps client work out of China and off anyone’s logs - paying per token, at a fraction of frontier pricing. The results have been good enough to keep paying.

But the point of this post is not “open model good, Claude bad,” so let me be honest about what I switched to.

GLM-5.2 is not a magic unlock

GLM-5.2 still isn’t a no-guardrails tool. The weights carry some refusals of their own, and Zhipu’s first-party API stacks more on top, along with the data-residency questions that come with a China-hosted endpoint - which is exactly why the route matters. Going through OpenRouter to US-based, zero-retention providers sidesteps the China concern; the genuinely unrestricted path is self-hosting the MIT weights, which means real hardware. It’s also slower than the frontier: it’ll burn about 45,000 reasoning tokens on a task GPT-5.5 does in 16,000, and on hard architecture it lags by a noticeable margin. Z.ai even acknowledged some reward-hacking during its own evaluations. It’s a workhorse, not a miracle.

None of that makes it a frontier-killer. It makes it a tool that lets me do my job - which, lately, is more than I could say for the model I was paying twice as much for. The open model’s benchmark claims were thin at launch, so I came back to them once the independent receipts landed - including a security eval that beat Claude Code outright.

The fix isn’t fewer guardrails

It would be cheap to end on “kill the safety theater.” The honest position is harder than that.

The labs aren’t wrong that this is dual-use. “I’m authorized” is unverifiable from a prompt - the model can’t see your engagement letter. Exploit code that helps a defender helps an attacker just as well. Anthropic carries genuine liability, and there’s a defensible case that some friction is the price of not handing frontier capabilities to anyone who asks nicely. I don’t want to pretend that tension isn’t real.

But blunt refusal is the wrong instrument, and Anthropic now says so itself. After the Fable backlash, they apologized:

We made the wrong tradeoff and we apologize for not getting the balance right.
— Anthropic, June 2026

Their own system cards tell the same story across generations: Opus 4.8 shipped with “over-refusals substantially reduced,” which is a quiet admission that 4.7 over-refused. They are iterating in the right direction. The trouble is that iteration moves on the timescale of model releases, while the asymmetry is live every single day in between - and every day it’s live, it’s the defenders eating the cost.

The actual fix is the thing they half-built and under-resourced: verified access that works. Prove you’re a professional once, get treated like one. A two-business-day SLA that’s actually two business days. A path for individuals, not just a consortium for incumbents. Until that exists, the calibration we have doesn’t stop the attackers - they self-hosted their way around it before breakfast. It just quietly decides which defenders are connected enough to do their jobs, and hands everyone else a reason to go download a model that never says no.

Who Does the Refusal Actually Stop?

The asymmetry nobody at the safety table wants to name

It isn’t me, and it isn’t one model

The official exits are locked or ghosted

What I actually switched to

The fix isn’t fewer guardrails

Share this article

Related Posts

The Window Closed: Kimi K3 Shipped and the Ban Never Came

Paragraph Nine Was the Payload: Nvidia's Open-Weights Letter and Anthropic Alone

The One Thing Washington Can't Ban: Kimi K3 and the Open-Weight Problem