Opus 4.7: Smarter, Stricter, Hungrier

Opus 4.7 shipped yesterday. The benchmarks are real. The bills are bigger. And for the first time, Claude ships with an automated cybersecurity chaperone that decides whether your request is allowed to reach the model at all.

Three true things at once. Let’s take them in order.

What Actually Got Better

On Anthropic’s own 93-task coding eval, 4.7 resolves 13% more than 4.6. Cursor reports +12% on CursorBench (70% vs 58%). XBOW’s visual-acuity benchmark went from 54.5% to 98.5%, which is the kind of jump that changes what you can actually build with computer-use agents. Rakuten-SWE-Bench reports 3x more production tasks resolved.

The partner quotes in the announcement read like a greatest-hits reel:

Devin: “works coherently for hours, pushes through hard problems rather than giving up”
Notion: “first model to pass our implicit-need tests… a third of the tool errors”
Vercel: “it even does proofs on systems code before starting work”
Warp: “worked through a tricky concurrency bug Opus 4.6 couldn’t crack”

The adaptive thinking mess from 4.6 has been partly addressed by a new xhigh effort level sitting between high and max. In Claude Code, xhigh is now the default. Anthropic’s internal numbers show xhigh giving most of max’s reasoning quality at meaningfully lower latency.

There’s also a /ultrareview slash command that spawns a dedicated review session against your diffs, and auto mode (previously limited to smaller cohorts) is now available to all Max users: a middle ground between “approve every tool call” and “skip all permissions”.

This is a genuine capability release. If you hand off your hardest coding work to 4.7, you will probably ship faster than you did with 4.6.

The effort dial matters now

If you’re on the API and your workload is not agentic coding, start with high, not xhigh. The xhigh default in Claude Code exists because coding agents benefit from deeper reasoning on later turns. For one-shot prompts and short tasks, xhigh just burns tokens. Measure before you pick.

The Quiet Story: Vision Caught Up

The coding benchmarks are leading the press release, but the vision jump is the part that quietly redraws the map of what you can build. Claude had a reputation for being second-tier on multimodal. Gemini 3.1 Pro made OCR, screenshots, PDFs, and charts feel like a first-class input. Claude 4.6 wanted you to pre-process everything into text. That gap closed yesterday.

The raw numbers:

Image resolution: up to 2,576 px on the long edge (~3.75 MP), a 3.3x increase in pixel area over 4.6. No API parameter, just pass the image.
XBOW visual acuity (pick the right button on a real auth screen): 54.5% → 98.5%. A 44-point jump on a benchmark that was actively regressing across Opus versions.
CharXiv-R (reasoning over dense scientific charts, no tools): 68.7% → 82.1%. Mythos sits at 86.1%.
Visual navigation without tools (Anthropic internal): 57.7% → 79.5%.
OSWorld-Verified (computer-use across real desktop apps): 72.7% → 78.0%, ahead of GPT-5.4 (75.0%) and within 1.6 points of Mythos.
OfficeQA Pro (Databricks): 21% fewer errors than 4.6 on document reasoning grounded in source material.

The practical unlock is that you can now pass high-resolution screenshots, architectural diagrams, patent drawings, and scientific figures directly and get back answers that are grounded in what’s actually on the page. Solve Intelligence is using it for life-sciences patent workflows (invalidity charting, infringement detection) that needed pixel-perfect chemical structure reads. XBOW’s computer-use agent now clicks the right button on auth screens 19 times out of 20 instead of 1 in 2.

So Where Does That Leave Gemini 3.1?

Honest comparison, not a victory lap:

OCR and dense document extraction: Gemini 3.1 Pro is still genuinely excellent here. Google’s native aspect-ratio preservation and media_resolution parameter give it fine-grained control Claude doesn’t expose, and its agentic “zoom and annotate” pattern (where Gemini writes Python to crop its own inputs) remains a nice trick.
MMMLU: Opus 4.7 91.5% vs Gemini 3.1 Pro 92.6%. Basically tied.
Computer-use / OSWorld: Opus 4.7 now leads the public tier at 78.0%. The XBOW number is the one that matters: when your agent has to actually click a thing on a real screen, 4.7 is now reliable in a way Gemini never advertised against.
Chart and diagram reasoning: 4.7’s CharXiv jump lands it in the same neighbourhood as Mythos. This is the category where Gemini 3’s advantage was most visible. It’s gone now.

What this enables that 4.6 could not

Agents that actually drive a browser or desktop. OCR pipelines that don’t need a pre-processing step. PDF reasoning that doesn’t lose the figures. Code review that can read a screenshot of a failing UI. Scientific paper analysis that can handle the formulas. Every one of these was theoretically possible on 4.6 and practically unreliable. That changed overnight.

The dual-use tension applies here too. The same vision jump that makes legitimate computer-use agents viable also makes CAPTCHA solving and screen-scraping agents viable. Expect the next round of bot-defence vendors to feel this within weeks.

The Tokenizer Tax

Here’s the detail buried in the migration guide. Opus 4.7 uses an updated tokenizer. Same text, more tokens. Anthropic quotes the multiplier as 1.0 to 1.35x depending on content type.

Combine that with xhigh being the new default, and a model that thinks more on later turns in agentic settings, and you get the picture:

Input tokens go up (tokenizer)
Thinking tokens go up (xhigh default, more turns deeper)
Output tokens go up (more thorough responses)
Per-token price stays at $5/$25 per million

Do the math. A workload that cost you $100/day on 4.6 at medium effort could easily run $150-180/day on 4.7 at xhigh, for a +13% benchmark lift. That ratio is not obviously good.

A reviewer at Yahoo Tech reports burning through an entire token quota in a single session, calling 4.7 “a token eating machine”. Hacker News commenters are flagging the same thing: the same prompt now bills more, and the default effort level does nothing to help.

Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.
— Anthropic migration guide

Anthropic’s answer is task budgets, now in public beta. You set a hard token ceiling per autonomous run and Claude prioritises work against it. This is the right primitive. It’s also a tacit admission that without a budget, 4.7 will happily spend all the money you let it spend.

For Max plan users on Claude Code, task budgets don’t help: you pay a flat subscription and the meter is usage-against-quota, not dollars. What you’ll feel instead is a faster refill cycle on “you’ve reached your usage limit” messages. The trust tax compounds.

The Chaperone

Opus 4.7 is the first Claude model shipping with automated safeguards that detect and block requests that indicate prohibited or high-risk cybersecurity uses. Not post-hoc review. Not a warning banner. Automated blocking, at request time, before the model sees your prompt.

The rationale is Mythos. Anthropic’s unreleased top model demonstrated autonomous cyber capabilities that rival skilled human researchers. 4.7 is the cheaper, weaker shim they’re comfortable shipping publicly, after they “experimented with efforts to differentially reduce these capabilities” during training. In other words: they trained the cyber ability down on purpose, then added a filter on top to catch what leaked through.

If you’re a security professional who wants to use 4.7 for vulnerability research, pentesting, or red-teaming, you can apply to the new Cyber Verification Program. Vetted access to a capability you didn’t need permission for last week.

The obvious problems:

False positives are inevitable. Every keyword-adjacent security classifier refuses legitimate work. Opus 4.5’s benign-request refusal rate was already 0.23%, higher than Sonnet 4.5’s 0.05%. A dedicated cyber blocker can only push that number up.
Verification programs are gated. The criteria for “verified” status are vague, and the cadence is whatever Anthropic decides. Independent researchers, students, and anyone doing security work outside a known firm gets slower access to a product they pay for.
The public model is the “lite” model now. Mythos exists. It’s the one actually trained for frontier work. The version you can buy is the de-risked one with a chaperone bolted on.

XBOW, one of Anthropic’s partner firms for offensive security tooling, published their first-look writeup and found 4.7 took smaller, more atomic steps than 4.6. Given the same token budget, it got further. Given the same number of actions, it found fewer vulnerabilities. The capability is still there. But you have to coach it into being ambitious, and you have to pay for more turns to get it.

If you rely on Claude for security work

Check your prompts against 4.7 before you migrate. The automated blocker runs on intent heuristics, not your reputation. Legitimate work gets refused. Apply to the Cyber Verification Program now if you haven’t, and budget for the possibility that your current pipeline needs rewriting.

What Didn’t Get Fixed

A few things 4.7 does not solve:

Adaptive thinking is still uneven. The xhigh level helps, but HN commenters are still reporting “adaptive thinking chooses to not think when it should”, mirroring the 4.6 complaints. xhigh-as-default is a hammer, not a fix.
Thinking transparency went backwards. Reasoning summaries are now hidden by default unless you pass "display": "summarized". You’re paying for thinking tokens you can’t see without asking for them.
Harm-reduction advice regressed. Anthropic’s own alignment write-up says 4.7 is “modestly weaker” than 4.6 on “tendency to give overly detailed harm-reduction advice on controlled substances”. So the chaperone is tight on cyber, looser on drugs. Optimise for your threat model.
The trust tax keeps compounding. After April’s cache-TTL silent change, the $600 surprise bills, and the 1.45M banned accounts, launching a model that bills more for the same prompt and blocks a class of paid work is not a trust-restoring move.

The Shape of the Deal

Anthropic is now running two products: the public Claude, which is capable, hungry, and chaperoned; and the internal model, which is the actual frontier. Every release from here looks like this. The public tier moves forward on coding and vision and agentic coherence. The capabilities that scare people get gated, trained down, or both.

If you write code for a living, Opus 4.7 is probably a direct upgrade. Budget for the meter and re-tune your prompts for a model that now takes instructions literally instead of loosely.

If you do security work, keep 4.6 in your pocket and apply to the verification program. If your prompts start bouncing, you’ll know why.

Our best model yet. It is more agentic, more intelligent, runs for longer.
— Boris Cherny, Anthropic, on 4.6

All of that is true for 4.7 too. The rest of the sentence is the part that’s new: and it costs more to run, and it will say no more often, and the model that doesn’t have those constraints isn’t for sale.

Opus 4.7: Smarter, Stricter, Hungrier

What Actually Got Better

The Quiet Story: Vision Caught Up

So Where Does That Leave Gemini 3.1?

The Tokenizer Tax

The Chaperone

What Didn’t Get Fixed

The Shape of the Deal

Share this article

Related Posts

Son of Anton

MiniMax M3: Frontier Coding at a Tenth of the Price

Vibing the Tool with the Tool