The stock market reacted before the benchmarks even loaded.
Thomson Reuters fell 16%. LegalZoom dropped nearly 20%. The software sector ETF had its worst day since the April 2025 tariff crash. All because Anthropic demoed its Cowork plugins doing the kind of research and analysis that entire SaaS verticals charge enterprise rates for.
Opus 4.6 is a good model. But the market reaction tells you more about where this is heading than any benchmark.
What Actually Shipped
Released February 5, 2026. The headlines:
- 1M token context window (beta): First for Opus-class models. API-only for now. More on what this means for the dumb zone below.
- Agent teams: Multiple Claude Code agents working in parallel. One on frontend, one on the API, one on migration. Each owns its piece and coordinates directly with the others. Research preview.
- Adaptive thinking: The model decides when extended reasoning helps. No more toggling between thinking modes manually.
- Context compaction (beta): Auto-summarizes older conversation context during long-running tasks. The model sustains multi-hour sessions without hitting limits.
- Effort controls: Four levels (low, medium, high, max) for intelligence/speed/cost tradeoffs at the API level.
Pricing: Unchanged at $5/$25 per million input/output tokens. Premium pricing kicks in above 200k tokens.
Benchmarks:
| Benchmark | Opus 4.5 | Opus 4.6 | Notes |
|---|---|---|---|
| GDPval-AA (knowledge work) | ~1,416 Elo | 1,606 Elo | +190 Elo, also beats GPT-5.2 by 144 |
| Terminal-Bench 2.0 (agentic coding) | 59.3% | 65.4% | Highest industry score |
| SWE-bench Verified | 80.9% | ~80.8% | Regressed 0.1%, benchmark saturated |
| Humanity’s Last Exam | - | Leading | Complex multi-discipline reasoning |
SWE-bench actually regressed 0.1% from Opus 4.5. HN commenters rightly note the benchmark is saturated: “A regression of such small magnitude could mean many things or nothing.” The real gains are in sustained agentic work, not one-shot patches.
Shrinking the Dumb Zone
The dumb zone has been a recurring theme on this blog: fill past 40% of your context window and reasoning quality degrades. Models pay attention to the beginning and end but lose the middle. It’s why /clear between tasks became gospel, why Ralph Wiggum starts each iteration fresh, why context management mattered more than raw intelligence.
Opus 4.6 attacks this directly. The MRCR v2 score - 76% vs Sonnet 4.5’s 18.5% - measures exactly what the dumb zone describes: retrieval accuracy deep in context. That’s a 4x improvement. Developers on HN testing it against 900+ documents report near-perfect precision where previous models failed entirely.
Combined with context compaction (lossy summarization of older turns) and the 1M token window, the practical effect is that the dumb zone threshold moved. You can fill more context before quality degrades. Long-running agentic sessions that previously required manual /clear checkpoints can now sustain themselves.
But 76% still means 1-in-4 retrievals from deep context fail. Context compaction is summarization, not perfect recall. Information still gets lost. The dumb zone got narrower, not eliminated. For critical multi-step workflows where every retrieval matters, the 12 Factor Agents principle still holds: own your context window, stay lean, don’t assume the model remembers everything you told it.
The 40% rule of thumb probably moves to 55-60% with Opus 4.6. But the principle hasn’t changed: context is an attention budget, not infinite storage. Compaction buys you runway, not immunity.
”Vibe Working”
Anthropic’s head of enterprise product, Scott White, told CNBC: “I think that we are now transitioning almost into vibe working.”
Not vibe coding. Vibe working.
The distinction matters. Vibe coding was about non-developers building software by describing what they wanted. Vibe working is about knowledge workers delegating entire workflows: financial analysis, legal research, document generation, data processing. Define the work, provide the inputs, let the model run.
— Scott White, AnthropicVibe coding started to exist as a concept in software engineering, and people could now do things with their ideas. I think we are now transitioning almost into vibe working.
This is the delegation thesis I wrote about with GPT-5.2, but Anthropic is making it concrete. GPT-5.2 proved delegation works for knowledge tasks. Opus 4.6 pairs that with the agentic infrastructure (agent teams, context compaction, adaptive thinking) to sustain those tasks across hours, not minutes.
Eighty percent of Anthropic’s business is enterprise customers. They’re not building for indie hackers anymore. They’re building for the workflows that spooked the stock market.
The timing is striking. Yesterday I wrote about Salesforce retreating from autonomous AI agents because Agentforce couldn’t handle deterministic enterprise workflows. Today Anthropic announces “vibe working” for those same enterprises. Same ambition, different architecture. Salesforce tried to make LLMs execute. Anthropic is making them delegate to tools that execute. The hybrid pattern wins again.
Agent Teams
Remember TeammateTool? The fully-implemented multi-agent system hiding in Claude Code’s binary, feature-flagged off? The one the community found by running strings on the binary?
It’s official now. Anthropic flipped the switch. “Agent teams” is TeammateTool with a product name. The 13 operations, the directory structures, the environment variables: all live. Enable it with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in your settings or environment.
The architecture is what we predicted: a team lead coordinates, teammates work independently in their own context windows, a shared task list tracks progress, and a mailbox system handles inter-agent messaging. But the key upgrade over hub-and-spoke subagents: teammates communicate directly with each other. Peer-to-peer, not just reporting back to the orchestrator.
TechCrunch notes this shifts from “one agent working through tasks sequentially” to “multiple agents, each owning its piece.” Split pane mode via tmux or iTerm2 lets you watch all agents working simultaneously.
The practical question from HN: “Does anyone know if costs are falling enough to make multi-agent workflows economical?” Fair question. Running three Opus-class agents in parallel at $25/M output tokens adds up fast. It’s still experimental with known limitations: no session resumption, task status can lag, no nested teams.
But the direction is clear. The 19-agent trap I warned about was over-fragmentation. Agent teams are coarser-grained: 2-4 specialized agents, not 19 microservices. That’s the right granularity.
500 Zero-Days
The headline that should concern everyone: Opus 4.6 autonomously found 500+ zero-day vulnerabilities in open-source code. Buffer overflows in OpenSC, crash bugs in GhostScript, memory corruption in CGIF. No specific instructions. Just access to Python and standard security tools.
The model didn’t just find bugs. It wrote its own proof-of-concept exploits to confirm they were real.
Anthropic added six new cybersecurity probes specifically because Opus 4.6 shows “enhanced cybersecurity abilities.” They’re directing the capability toward defensive use. But the same model that finds zero-days defensively can find them offensively. The gap between “security research tool” and “vulnerability factory” is narrowing.
This is the when not to use AI problem at a new scale. The model’s 2x improvement on computational biology benchmarks raises similar questions. Capabilities that are transformative for drug discovery are concerning for biosecurity. Anthropic’s safety team ran their most extensive testing ever for this release. Whether that’s enough is an open question.
What It Doesn’t Solve
- Claude Code UX gripes persist: HN developers complained about 3-4 second startup times, React/Terminal implementation inefficiencies. Model intelligence is ahead of the tooling experience.
- 1M context is beta and API-only: Subscription users don’t get it yet. The killer feature is gated behind developer access.
- The skill gap widens: As I wrote about GPT-5.2’s delegation problem, these models amplify good scoping and bad scoping equally. Most people still can’t define six hours of work clearly enough to hand it to a model.
- Enterprise adoption is slower than demos: The stock market panicked about replacement, but security concerns, compliance requirements, and institutional inertia are real friction. Demos aren’t deployments.
- Where’s Sonnet 5?: A Vertex AI log leak revealed “Fennec” (Sonnet 5) with a Feb 3 checkpoint. Early evals showed 82.1% SWE-Bench and coding stronger than Opus 4.5 in some workflows, at $3/M input. Everyone expected it this week. Instead we got Opus 4.6. For API developers paying per token, a cheaper Sonnet that rivals Opus would matter more than a better Opus.
Where This Leaves Us
44% of enterprises now use Anthropic in production. For developers, agent teams move us from “AI assists my coding” to “AI runs my codebase while I review.” The developers on HN who describe being “in reviewer mode more often than coding mode” aren’t complaining. They’re describing the new job.
— Developer on Hacker NewsIt feels like horizontally scaling yourself. Agents implement features while you review previous work.
The delegation era I wrote about two months ago just got its infrastructure. Context that doesn’t rot. Agents that coordinate. Thinking that adapts to the problem. The model is ready. The question is whether we are: can you scope work clearly enough to hand it to a system that’s now capable of running it?


