Opus 4.5 scores 80.9% on SWE-bench Verified. The same model scores 45.89% on the contamination-free Pro split. OpenAI has quietly stopped reporting Verified at all. Vendor benchmark cards are marketing.
Read more →
Lightrun: 43% of AI-generated code changes need debugging in production after passing QA. CodeRabbit: 1.7x bugs, 2.74x security, 8x I/O. METR: 19% slower while feeling 20% faster. The numerator is what gets reported. The denominator is what nobody puts in the deck.
Read more →
Uncle Bob, father of TDD, posted on X that TDD is 'very inefficient for AIs' and that the agent is best thought of as 'a highly focused idiot savant.' Testing didn't die. It got more important. And the review target flipped.
Read more →
Traditional coders touched a file and tidied it. The Boy Scout Rule. Now nobody does. Agents add, they don't subtract, and the codebase accretes faster than ever. A technique for putting cleanup back in as an explicit gate, not a virtue you hope for.
Read more →
A CTO once told me not to take people's Legos away. I ignored him, solved the team's problems myself, and got exactly what I optimised for: a sound plan and a team that couldn't stand me. In 2026, with agents doing the bricks, this is the lesson that matters.
Read more →
Berkeley just built an agent that games AI benchmarks. Karpathy called it months ago. The best coding model doesn't top the charts, the highest-ranked Chinese models disappoint in practice, and the entire leaderboard industry optimizes for the wrong thing.
Read more →
Enterprise architecture patterns were designed for a world where code was expensive to write and expensive to change. That world ended. The patterns didn't get the memo.
Read more →
The bottleneck isn't AI capability - it's that developers lack design vocabulary. Impeccable bridges the gap, and the Tessl benchmarks prove it: 1.59x improvement over baseline.
Read more →
Frontier models top out at 68% compliance with 500 instructions. Every rule you add makes every other rule less likely to be followed. The research explains why.
Read more →
AI coding tools create a legal paradox: the code you ship likely can't be copyrighted, but it might infringe someone else's. All the liability, none of the protection.
Read more →
Tokens are nouns. Patterns are verbs. The missing layer is grammar: a shared vocabulary that spans Figma, web, and native without breaking when someone ships a 'small' change.
Read more →
Salesforce quietly walked back autonomous AI agents to deterministic scripting. The pattern reveals when LLMs work - and when they don't.
Read more →
Boris Cherny followed up his personal workflow with tips from across the team. Same tool, different people, different approaches. The patterns worth stealing.
Read more →
Boris Cherny shared his workflow for the tool he built. The setup is surprisingly vanilla. The philosophy is worth studying.
Read more →
Factory AI's Luke predicts the future isn't more powerful models - it's AI that enforces software engineering best practices by default. Here's why that matters more than you think.
Read more →
The arguments about vibe coding and junior developers miss what software engineering was always about: shipping products, not typing code.
Read more →
HumanLayer's 12-factor agents codifies what works in production AI: own your context, keep agents small, stay out of the dumb zone.
Read more →
Opus 4.5 shipped Plan Mode as a core workflow. The workarounds are obsolete. And the case for auto-compact finally tips in favor of enabling it.
Read more →
At around 30 employees, growing companies either mature or become toxic. Here's the playbook for organizational dysfunction - and why your engineering leaders keep leaving.
Read more →
What happens when you lose external validation and discover what actually matters: the work itself.
Read more →
The latest C# release continues its quiet war on ceremony with field-backed properties, extension members, and smarter spans. Here's what matters and what doesn't.
Read more →
.NET 10's shebang support and file-based apps turn C# into a scripting language. No more context-switching to Python for quick scripts.
Read more →
Why coding interviews optimized for 2010 fail to identify great engineers in 2025, and why orgs can't adapt fast enough.
Read more →
Real-time AI generation vs curated libraries: lessons from building the same product twice with radically different architectures.
Read more →
Building a multi-stage AI content pipeline where each generation depends on the last. Lessons from generating thousands of hybrid creatures with resilient error handling.
Read more →
MCPs, subagents, and automation are tempting. But the developers getting the most from Claude Code aren't rushing to advanced features - they're mastering the fundamentals.
Read more →