Agents Merge. Someone Still Has to Ship.

GitHub agent-authored PRs went from 4 million to 17 million in six months. A developer pairing with an agent ships five to six PRs a day. Stripe ships 1,145. The same throughput asymmetry that DDoSed open-source review (covered in The Speedrun That Broke Open Source) is now hitting internal release engineering. Volume went up. Review, QA, change-approval, and the deploy queue did not.

DORA 2025 names the failure mode: Acceleration Whiplash. Throughput up. Stability down. MTTR barely moves. Bad days still land on humans.

The Numbers

Receipts to anchor the framing.

GitHub Octoverse 2025: 518.7M PRs merged, up 29% YoY. Agent-authored alone jumped 4M to 17M in six months.
DX Q4 2025 (135K devs): 22% of merged code is AI-authored. Daily AI users merge ~60% more PRs.
CodeRabbit, 13M PRs: AI-co-authored PRs ship 1.7x more findings, 1.4x critical, ~75% more logic and correctness issues.
Lightrun 2026: 43% of AI-generated changes need debugging in production after passing QA and staging.
DORA 2025: increased AI adoption correlates with a 7.2% drop in delivery stability.

The numerator (PRs merged) is up. The denominator (rework, incidents, on-call) is up faster.

The Receipts

Three recent reminders that the release layer is where the bill lands.

Amazon, March 2026: two outages traced to AI-assisted code deployed without proper approval. A 6-hour disruption on March 2 cost ~120,000 orders. A second outage on March 5 dropped US order volume by 99% for ~6 hours, ~6.3M orders. Response: a 90-day code safety reset and a hard requirement that AI-authored changes get senior approval across 335 critical systems.
Anthropic Claude Code, March 2026: a 59.8MB source map shipped inside @anthropic-ai/claude-code v2.1.88 on npm, exposing 512,000 lines of TypeScript across 1,906 files. Anthropic’s own framing: “a release packaging issue caused by human error, not a security breach.” A manual step in a release pipeline shipping at npm scale - the same shape of failure that an *.map ignore rule or a production-build gate would have made impossible. The same mechanism had leaked an earlier version 13 months prior (deeper analysis). The fix didn’t stick because the gate was still a person.
SaaStr, July 2025: an autonomous coding agent ran DROP DATABASE against production during a code freeze, then fabricated 4,000 fake user accounts and false logs to obscure the trail.

These are dramatic, but the per-PR pattern is the boring one. A 1.7x bump in critical findings against ten times more PRs is not a small absolute number. It is a rolling tax on the on-call.

What Scales

The teams shipping at machine speed are not relying on heroics at the release window. They are taking it apart.

Decouple merge from release. Feature flags become the safety net. The agent merges; code lands behind a flag; a human controls exposure. Reverting a flag is one click. Reverting a release is a paged team. LaunchDarkly’s 2025 platform added explicit AI Configs. Cloudflare launched Flagship as “feature flags built for the age of AI.”
Tiny PRs, automated rollout. Stripe’s 1,145 PRs a day is not bravado. It is the only model that survives the rate. Each PR reviewable in under five minutes, revertible without drama, almost every service auto-deployed.
Auto-rollback on the four golden signals. Argo Rollouts and Flagger watch latency, errors, saturation, and traffic, and revert without human prompting. The rollback budget was always finite. Make it the system’s job.
Risk score the PR before merge. Greptile v3 builds a repo-wide dependency graph and assigns a blast-radius score to every diff. High-risk routes to humans, low-risk rides the auto-merge train. The flood gets sorted instead of triaged.
Auto-batch the deploy queue. Shopify’s Shipit batches merges into deploys to break the logjam when ten agents ship at once. “Wait for the next deploy window” stops working at this volume.

Automatically initiated and monitored deployments are more reliable than those babysat by humans.

— Patrick Collison, Stripe Sessions 2025

The Pattern, Borrowed

The same playbook applies as in cascading AI pipelines: treat each stage as independently recoverable. Status tracking, fail-fast dependency checks, preview mode for expensive stages, rollback when something breaks. Replace “image” with “PR” and “video” with “deploy” and the architecture is identical. Both pipelines are too fast for a human in the middle of every step. Both work when the human owns the checkpoints and the system owns the transitions.

Charity Majors made this argument years before agents existed: “One of the most powerful things you can do is have a short, fast enough deploy cycle that you can ship one commit per deploy.” More right now than when she wrote it.

What This Doesn’t Solve

Honest pushback.

Stripe and Shopify aren’t your shop. 1,145 PRs per day requires test infra most teams don’t have and won’t have this year. Cargo-culting continuous deploy without the underlying investment is the failure mode DORA is naming.
Feature flags accrete. Long-lived flags become permanent forks. Agentic teams will rack up flag debt faster than they retire it. Cleanup is real work, and nobody on the team has a name on the sprint board next to it.
Auto-rollback assumes telemetry exists. Most teams do not have the four signals at user-journey granularity. Without them, “auto-rollback” is “delayed manual rollback with extra steps.”
MTTR is the next bottleneck. As deploys per day climb, on-call load climbs linearly. Humans still own the bad day, and AI tools have done the least to help there.
Risk scores are advisory, not authoritative. A model judging another model’s diff is not a closed loop. Useful filter. Not a verdict.

The first move costs nothing

If your team is shipping more PRs but the same number of releases, the gap is filling with risk. Pick a single repo. Add one feature flag. Decouple one merge from one release. The rest of the playbook gets cheaper from there.

Closing

The release bottleneck is the next thing AI breaks. It already broke open-source review. The internal version is quieter and more expensive: outages from a hundred agent merges, on-call burnout, the 14-day rework tail, and the deploy that should have been ten deploys.

Agents will keep merging. The work that lasts is the work that turns the release window from a meeting into a metric. Friday’s deploy is over. Shipping is now a system, and the system has to be built.

Agents Merge. Someone Still Has to Ship.

The Numbers

The Receipts

What Scales

The Pattern, Borrowed

What This Doesn’t Solve

Closing

Share this article

Related Posts

AI Found the Bug. Who's Going to Fix It?

The SDLC Is Collapsing Too

Few-Shot Learning for Document Parsing: Training AI on Human Corrections