Agent Harnesses: From DIY Patterns to Product

Anthropic’s engineering team just published “Effective Harnesses for Long-Running Agents”. It addresses a fundamental challenge: agents struggle to maintain progress across context windows. Each new session starts without memory of prior work.

This isn’t just an academic problem. It’s the core infrastructure challenge for anyone building agentic products. The patterns Anthropic describes - progress files, feature lists, session protocols - are exactly what I’ve been building into SpecPilot. Their DIY approach validates the architecture, but also reveals why this needs to become product, not just patterns.

The Problem: Context Window Amnesia

Every time an agent session ends, the work context disappears. The next session starts fresh: no memory of what was built, what failed, what decisions were made. For short tasks, this doesn’t matter. For multi-hour or multi-day projects, it’s fatal.

AI agents struggle to maintain progress across multiple context windows. Each new session starts without memory of prior work, requiring agents to bridge gaps between “coding shifts” despite limited context windows.

— Anthropic Engineering

The symptoms are familiar to anyone who’s tried to use Claude Code for real projects:

Agent declares victory prematurely (didn’t read the full requirements)
Agent redoes work from a previous session (no progress awareness)
Agent leaves buggy code because it can’t remember the broader context
Agent wastes time on setup that was already completed

Anthropic’s Two-Agent Pattern

Their solution splits the work into two agent types:

Initializer Agent (first session):

Creates init.sh for running development environments
Writes claude-progress.txt documenting work history
Generates comprehensive feature lists (200+ requirements in JSON)
Makes initial git commit showing project structure

Coding Agent (subsequent sessions):

Reads progress files and git logs first (always)
Implements one feature at a time
Commits changes with descriptive messages
Leaves code in a clean, mergeable state

The key insight: external artifacts become the agent’s memory. Progress files, git history, and structured feature lists persist across sessions. Each agent session reconstructs context from these artifacts before doing any work.

Session startup protocol

Every coding agent begins by: running pwd to confirm location, reading progress documentation, reviewing the feature list, and running existing tests. Only then does it start implementation. This ritualized startup prevents the “where was I?” problem.

From Patterns to Product

These patterns work for engineers who can wire up shell scripts and maintain progress files. But they don’t scale to non-technical users or teams that need reliability without ops overhead.

That’s where products come in. Here’s how Anthropic’s patterns map to product features:

Anthropic Pattern	Product Implementation
`claude-progress.txt`	Real-time agent run feed with structured events
Feature lists (JSON)	Spec import, task breakdown, board integration
Session startup protocol	Agent orchestration with context injection
Incremental commits	Automated branch/commit/PR workflows
Human intervention triggers	Pause/resume controls, terminal takeover
End-to-end testing	Integrated test runners with session reuse

The product abstracts the harness. Users write specs and review PRs. The infrastructure - progress tracking, context management, session handoffs - becomes invisible.

Why This Matters for the SDLC Collapse

I wrote about the SDLC collapsing into continuous flow with human checkpoints. Anthropic’s harness patterns are the implementation detail that makes this possible.

When agents can:

Read specs from a feature list
Track their own progress in persistent files
Commit incrementally and recover from failures
Hand off cleanly between sessions

…the traditional phase boundaries dissolve. Planning feeds directly into implementation. Testing happens inline. Documentation updates as code changes. The harness enables the collapse.

The phases collapse because the harness maintains continuity. Without persistent progress tracking, each session would restart the entire SDLC.

What the DIY Approach Misses

Anthropic’s patterns are engineering-focused. They assume you can:

Write and maintain shell scripts for session initialization
Design JSON schemas for feature tracking
Debug agent behavior when startup protocols fail
Handle edge cases in progress file parsing

For engineers building their own agents, this is table stakes. But the market opportunity is the 90% of teams who need agentic workflows without the ops burden.

What products must add:

Visual progress tracking: Not just logs - timeline views, status indicators, screenshot captures
Structured intervention: Not just “human intervenes” - specific controls for pause, resume, redirect, abort
Recovery automation: When agents fail mid-session, automatic rollback and retry logic
Multi-agent coordination: Specialized agents (testing, review, documentation) working in concert
Non-technical interfaces: Specs in natural language, not JSON; boards, not git commands

The Human-in-Loop Design

Both Anthropic’s patterns and product implementations share a critical constraint: humans must stay in the loop at judgment points.

Anthropic handles this with:

Feature lists that prevent scope creep
Explicit “don’t mark complete without testing” rules
Git commits as checkpoints for human review

Products handle this with:

Approval gates before deployment
PR reviews before merge
Pause controls when agents hit blockers
Alert systems for decisions requiring human judgment

The judgment layer doesn't automate

The harness automates execution, not decisions. Which features to build, whether code is correct, when to ship - these remain human responsibilities. The harness just ensures agents don’t lose context between the decision points.

Practical Takeaways

If you’re building agentic products or workflows:

Progress files are essential: Some form of persistent progress tracking must survive across sessions. Structured logs, database records, git history - the format matters less than the existence.
Session startup rituals matter: Agents should always read context before acting. Make this explicit in prompts and verify it in logs.
Incremental commits enable recovery: Don’t wait for “done” to commit. Each completed unit of work should be checkpointed.
Feature lists prevent premature completion: Comprehensive requirements, tracked explicitly, stop agents from declaring victory too early.
Human intervention needs structure: “Pause and ask for help” isn’t enough. Define when, how, and what information to surface.

Where This Is Heading

Anthropic hints at future directions: specialized agents for testing, QA, and code cleanup. Multi-agent systems where different agents handle different phases. Extensions beyond web development into scientific research and financial modeling.

The harness becomes the orchestration layer. Products built on these patterns will compete on:

How seamlessly they handle session handoffs
How much context they can maintain across long projects
How naturally humans can intervene and redirect
How reliably agents recover from failures

The patterns are published. The infrastructure is known. The race is to productize it well.

The core insight: agents need external memory to work on real projects. Anthropic’s patterns show how engineers can build this. Products like SpecPilot show how to make it accessible. The harness is the bridge between stateless AI and stateful work.

For a deeper dive into spec-driven workflows, see Product Engineering: The New Superpower, which covers GitHub’s Spec Kit approach to the same problem.

Read the original

Anthropic’s full article is worth reading: Effective Harnesses for Long-Running Agents. It includes specific techniques for feature list design, testing strategies, and failure mode handling.

Agent Harnesses: From DIY Patterns to Product

The Problem: Context Window Amnesia

Anthropic’s Two-Agent Pattern

From Patterns to Product

Why This Matters for the SDLC Collapse

What the DIY Approach Misses

The Human-in-Loop Design

Practical Takeaways

Where This Is Heading

Share this article

Related Posts

Your Website Is About to Become a Workflow

12 Factor Agents: Principles for AI That Actually Work

The SDLC Is Collapsing Too