Anthropic’s engineering team just published “Effective Harnesses for Long-Running Agents”. It addresses a fundamental challenge: agents struggle to maintain progress across context windows. Each new session starts without memory of prior work.

This isn’t just an academic problem. It’s the core infrastructure challenge for anyone building agentic products. The patterns Anthropic describes - progress files, feature lists, session protocols - are exactly what I’ve been building into SpecPilot. Their DIY approach validates the architecture, but also reveals why this needs to become product, not just patterns.

The Problem: Context Window Amnesia

Every time an agent session ends, the work context disappears. The next session starts fresh: no memory of what was built, what failed, what decisions were made. For short tasks, this doesn’t matter. For multi-hour or multi-day projects, it’s fatal.

AI agents struggle to maintain progress across multiple context windows. Each new session starts without memory of prior work, requiring agents to bridge gaps between “coding shifts” despite limited context windows.

— Anthropic Engineering

The symptoms are familiar to anyone who’s tried to use Claude Code for real projects:

  • Agent declares victory prematurely (didn’t read the full requirements)
  • Agent redoes work from a previous session (no progress awareness)
  • Agent leaves buggy code because it can’t remember the broader context
  • Agent wastes time on setup that was already completed

Anthropic’s Two-Agent Pattern

Their solution splits the work into two agent types:

Initializer Agent (first session):

  • Creates init.sh for running development environments
  • Writes claude-progress.txt documenting work history
  • Generates comprehensive feature lists (200+ requirements in JSON)
  • Makes initial git commit showing project structure

Coding Agent (subsequent sessions):

  • Reads progress files and git logs first (always)
  • Implements one feature at a time
  • Commits changes with descriptive messages
  • Leaves code in a clean, mergeable state

The key insight: external artifacts become the agent’s memory. Progress files, git history, and structured feature lists persist across sessions. Each agent session reconstructs context from these artifacts before doing any work.

Session startup protocol

Every coding agent begins by: running pwd to confirm location, reading progress documentation, reviewing the feature list, and running existing tests. Only then does it start implementation. This ritualized startup prevents the “where was I?” problem.

From Patterns to Product

These patterns work for engineers who can wire up shell scripts and maintain progress files. But they don’t scale to non-technical users or teams that need reliability without ops overhead.

That’s where products come in. Here’s how Anthropic’s patterns map to product features:

Anthropic PatternProduct Implementation
claude-progress.txtReal-time agent run feed with structured events
Feature lists (JSON)Spec import, task breakdown, board integration
Session startup protocolAgent orchestration with context injection
Incremental commitsAutomated branch/commit/PR workflows
Human intervention triggersPause/resume controls, terminal takeover
End-to-end testingIntegrated test runners with session reuse

The product abstracts the harness. Users write specs and review PRs. The infrastructure - progress tracking, context management, session handoffs - becomes invisible.

Why This Matters for the SDLC Collapse

I wrote about the SDLC collapsing into continuous flow with human checkpoints. Anthropic’s harness patterns are the implementation detail that makes this possible.

When agents can:

  • Read specs from a feature list
  • Track their own progress in persistent files
  • Commit incrementally and recover from failures
  • Hand off cleanly between sessions

…the traditional phase boundaries dissolve. Planning feeds directly into implementation. Testing happens inline. Documentation updates as code changes. The harness enables the collapse.

The phases collapse because the harness maintains continuity. Without persistent progress tracking, each session would restart the entire SDLC.

What the DIY Approach Misses

Anthropic’s patterns are engineering-focused. They assume you can:

  • Write and maintain shell scripts for session initialization
  • Design JSON schemas for feature tracking
  • Debug agent behavior when startup protocols fail
  • Handle edge cases in progress file parsing

For engineers building their own agents, this is table stakes. But the market opportunity is the 90% of teams who need agentic workflows without the ops burden.

What products must add:

  • Visual progress tracking: Not just logs - timeline views, status indicators, screenshot captures
  • Structured intervention: Not just “human intervenes” - specific controls for pause, resume, redirect, abort
  • Recovery automation: When agents fail mid-session, automatic rollback and retry logic
  • Multi-agent coordination: Specialized agents (testing, review, documentation) working in concert
  • Non-technical interfaces: Specs in natural language, not JSON; boards, not git commands

The Human-in-Loop Design

Both Anthropic’s patterns and product implementations share a critical constraint: humans must stay in the loop at judgment points.

Anthropic handles this with:

  • Feature lists that prevent scope creep
  • Explicit “don’t mark complete without testing” rules
  • Git commits as checkpoints for human review

Products handle this with:

  • Approval gates before deployment
  • PR reviews before merge
  • Pause controls when agents hit blockers
  • Alert systems for decisions requiring human judgment
The judgment layer doesn't automate

The harness automates execution, not decisions. Which features to build, whether code is correct, when to ship - these remain human responsibilities. The harness just ensures agents don’t lose context between the decision points.

Practical Takeaways

If you’re building agentic products or workflows:

  • Progress files are essential: Some form of persistent progress tracking must survive across sessions. Structured logs, database records, git history - the format matters less than the existence.
  • Session startup rituals matter: Agents should always read context before acting. Make this explicit in prompts and verify it in logs.
  • Incremental commits enable recovery: Don’t wait for “done” to commit. Each completed unit of work should be checkpointed.
  • Feature lists prevent premature completion: Comprehensive requirements, tracked explicitly, stop agents from declaring victory too early.
  • Human intervention needs structure: “Pause and ask for help” isn’t enough. Define when, how, and what information to surface.

Where This Is Heading

Anthropic hints at future directions: specialized agents for testing, QA, and code cleanup. Multi-agent systems where different agents handle different phases. Extensions beyond web development into scientific research and financial modeling.

The harness becomes the orchestration layer. Products built on these patterns will compete on:

  • How seamlessly they handle session handoffs
  • How much context they can maintain across long projects
  • How naturally humans can intervene and redirect
  • How reliably agents recover from failures

The patterns are published. The infrastructure is known. The race is to productize it well.


The core insight: agents need external memory to work on real projects. Anthropic’s patterns show how engineers can build this. Products like SpecPilot show how to make it accessible. The harness is the bridge between stateless AI and stateful work.

For a deeper dive into spec-driven workflows, see Product Engineering: The New Superpower, which covers GitHub’s Spec Kit approach to the same problem.

Read the original

Anthropic’s full article is worth reading: Effective Harnesses for Long-Running Agents. It includes specific techniques for feature list design, testing strategies, and failure mode handling.