Anthropic’s engineering team just published “Effective Harnesses for Long-Running Agents”. It addresses a fundamental challenge: agents struggle to maintain progress across context windows. Each new session starts without memory of prior work.
This isn’t just an academic problem. It’s the core infrastructure challenge for anyone building agentic products. The patterns Anthropic describes - progress files, feature lists, session protocols - are exactly what I’ve been building into SpecPilot. Their DIY approach validates the architecture, but also reveals why this needs to become product, not just patterns.
The Problem: Context Window Amnesia
Every time an agent session ends, the work context disappears. The next session starts fresh: no memory of what was built, what failed, what decisions were made. For short tasks, this doesn’t matter. For multi-hour or multi-day projects, it’s fatal.
— Anthropic EngineeringAI agents struggle to maintain progress across multiple context windows. Each new session starts without memory of prior work, requiring agents to bridge gaps between “coding shifts” despite limited context windows.
The symptoms are familiar to anyone who’s tried to use Claude Code for real projects:
- Agent declares victory prematurely (didn’t read the full requirements)
- Agent redoes work from a previous session (no progress awareness)
- Agent leaves buggy code because it can’t remember the broader context
- Agent wastes time on setup that was already completed
Anthropic’s Two-Agent Pattern
Their solution splits the work into two agent types:
Initializer Agent (first session):
- Creates
init.shfor running development environments - Writes
claude-progress.txtdocumenting work history - Generates comprehensive feature lists (200+ requirements in JSON)
- Makes initial git commit showing project structure
Coding Agent (subsequent sessions):
- Reads progress files and git logs first (always)
- Implements one feature at a time
- Commits changes with descriptive messages
- Leaves code in a clean, mergeable state
The key insight: external artifacts become the agent’s memory. Progress files, git history, and structured feature lists persist across sessions. Each agent session reconstructs context from these artifacts before doing any work.
Every coding agent begins by: running pwd to confirm location, reading progress documentation, reviewing the feature list, and running existing tests. Only then does it start implementation. This ritualized startup prevents the “where was I?” problem.
From Patterns to Product
These patterns work for engineers who can wire up shell scripts and maintain progress files. But they don’t scale to non-technical users or teams that need reliability without ops overhead.
That’s where products come in. Here’s how Anthropic’s patterns map to product features:
| Anthropic Pattern | Product Implementation |
|---|---|
claude-progress.txt | Real-time agent run feed with structured events |
| Feature lists (JSON) | Spec import, task breakdown, board integration |
| Session startup protocol | Agent orchestration with context injection |
| Incremental commits | Automated branch/commit/PR workflows |
| Human intervention triggers | Pause/resume controls, terminal takeover |
| End-to-end testing | Integrated test runners with session reuse |
The product abstracts the harness. Users write specs and review PRs. The infrastructure - progress tracking, context management, session handoffs - becomes invisible.
Why This Matters for the SDLC Collapse
I wrote about the SDLC collapsing into continuous flow with human checkpoints. Anthropic’s harness patterns are the implementation detail that makes this possible.
When agents can:
- Read specs from a feature list
- Track their own progress in persistent files
- Commit incrementally and recover from failures
- Hand off cleanly between sessions
…the traditional phase boundaries dissolve. Planning feeds directly into implementation. Testing happens inline. Documentation updates as code changes. The harness enables the collapse.
The phases collapse because the harness maintains continuity. Without persistent progress tracking, each session would restart the entire SDLC.
What the DIY Approach Misses
Anthropic’s patterns are engineering-focused. They assume you can:
- Write and maintain shell scripts for session initialization
- Design JSON schemas for feature tracking
- Debug agent behavior when startup protocols fail
- Handle edge cases in progress file parsing
For engineers building their own agents, this is table stakes. But the market opportunity is the 90% of teams who need agentic workflows without the ops burden.
What products must add:
- Visual progress tracking: Not just logs - timeline views, status indicators, screenshot captures
- Structured intervention: Not just “human intervenes” - specific controls for pause, resume, redirect, abort
- Recovery automation: When agents fail mid-session, automatic rollback and retry logic
- Multi-agent coordination: Specialized agents (testing, review, documentation) working in concert
- Non-technical interfaces: Specs in natural language, not JSON; boards, not git commands
The Human-in-Loop Design
Both Anthropic’s patterns and product implementations share a critical constraint: humans must stay in the loop at judgment points.
Anthropic handles this with:
- Feature lists that prevent scope creep
- Explicit “don’t mark complete without testing” rules
- Git commits as checkpoints for human review
Products handle this with:
- Approval gates before deployment
- PR reviews before merge
- Pause controls when agents hit blockers
- Alert systems for decisions requiring human judgment
The harness automates execution, not decisions. Which features to build, whether code is correct, when to ship - these remain human responsibilities. The harness just ensures agents don’t lose context between the decision points.
Practical Takeaways
If you’re building agentic products or workflows:
- Progress files are essential: Some form of persistent progress tracking must survive across sessions. Structured logs, database records, git history - the format matters less than the existence.
- Session startup rituals matter: Agents should always read context before acting. Make this explicit in prompts and verify it in logs.
- Incremental commits enable recovery: Don’t wait for “done” to commit. Each completed unit of work should be checkpointed.
- Feature lists prevent premature completion: Comprehensive requirements, tracked explicitly, stop agents from declaring victory too early.
- Human intervention needs structure: “Pause and ask for help” isn’t enough. Define when, how, and what information to surface.
Where This Is Heading
Anthropic hints at future directions: specialized agents for testing, QA, and code cleanup. Multi-agent systems where different agents handle different phases. Extensions beyond web development into scientific research and financial modeling.
The harness becomes the orchestration layer. Products built on these patterns will compete on:
- How seamlessly they handle session handoffs
- How much context they can maintain across long projects
- How naturally humans can intervene and redirect
- How reliably agents recover from failures
The patterns are published. The infrastructure is known. The race is to productize it well.
The core insight: agents need external memory to work on real projects. Anthropic’s patterns show how engineers can build this. Products like SpecPilot show how to make it accessible. The harness is the bridge between stateless AI and stateful work.
For a deeper dive into spec-driven workflows, see Product Engineering: The New Superpower, which covers GitHub’s Spec Kit approach to the same problem.
Anthropic’s full article is worth reading: Effective Harnesses for Long-Running Agents. It includes specific techniques for feature list design, testing strategies, and failure mode handling.


