Andrej Karpathy pointed an AI agent at his own well-tuned GPT-2 training code last week. He gave it one GPU, one git branch, and a simple loop: change the code, train for five minutes, check if the number went down, keep or discard, repeat. Then he went to sleep.
By morning, the agent had run about 100 experiments. Over two days, across multiple sessions, it ran 125 experiments and found genuine improvements to code Karpathy had already optimized by hand. Each dot on the chart is a complete training run. The ones that beat the baseline get committed. The ones that don’t get reset. 28,000 GitHub stars in five days.
He called it autoresearch. The README opens with a sci-fi framing:
— Andrej Karpathy, autoresearch READMEResearch is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that’s right or wrong as the “code” is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began.
Part code, part sci-fi, and - in his words - a pinch of psychosis.
The Three-File Lab
The entire system is three files:
prepare.py- Data prep and evaluation utilities. Locked. Nobody touches this.train.py- The model and training loop. ~630 lines. The agent modifies this freely.program.md- Research instructions, constraints, taste criteria. The human writes this.
The 630-line constraint is deliberate. The entire training script fits inside a single context window, so the agent can read and reason about the whole codebase at once. No multi-file confusion, no lost context, no hallucinated imports.
The human writes .md. The agent writes .py. Karpathy frames this as: “the human programs the organization; the agent programs the model.”
Git as Agent Memory
This is the architectural decision that makes the whole thing work. The agent creates a branch (autoresearch/mar5), and every experiment follows the same cycle:
- Read
program.mdfor instructions - Modify
train.py, commit the change - Run training for up to 5 minutes
- Check the validation score
- If improved: keep the commit. If not:
git reset - Log everything to
results.tsv - Repeat
The branch history becomes the agent’s persistent memory. It can read git log to see what worked. It can read results.tsv to see what it already tried. No vector database. No embeddings. Just commits and a TSV file.
The program.md file includes one non-negotiable instruction in all caps: NEVER STOP. Once the loop begins, the agent must not pause to ask the human anything. It runs until manually killed. This is not a dialogue. It’s an infinite research loop.
Notably, this is the one thing Codex can’t do. Karpathy filed an issue noting that Codex “ignores instruction to never stop,” making it incompatible with autoresearch. Claude handles this natively. OpenAI engineers responded that they’re working on lifecycle hooks to support it.
What the Agent Actually Found
Karpathy posted session reports across GitHub Discussions and PRs. First session: 89 experiments in 7.5 hours. 15 kept, 74 discarded, zero crashes. Then a second overnight run stacked on those wins.
You don’t need to understand the ML specifics to appreciate what happened. The agent found three categories of improvement in code its author had already hand-optimized:
- Better resource allocation under constraints. The biggest win was halving the batch size. Counterintuitive: smaller batches mean less data per step. But within a fixed 5-minute window, smaller batches mean more steps. The agent discovered that iteration speed beats batch size when time is the bottleneck. An engineer would recognize this as the same tradeoff in any CI pipeline: sometimes more small deploys beat fewer large ones.
- Narrow sweet spots humans don’t have patience to find. Several improvements sat in tight ranges. Changing an initialization parameter worked at 0.68x but failed at 0.66x. Adding a tiny amount of regularization helped, but slightly more made it worse. The agent found these by brute iteration across hundreds of runs. No human is going to manually test 0.68x vs 0.67x vs 0.66x at 5 minutes each.
- Actual bugs. The agent found a missing multiplier that was making part of the model too diffuse, and suboptimal optimizer settings. These weren’t taste calls. They were mistakes. And they transferred to larger models, confirming they were real fixes, not artifacts of the small-scale setup.
The dead ends were instructive too. Some changes that sound reasonable in theory (sharing weights between components, adding a popular activation function) either broke the evaluation metric entirely or added complexity that ate the time budget. The agent learned this by trying and failing, not by reasoning about it.
Not Parameter Tuning
Automated parameter sweeps have existed for years. You define a search space (learning rate between 0.001 and 0.1, batch size 32 or 64 or 128), and the system tries combinations. The search is bounded by what the human defines upfront.
Autoresearch is fundamentally different. The agent modifies source code. It can restructure the architecture, swap algorithms, add entirely new techniques, or delete components. As Karpathy put it on Hacker News: “Agents can modify code arbitrarily, the notion of a ‘hyperparameter’ dissolves.” It’s closer to giving a junior engineer the codebase and saying “make it faster, I’ll check your work in the morning.”
The Lutke Experiment
Shopify CEO Tobi Lutke adapted the pattern overnight for an internal search quality task. After 37 experiments in 8 hours, a smaller model scored 19% higher than a model twice its size that had been manually configured.
This is the result that should make people pay attention. Not because the numbers are definitive (they aren’t - both Karpathy’s and Lutke’s results are preliminary, small-scale, and uncontrolled), but because the pattern is so simple to replicate. Three files, one GPU, one night.
What Autoresearch Doesn’t Solve
- Goodhart’s Law applies immediately. The agent optimizes for one metric. If that metric is gameable, the model looks better on paper and fails in production. Karpathy acknowledged this is fixable via
program.mdconstraints, but it reveals how quickly agents find shortcuts. - The agent runs out of ideas. Late in long sessions, experiments degrade to random seed changes and micro-adjustments. Diminishing returns hit fast.
- No isolation between improvements. Changes are stacked sequentially. There’s no way to know if improvement #15 still matters after #16 through #20 changed the surrounding code. Same problem as merging 20 PRs without testing them independently.
- Results aren’t portable. Community contributors found that the same optimization improves performance on one GPU but hurts on another. The 5-minute budget optimizes for your hardware, not generalizable findings.
- Cost silence. Neither Karpathy nor early adopters have published cost breakdowns for GPU time plus agent API costs.
In one run, the agent changed the random seed from 42 to 137 for a tiny gain. Its own report noted: “Seed 7 was worse. Make of that what you will.” This is eval overfitting in miniature. Any metric you expose to an autonomous loop will eventually get gamed.
The Pattern, Not the Project
Autoresearch itself is narrow: one GPU, one training script, one metric. What’s interesting is the pattern underneath.
An agent with a fixed time budget, a clear evaluation metric, permission to modify code freely, and git as memory. That’s a general-purpose experimental loop. In five days, the community has adapted it for MuJoCo robotics simulation, adversarial protocol hardening (where it found compound edge cases missed by 359 hand-written tests), and multiple hardware platforms from H100s to Apple Silicon. Karpathy has floated a SETI@home-style distributed version where agents on different GPUs collaborate asynchronously. Someone already built it.
The program.md simplicity criterion captures the philosophy: “A tiny improvement that adds 20 lines of hacky code? Probably not worth it. A tiny improvement from deleting code? Definitely keep.” Taste, encoded as text, applied at scale.
If you’ve been keeping an agent running in the background, autoresearch is the logical next step. Not just delegating tasks. Delegating the entire hypothesis-experiment-evaluate loop.


