Autoresearch Became a Primitive

Eight days ago, Karpathy open-sourced a training loop that runs itself. One file, one metric, git as memory. 32,800 stars. 4,400 forks.

The forks are where it gets interesting. The community didn’t just clone the repo. They extracted the pattern and applied it to problems Karpathy never intended.

The Pattern Is Portable

Autoresearch’s design constraints - one mutable file, one evaluation metric, a fixed time budget, git as memory - were originally about fitting a training script into a context window. But those same constraints make the loop applicable to anything with a measurable outcome.

Teams are now pointing it at:

GPU kernel optimization. Autokernel (608 stars) feeds the loop any PyTorch model and discovers faster Triton/CUDA kernels overnight. ~40 experiments per hour, prioritized by Amdahl’s Law. Recently added AMD ROCm support.
Apple Silicon. autoresearch-mlx (701 stars) runs natively on Mac via MLX. An M4 Max hit val_bpb 1.294 from a 2.667 baseline overnight. Interesting finding: smaller hardware favored different winning strategies than the H100 runs, reinforcing the “results aren’t portable” caveat from the original.
Agent code. Harrison Chase built autoresearch-agents (72 stars): an agent that iteratively improves another agent’s implementation, using LangSmith eval scores as the metric. Agents optimizing agents.
Everything else. autoresearch by Udit Goenka (216 stars) is domain-agnostic. Test coverage, bundle size, SEO scores, accessibility, Terraform compliance, email copy. If there’s a number, it loops.

The constraint is the feature

The 630-line, single-file limit felt like a limitation. It’s actually what makes the pattern portable. Any problem you can express as “modify this file, check this number” fits the loop. The constraint forces clarity about what you’re actually optimizing.

Multi-Agent Direction

Karpathy hasn’t stopped at the solo loop. He’s demonstrated a multi-agent setup with a “chief scientist” in plan mode directing “junior engineers” running experiments in parallel tmux sessions. He’s also teased “SETI@home style” distributed autoresearch where agents on different GPUs collaborate asynchronously.

The git-as-memory pattern translates naturally to coordination: agents already publish results to GitHub Discussions that other agents can read. The infrastructure for distributed experimentation is mostly there. It’s a different kind of scaling - not more compute on one problem, but more agents exploring different hypotheses simultaneously and sharing what works.

The Weird Ones

Not every derivative is serious ML research:

agent-factory (70 stars) scrapes Reddit, Hacker News, and Twitter for real problems, builds AI agents to solve them, and ships overnight. 20+ agents deployed so far, covering tax deductions, wage rights, and data broker opt-outs.
pi-autoresearch (1,377 stars) adds persistent sessions that survive restarts and context resets, a dashboard UI, and branch-aware experiment tracking. The most popular derivative by stars, largely because it makes the loop usable for non-ML tasks with a proper interface.

Autonomy scales when you constrain scope, clarify success, and mechanize verification.

— uditgoenka/autoresearch README

What’s Actually New Here

The individual components aren’t novel. Automated testing loops, CI/CD pipelines, hyperparameter sweeps - these patterns are decades old. What’s new is the combination with code-modifying agents.

A traditional CI loop runs the same code with different inputs. An autoresearch loop changes the code itself. The search space isn’t “which parameters work best” but “which code works best.” That’s a categorically different kind of automation.

The 8-day explosion suggests this is hitting a nerve. Not because the pattern is complicated (it isn’t), but because it’s the first concrete workflow where “let the agent run overnight” produces reliably useful results. The constraint-heavy design - one file, one metric, fixed budget, automatic revert on failure - turns out to be exactly what makes agents productive unsupervised.

Same caveats apply

Every derivative inherits the original’s limitations. Goodhart’s Law still kicks in. Agents still run out of ideas and resort to random changes. Results from one hardware setup still don’t transfer to another. Distributed versions multiply these problems across nodes.

The Loop as Primitive

The trajectory is clear. Autoresearch started as “AI trains a neural network overnight.” Eight days later, it’s “AI optimizes anything measurable overnight.” The pattern - constrain scope, define success numerically, automate verification, loop - is becoming a building block.

If you have a codebase with a clear metric and a tolerance for overnight experiments, you already have everything you need. The question isn’t whether the loop works. It’s what you point it at.

Autoresearch Became a Primitive

The Pattern Is Portable

Multi-Agent Direction

The Weird Ones

What’s Actually New Here

The Loop as Primitive

Share this article

Related Posts

Autoresearch: 700 Experiments While You Sleep

Gemini 3 Flash: The Model That Shouldn't Exist

April's First 72 Hours: Cursor 3, Gemma 4, Free Qwen 3.6, and the Agent Push