Karpathy's Autoresearch: AI That Runs Its Own Experiments

In Brief

Andrej Karpathy, the former director of AI at Tesla and a co-founder of OpenAI, left an AI agent running overnight in early March 2026. By morning, it had run hundreds of experiments on his code and found improvements he had missed. The project is called autoresearch, and it is open source with three core files and roughly 630 lines of code. In a two-day run, the agent completed around 700 experiments and made the code 11% faster. Within a week, the repository had gained over 30,000 stars on GitHub. Karpathy has since sketched out an idea he calls AgentHub, a platform designed to let entire swarms of AI agents collaborate on research together.

What is autoresearch?

Imagine a chef who wants to perfect a recipe. Normally, the chef tries one variation at a time: a little more salt, a different oven temperature, a longer baking time. Each experiment takes effort, and the chef can only try so many before getting tired. Now imagine the chef could clone themselves 700 times, let each clone try a different variation overnight, and wake up to the best version.

That is the basic idea behind autoresearch. You give an AI agent three things: a piece of code it can modify, a measurable goal (like "make this run faster"), and a fixed time limit per experiment. The agent then runs in a loop. It changes something, tests whether the change helped, keeps improvements, throws away failures, and repeats. No human involvement needed.

Karpathy tested this on AI training code, but the pattern works for anything with a measurable score. Here are some examples:

Website performance — Make this website load faster. The agent tries code changes, measures load time, keeps what helps.

AI prompts — Get this AI prompt to give better answers. The agent tests phrasings, measures accuracy, keeps the best ones.

Energy consumption — Reduce power usage in this system. The agent adjusts settings, measures consumption, keeps what saves energy.

Medical research — Find the best combination of drug dosages in a clinical trial.

Logistics — Optimize route planning for an entire truck fleet.

Autoresearch is not machine learning. Machine learning is when a model learns patterns from data. Autoresearch is simpler: change something, measure the result, keep what works. It is systematic trial-and-error. The website example above needs no machine learning at all — the agent tries code changes, measures load time, and keeps what is faster. Karpathy happened to use the pattern on machine learning code, but the method itself is just a loop.

Karpathy describes it bluntly in the project's README: "One day, frontier AI research used to be done by meat computers... That era is long gone." By "meat computers," he means human brains. The point is not that AI is smarter than researchers. It is that AI can try hundreds of things while the researcher sleeps.

How it works

Three files, one loop

The project has three core files. This is deliberate. Karpathy wanted something anyone could understand and adapt, not a massive software framework. The full project is freely available on GitHub.

prepare.py downloads the training data and sets up the scoring system. Think of it as the exam paper. It defines the test that every experiment must pass, and it never changes. The agent cannot touch this file.

train.py is the code the agent actually modifies. It contains a complete AI model and all the settings that control how it learns. This is where the experimentation happens. The agent might change how fast the model learns, how the layers connect, or how numbers flow through the system.

program.md is the agent's job description, written in plain English. It tells the agent what to do, how to evaluate results, and includes one important rule: "NEVER STOP." As Karpathy puts it, "The human might be asleep."

The 5-minute time budget

Every experiment gets exactly 5 minutes. This is the design choice that makes the whole system work. At first, 5 minutes might sound short. But by fixing the time, every experiment becomes directly comparable, like giving every student the same exam duration. Whether the agent tweaks one small setting or rewrites an entire section, the 5-minute clock means the results can be compared apples-to-apples.

At this pace, the agent runs about 12 experiments per hour. Leave it running overnight and that is roughly 100 experiments. Karpathy left his running for two days and got around 700.

How the agent decides

After each 5-minute run, the agent checks a score called validation loss. This is a number that measures how well the AI model handles data it has never seen before. Lower is better. Think of it as a test score where 0 is perfect.

If the score improved, the agent saves the change using Git, a version control system that tracks every modification (like "track changes" in a word processor). If the score got worse, the agent throws away the change and starts fresh. Every kept improvement stacks on top of the last one, building a staircase of progress over time.

Scatter plot showing autoresearch experiments over time — gray dots are discarded experiments, green dots are kept improvements, with a staircase line showing the running best score

Results that surprised even Karpathy

700 experiments, 20 discoveries

Over two days, the agent ran roughly 700 experiments on a model Karpathy had already optimized himself. It found around 20 improvements that stuck. Together, these improvements cut the training time from 2.02 hours to 1.80 hours, an 11% speedup.

What is remarkable is what it found: bugs and missed optimizations in code written by one of AI's best researchers. The improvements included fixes to scaling, regularization and attention settings that had been sitting in the code unnoticed. The agent found them because it systematically tested everything, not because it understood the code better than Karpathy.

The Shopify CEO's overnight test

Tobias Lütke, CEO of Shopify, tried autoresearch on his own AI model. He ran 37 experiments overnight and woke up to a 19% performance improvement. A smaller 0.8 billion parameter model, trained overnight with the agent's optimizations, outperformed his previous 1.6 billion parameter model.

AgentHub: GitHub for AI agents

After watching autoresearch work, Karpathy identified the next problem. One agent working alone is powerful, but what if hundreds of agents could collaborate? He wrote on X: "The goal is not to emulate a single PhD student, it's to emulate a research community of them."

The result is AgentHub, a concept he describes simply: "GitHub is for humans. AgentHub is for agents."

How it would differ from GitHub

GitHub is the world's most popular platform for programmers to store and collaborate on code. It is built around human workflows: branches, pull requests, code reviews, merge conflicts. Karpathy's idea is to strip all of that away because agents do not need it.

In his X posts, he described how Git is "almost but not really suited" for agent collaboration. It assumes one main branch that humans merge into. Agents, on the other hand, could work in a sprawling tree of parallel experiments, sharing results through something like a message board rather than pull requests.

The SETI@home vision

Karpathy compared his goal to SETI@home, a famous project from the early 2000s where ordinary people donated their computer's spare processing power to search for signals from alien civilizations. His vision for autoresearch is similar: anyone with a powerful GPU (a computer chip used for AI training) could contribute experiments to a shared research effort.

As he wrote: "Any metric... reasonably efficient to evaluate... can be autoresearched by an agent swarm." AgentHub is the infrastructure that would make this possible. Karpathy calls it "a sketch" and "an idea in development."

Common misconceptions

"This is the same as Deep Research"

No. When companies like OpenAI and Perplexity say "deep research," they mean an AI that searches the web, reads articles, and writes a summary. Autoresearch does not search the web at all. The word "research" here means scientific experimentation: changing code, running tests, measuring results. It is closer to a laboratory than a library.

"Anyone can run this on their laptop"

Not yet. Autoresearch requires a powerful NVIDIA GPU (Graphics Processing Unit), a specialized computer chip used for AI training. Karpathy tested it on an H100, a professional GPU that costs thousands of dollars. This is not something you can run on a regular home computer. Community members have started adapting it for less powerful hardware, but the full version needs serious computing power.

"It only works for machine learning"

Karpathy himself pushed back on this. Autoresearch is not a ready-made app. It is a pattern, a repeatable method, you copy and adapt to your own problem. The three files show how it works for one specific AI model, but the idea applies to anything with a measurable score. Community members have already adapted the pattern for other tasks, including optimizing AI prompts (the instructions given to an AI). The concept is not limited to training AI models.

What this means

For regular people

Autoresearch shows where AI development is heading. Today, it optimizes AI training code. Tomorrow, the same loop could optimize anything measurable: energy usage, delivery routes, product designs. When Karpathy says "all LLM frontier labs will do this," he means that autonomous experimentation will become standard practice, not an experiment itself.

For researchers and developers

The pattern is both a tool and a wake-up call. An agent found bugs in code written by one of the most respected AI researchers alive. That does not mean researchers are obsolete. It means the tedious, repetitive work of trying hundreds of variations is better handled by machines. Researchers can focus on the creative work: asking the right questions, designing the experiments, and interpreting surprising results.

For AI labs

Karpathy called this "the final boss battle." If autonomous agents can improve AI systems without human involvement, the labs that master this approach will move faster than those that do not. The intelligence brownouts Karpathy mentioned, where an OAuth outage wiped out his running experiments, hint at a future vulnerability: when critical research depends on AI infrastructure, any downtime means lost progress.

Glossary

Term	Definition
Autoresearch	Karpathy's project that lets AI agents run autonomous experiments in a loop: modify code, test, keep or discard, repeat.
AgentHub	Karpathy's concept for a collaboration platform for AI agents, like GitHub but designed for machines instead of humans. Still an early sketch.
Validation loss	A score measuring how well an AI model handles data it has not seen before. Lower is better. Think of it as a test grade where 0 is perfect.
Hyperparameters	Settings chosen before training starts, like learning rate and batch size. The "oven temperature and baking time" of AI training.
LLM (Large Language Model)	An AI system trained on massive amounts of text that can understand and generate language. Examples: Claude, GPT.
GPU (Graphics Processing Unit)	A computer chip originally designed for graphics but now widely used for AI training because it handles many calculations at once.
Git	A version control system that tracks every change to code. Like "track changes" in a word processor, but for programmers.
Open source	Software whose code is publicly available for anyone to use, study, and modify. Like a recipe that is freely shared.
Bits per byte (bpb)	A measure of how efficiently an AI model compresses text. Lower means the model understands the text better.
SETI@home	A famous distributed computing project where volunteers donated spare computing power to search for extraterrestrial signals. Karpathy uses it as a metaphor for collaborative agent research.