Andrej Karpathy’s autoresearch Spurs Shopify’s 53% ThemeRunner Gain

Andrej Karpathy’s autoresearch Spurs Shopify’s 53% ThemeRunner Gain

Andrej Karpathy released autoresearch in early March 2026, and the three-file GitHub repository quickly became the template behind Shopify pull request #2056. The clearest result so far is a reported 53% ThemeRunner speedup, but the code is still sitting in a pull request rather than shipping in production.

The repository keeps a fixed prepare.py file that the agent may not edit, and it uses a roughly 630-line train.py file that the agent may modify. Each run lasts five minutes on a single Nvidia GPU, and the setup produces about 12 experiments per hour.

Tobi Lütke’s Shopify PR #2056

Tobi Lütke opened Shopify pull request #2056 in March 2026 after roughly 120 automated experiments on a branch named autoresearch/liquid-perf-2026-03-11. The branch produced parse-plus-render times of 7,469 microseconds and then 3,534 microseconds on ThemeRunner, while object allocations fell from 62,620 to 24,530.

All 974 unit tests passed, and the pull request carried 93 commits. That is the kind of evidence a performance patch needs before anyone can trust a benchmark claim, because it shows both speed and basic correctness in the same run.

Overfit Warning From Lütke

Lütke also wrote, "This is probably somewhat overfit." That caveat matters more than the headline speedup, because the result may fit the benchmark setup better than it fits real-world Liquid workloads.

The pull request has not been merged, so the 53% reduction remains a benchmark result rather than a shipped change for Shopify users. Simon Willison documented that Lütke ran the loop using pi-autoresearch, the Pi extension developed with Shopify engineer David Cortés.

What Autoresearch Is Showing

Autoresearch uses a single editable file, a frozen evaluator, and a scalar metric, which is why it can generate so many runs so quickly. Karpathy’s own two-day run on code he had already hand-tuned produced around 20 stacking improvements and an 11% training speedup, while the Vector Institute ran 910 experiments across 16 GPUs in eight hours and reached the same validation loss a sequential single-GPU run would have needed 72 hours to find.

For readers tracking whether this approach is ready for production, the unresolved point is not the benchmark math. It is whether the Shopify result can survive a merge into real code, and that decision still sits with the unmerged pull request.

Next