An AI System Ran Its Own Research Loop and Beat torch.compile by 4x

AlphaLab is a multi-agent research system built on frontier LLMs (GPT-5.2 and Claude Opus 4.6) that runs full experimental cycles without human guidance. A paper published this week documents results across three domains: CUDA kernel optimisation (4.4x faster than torch.compile on average, with a maximum of 91x), LLM pretraining (22% lower validation loss compared to a single-shot baseline), and traffic forecasting (23-25% better than standard baselines after the system independently researched and implemented published model families). None of these results required a human to specify the optimisation strategy, choose the architecture, or review intermediate results.

The system runs a Strategist/Worker loop where one agent maintains a growing playbook of domain knowledge while worker agents execute experiments and report back. The Strategist synthesises results across rounds, updates its hypothesis about what’s working, and directs the next set of experiments. This is not prompt chaining with a fixed plan — the Strategist genuinely updates its theory of the domain based on empirical feedback and redirects accordingly. The playbook persists across rounds, so the system can distinguish between techniques that worked once and techniques that work reliably across varied inputs.

The finding that different frontier models discover “qualitatively different solutions” is the most practically significant detail in the paper. GPT-5.2 and Claude Opus 4.6, running the same framework on the same problem, produced different kernel implementations that each outperformed torch.compile but by different margins and through different mechanisms. This is not a marginal difference attributable to sampling randomness — they found genuinely distinct approaches. For anyone running research or optimisation pipelines, this suggests that using multiple frontier models as parallel hypothesis generators rather than a single model as the sole oracle is not just a redundancy measure, it is a coverage strategy.

The CUDA kernel result deserves specific attention. torch.compile is PyTorch’s own optimising compiler. Beating it by 4.4x on average, and up to 91x on specific kernels, means the system is finding non-obvious memory layout decisions, fusion opportunities, and execution patterns that the compiler’s general-purpose heuristics miss. GPU kernel optimisation has historically required deep hardware knowledge and manual profiling. That this is now achievable through an autonomous LLM research loop changes the economics of inference optimisation for teams that lack a dedicated GPU performance engineer.

The limitation to flag is that AlphaLab uses frontier models at scale, running many experimental rounds. The compute cost of the research loop is not trivial. The paper does not provide a cost breakdown, which matters for teams evaluating whether to apply this approach to their own optimisation problems. The results are genuine, but they should not be read as “LLMs will optimise your CUDA kernels for free.” What the paper demonstrates is that autonomous AI research loops can reach expert-level results in narrow technical domains when given enough iterations, and that the ceiling for what those loops can find is higher than previous single-shot baselines suggested.