The standard coding agent loop — read the codebase, generate a hypothesis, write a patch, run tests — has a structural limitation that most agent benchmarks do not measure. It can only find optimisations that are visible from the source code itself. A senior engineer working on the same problem would also read relevant papers, study how other implementations solved the same challenge, and bring that context into their design decisions. SkyPilot’s research-driven agent paper tests what happens when you give an agent that same research phase.

The setup extends a standard coding agent with a literature search step: the agent reads arXiv papers and examines competing project codebases before forming optimisation hypotheses. Experiments run in parallel across cloud VMs via SkyPilot, with each VM building the project, running benchmarks, and validating correctness. Applied to llama.cpp CPU inference optimisation, the agent ran 30+ experiments over approximately three hours, producing five successful optimisations. On x86, the result was a 15.1% improvement in text generation throughput for TinyLlama 1.1B. The total cost was $29: $20 in compute and $9 in API calls.

The most instructive detail is which knowledge source proved most valuable. The agent found more applicable optimisations by studying competing backends and forks of llama.cpp than by reading arXiv papers. Specifically, it discovered RMS_NORM + MUL kernel fusion patterns that existed in the CUDA and Metal backends but had not been ported to the CPU path. That gap was invisible from the source code alone — you only find it by reading sibling implementations. Papers described general techniques; the codebase of a competing project contained the specific applied solution.

If the task is bug fixing or feature implementation in well-understood code, a code-only agent with a tight feedback loop is appropriate. If the task is performance engineering, architectural design, or any domain where existing literature or competing implementations encode substantial tacit knowledge, a research phase changes the quality of the solution space the agent can explore. The agent is not just smarter because it read papers; it is smarter because it had access to solutions that already existed in applied form elsewhere.

The $29 figure should not be treated as a general benchmark. It reflects a specific three-hour run on a constrained task with clear correctness metrics. Tasks with less structured feedback will cost more and converge less reliably. What transfers from this result is the architecture: literature retrieval and competing-implementation study as a first phase, followed by parallelised experimentation with automated validation. The SkyPilot infrastructure handles the parallelism; the research phase is what changes what the agent looks for.