A model can teach itself to write better code

A new paper from researchers at Apple and Google shows that you can meaningfully improve a model’s code generation ability using only the model’s own outputs. No teacher model, no reinforcement learning, no external test harness to verify correctness. Sample from the model at a range of temperatures and truncation settings, then fine-tune on those samples with standard supervised fine-tuning. That is the entire method, and it pushes Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6 — a gain of nearly 13 percentage points concentrated on harder problems.

The method, called Simple Self-Distillation (SSD), works because of how decoding distributions behave. LLM decoding involves a tradeoff between precision (committing to the right token in contexts where there is one right answer) and exploration (preserving diversity where it helps). Standard sampling configurations do not navigate this tradeoff well at the token level. SSD’s fine-tuning step reshapes those distributions in a context-dependent way: it suppresses what the authors call “distractor tails” where precision matters (the kinds of mistakes that produce subtly wrong code), while preserving diversity where exploration is beneficial. The result is a model that makes fewer precision errors without becoming less creative.

The gains generalise across Qwen and Llama model families at 4B, 8B, and 30B scales, and work for both instruct and “thinking” variants. That breadth is the more interesting result. This is not a technique that only works at a specific scale or model family. The mechanism appears to be addressing something fundamental about how instruction-tuned models sample code, rather than patching a quirk of a particular architecture.

For practitioners, this has a clear implication. If you are running a code-focused fine-tuning project and do not have access to a verified dataset of correct solutions, this gives you a principled path to improvement using only the model’s own outputs. The compute cost is modest: generate samples, filter out pathological cases, run SFT. The approach also suggests that the gap between a capable instruct model and a code-specialist is partly addressable without complex RL pipelines or expensive human annotation.

The honest limitation is that the sampling configuration — temperature and truncation settings — matters significantly, and the paper does not give a universal recipe. You will need to tune these for your model and task distribution. The method also improves most on hard problems, which is exactly where you want gains, but the easy problems are already near-solved by standard decoding and SSD does not regress them. That is a reasonable property, but do not expect it to close the gap entirely on competition-level problems at the hardest end.