The Chinchilla scaling laws, published by DeepMind in 2022, became the dominant framework for deciding how to allocate a compute budget between model size and training tokens. The short version: train a smaller model for longer rather than a larger model for less. That guidance shaped most major training runs since. A new paper from arxiv introduces Train-to-Test (T²) scaling laws that add a variable Chinchilla ignored entirely: inference-time sampling. When you account for the cost of running pass@k sampling at inference, the optimal pretraining decision shifts away from Chinchilla and toward what the authors call the overtraining regime.

The mechanics are straightforward once you see them. Chinchilla optimises for a model that performs best at a given compute budget, assuming one forward pass per query. But modern language models, particularly reasoning models, often run many forward passes per query through sampling, beam search, or chain-of-thought approaches. A slightly undertrained model that requires more inference samples to reach a given accuracy threshold is not actually cheaper than an overtrained model that needs fewer. When you fold inference cost into the optimisation, the numbers shift. The paper validates this across eight downstream tasks and shows the effect persists through post-training stages, so it is not an artefact of the base model evaluation setup.

The practical implication is significant for anyone making training decisions today. If your deployment pattern involves any form of test-time compute scaling (best-of-N sampling, repeated reasoning attempts, majority voting), you are probably undertraining your models relative to the compute-optimal point. The T² framework gives you a principled way to recalculate that point based on your actual inference budget. The paper claims that models trained in their optimal region substantially outperform Chinchilla-optimal baselines when inference cost is included in the accounting.

There is a reason this has not been obvious until now: the inference cost term was largely invisible during the era when sampling at test time was not standard practice. Base model evaluations typically use greedy decoding or a single sample. The moment you shift to a deployment pattern that uses inference compute to improve output quality, the training optimum moves. Reasoning models, which almost by definition run extended test-time compute, are the clearest case. Training a reasoning model to Chinchilla-optimal and then burning compute at inference to compensate is a dominated strategy.

The limitation worth noting is that T² scaling laws require you to know your inference sampling budget in advance, which is not always straightforward to specify. Different tasks will use different amounts of test-time compute, and the optimum shifts accordingly. This is a framework for principled decision-making, not a single number to replace Chinchilla. But for teams building models where extended reasoning or best-of-N is core to the product, the paper is worth reading carefully before the next training run.