A small guide model can cut your LLM inference costs by 22% without replacing your frontier model

Most approaches to LLM inference cost optimisation work by replacing the expensive model: route simple queries to a small model, hard queries to a large one. ExecTune takes a different approach. Instead of routing, it adds a small trained “guide” model that sits in front of your existing frontier model and generates structured execution strategies. The big model stays; the guide shapes what it does. The paper reports 9.2% accuracy gains on math and code benchmarks alongside 22.4% cost reductions over baseline approaches. More concretely: Claude Haiku 3.5 guided by an ExecTune guide model matches unguided Sonnet 3.5 performance, and reaches within 1.7% of Sonnet 4 accuracy at 38% lower cost.

The core concept is a Guide-Core Policy (GCoP). The guide model is trained with a combination of teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning. The key objective the paper introduces is “guide-averaged executability,” which measures whether the strategies the guide generates can actually be followed by the core model under real deployment constraints. This matters because a naive guide model might produce theoretically optimal strategies that the core model consistently fails to execute correctly. The training process explicitly optimises for the guide-core pair working together, not for the guide in isolation.

For practitioners managing recurring inference budgets, the architectural advantage is modularity. You can update or fine-tune the guide model — which is small and cheap to train — without retraining the expensive core model. If your task distribution shifts or you want to specialise for a new domain, you retrain the guide, not the frontier model. This is architecturally cleaner than distillation or full fine-tuning approaches, where the capability and the cost optimisation are baked together. The tradeoff is that this only applies to tasks where the guide can produce structured, followable strategies: math, code, and structured reasoning are natural fits; open-ended generation is less amenable.

The benchmark results use Haiku 3.5 and Sonnet-class models, which makes the numbers plausible to reproduce — these are real API-accessible models, not custom research checkpoints. The paper doesn’t report numbers on longer-context tasks or tasks with ambiguous success criteria, which is where the guide’s structured strategy generation is likely to break down. For production deployments, the relevant question is how much of your inference load fits the structured-task profile the guide is trained on.

For any team currently spending heavily on Sonnet-class or above models for math, code generation, or structured data tasks, this is worth evaluating. The 38% cost reduction while staying within 1.7% of Sonnet 4 accuracy is a better trade than most routing strategies can offer for those task types. The paper is on arXiv and the approach builds on publicly available models, so the path to a proof-of-concept is shorter than most research papers suggest.