A 1.3M parameter model beats LLMs 92,000 times its size at real-time game control

A paper from this week’s arXiv batch pits a 1.3M parameter task-specific model against LLMs up to 92,000 times its size on real-time game control, and the small model wins by a factor of roughly 14. The model — a ModernBERT encoder with depth-aware token representations trained on 31,000 human gameplay demonstrations — achieves 178 frags in DOOM against 13 combined across all tested LLMs including GPT-4o-mini. It makes decisions in 31 milliseconds and runs on consumer hardware. The LLMs, constrained by their architecture and inference latency, cannot keep pace with real-time control demands regardless of their reasoning capability.

The result is not surprising if you think carefully about what LLMs are and are not good at. Language models trained on internet text contain broad world knowledge and flexible reasoning, but they do not have the inductive biases needed for fast sensorimotor control. A domain-specific model trained on the right data, with an architecture appropriate to the task, will beat a generalist model at that task whenever the task is narrow and well-defined. What is notable here is the magnitude of the gap and the scale difference: the specialised model is orders of magnitude smaller, faster, cheaper to run, and more accurate.

The production implications are straightforward but routinely ignored. Most teams default to frontier LLMs for new tasks because the API is easy and the out-of-the-box quality is good enough to prototype. For tasks with tight latency requirements (real-time control, high-frequency decision loops, edge inference), a domain-specific model trained on representative data is not just preferable — it is often the only feasible option. 31ms decision latency is not achievable through a cloud API call. The DOOM result makes this concrete, but the pattern generalises to anything with strict timing constraints: robotic control, interactive applications, on-device inference.

The 31,000 training examples is the other number worth sitting with. That is a small dataset by any standard, and it produced a model that substantially outperforms much larger generalists. Domain-appropriate training data matters more than model size when the task is well-defined. This is the same lesson that instruction fine-tuning research has been pointing at for years, applied to a domain where the feedback signal is unambiguous: did the character move or did it get shot.

The practical takeaway is a question worth asking before the next LLM integration: is this task domain-constrained and latency-sensitive? If both are true, a specialised model trained on your actual task data is worth the investment. The LLM default makes sense for open-ended tasks with flexible timing. For closed-ended tasks with real-time requirements, it is the wrong choice — and the gap between wrong and right is 92,000x.