ChatGPT voice mode runs on an older, weaker model than you think

OpenAI’s voice mode is not running on the same model as the rest of ChatGPT. Simon Willison flagged this week that the voice interface uses a GPT-4o era model with a knowledge cutoff of April 2024 — around two years behind current. Andrej Karpathy has called it “slightly orphaned,” noting it will “fumble the dumbest questions” while premium coding models handle complex tasks with ease. For anyone who has noticed voice responses feeling off in ways that text responses don’t, this is the explanation.

The reason is structural, not neglectful. Karpathy identifies two conditions that drive model improvement: verifiable success metrics and B2B value. Coding tasks have both — unit tests either pass or fail, and enterprise software teams pay for better code. Voice conversation has neither. “Did the AI sound natural” is not a metric a reward model can optimise against, and the market willing to pay specifically for better voice is narrower than the market paying for better code. The result is that research and fine-tuning resources flow toward products where progress is measurable.

This matters beyond the specific ChatGPT case. It is a reminder that AI products are not uniform — the same brand name can sit on top of very different model tiers. OpenAI maintains multiple production models simultaneously, and which one a given interface calls is a product decision, not a capability ceiling. The voice interface has its own latency constraints, serving infrastructure, and model fine-tuning requirements that make swapping in the latest model non-trivial even when newer models are available. What ships as “ChatGPT” in voice is the version of GPT-4o that was optimised for speech synthesis, not the version optimised for reasoning.

The practical consequence is simple: voice mode is not the right surface for tasks that require recent knowledge or complex reasoning. If you are using it for anything beyond dictation, reminders, or simple lookups, you are likely hitting the gap between what feels like a premium interface and the model actually running underneath. Text mode gives you the better model. Voice gives you the more natural interaction. Right now those two things are not the same product.

The deeper pattern is that capability improvements at frontier labs do not propagate uniformly across all surfaces or modalities. Speech-to-speech has its own research track, and it is behind the text track by a meaningful margin. The April 2024 knowledge cutoff is not a temporary gap — it reflects how development resources are actually allocated, which follows revenue and verifiability, not interface parity.