Calibrated Uncertainty Scores for LLMs Without Access to Model Internals

Most approaches to uncertainty quantification in LLMs require access to token logprobs or model activations. That rules out the majority of production deployments, which run against proprietary APIs that expose neither. SELFDOUBT is a paper from this week’s arXiv batch that takes a different approach: it estimates uncertainty from the generated text alone, using a metric called the Hedge-to-Verify Ratio — comparing the frequency of hedging language in a response against how much of the output can be independently verified. The result is a single-pass confidence score that requires nothing beyond the model’s own output.

The significance is that this is API-compatible by design. If your application routes requests to GPT, Claude, or Gemini without logprob access, you currently have no principled way to flag low-confidence outputs other than checking for obvious uncertainty phrases (“I think,” “I’m not sure”). SELFDOUBT formalises that intuition into a measurable signal: the ratio of hedge phrases to verifiable claims correlates with output correctness in ways that turn out to be consistent across different model families. The reported results are 90% accuracy at 71% coverage without any task-specific labels, which is a meaningful threshold for building routing or fallback logic.

Where this matters most is selective generation — flagging which model outputs are reliable enough to act on directly versus which should be routed to a human reviewer or a more capable model. If you are building an agent that takes consequential actions based on model outputs, the ability to attach a confidence estimate to each output changes the safety architecture significantly. Rather than accepting all outputs equally or running expensive verification loops on everything, you can triage by confidence.

The paper does not claim the hedge ratio is a perfect proxy for correctness. It is a heuristic that holds statistically but will misfire on confident incorrect outputs, which are exactly the failure mode that representation engineering research earlier this week identified as systematically hard to catch. SELFDOUBT covers the cases where models signal uncertainty through language; it does not help with the cases where models are confidently wrong. Those are different problems.

The takeaway: if you are building a system that uses model outputs to make decisions, whether that is a RAG pipeline, an agent, or a code generation tool, a hedge-based confidence filter is something you can add today with no model access requirements. The specific SELFDOUBT implementation will require reading the paper for tuning details, but the underlying approach is straightforward enough to implement against any API.