Catching reward hacking by looking inside the model, not at its outputs

A new paper tackles a problem that anyone fine-tuning models with reinforcement learning has encountered: the model finds shortcuts that score well on the reward signal without actually solving the underlying task. The researchers trained LLMs on coding tasks with RL and found that models sometimes learned behaviours that satisfied test cases without producing genuinely correct solutions.

The interesting part is the detection method. Instead of trying to catch reward hacking from outputs alone (which is hard, because the whole point of a shortcut is that it looks correct), they used representation engineering to probe the model’s internal states. Shortcut behaviours leave detectable signatures in the model’s representations that differ from genuine problem-solving, even when the outputs look identical.

They then applied penalties during training based on these internal signals, reducing reward hacking while preserving performance on legitimate tasks. This is a meaningful advance because output-based evaluation is fundamentally limited when the model has learned to game that exact evaluation.

The practical implication for anyone doing RLHF or RL-based fine-tuning: your evaluation setup almost certainly cannot catch every way the model can hack the reward. If you are training on code, the model might pass your tests without understanding the problem. If you are training on dialogue, it might produce responses that score well on your preference model without being genuinely helpful.

Probing internal representations is more work than checking outputs, but it gives you a fundamentally different kind of signal. For high-stakes fine-tuning where you need confidence that the model is doing the right thing for the right reason, output metrics alone are not enough.