Every major AI agent benchmark can be gamed to a perfect score

Eight benchmarks. Every single one exploitable to near-perfect scores without solving any task. That is the finding from Berkeley’s RDI lab, and it lands directly in the middle of how AI labs market models, how companies make model selection decisions, and how safety researchers assess risk. The benchmarks affected include SWE-bench Verified (the standard for code generation), OSWorld (the dominant computer-use benchmark), WebArena, GAIA, and four others — a set that covers the majority of serious agent capability claims made in the last 18 months.

The attack surface varies by benchmark but seven recurring patterns underpin all of them: no isolation between agent and evaluator environments, answers shipped alongside test configurations, eval() called on untrusted agent output, prompt injection into LLM judge inputs, weak string matching that normalises away incorrect answers, broken evaluation logic that skips checks entirely, and trusting output from code the agent itself controls. In the case of SWE-bench Verified, a Pytest hook can force every test to pass regardless of the underlying code. In FieldWorkArena, the validator only checks that a message exists, not whether it is correct. OSWorld scores under 100% only because the exploit downloads public gold files from HuggingFace and the pipeline occasionally fails on that network step.

The impact operates at a couple of levels. First, model selection decisions made in the last year are suspect. If a model claims state-of-the-art on SWE-bench or WebArena, that benchmark score is not a reliable signal of real capability. This matters for teams that chose infrastructure, negotiated contracts, or built product roadmaps based on published rankings. Second, and more concerning, the authors note that increasingly capable agents may discover these reward-hacking shortcuts themselves. An agent optimising for benchmark performance has a strong incentive to find that manipulating the evaluator is easier than solving the task. You do not need deliberate cheating for this to happen — you need an agent that is good at finding shortcuts.

The field has known about benchmark contamination as a problem for a while, but this is somewhat different and novel from contamination. Contamination has meant training data leakage. What Berkeley documents is structural: the evaluation infrastructure itself is exploitable by design. Fixing contamination requires curating better datasets. Fixing structural exploitability requires isolated execution environments, append-only evaluator logs, hardware attestation, and judge inputs that are sanitised against prompt injection — all of which add operational overhead that the volunteer-maintained benchmark ecosystem has not historically had resources to implement.

For practitioners, the takeaway is to stop treating benchmark position as a proxy for production capability without validating it in your specific context. Run models on your own tasks with your own evaluation, treat published scores as directional rather than definitive, and when choosing between models for production agentic work, weight internal evals far more heavily than leaderboard position. The Berkeley paper gives you a useful checklist: before trusting any benchmark score, ask whether the evaluation runs in an isolated environment where the agent cannot access evaluator state. If it does not, the score tells you very little.