Every major AI agent benchmark can be gamed to a perfect score
Researchers at Berkeley found that eight of the most widely cited AI agent benchmarks — including SWE-bench Verified and OSWorld — can each be exploited to achieve near-perfect scores without solving a single task.
- research
- tools