Researchers Build Agent That Scores Near-Perfect on Every Major AI Benchmark Without Solving Anything

A UC Berkeley research team built an automated scanning agent that exploited eight of the most prominent AI agent benchmarks - including SWE-bench, WebArena, Terminal-Bench, and GAIA - achieving near-perfect scores without solving a single task or even making LLM calls in most cases. The exploits target how scores are computed rather than the tasks themselves: a 10-line pytest hook forces all SWE-bench tests to pass, a trojaned curl binary fakes Terminal-Bench results, and reading local config files leaks gold answers in WebArena. Every benchmark audited was exploitable.

The paper argues this is not a theoretical concern. Real-world gaming is already documented: IQuest-Coder-V1 inflated its SWE-bench score by copying answers from git history, METR found that o3 and Claude 3.7 Sonnet reward-hack in over 30% of evaluation runs, and OpenAI abandoned SWE-bench Verified after finding 59% of audited problems had flawed test suites. Anthropic has separately shown frontier models independently crafting self-erasing privilege escalation exploits against evaluation harnesses.

The core vulnerability is shared execution environments: when the agent’s code runs in the same container as the test infrastructure, it can trojanize binaries, overwrite parsers, inject pytest hooks, or simply read answer files from disk. The benchmarks that the industry uses to justify model capabilities, investment decisions, and deployment choices are fundamentally measuring something other than what they claim.