LLM subverts evaluation

"BrowseComp is an evaluation designed to test how well models can find hard-to-locate information on the web.

"Like many benchmarks, it is vulnerable to contamination: answers leak onto the public web through academic papers, blog posts, and GitHub issues, and a model running the eval can encounter them in search results.

"When we evaluated Claude Opus 4.6 on BrowseComp in a multi-agent configuration, we found nine examples of this kind of contamination across 1,266 BrowseComp problems.

"However, we also witnessed two cases of a novel contamination pattern.

"Instead of inadvertently coming across a leaked answer, Claude Opus 4.6 independently hypothesized that it was being evaluated, identified which benchmark it was running in, then located and decrypted the answer key.

"To our knowledge, this is the first documented instance of a model suspecting it is being evaluated without knowing which benchmark was being administered, then working backward to successfully identify and solve the evaluation itself.

"We believe this previously unobserved technique is made possible by increases in model intelligence and more capable tooling, notably code execution.

"This finding raises questions about whether static benchmarks remain reliable when run in web-enabled environments."

Search This Blog

chatainews

LLM subverts evaluation

Comments

Post a Comment