The limits of traditional testing
If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long.
One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.
Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)
A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.
But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly.
Where things break down
Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”
Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.
For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”