AI Benchmarks Move Beyond Math Olympiads to Will Smith and Spaghetti

The release of new AI video generators often leads to an unofficial, yet telling, benchmark: can it realistically depict actor Will Smith eating spaghetti? This trend, which has become both a meme and a practical test, was even parodied by Smith himself. Recently, Google's Veo 2 has successfully rendered the now-iconic scene, highlighting a shift in how AI capabilities are being measured.

This unconventional approach is part of a larger movement in 2024 where the AI community has adopted bizarre "unofficial" benchmarks. These include a teenager's app that allows AI to control Minecraft and a platform where AIs compete in games like Pictionary. Such playful tests are gaining traction, prompting questions about their popularity.

The appeal of these odd benchmarks lies in their accessibility and relevance to everyday users. Standard industry benchmarks, like performance on Math Olympiad exams or doctoral-level problems, often fail to resonate with the average person, who primarily uses AI for tasks such as email drafting and basic research. This disconnect highlights a critical gap in assessing AI's practical utility.

Moreover, crowdsourced platforms like Chatbot Arena, while popular, are not without limitations. These platforms often suffer from biases due to the unrepresentative backgrounds of the evaluators who tend to come from the tech industry rather than from more varied backgrounds, making these benchmarks a subjective tool rather than an objective one.

This preference for unconventional benchmarks could also stem from a critical failure of industry-standard benchmarks. As noted by Wharton professor Ethan Mollick, these standard assessments often fail to compare an AI's performance against that of an average human. This has led to a significant gap in assessments for real-world AI applications in areas like medicine, law, and general advice quality.

While these unusual tests like Minecraft, Connect 4, and depicting Will Smith eating spaghetti are not empirically rigorous, they are more digestible and entertaining for the general public. The industry's struggle to convey the complexity of AI to a wider audience means that these quirky metrics, despite their lack of generalizability, seem to be taking a more prominent position in evaluating these technologies. This suggests that the search for the next viral AI benchmark is already underway.