Discussions surrounding AI benchmarks and their representation by AI laboratories are increasingly gaining public attention.
Recent Controversy Over Benchmark Reporting
An employee from OpenAI recently accused Elon Musk’s AI venture, xAI, of publishing deceptive benchmark results for its new AI model, Grok 3. Igor Babushkin, one of xAI’s co-founders, defended the company’s practices on social media.
The reality of the situation appears to be nuanced.
xAI shared a blog post detailing Grok 3’s performance on the AIME 2025, a challenging set of mathematics questions from a recent mathematics competition. There are concerns from some experts regarding the validity of AIME as a benchmark for AI models. Nevertheless, AIME 2025 and previous iterations of the exam are often utilized to evaluate a model’s mathematical capabilities.
The graph published by xAI indicated that two versions of Grok 3, named Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperformed OpenAI’s most advanced model, o3-mini-high, on the AIME 2025. However, OpenAI representatives on the platform quickly noted that the graph excluded the AIME 2025 score for o3-mini-high at “cons@64.”
So, what does cons@64 refer to? It stands for “consensus@64” and allows a model 64 attempts to solve each problem in a benchmark, selecting the most frequently given answers as the final responses. This method tends to significantly enhance the benchmark scores of models, and failing to include it could misleadingly suggest that one model exceeds another when that may not be accurate.
When evaluated at “@1,” which denotes the first score achieved on the benchmark, Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s results did not reach the score of o3-mini-high. Additionally, Grok 3 Reasoning Beta barely trails behind OpenAI’s o1 model that operates on “medium” computing capacity. Nonetheless, xAI is promoting Grok 3 as the “world’s smartest AI.”
Babushkin pointed out on social media that OpenAI has also released benchmark charts in the past that could be seen as misleading — though these have primarily compared its own models’ performances. An independent source in the debate created a more “accurate” visualization showcasing nearly all models’ performance at the cons@64 metric.
Many see my plot as an attack on OpenAI while others view it as a criticism of Grok; in truth, it’s merely DeepSeek propaganda. I believe Grok performs well, yet OpenAI’s tactics with o3-mini-high merit further investigation. https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic
— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025
However, AI researcher Nathan Lambert emphasized that a crucial metric remains undisclosed: the computational and financial resources required for each model to achieve its optimal score. This highlights the limitations of most AI benchmarks in accurately conveying the capabilities and constraints of various models.