X AI's Grok-3 Shows Dominance Over Competitors in Benchmark Tests

来源：Manufactry | 2025-02-18 18:54

X AI today released its new-generation large language model, Grok-3, and its streamlined version, Grok-3 mini. The latest benchmark tests show that Grok-3 has a significant advantage in a direct comparison with DeepSeek.

In the mathematical ability test (AIME '24), Grok-3 scored 52 points, significantly exceeding DeepSeek-V3's 39 points. In the scientific knowledge assessment (GPQA), Grok-3 led with a score of 75 points, while DeepSeek-V3 scored 65 points. In the programming ability test (LCB Oct - Feb), Grok-3 also scored 57 points, surpassing DeepSeek-V3's 36 points.

In the newly announced AIME 2025 performance test, the Grok-3 Reasoning Beta version achieved an excellent score of 93 points in the composite score of reasoning and calculation time, and its streamlined version, Grok-3 mini, also reached 90 points. In contrast, DeepSeek-R1 scored 75 points, and Gemini-2 Flash Thinking only scored 54 points. This result further highlights Grok-3's outstanding advantages in complex mathematical reasoning and computational efficiency.

Notably, DeepSeek's recently released DeepSeek-R1 also failed to outperform Grok-3 in other reasoning ability tests. In mathematical reasoning, Grok-3 scored 93 points, while DeepSeek-R1 scored 73 points; in scientific reasoning, Grok-3 scored 85 points, and DeepSeek-R1 scored 74 points; in programming reasoning, Grok-3 reached 79 points, while DeepSeek-R1 scored 65 points.

In addition, in the LMSYS chatbot arena assessment, Grok-3 scored approximately 1400 points, not only exceeding the DeepSeek series but also leading other mainstream large models, including GPT-4 and Claude. These data indicate that despite DeepSeek's strong development momentum in the past few months, Grok-3 still maintains a leading position in overall performance. Its advantages are particularly obvious in mathematical reasoning and computational efficiency, which not only reflects xAI's technical strength in model development but also shows the intense competition in the AI field.