Grok-3: A new challenger to OpenAI, DeepSeek, Google?

According to xAI’s own data, Grok-3 outperformed several other major AI models on benchmarks related to math (AIME), science (GPOA), and coding (LCB).

Arun Padmanabhan

June 03, 2025 / 16:28 IST

GrokAI

Just days after reports surfaced that Elon Musk had made an offer to buy OpenAI, the billionaire's AI venture, xAI, unveiled its highly anticipated Grok-3 model.

As Musk's AI ambitions grow, Grok-3 marks a major step in his quest to compete with industry rivals. But how does Grok-3 stack up against the likes of OpenAI's GPT, Google's Gemini, Anthropic's Claude, and others in terms of performance and capabilities? Let’s break it down.

The real star isn't the AI itself, but its incredible engine: Colossus. The system underwent two phases of training: an initial 122-day synchronous training period using 100,000 GPUs, followed by a scaling-up phase over 92 days, bringing the total GPU count to 200,000. During the launch event, xAI developers said that constructing the infrastructure proved to be even more challenging than the development of the AI model itself.

The company has even bigger plans, with Musk hinting that they’re aiming for a system five times as powerful, making it the largest GPU cluster.

Benchmarking Grok-3

Grok1 According to xAI’s own data, Grok-3 outperformed several other major AI models on benchmarks related to math (AIME), science (GPOA), and coding (LCB). Grok-3 scored 52 in math, 75 in science, and 57 in coding, while the Grok-3 Mini variant scored 40, 65, and 41, respectively. These scores surpass those of major models like Google DeepMind’s Gemini-2 Pro, Anthropic’s Claude 3.5 Sonnet, and OpenAI’s GPT-4o, all of which scored lower across these tests.

Grok3

Grok-3’s standout performance wasn’t limited to just the usual benchmarks. During blind evaluations in the Chatbot Arena, an open-source platform for comparing AI models, Grok-3's earlier test version, codenamed "Chocolate," topped the ELO rankings, signifying that users consistently preferred its answers over those of other leading models.

The Chatbot Arena platform pits models against each other in blind tests, where users are unaware of which model is giving the answers.

Grok2

The “Reasoning Beta” variant, which integrates a chain-of-thought processing system and additional computation during testing, achieved 93 percent on the AIME 2025 benchmark. While other models, like GPT-4 and Gemini-2.0, scored below 87% on the same test.

Interestingly, Grok-3 Mini Reasoning Beta, despite its smaller size, sometimes outperformed the full version of Grok-3.

Playing catch-up?

Despite these impressive results, Grok-3's live demo might make you wonder whether it truly brings something new to the table. While Grok-3 solved physics problems and wrote game code, these are capabilities that other AI models, including OpenAI's GPT-4, Claude, and Google's Gemini, have already showcased.

Meanwhile, Grok-3 introduced DeepSearch, a research assistant capable of searching the web and generating detailed reports. This functionality is similar to what we’ve seen from OpenAI and Google with their own AI-driven research agents.

xAI also previewed upcoming features like voice capabilities, allowing for more natural, expressive speech interactions. This is expected to rival OpenAI’s "Advanced Voice Mode," but Musk emphasised that Grok-3’s voice model is much more sophisticated than simple text-to-speech technology.

xAI is also building an AI-powered gaming studio to help developers create games using Grok-3.

Grok-3’s place in the AI race?

So, where does Grok-3 stand in comparison to other leading AI models? Right now, it seems capable of matching and, in some cases, outpacing the best models available.

However, with OpenAI preparing to release GPT-4.5, and the growing emphasis on affordable AI models following the launch of China's DeepSeek, the real test will be how Grok-3 performs against the next wave of innovation.

Andrej Karpathy, founder of Eureka Labs, and formerly with OpenAI and Tesla, was given early access to Grok-3 and said it performed well on complex tasks, such as creating a hex grid for the popular board game Settlers of Catan.

“Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not,” he said on X (formerly Twitter).

“The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though, of course, we need actual, real evaluations to look at,” he added.

Meanwhile, Gary Marcus, CEO of Geometric Intelligence and an outspoken critic of the current AI hype, was more sceptical.

"Elon Musk promised that Grok 3 would be the smartest AI ever. One of his fans even predicted earlier today that it would be AGI! Spoiler alert: it wasn’t," he wrote on Substack.

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Arun Padmanabhan

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

Al Edge Newsletter On Saturdays

Find the best of Al News in one place, specially curated for you every weekend.
MC Tech 3 Newsletter Daily-Weekdays

Stay on top of the latest tech trends and biggest startup news.