Google’s own benchmark shows AI chatbots still get one in three answers wrong

Google has put today’s most popular AI chatbots -- including its own Gemini -- under a harsher spotlight, and the results suggest confidence still far outpaces correctness.

Sarthak Singh

December 17, 2025 / 13:05 IST

Artificial Intelligence

Google has published a strikingly blunt assessment of how reliable modern AI chatbots really are, and the numbers make for uneasy reading. Using its newly launched FACTS Benchmark Suite, the company found that even the strongest AI models struggle to cross a 70 percent factual accuracy threshold. In plain terms, today’s chatbots still get roughly one out of every three answers wrong, even when they sound completely sure of themselves.

The highest score came from Google’s own Gemini 3 Pro, which achieved 69 percent overall accuracy. Other major systems from OpenAI, Anthropic, and xAI performed worse, reinforcing a point many researchers have been making quietly for years. Fluency is not the same as truth.

What makes this benchmark stand out is what it chooses to measure. Most AI evaluations focus on task completion, such as whether a model can summarise text or write code. FACTS instead asks a more uncomfortable question: is the information actually correct? For sectors like finance, healthcare, and law, that distinction is not academic. A confident but incorrect answer can lead to bad decisions, regulatory trouble, or real-world harm.

The FACTS Benchmark Suite was developed by Google’s FACTS team in partnership with Kaggle and is designed around four practical use cases. One test looks at parametric knowledge, checking whether a model can answer factual questions using only what it learned during training. Another evaluates search performance, measuring how well models retrieve and use information from the web. A third focuses on grounding, testing whether a model sticks to a provided document instead of inventing extra details. The fourth examines multimodal understanding, such as interpreting charts, diagrams, and images.

Results varied sharply across these categories. After Gemini 3 Pro, Gemini 2.5 Pro and OpenAI’s ChatGPT-5 clustered around 62 percent accuracy. Claude 4.5 Opus scored roughly 51 percent, while Grok 4 landed near 54 percent. Multimodal tasks were consistently the weakest area, with many models dipping below 50 percent accuracy. That matters because misreading a chart or pulling the wrong figure from an image can easily slip past users who assume the system is reliable.