Moneycontrol PRO
HomeTechnologyGoogle’s own benchmark shows AI chatbots still get one in three answers wrong

Google’s own benchmark shows AI chatbots still get one in three answers wrong

Google has put today’s most popular AI chatbots -- including its own Gemini -- under a harsher spotlight, and the results suggest confidence still far outpaces correctness.

December 17, 2025 / 13:05 IST
Artificial Intelligence

Google has published a strikingly blunt assessment of how reliable modern AI chatbots really are, and the numbers make for uneasy reading. Using its newly launched FACTS Benchmark Suite, the company found that even the strongest AI models struggle to cross a 70 percent factual accuracy threshold. In plain terms, today’s chatbots still get roughly one out of every three answers wrong, even when they sound completely sure of themselves.

The highest score came from Google’s own Gemini 3 Pro, which achieved 69 percent overall accuracy. Other major systems from OpenAI, Anthropic, and xAI performed worse, reinforcing a point many researchers have been making quietly for years. Fluency is not the same as truth.

What makes this benchmark stand out is what it chooses to measure. Most AI evaluations focus on task completion, such as whether a model can summarise text or write code. FACTS instead asks a more uncomfortable question: is the information actually correct? For sectors like finance, healthcare, and law, that distinction is not academic. A confident but incorrect answer can lead to bad decisions, regulatory trouble, or real-world harm.

The FACTS Benchmark Suite was developed by Google’s FACTS team in partnership with Kaggle and is designed around four practical use cases. One test looks at parametric knowledge, checking whether a model can answer factual questions using only what it learned during training. Another evaluates search performance, measuring how well models retrieve and use information from the web. A third focuses on grounding, testing whether a model sticks to a provided document instead of inventing extra details. The fourth examines multimodal understanding, such as interpreting charts, diagrams, and images.

Results varied sharply across these categories. After Gemini 3 Pro, Gemini 2.5 Pro and OpenAI’s ChatGPT-5 clustered around 62 percent accuracy. Claude 4.5 Opus scored roughly 51 percent, while Grok 4 landed near 54 percent. Multimodal tasks were consistently the weakest area, with many models dipping below 50 percent accuracy. That matters because misreading a chart or pulling the wrong figure from an image can easily slip past users who assume the system is reliable.

Google’s conclusion is measured rather than alarmist. Chatbots are improving, and they are undeniably useful. For now, human oversight, strong guardrails, and healthy scepticism are still doing much of the heavy lifting.

 

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Sarthak Singh Sarthak is an experienced writer having covered personal and consumer tech, gadgets news, social media trends, and more for several years
first published: Dec 17, 2025 01:05 pm

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

  • On Saturdays

    Find the best of Al News in one place, specially curated for you every weekend.

  • Daily-Weekdays

    Stay on top of the latest tech trends and biggest startup news.

Advisory Alert: It has come to our attention that certain individuals are representing themselves as affiliates of Moneycontrol and soliciting funds on the false promise of assured returns on their investments. We wish to reiterate that Moneycontrol does not solicit funds from investors and neither does it promise any assured returns. In case you are approached by anyone making such claims, please write to us at grievanceofficer@nw18.com or call on 02268882347