Moneycontrol PRO
Black Friday Sale
Black Friday Sale
HomeArtificial IntelligenceAs AI content floods the internet, are we risking a collapse of intelligence?

As AI content floods the internet, are we risking a collapse of intelligence?

For India, the stakes are especially high. Startups such as Gan.ai, Gnani.ai and Soket AI Labs, selected to build foundation models under the IndiaAI Mission, face a unique challenge: how to curate clean human-created datasets in an increasingly synthetic web

July 21, 2025 / 11:55 IST

In Greek mythology, the ouroboros, a serpent eating its tail, is a symbol of infinity, recursion, and at times, even self-destruction. In today’s artificial intelligence (AI) ecosystem, that metaphor might be becoming a reality.

In the early days of the internet, human-made content was king. But in 2025, that reality shifted. Generative AI models like GPT-4 and Claude are not only writing our emails but also shaping how the next generation of AI learns.

The problem? AI may be learning more from itself than from us.

The infinite feedback loop

As content created by AI floods the internet, future models are at risk of being trained on synthetic data rather than human content. According to a report from AI content detection firm Copyleaks, there’s been an 8,362 percent jump in AI-generated web content since ChatGPT-3.5 launched in late 2022.

A 2023 study from researchers at Oxford and Cambridge calls this “model collapse”. The paper describes model collapse as a degenerative loop where models trained primarily on AI-generated content begin to lose accuracy and originality.

“We discover that indiscriminately learning from data produced by other models causes ‘model collapse’ — a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time,” the study said.

The idea is simple. If models are repeatedly trained on synthetic data, errors and biases compound, leading to degraded performance and less “human-like” understanding.

Guardrails against collapse

To counter this, AI companies are now building tools that can tell a machine from a human.

One approach is provenance tagging: embedding invisible watermarks in AI-generated content. Companies such as Google, OpenAI, and Meta are already piloting such systems to filter out such content during training.

Another method is synthetic detection, where models are trained to identify signs of AI-generated text like overly formal language or sentence patterns.

Some companies are also using human-in-the-loop curation by gathering trusted data from books, research papers or verified user submissions.

“One approach is to detect statistical watermarks in online content, which can help identify whether the text was generated by another model. Data provenance is another approach… These methods help filter out synthetic content before it becomes part of the training dataset,” said Srikanth Velamakanni, CEO of Fractal Analytics, which recently released a reasoning-heavy model called Fathom-R1-14B.

“Model collapse is mostly a theoretical concern. The idea that we’re running out of training data comes from focusing only on what's easily available. In reality, there's still a massive amount of untapped data out there,” Velamakanni added.

India’s risk and opportunity

For India, the stakes are especially high. Startups like Gan.ai, Gnani.ai, and Soket AI Labs, all selected to build homegrown foundation models under the IndiaAI Mission, face a unique challenge: how to curate clean human-created datasets in an increasingly synthetic web.

Gan.ai founder Suvrat Bhooshan said model collapse is not an immediate risk. “While model collapse from synthetic data isn’t a major concern right now as real data still dominates, we know it can become a problem,” he said. “Our models include watermarking capabilities to detect if content was generated by us, so we can remove that from training sets.”

At Gnani.ai, which is building a 14-billion-parameter voice-first LLM, co-founder Ganesh Gopalan calls model collapse “a critical challenge”.

“Our pipeline employs automated speech recognition (ASR) metadata tagging and provenance tracking to eliminate synthetic content, maintaining a 20 percent human-sourced data baseline to counter collapse risks,” Gopalan said, adding reinforcement learning and human-in-the-loop validation are key to ensuring quality.

Soket AI Labs, another IndiaAI Mission startup, is taking a multi-pronged approach. “Learning from synthetic data only becomes problematic if it leads the model to derive conclusions or patterns that fall outside the desired data distribution for its intended application,” founder Abhishek Upperwal said.

While Upperwal agrees that AI-generated content could "poison" future training data, he added, “As AI research advances, I expect significant improvements in mechanisms to distinguish AI-generated content from human-generated content. We use web snapshots from pre-2018 Common Crawl datasets to ensure foundational data remains largely human-written.”

Diverging views

However, not everyone believes model collapse is a crisis in the making. Independent tech expert Tanuj Bhojwani dismisses it as “one research paper that got overblown”.

“Most model builders rely on closed data pipelines, not open data, so this scenario happening is rather unlikely,” Bhojwani said.

He argued that synthetic data, when done right, can be cleaner than human datasets. “TTS models like Kokoro, for example, gave ground-breaking results being trained mostly with synthetic data,” he added.

Sravan Kumar Aditya, co-founder of Toystack AI, disagrees. “With the vast amount of content AI has created, we can’t use that to train models,” he said.

“I don’t recommend training on synthetic data because it may not be correct or accurate. If you want a good model, train it on real-world examples created by humans, not AI itself.”

Tarun Pathak, research director at Counterpoint, said model collapse is already a concern in the AI research community. “When synthetic content dominates, hallucinations increase and the line between fact and fiction blurs. People stop fact-checking, and misinformation rises," he added

Is the window closing for newcomers?

According to the Oxford and Cambridge paper, early movers in the LLM race, like OpenAI, had an advantage as they had access to a cleaner and more human internet.

Yet, Velamakanni of Fractal said “the window for clean models is far from closed", especially for a country like India. “A significant portion of Indian-language content remains untouched,” he said. “Newspapers, books, government archives, all of it can serve as clean sources.” India, he argued, must focus not on scale, but on cultural relevance and domain focused models.

At Gnani, Gopalan agrees. “India has a distinct opportunity to create AI models that authentically capture its vast linguistic diversity, cultural nuance, and local context, rather than simply mirroring Western internet patterns,” Gopalan said.

“By investing in careful data curation, collaborating with local publishers, crowdsourcing speech and text data, and tapping government-backed initiatives like the IndiaAI Mission, players can create models grounded in authentic linguistic and cultural contexts,” he added.

Pathak of Counterpoint said while data is still available, acquiring and processing it at scale is resource-intensive. "As a result, many companies are opting to build on existing foundation models rather than reinvent the wheel," he said.

Is smaller better?

One common theme among industry players is the push towards smaller models trained on trusted human data.

“English content has seen the biggest flood of synthetic text,” said Fractal’s Velamakanni. “But Hindi, Tamil, Bengali, among others remain relatively clean. We should take advantage of that.”

Gopalan said India’s strength is its diversity. “This local-first, quality-over-quantity approach can help them (Indian AI startups) stand out globally, not by competing head-on in sheer scale, but by offering models that excel in relevance, accuracy, and real-world utility for diverse Indian users,” he said.

"Indian startups and labs can still compete globally by focusing on multilingual LLMs tailored to Indian languages. The key lies in curating high-quality regional datasets, fine-tuning models with human oversight, and leveraging partnerships with local institutions to access authentic, trusted data," Counterpoint’s Pathak said.

Arun Padmanabhan
first published: Jul 21, 2025 10:14 am

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

  • On Saturdays

    Find the best of Al News in one place, specially curated for you every weekend.

  • Daily-Weekdays

    Stay on top of the latest tech trends and biggest startup news.

Advisory Alert: It has come to our attention that certain individuals are representing themselves as affiliates of Moneycontrol and soliciting funds on the false promise of assured returns on their investments. We wish to reiterate that Moneycontrol does not solicit funds from investors and neither does it promise any assured returns. In case you are approached by anyone making such claims, please write to us at grievanceofficer@nw18.com or call on 02268882347