OPINION | Mind your language

Voice is fast becoming the default interface for the next wave of internet users. Indians rarely stay within one language when conversing. Most speech systems are not built for this and treat linguistic mixing as an error. Its economic cost is sizeable

Moneycontrol Opinion

April 15, 2026 / 16:14 IST

Supriya Paul

There is a quiet mismatch at the heart of Artificial Intelligence in India. On paper, speech recognition systems are getting better. In practice, they are often not improving where it matters most.

The numbers look reassuring. Leading systems now report near human accuracy in controlled environments. But those environments are carefully constructed. Clean audio. Standard accents. One language at a time. Step outside that frame and the confidence begins to slip.

India speaks in ways that are unique

India does not speak in controlled environments.

It speaks across languages, often within the same sentence. It speaks in accents shaped by geography, education, and mobility. It speaks in crowded markets, moving vehicles, shared homes, and uneven network conditions. What is measured in the lab is only a fraction of what is encountered in the real world. The gap between the two is not incidental. It is structural.

This would be a technical footnote in most countries. In India, it is foundational.

Voice as the default interface

Voice is fast becoming the default interface for the next wave of internet users. With more than 950 million Indians online and millions more entering from non-English backgrounds, speaking is often easier than typing. Industry estimates suggest that a large majority of new users prefer local language interactions, and voice search in Indian languages has surged sharply in recent years. In sectors ranging from banking and e-commerce to public services, companies are redesigning interfaces around speech.

The promise is straightforward. Speak naturally, and the system will understand.

The reality is more complicated.

Speech patterns and language switching

The problem is often framed as one of data. India, it is said, is a low resource language environment. The assumption is that once enough data is collected, performance will converge. This is a convenient explanation. It is also incomplete.

India is not short on language. It is saturated with it.

The country operates across 22 officially recognised languages and hundreds of others, spanning multiple linguistic families. Each comes with its own phonetic logic, grammatical structure, and rhythm. Even within a single language, variation is immense. The Hindi of a news anchor is not the Hindi of a shopkeeper in Kanpur or a student in Delhi. Indian English itself carries distinct patterns that do not always align with global norms.

More importantly, Indians rarely stay within one language. Conversations move fluidly between Hindi and English, Tamil and English, Bengali and English, often mid-sentence. This is not confusion. It is a stable, rule governed way of speaking that reflects context and identity.

Most speech systems are not built for this.

They are trained and evaluated on monolingual datasets where linguistic mixing is treated as an error. As a result, they perform well in controlled demonstrations but struggle in everyday use. They fragment sentences, misinterpret intent, and default to the closest familiar pattern. The failure is subtle, but cumulative. Each small error introduces friction. Each moment of friction reduces trust.

What makes this harder to address is not the modelling. It is the measurement.

Criticality of benchmarks

Benchmarks in AI are often presented as neutral scorecards. In reality, they shape the direction of progress. When benchmarks reward performance on clean, standardised, single language speech, they encourage systems to optimise for exactly that. Everything outside this frame becomes secondary.

This creates a loop that is easy to overlook. Systems improve against the benchmark. The benchmark reflects a narrow slice of reality. Real world usage remains underserved, but also unmeasured. The gap persists, even as the scores rise.

The stakes are rising quickly. India’s speech and voice recognition market is projected to grow at over 15 percent annually through the decade, driven by applications in customer service, digital payments, education, and governance. Investment is flowing into voice-led interfaces designed to reach first time users. Yet performance in real world Indian conditions continues to lag behind headline claims. Error rates increase sharply with accent variation, background noise, and mixed language input, conditions that define everyday communication.

The economic cost of this is diffuse but significant. A failed voice command in a smart assistant is an inconvenience. A failed interaction in a banking or healthcare context is something else. It is exclusion, experienced as a breakdown.

This is where the conversation moves beyond performance and into sovereignty.

Sovereignty in speech systems

Sovereignty in speech systems is not about isolation or control. It is about the ability to define what good performance actually means. Today, most benchmarks that shape AI development are designed around conditions that do not reflect Indian speech. As long as success continues to be measured against these external standards, systems will remain optimised for the wrong reality.

A sovereign approach would begin by reclaiming measurement itself. Benchmarks would need to reflect how Indians actually speak, across accents, dialects, and social contexts. Code switching would be treated as a core feature of language, not an exception. Evaluation would move beyond a single accuracy score and instead reveal where systems succeed and where they fail.

The gap is not theoretical. Say, for example, a farmer in rural Maharashtra calling a government helpline to report crop damage may describe the symptoms in Marathi, only to receive advice mapped to a standard category that does not quite match what he is seeing. Or a small business owner in Surat reporting a suspicious transaction through a bank’s voice system may explain the urgency in a mix of Gujarati and Hindi, and the system registers keywords but not intent, guiding him through routine steps rather than escalation. By the time the interaction reaches a human agent, the outcome is already determined.

This is not about representation for its own sake. It is about capability that holds in the real world.

Systems trained and tested against realistic conditions are more resilient, more adaptive, and more useful. In a country where digital infrastructure is expanding rapidly and AI is being embedded into essential services, this distinction matters.

There is a tendency in technology to assume that scale will smooth out complexity. In India, scale amplifies it.

If AI is to become a reliable interface for everyday life, it will need to learn to listen differently. Not just more, but better. Not just to language as it is standardised, but to language as it is lived.

The sovereignty of Indian speech lies in this shift. Not in building separate systems for the sake of it, but in ensuring that the systems being built are grounded in the reality they are meant to serve. Only then does listening become understanding.

Supriya Paul is Co-founder, Josh Talks. Views are personal and do not represent the stand of this publication

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Moneycontrol Opinion

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

Al Edge Newsletter On Saturdays

Find the best of Al News in one place, specially curated for you every weekend.
MC Tech 3 Newsletter Daily-Weekdays

Stay on top of the latest tech trends and biggest startup news.