Moneycontrol PRO
Open App
you are here: HomeNewsTechnology

The Indian language riddle: How global tech firms are using machine learning to translate the vernacular

With the evolution of the digital ecosystem in the country, the adoption of Indic-language scripts has acquired greater importance, both as a matter of convenience for customers as well as a means to expand reach and scope of business

October 13, 2018 / 06:48 PM IST
Languages (Source: Wikimedia Commons)

Languages (Source: Wikimedia Commons)

About two years ago, Subrata Bhowmick landed in Kolkata, and hailed a cab from a popular taxi hailing app. A couple of days later, he got a promotional message from the company in spoken Bengali, except that the script was English.

This, in a lot of ways, sums up the language problem in India, a country of 22 major languages, 13 different scripts and over 720 dialects.

Popular Bollywood director Imtiaz Ali once said in a 2013 interview- “What fascinates me about India, or rather northern India, is that every 20 kilometres, the language, dialect, music, food, clothes... everything changes.”

While Ali is a person of the creative arts, engineers have faced the same conundrum while developing local language capabilities for their technology offerings, tailor-made for the Indian market.

With the evolution of the digital ecosystem in the country, the adoption of Indic-language scripts has acquired greater importance, both as a matter of convenience for customers as well as a means of expanding the reach and scope for businesses. Until now, several companies have taken to what the aforementioned cab aggregator did - type your native language in English script.


Some years ago, the government embarked on a research project under the then Department of Electronics and Information Technology, called the Technology Development for Indian Languages (TDIL).

Over the years, interdisciplinary research resulted in the fragmentation of the extant branches of study, creating  “a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty”.

Research in the field of computational linguistics has gone a long way in developing a standard for Indian languages. In 2014, the then information technology secretary Ram Sewak Sharma was working on formalising a platform called e-Bhasha to make online content available for Indian citizens in their own language. The platform was launched as a mission mode project (MMP) in 2015.

Other initiatives were subsequently undertaken to make digital interactions in local languages a possibility for most Indians. One among these was a dedicated global group led by a language technology expert who would help Indians access web addresses in their local scripts.

The work on this project is almost nearing completion, and is one part of the huge technology piece around the complex problem of Indian languages.

Global developments

In 2016, the annual Internet Trends report by venture capitalist and technology analyst Mary Meeker, said that if India was excluded, the global Internet growth rate dipped to 7 percent for the year, marking the country as one with a high growth potential.

Last year, her report said the number of internet users in India grew more than 28 percent in 2016, in spite of the fact that only one in four Indians has access to the Indians. In 2017, internet penetration in India stood at 27 percent.

study commissioned last year by the Universal Acceptance Steering Group, which is working to ensure all web addresses in different languages are compatible with each other, found a potential $9.8 billion growth opportunity in online revenue.

The study, carried out by Analysys Mason looked at Russian, Chinese, Arabic, Vietnamese and Indic language groups, which said that online spending from these users could start at $6.2 billion per year.

This strengthens the case for building technology and systems in the native language of non-English speakers. In 2016, Google Translate made the news for translating Japanese with uncanny precision. Google's translation service, which uses machine learning, managed to recognise and automatically translate complex characters in the Japanese script.

As this piece in the New York Times Magazine explained, Google has expedited its quest to fine tune its algorithms so that it attains something like human flexiblity vis-à-vis learning from real-world situations. Google Brain, a new department in the Alphabet stable, has been at work to develop artificial neural networks to acquaint themselves with the world via travel and error, much in the same way as toddlers pick up the rudiments of life, such as language.

The crack team that works at creating a seamless experience for users of Google's Assistant and Translate products leverage the power of artificial intelligence and machine learning to gradually refine the underlying programs to pick up individual quirks and idiosyncrasies such as accent and intonation of speakers to deliver the best possible results in minimal time.

Essentially, Google taught machines to learn language like humans. And that was a breakthrough for the most consumer-facing translation product in the world- it was translating 500 million monthly users and 140 billion words per day in a different language in 2016.

Other big tech companies too, were intensifying efforts to build local language capability- Microsoft, Amazon and Facebook being among the major players with fingers in the pie.

The turning point

It was around 2013-14 that e-commerce became the rage in urban India. From books to clothes, shoes to perfumes, everything was now available online. Even grocery shopping could be conducted at the touch of a button, marking an inflection point in the way Indians procured their basic necessities.

However, this boom catered to a largely English-speaking population. Major e-commerce firms were quick to gauge the scope of Indian market. The Indian hinterland presented a largely untapped market for e-tailers, who soon realised that the next wave of growth for them would come from the non-English speaking audience in tier-2 and tier- 3 cities.

Snapdeal dabbled in some local language content in 2015 when it introduced Hindi and Telugu translations on its mobile website in 2015.

The same year in January, Google introduced lane guidance in Google Maps in Hindi. By August, it intensified efforts to get people to create more content in Hindi and also advertise more aggressively through its AdSense platform.

“But why does Hindi matter? If you have a large user base in India; or you’re looking to grow in this strategic emerging market; catering your content to Hindi speakers is key,” it said in a blogpost at the time.

Last year, Bangalore-based Reverie Language Technologies, which provides localization services for Indic languages and has an Indic keyboard app called Swalekh, released a report about local language content usage perspective in India.

It found that Hindi, Marathi, Gujarati, Telugu and Bengali represent over 75 percent of all Indian language words typed in by users between January and June 2017.

A recent Google study also showed a 61 percent growth in Indian language e-commerce searches in 2017. The same report also said that 45 percent of shopping queries came from non-metro cities compared to 36 percent in 2016. The highest search volume came from Lucknow, Jaipur and Indore.

In August this year, the Internet and Mobile Association of India (IAMAI) & Kantar IMRB, in a joint report said a potential 205 million internet non-users are likely to go digital if essential services on the internet are provided in a language of their choice.

The business imperative is thus becoming clear- the English-speaking market is saturated, and the next wave of growth and revenue for online businesses will come from small towns where the native language, like what Imtiaz Ali said, changes every 20 kilometers.

Why is it still work in progress?

One reason why language technology is still very much an evolving field in India is because of the lack of a unifying platform for all stakeholders to set a common agenda.

Google Translate, perhaps the most commonly used resource for translations and transliteration on the Internet, “learnt” different languages as far back as 2010,from transcripts of United Nations and European Parliament meetings, which were translated by humans into six languages.

India, unfortunately does not have similar texts that can be used to train machines to learn Indian languages. Understandably, creating these even now would require large amounts of time and money to be invested, thereby dulling the profit motive for private players.

“Machine translation algorithms are often trained using parallel corpora created by human translators in order to achieve high-quality output. If such parallel corpora in Indian language pairs (Tamil to Marathi, for example) are made available by the government under open source licenses, it could spawn the growth of translation technology companies,” said Venkatesh Hariharan, a former Google India Policy head and currently Director - FinTech at the software think tank iSPIRT.

Several small firms in India have language capabilities in different areas. These include translation services company Process9, Reverie Language Technologies, and IndusOs, an Android-based operating system that uses Indian languages.

There are also independent, volunteer-led initiatives like the non-profit Indic Project, which “creates Indian language applications and solutions for the majority of India that does not speak English,” and supports all 22 Indian languages.

“The government has historically thought it will build language technology on their own, but most of it has not been available to most people… While the government has been looking to build language technology, it has never considered civil society or language community as stakeholders,” said Anivar Aravind, executive director of the Indic Project.

Another issue is that the monetisation of non-English content online has still not happened.

“While the progress in languages is welcome, monetisation from language sites has to be in par with English. I must say monetisation from Indic has come a long way. For language to become monetisable, creatives also need to be in the local language. Many local language sites still have ads in English,” said BG Mahesh, founder and managing director at, a popular Indian language portal.

When Mahesh set up the portal, in the late 1990s and early 2000s, the focus was a lot more on non- resident Indians, who accounted for 80-85 percent on their traffic.

With time, the Internet has become accessible and affordable for most Indians, literacy has improved and portal like his now see India users as well.

What still needs to be addressed is having an end-to-end flow- right from searching for a product or thing, navigating option and the entire payment cycle- in the local language of the customers.

In August this year, at its annual event for India, Google announced a slew of initiatives and product upgrades targeted specifically at the different local languages spoken in India. A few days later, Amazon launched its e-commerce platform in Hindi.

“Tech giants patronising Indian languages only validates how huge the Indian language space is. They have seen the traction that Hindi evolved for them. After Hindi they are focusing on all other Indian languages,” said Mahesh.
Facebook has allowed users to post in 12 Indian languages, beginning with eight languages as far back as 2012. The social network does not have a specific product around regional languages like Google or Amazon do, but allows users across its offerings- stories, videos etc to use the 12 Indian languages.

This is mainly content generation, as opposed to driving commerce, though it does have a Marketplace that allows buyers and sellers to connect directly through the social network.

Challenges ahead

Working and developing in Indian languages is not as simple as mere translation.

“The important thing is to think of not just the technology part, but also the experience part of it,” said co-founder and chief executive of Reverie LanguageTechnologies.

As an example, Pani said, the image of a shopping cart on an e-commerce website that an urban customer understands well, may not make much sense for a tier-2 or tier-3 market customer, because they haven't seen a shopping cart the way the English-speaking market recognizes it.

Similarly, the microphone icon on most websites does not mean much to that market and they often confuse it for a "shivling," a common symbol of Lord Shiva in Hindu mythology.

“On e-commerce websites, email registration is a barrier. Cash-on-delivery not being the default mode also becomes a barrier to the growth of e-commerce in tier-2 and tier-3 markets.  In our experience, 80 percent customers drop out of the shopping cart for such reasons,” added Pani.

Big tech firms entering the Indian language space is certainly a positive, but they will have to address the challenge of localising content for different geographies, and maybe even work alongside some of these smaller but more experienced players who understand the market and its challenges better.
Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Neha Alawadhi
ISO 27001 - BSI Assurance Mark