India is a challenging market for tech companies and startups looking to make the web more accessible to Indians. Dialects change every few hundred kilometres, and people are more comfortable conversing than writing. This has led to a dearth of natural language data corpus to build large language AI models that understand all the language and dialect nuances.
Google, however, believes it may be a step closer to solving this through its collaboration with the Indian Institute of Science (IISc) and ARTPARK (Artificial Intelligence & Robotics Technology Park) on an initiative called Project Vaani, which was launched in December last year.
On June 28, the internet giant announced that IISc is open-sourcing the first set of speech data comprising over 4,000 hours across 38 languages to developers, with more data sets expected to be added in the future. The announcement was made at the company’s developer event held in Bengaluru.
The initiative is aimed at collecting and transcribing open-source anonymised speech data from across all of India's 773 districts, while ensuring linguistic, educational, urban-rural, age, and gender diversity in three different phases, with the first phase focusing on 80 districts across 10 states.
This move is expected to boost the development of technologies such as automatic speech recognition (ASR), speech-to-speech translation (SST), and natural language understanding (NLU) and enable developers to leverage this speech data to build language solutions for the Indian market.
In an interview with Moneycontrol, Manish Gupta, Director at Google Research India, said that they consciously followed a region-based approach rather than a language-based approach, wherein the team went to each region and asked people to describe images shown in their own languages.
"In this process, we have seen the first-ever digital samples being collected for many of these rare languages," Gupta said.
"We also realise there are also a lot of challenges, when you're doing an operation at scale. How do you ensure good quality control? How do you ensure there is appropriate metadata that you're capturing along with this? It has led to some good learnings for us that we are going to apply as we expand this programme to the next set of districts" he added.
Read: Google is slowly reimagining search in the generative AI era
The initiative aims to eventually create a corpus of over 150,000 hours of speech. In due course of time, the speech data is also expected to be made available through Bhashini, an AI-led language translation platform under the Ministry of Electronics and Information Technology (MeitY).
Google has previously announced its ambition to build a single, unified AI model that can handle over 100 Indian languages across speech and text. The company believes that this will pave the way for more inclusive experiences for many Indians.
New AI developer tools
Google is also making PaLM 2, the company's next-generation large language model (LLM) with improved multilingual, reasoning, and code generation capabilities, available to developers in India through PaLM API, and Makersuite, a tool that lets developers quickly start prototyping in an easy manner. PaLM 2 currently powers the company's AI chatbot Bard.
PaLM API is also integrated into Google's developer tools such as Firebase and Colab while Google Cloud customers will be able to use PaLM API in Vertex AI, the company's machine learning platform. This development comes at a time when startups and tech companies are rushing to incorporate generative AI capabilities into their respective products.
The tech giant is also open-sourcing its Open Buildings datasets in India, comprising over 200 million buildings across the country. The dataset was built using deep learning models to find buildings in satellite imagery and includes a unique plus code for every building.
Gupta said they hope this will enable developers to build tools that power more efficient urban planning, distribution, and humanitarian crisis response efforts.
Apart from this, the company said it will soon roll out a "Trusted Tester" programme for developers to access its healthcare AI model APIs in private preview. The capability, first unveiled at Google for India event in 2022, identifies medicine names within handwritten prescriptions.
Google is also open-sourcing SeeGull Database, a dataset containing stereotypes about identity groups spanning 178 countries, as well as state-level identities within India and the United States, to evaluate and mitigate biases in Natural Language Processing. This is based on the company's efforts with Project Bindi (Bias Interventions for NLP and Data in the Indian context) which was introduced last year.
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!
Find the best of Al News in one place, specially curated for you every weekend.
Stay on top of the latest tech trends and biggest startup news.