Google, IISc help make local apps in India more inclusive

Google said that IISc is open-sourcing the first set of speech data comprising over 4,000 hours across 38 languages to developers, with more data sets expected to be added in the future.

Vikas SN

June 28, 2023 / 12:00 IST

Google has previously announced its ambition to build a single, unified AI model that can handle over 100 Indian languages across speech and text.

India is a challenging market for tech companies and startups looking to make the web more accessible to Indians. Dialects change every few hundred kilometres, and people are more comfortable conversing than writing. This has led to a dearth of natural language data corpus to build large language AI models that understand all the language and dialect nuances.

Google, however, believes it may be a step closer to solving this through its collaboration with the Indian Institute of Science (IISc) and ARTPARK (Artificial Intelligence & Robotics Technology Park) on an initiative called Project Vaani, which was launched in December last year.

On June 28, the internet giant announced that IISc is open-sourcing the first set of speech data comprising over 4,000 hours across 38 languages to developers, with more data sets expected to be added in the future. The announcement was made at the company’s developer event held in Bengaluru.

The initiative is aimed at collecting and transcribing open-source anonymised speech data from across all of India's 773 districts, while ensuring linguistic, educational, urban-rural, age, and gender diversity in three different phases, with the first phase focusing on 80 districts across 10 states.

This move is expected to boost the development of technologies such as automatic speech recognition (ASR), speech-to-speech translation (SST), and natural language understanding (NLU) and enable developers to leverage this speech data to build language solutions for the Indian market.

"We also realise there are also a lot of challenges, when you're doing an operation at scale. How do you ensure good quality control? How do you ensure there is appropriate metadata that you're capturing along with this? It has led to some good learnings for us that we are going to apply as we expand this programme to the next set of districts" he added.

Read: Google is slowly reimagining search in the generative AI era

The initiative aims to eventually create a corpus of over 150,000 hours of speech. In due course of time, the speech data is also expected to be made available through Bhashini, an AI-led language translation platform under the Ministry of Electronics and Information Technology (MeitY).

Google has previously announced its ambition to build a single, unified AI model that can handle over 100 Indian languages across speech and text. The company believes that this will pave the way for more inclusive experiences for many Indians.
New AI developer tools

Google is also making PaLM 2, the company's next-generation large language model (LLM) with improved multilingual, reasoning, and code generation capabilities, available to developers in India through PaLM API, and Makersuite, a tool that lets developers quickly start prototyping in an easy manner. PaLM 2 currently powers the company's AI chatbot Bard.

PaLM API is also integrated into Google's developer tools such as Firebase and Colab while Google Cloud customers will be able to use PaLM API in Vertex AI, the company's machine learning platform. This development comes at a time when startups and tech companies are rushing to incorporate generative AI capabilities into their respective products.

The tech giant is also open-sourcing its Open Buildings datasets in India, comprising over 200 million buildings across the country. The dataset was built using deep learning models to find buildings in satellite imagery and includes a unique plus code for every building.

Gupta said they hope this will enable developers to build tools that power more efficient urban planning, distribution, and humanitarian crisis response efforts.

Apart from this, the company said it will soon roll out a "Trusted Tester" programme for developers to access its healthcare AI model APIs in private preview. The capability, first unveiled at Google for India event in 2022, identifies medicine names within handwritten prescriptions.

Google is also open-sourcing SeeGull Database, a dataset containing stereotypes about identity groups spanning 178 countries, as well as state-level identities within India and the United States, to evaluate and mitigate biases in Natural Language Processing. This is based on the company's efforts with Project Bindi (Bias Interventions for NLP and Data in the Indian context) which was introduced last year.

Vikas SN

first published: Jun 28, 2023 12:00 pm

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

Al Edge Newsletter On Saturdays

Find the best of Al News in one place, specially curated for you every weekend.
MC Tech 3 Newsletter Daily-Weekdays

Stay on top of the latest tech trends and biggest startup news.

Google, IISc help make local apps in India more inclusive

Google said that IISc is open-sourcing the first set of speech data comprising over 4,000 hours across 38 languages to developers, with more data sets expected to be added in the future.

Related stories

Subscribe to Tech Newsletters

Trending news