HomeNewsTechnologyIndic LLMs need a lot of datasets to train models, says IIT Madras professor

Indic LLMs need a lot of datasets to train models, says IIT Madras professor

The need for speech-based language is also because, in the country, typed input is error-prone and very English-intensive, said Mitesh Khapra, Associate Professor of Computer Science at IIT-Madras

Khapra explained that it is also very cost-intensive to use foreign LLMs for translations. File photo

While the country needs India-specific large language models (LLMs), these "Indic LLMs" termed by experts need large datasets to train, Mitesh Khapra, Associate Professor of Computer Science at IIT-Madras and a key player in the government's Bhashini initiative said on December 12.

Khapra, who was speaking on the inaugural day of the Global Partnership on Artificial Intelligence Summit being held in New Delhi, defined 'Indic LLMs' as those which can be used for Indian use cases and also for India-specific contexts.

Story continues below Advertisement

Remove Ad

"For example, GPT4 or other models would not know what Kanda Poha (a Marathi marriage ceremony) is, because it is unlikely there is a Wikipedia page or blogpost in English on the subject. But on the same subject, there will be documentation in vernacular languages," Khapra said.

Also read:

He added that as part of the Bhashini initiative, a project which leverages AI to translate languages via text and voice, there is a huge plan to collect data.
"Such a knowledge repository for India doesn't exist in the digital form. Imagine every library in India getting digitised. All India Radio broadcasts three bulletins per day. Can you upload these broadcasts for the last 20 years and upload them in Indian languages?" he asked.

Story continues below Advertisement

Remove Ad

Second, the need for speech-based language is also because, in the country, typed input is error-prone and very English-intensive.

"Today, one problem that is there in GPT4 or Llama 2 is that, suppose you ask a question in Hindi, and it gives a very good answer. But the technical detail is that, while for you and me the answer is 100 words or tokens, in GPT4's interpretation it is close to 1000 tokens. That is because GPT4 treats every Indian character as one token," he added.

Story continues below Advertisement

Remove Ad

Khapra was speaking at a panel on "Building Scalable Large Language Models (LLMs)". Other panellists included Mohit Bansal, a John R & Louis S Parker Professor at the University of North Carolina, Rohini Srivastha, Chief Technology Officer (India and South Asia), Microsoft, Manohar Paluri, Senior Director, Meta, Americo Carvalho, Head AI and ML, Amazon Web Services and others.

When asked about his views on creating sovereign AI capabilities, Carvalho said, "Reuse whatever has been created. Making sure that we can operate in the cloud, is inherently something that helps with a lot of considerations."

Download MC Apps:

Copyright © Network18 Media & Investments Limited. All rights reserved. Reproduction of news articles, photos, videos or any other content in whole or in part in any form or medium without express written permission of moneycontrol.com is prohibited.

English

Markets

News

Personal Finance

Mutual Funds

Commodities

Media

Invest Now

Specials

Indic LLMs need a lot of datasets to train models, says IIT Madras professor

The need for speech-based language is also because, in the country, typed input is error-prone and very English-intensive, said Mitesh Khapra, Associate Professor of Computer Science at IIT-Madras

Related Stories

Trending Topics

News

Markets

Personal Finance

Mutual Funds

Tools

Community

Network 18 Sites

Quick Links