HomeNewsTechnologyIndic LLMs need a lot of datasets to train models, says IIT Madras professor

Indic LLMs need a lot of datasets to train models, says IIT Madras professor

The need for speech-based language is also because, in the country, typed input is error-prone and very English-intensive, said Mitesh Khapra, Associate Professor of Computer Science at IIT-Madras

December 12, 2023 / 15:40 IST
Story continues below Advertisement
File photo
Khapra explained that it is also very cost-intensive to use foreign LLMs for translations. File photo

While the country needs India-specific large language models (LLMs), these "Indic LLMs" termed by experts need large datasets to train, Mitesh Khapra, Associate Professor of Computer Science at IIT-Madras and a key player in the government's Bhashini initiative said on December 12.

Khapra, who was speaking on the inaugural day of the Global Partnership on Artificial Intelligence Summit being held in New Delhi, defined 'Indic LLMs' as those which can be used for Indian use cases and also for India-specific contexts.

Story continues below Advertisement

"For example, GPT4 or other models would not know what Kanda Poha (a Marathi marriage ceremony) is, because it is unlikely there is a Wikipedia page or blogpost in English on the subject. But on the same subject, there will be documentation in vernacular languages," Khapra said.

Also read: