While the country needs India-specific large language models (LLMs), these "Indic LLMs" termed by experts need large datasets to train, Mitesh Khapra, Associate Professor of Computer Science at IIT-Madras and a key player in the government's Bhashini initiative said on December 12.
Khapra, who was speaking on the inaugural day of the Global Partnership on Artificial Intelligence Summit being held in New Delhi, defined 'Indic LLMs' as those which can be used for Indian use cases and also for India-specific contexts.
"For example, GPT4 or other models would not know what Kanda Poha (a Marathi marriage ceremony) is, because it is unlikely there is a Wikipedia page or blogpost in English on the subject. But on the same subject, there will be documentation in vernacular languages," Khapra said.
Also read:
He added that as part of the Bhashini initiative, a project which leverages AI to translate languages via text and voice, there is a huge plan to collect data.
"Such a knowledge repository for India doesn't exist in the digital form. Imagine every library in India getting digitised. All India Radio broadcasts three bulletins per day. Can you upload these broadcasts for the last 20 years and upload them in Indian languages?" he asked.
Second, the need for speech-based language is also because, in the country, typed input is error-prone and very English-intensive.
Thirdly, Khapra explained that it is also very cost-intensive to use foreign LLMs for translations.
"Today, one problem that is there in GPT4 or Llama 2 is that, suppose you ask a question in Hindi, and it gives a very good answer. But the technical detail is that, while for you and me the answer is 100 words or tokens, in GPT4's interpretation it is close to 1000 tokens. That is because GPT4 treats every Indian character as one token," he added.
Khapra was speaking at a panel on "Building Scalable Large Language Models (LLMs)". Other panellists included Mohit Bansal, a John R & Louis S Parker Professor at the University of North Carolina, Rohini Srivastha, Chief Technology Officer (India and South Asia), Microsoft, Manohar Paluri, Senior Director, Meta, Americo Carvalho, Head AI and ML, Amazon Web Services and others.
When asked about his views on creating sovereign AI capabilities, Carvalho said, "Reuse whatever has been created. Making sure that we can operate in the cloud, is inherently something that helps with a lot of considerations."
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!