OpenAI CEO Sam Altman, during his visit to the country earlier this year, stirred the proverbial pot by seemingly dismissing the prospect of a groundbreaking Large Language Model (LLM) emerging from India.
As the world watches the high-stakes arena of LLM heating up, one question burns bright: where does India stand in this digital marathon? While questions linger, recent developments paint a different picture. Indian companies are quietly building LLMs, each with ambitious goals and unique strengths.
Let’s taka a look at who's building what...
Krutrim
Ola co-founder Bhavish Aggarwal has thrown his hat in the Indian LLM ring with Krutrim Si Designs, his new AI venture. The company's debut offering is a family of multilingual AI models named Krutrim, which means artificial in Sanskrit.
Krutrim comes in two sizes: the base model, trained on 2 trillion tokens and unique datasets, and the larger multi-model Krutrim Pro, slated for release next quarter and designed for advanced problem-solving and task execution.
In the world of LLMs, tokens are the essential building blocks for processing and representing text data. Think of them as the bricks that LLMs use to understand and generate human-like language.
Beyond its impressive size, Krutrim stands out for its exceptional multilingual capabilities. It can understand 22 Indian languages and can generate outputs in 10, including Marathi, Hindi, Telugu, Kannada, and Odiya. The company even claims its AI models outperform OpenAI's GPT-4 in handling Indic languages.
Sarvam AI’s OpenHathi
AI startup Sarvam AI made waves earlier this month with the release of OpenHathi-Hi-v0.1, the first Hindi LLM in the OpenHathi series. Built on Meta AI's Llama2-7B architecture, OpenHathi-Hi-v0.1 promises performance on par with GPT-3.5 for Indic languages.
Sarvam AI's unique approach involves a 48,000-token extension to Llama2-7B's tokenizer and a two-phase training process. The first phase involves embedding alignment, which aligns randomly initialised Hindi embeddings. The second phase is bilingual language modelling, where the model is trained to attend cross-lingually across tokens.
The five-month-old startup founded by Vivek Raghavan and Pratyush Kumar evaluated OpenHathi's performance beyond standard Natural Language Generation (NLG) tasks, ensuring real-world applicability.
The startup also partnered with KissanAI to fine-tune its base model using conversational data they gathered. This dataset comprises conversations from a GPT-powered bot engaging with farmers in different languages.
BharatGPT by CoRover.ai
Earlier this month, Bengaluru-based CoRover.ai unveiled its indigenous LLM named BharatGPT. In collaboration with Bhashini, a National Language Translation Mission (NLTM) under the Ministry of Electronics and Information Technology (MeitY), BharatGPT supports over 12 Indian languages.
CoRover provides AI virtual assistants, including chatbots, voicebots, and videobots, to numerous organisations such as IRCTC, LIC, IGL, KSRTC, NPCI, and the Government of India, boasting a user base exceeding 1 billion.
CoRover claims BharatGPT empowers developers and business users to create text and voice-enabled multilingual Virtual Assistants in mere seconds.
Zoho
Indian software-as-a-service (SaaS) giant Zoho announced in June its plans to build its own LLM, similar to OpenAI's GPT and Google's PaLM 2.
During the CNBC-TV18 and Moneycontrol Global AI Conclave on December 16, Zoho's founder, Sridhar Vembu, shared that the company is working on smaller AI models that are based on 7 billion to 20 billion parameters to solve specific domain problems for its customers.
The SaaS giant recently unveiled a suite of 13 generative AI extensions and integrations for its applications, all powered by ChatGPT.
Indus Project by Tech Mahindra
Tech Mahindra CEO CP Gurnani was quick to respond to Altman's statement about it being "hopeless to compete" in the LLM race, stating, "challenge accepted". Soon after, the company unveiled its LLM named the Indus Project.
The Indus Project aims to establish a foundation model for Indian languages. In the first phase, the company is focusing on creating an LLM for the Hindi language, covering 40 dialects. Tech Mahindra reportedly plans to launch the LLM early next year.
On its website, the company expresses its goal to "build an Open Source Large Language AI model to serve the needs of 25 percent of the world's population".
Reliance-NVIDIA
Global chip design major NVIDIA and Reliance Industries on September 8 announced a partnership to develop a foundation LLM trained on the nation's diverse languages and tailored for generative AI applications.
Under this partnership, NVIDIA will grant access to its cutting-edge chips and AI supercomputing services in the cloud. The responsibility for execution and implementation will be overseen by Reliance Jio, leveraging its extensive expertise in mobile telephony, 5G spectrum, fibre networks, and other related domains.
Tata-NVIDIA
NVIDIA and the Tata Group have joined forces in a strategic partnership with a threefold agenda. Firstly, the partnership aims to develop and process generative AI applications while providing AI upskilling to over six lakh employees at TCS.
Second, the partnership involves working closely with Tata Motors to implement AI across various domains such as design, styling, engineering, simulation testing, and autonomous vehicle capabilities.
NVIDIA will also assist Tata Communications in constructing AI infrastructure.
AI4Bharat
Backed by Nandan Nilekani and based in IIT Madras, AI4Bharat focuses on the development of open-source language AI for Indian languages, offering datasets, models, and applications.
One of their models is IndicBart, a multilingual, sequence-to-sequence pre-trained model designed for Indic languages and English. Based on the mBART architecture, it currently supports 11 Indian languages and is trained on a vast corpus of 452 million sentences and 9 billion tokens, spanning both Indic languages and Indian English content.
Another significant model from AI4Bharat is IndicBERT, a multilingual ALBERT model trained on large-scale corpora covering 12 major Indian languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. Despite having fewer parameters compared to other public models like mBERT and XLM-R, IndicBERT delivers performance across various tasks.
Project Vaani
The Indian Institute of Sciences (IISc), AI and Robotics Technology Park (ARTPARK), and Google have joined forces to create Project Vaani, an initiative to build a comprehensive language model.
The initiative is designed to gather and transcribe open-source anonymised speech data from all 773 districts in India. The project places emphasis on ensuring linguistic, educational, urban-rural, age, and gender diversity in three distinct phases, with the initial phase concentrating on 80 districts across 10 states.
Disclaimer: Moneycontrol is a part of the Network18 group. Network18 is controlled by Independent Media Trust, of which Reliance Industries is the sole beneficiary.
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!