Lightspeed-backed artificial intelligence (AI) startup Sarvam AI launched its Large Language Model (LLM), Sarvam 1, on October 24.
The startup claims it is India’s first homegrown multilingual LLM, trained from scratch on domestic AI infrastructure in 10 Indian languages and English, according to a post on X (formerly Twitter).
Sarvam 1 supports English along with 10 major Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. The LLM is trained on Nvidia’s H100 Graphics Processing Unit (GPU) and utilizes a two-billion-parameter language model.
Sarvam AI also uses a host of services and products of Nvidia, such as its microservice, conversational AI, LLM software, and inference server to optimise and deploy conversational AI agents with sub-second latency.
Apart from Nvidia, the LLM used Yotta’s data centres for compute infrastructure and AI4Bharat’s open-source technology and language resources.
“This combination of high performance and computational efficiency makes Sarvam-1 particularly well-suited for practical applications, including deployment on edge devices,” the AI company wrote in a blog post.
In December 2023, the AI startup launched Open Hathi, India’s first Hindi LLM. The model was built on Meta AI's Llama2-7B architecture, with 48,000 token extensions. On the other hand, Sarvam is developed using a training corpus of 2 trillion tokens.
The LLM has two trillion tokens of synthetic Indic data due to its efficient tokenizer, with the custom data pipeline capable of generating high-quality and diverse text, while maintaining factual accuracy. Sarvam said the latest from its stable matches or beats much larger models like Llama 3.1 8B, apart from being four to six times faster during inference.
In AI, inference refers to the process by which a trained model makes predictions or draws conclusions from new data based on the patterns it learned during training.
Sarvam-2T, the startups’ pretraining corpus, supports two times longer documents, three times higher quality, and eight times more scientific content than existing Indic datasets. Sarvam-2T stores around 2 trillion Indic tokens in total.
The data is almost evenly split between the 10 supported languages, with the exception of Hindi, which comprises about 20 percent of the data.
Also read: India aims to be the AI use-case capital of the world, says Nandan Nilekani
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!
Find the best of Al News in one place, specially curated for you every weekend.
Stay on top of the latest tech trends and biggest startup news.