Sarvam AI launches Sarvam 1 LLM, trained on 10 Indic languages and English

The startup claims it is India’s first homegrown multilingual LLM, trained from scratch on domestic AI infrastructure in 10 Indian languages and English.

Moneycontrol News

October 24, 2024 / 22:19 IST

Sarvam 1 trained on Nvidia's inference server, Yotta’s data centre, and AI4Bharat’s open-source language resources.

Lightspeed-backed artificial intelligence (AI) startup Sarvam AI launched its Large Language Model (LLM), Sarvam 1, on October 24.

The startup claims it is India’s first homegrown multilingual LLM, trained from scratch on domestic AI infrastructure in 10 Indian languages and English, according to a post on X (formerly Twitter).

Sarvam 1 supports English along with 10 major Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. The LLM is trained on Nvidia’s H100 Graphics Processing Unit (GPU) and utilizes a two-billion-parameter language model.

Sarvam AI also uses a host of services and products of Nvidia, such as its microservice, conversational AI, LLM software, and inference server to optimise and deploy conversational AI agents with sub-second latency.

Apart from Nvidia, the LLM used Yotta’s data centres for compute infrastructure and AI4Bharat’s open-source technology and language resources.

The LLM has two trillion tokens of synthetic Indic data due to its efficient tokenizer, with the custom data pipeline capable of generating high-quality and diverse text, while maintaining factual accuracy. Sarvam said the latest from its stable matches or beats much larger models like Llama 3.1 8B, apart from being four to six times faster during inference.

In AI, inference refers to the process by which a trained model makes predictions or draws conclusions from new data based on the patterns it learned during training.

Sarvam-2T, the startups’ pretraining corpus, supports two times longer documents, three times higher quality, and eight times more scientific content than existing Indic datasets. Sarvam-2T stores around 2 trillion Indic tokens in total.

The data is almost evenly split between the 10 supported languages, with the exception of Hindi, which comprises about 20 percent of the data.

Also read: India aims to be the AI use-case capital of the world, says Nandan Nilekani

Moneycontrol News

first published: Oct 24, 2024 10:19 pm

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

Al Edge Newsletter On Saturdays

Find the best of Al News in one place, specially curated for you every weekend.
MC Tech 3 Newsletter Daily-Weekdays

Stay on top of the latest tech trends and biggest startup news.

Sarvam AI launches Sarvam 1 LLM, trained on 10 Indic languages and English

The startup claims it is India’s first homegrown multilingual LLM, trained from scratch on domestic AI infrastructure in 10 Indian languages and English.

Related stories

Subscribe to Tech Newsletters

Trending news