Moneycontrol PRO
Black Friday Sale
Black Friday Sale
HomeNewsTechnologySarvam AI unveils OpenHathi, the first Hindi large language model

Sarvam AI unveils OpenHathi, the first Hindi large language model

The model is built on Meta AI's Llama2-7B architecture, and according to Sarvam AI, it delivers performance on par with GPT-3.5 for Indic languages.

December 13, 2023 / 15:05 IST
Sarvam AI’s AI model has a 48,000-token extension of Llama2-7B’s tokenizer and undergoes a two-phase training process.

Homegrown AI startup Sarvam AI has released OpenHathi-Hi-v0.1, the first Hindi large language model (LLM) in the OpenHathi series.

The model is built on Meta AI's Llama2-7B architecture, and according to Sarvam AI, it delivers performance on par with GPT-3.5 for Indic languages.

The AI model used by Sarvam AI has a 48,000-token extension of Llama2-7B’s tokenizer and undergoes a two-phase training process. The first phase involves embedding alignment, which aligns randomly initialised Hindi embeddings. The second phase is bilingual language modeling, where the model is trained to attend cross-lingually across tokens.

Also read: What powers ChatGPT and Bard? A look at LLMs or large language models

"We show that our model works as well as, if not better than GPT-3.5 on various Hindi tasks while maintaining its English performance," the company said in a post on X (formerly Twitter).

The company said that it evaluated the model's performance on real-world tasks beyond standard Natural Language Generation (NLG) tasks.

The five-month-old AI startup also partnered with KissanAI to fine-tune its base model using conversational data they gathered. This dataset comprises conversations from a GPT-powered bot engaging with farmers in different languages.

Also read: Meta open-sources Llama 2, but with strings attached

"The first step in adding Hindi skills to Llama-2 is decreasing the fertility score (the average number of tokens a word is split into) of its tokeniser on Hindi text. This would make both training and inferencing faster and more efficient," the company said in a blog post.

"We train a sentence-piece tokeniser from a subsample of 100K documents from the Sangraha corpus, created at AI4Bharat, with a vocabulary size of 16K. We then merge this with the Llama2 tokeniser and create a new tokeniser with a 48K vocabulary (32K original vocabulary plus our added 16K)," it added.

Sarvam AI, founded in July 2023 by Vivek Raghavan and Pratyush Kumar, secured $41 million in a funding round earlier this month. Lightspeed Ventures led the investment, with participation from Peak XV Partners and Khosla Ventures.

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Moneycontrol News
first published: Dec 13, 2023 03:05 pm

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

  • On Saturdays

    Find the best of Al News in one place, specially curated for you every weekend.

  • Daily-Weekdays

    Stay on top of the latest tech trends and biggest startup news.

Advisory Alert: It has come to our attention that certain individuals are representing themselves as affiliates of Moneycontrol and soliciting funds on the false promise of assured returns on their investments. We wish to reiterate that Moneycontrol does not solicit funds from investors and neither does it promise any assured returns. In case you are approached by anyone making such claims, please write to us at grievanceofficer@nw18.com or call on 02268882347