The Digital India Bhashini Division (DIBD) under the Ministry of Electronics and Information Technology has called on agencies to annotate and label datasets in 22 Indian languages necessary for training artificial intelligence (AI) models.
"Data annotation and labelling are crucial for machine learning because they provide the necessary context for algorithms to learn effectively," DIBD CEO Amitabh Nag told Moneycontrol.
"By adding meaningful tags or labels to raw data (like images, text, or audio), these processes enable models to understand patterns, make accurate predictions and ultimately, perform desired tasks," Nag added.
Nag said that without properly labelled data, "machine learning models struggle to learn, leading to poor performance and unreliable results"
In this regard, DIBD has floated a request for empanelment (RFE), inviting companies to annotate and label Indian datasets. "The RFE is providing a huge opportunity to those in the data industry to be part of the AI revolution,” Nag added.
According to the RFE, vendors will be expected to cover five core AI/ML language tasks, which includes Automatic Speech Recognition (ASR), Machine Translation (MT), Text-to-Speech (TTS), Optical Character Recognition (OCR), and Transliteration.
The RFE said that vendors will be expected to annotate raw data with domain- and task-specific metadata. For ASR, for instance, selected vendors will have to produce both verbatim and cleaned transcripts, timestamping and speaker details such as age and gender.
For Machine Translation, translations will need to be validated for context, fluency, and alignment.
All annotation work must be performed on Bhashini’s in-house Data Capture and Curation Framework (DCCF) platform, the RFE said.
The RFE sets strict quality benchmarks to validate consistency. Industry experts emphasise how crucial high-quality labeled data is for effective AI systems, particularly in low-resource languages.
Bhashini currently supports over 22 official Indian languages, the same languages listed under the Eighth Schedule of the Indian Constitution.
The empanelment will be valid for one year and can be extended to two years, and the government body is inviting bids until August 28.
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!
Find the best of Al News in one place, specially curated for you every weekend.
Stay on top of the latest tech trends and biggest startup news.