Moneycontrol PRO
FiDEX 2026
FiDEX 2026

AI Summit: Language data gaps limit AI progress beyond dominant languages, says Alice Oh

Researcher flags structural and cost barriers in building inclusive multilingual models

February 18, 2026 / 20:16 IST
At the centre of the challenge lies the difficulty of obtaining sufficiently large datasets needed to train modern AI systems.
Snapshot AI
  • AI language models still focus on a few dominant languages.
  • AI inclusivity for many languages hindered by data scarcity, high costs.
  • Address language gaps in governments, companies, and universities.

Despite rapid advances in language models, progress remains limited to a narrow set of dominant languages, Alice Oh, professor, Korea Advanced Institute of Science and Technology, said. She pointed to structural data constraints — including languages without written forms and undigitised texts — as a persistent bottleneck for AI inclusivity.

“I think definitely as people pay more attention to these languages, the language models do get much better. It’s really great to see those improvements,” Alice Oh told Moneycontrol in an interview, while cautioning that such gains remain uneven. “It’s just that it’s still limited to a small number of languages.”

Data availability

At the centre of the challenge lies the difficulty of obtaining sufficiently large datasets needed to train modern AI systems. Oh acknowledged that data scarcity is not an occasional obstacle but a defining limitation for many languages.

“Oh yeah, of course,” Oh said when asked whether large-scale data collection poses a challenge. “Sometimes it’s just not possible to get large-scale data.”

Unlike widely used global languages that benefit from vast digital text repositories, many languages lack usable machine-readable resources. This directly affects how effectively AI systems can learn, generalise, and deliver reliable outputs across linguistic contexts, she said.

Cost pressures

Beyond linguistic limitations, Oh underscored the financial implications of building inclusive datasets. “Again, it just becomes expensive. But somebody has to do that work,” she said.

Oh emphasised that the responsibility for addressing language inequities cannot rest solely with large technology firms. “It’s going to be companies, but also governments and universities — all of these people and institutions have to pay attention to this problem,” she said.

The combination of limited written archives, digitisation gaps, and high costs, she suggested, continues to slow the expansion of language models beyond a small set of dominant languages despite growing research attention.

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Meghna Mittal
Meghna Mittal Deputy News Editor at Moneycontrol. Meghna has experience across television, print, online and wire media. She has been covering the Indian economy, monetary and fiscal policies, Finance and Trade ministries. She tweets at @Meghnamittal23 Contact: meghna.mittal@nw18.com
first published: Feb 18, 2026 08:15 pm

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

  • On Saturdays

    Find the best of Al News in one place, specially curated for you every weekend.

  • Daily-Weekdays

    Stay on top of the latest tech trends and biggest startup news.

Advisory Alert: It has come to our attention that certain individuals are representing themselves as affiliates of Moneycontrol and soliciting funds on the false promise of assured returns on their investments. We wish to reiterate that Moneycontrol does not solicit funds from investors and neither does it promise any assured returns. In case you are approached by anyone making such claims, please write to us at grievanceofficer@nw18.com or call on 02268882347
CloseParallel Income Plan 2026