Latest

IIT Bombay Develops Multilingual AI Model Supporting 22 Indian Languages for Rural Access

AY

Amit Yadav

Mar 7, 20262 min read0 views
IIT Bombay Develops Multilingual AI Model Supporting 22 Indian Languages for Rural Access

Researchers at IIT Bombay's Centre for AI Research have released BharatLM 2.0, an open-source large language model trained natively on all 22 scheduled Indian languages — the most comprehensive Indic AI model ever built, designed to serve the 900 million Indians who prefer to communicate in languages other than English.

IIT Bombay's Centre for AI Research has released BharatLM 2.0, a large language model trained natively on all 22 languages listed in the Eighth Schedule of the Indian Constitution — including Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Odia, Punjabi, and 12 more. The 13-billion-parameter model is fully open-source under the Apache 2.0 licence and is available on Hugging Face, GitHub, and the AI4Bharat platform.

Unlike previous Indic language models that translated English-first architectures, BharatLM 2.0 was pre-trained from scratch on a 2.4 trillion token multilingual corpus sourced from digitised books, government records, regional news archives, and transcribed oral histories — with each language given proportional representation based on native speaker population. The training corpus was curated over 18 months in collaboration with CDAC, IIT Madras, and seven state governments.

In evaluations on IndicGLUE (the standard benchmark for Indian language NLP tasks), BharatLM 2.0 outperforms Google's mT5, Meta's mBERT, and all previous AI4Bharat models across every tested language. It is particularly strong in low-resource languages like Santali, Dogri, and Bodo that were largely absent from previous multilingual models.

The practical applications are extensive. The team has already built a prototype voice-based healthcare advisory system in Marathi and Bhojpuri that is being piloted with ASHA workers in rural Maharashtra and Bihar. The system allows frontline health workers to query medical guidelines, report patient data, and receive appointment reminders in their native languages — entirely without English literacy.

Prof. Preethi Jyothi, who led the project, described the model as "infrastructure for a more equitable AI future in India." She added that the team has applied for funding under the India AI Mission to scale the model to 70 billion parameters, which would bring it to GPT-4 class performance across Indic languages. IndiaAI has indicated it will fast-track the application given the model's alignment with national priorities on linguistic inclusion.