Chennai: In a unique attempt in academia, the Indian Institute of Technology-Madras (IIT-M) faculty have developed Artificial Intelligence (AI) models and datasets to process texts in 11 Indian regional languages. This was taken up jointly with ‘AI4Bharat’, a platform for building AI solutions for problems of relevance to India.
This is a unique attempt in academia to develop and publicly release such large scale multilingual AI models containing millions of parameters trained on billions of tokens from Indian languages. The researchers from IIT Madras and AI4Bharat released AI models and datasets for 11 languages–Tamil, Hindi, Malayalam, Telugu, Kannada, Punjabi, Bengali, Odia, Assamese, Gujarati and Marathi, a release from IIT-M said today.
The multilingual AI models and datasets developed through this initiative will provide the essential building blocks to students, faculty, start-ups and industry to work on Indian language tools and push the frontiers of technology. These models are freely available and can be downloaded from a Github repository (https:ndicnlp.ai4bharat.org/).
An accompanying research paper describing the research methodologies and evaluation has been accepted at the EMNLP-Findings (a companion publication at one of the top Natural Language Processing conferences). Dr Mitesh M Khapra, Assistant Professor, Department of Computer Science and Engineering, IIT-M, said, “We have a very rich diversity of languages in our country. As we move towards a digital economy, it is important that our languages find a space online. This requires a lot of innovation in creating input tools, datasets, and AI models for Indian languages.”
For example, imagine a learner who posts a question on an e-learning platform in Tamil or Hindi or any other numerous Indian regional languages. There is a need for tools that could automatically process such questions written in Indian languages and classify them into specific topics, he said.
“While such tools are available for English and other foreign languages, there are hardly any tools for Indian languages and this is the critical gap that we are trying to address through this initiative”, he added.