A common repository of Indian Language Data for the promotion and creation of language resources for the Human Language Technology

Under BHASHINI, Funded by MeitY, Govt. of India

COIL-D Visual

About Us

The Mission Bhashini, an initiative by the Ministry of Electronics and Information Technology (MeitY), Government of India, has recently recognized and funded the establishment of the "COIL-D: Centre of Indian Language Data." This esteemed project is being led by Prof. Asif Ekbal, Associate Professor at IIT Patna, in collaboration with IIT Delhi, IIIT Delhi, IIT Guwahati, IGDTUW, MIT Manipal, and DIBD.


Project
Process

Aims

Create language resources like parallel corpora for for MT between Indian languages, as well as benchmark datasets for PoS tagging, NER, ASR, and TTS

Primary Focus

Curate core Hindi data—50% from science, healthcare, agriculture, climate, tourism, judiciary; 30% education; 20% governance & 20% conversational content

Key Focus

To create, preserve, and standardize language resources to support NLP research and applications



Applications

This dataset will power Hindi-to-21 language translation and multilingual chatbots, enabling efficient cross-sector communication

Initiative

Establish a central repository for Indian language data and provide a platform for developing and benchmarking HLT applications

Results

With minimal post-editing, you can quickly translate large volumes of text, making it ready for publication on print and digital platforms

Development

Develop leaderboards to evaluate MT, PoS tagging, NER, NLG, sentiment analysis, ASR, and TTS, ensuring systematic NLP assessment

Delivery

From these advanced tools at no cost surpassing existing solutions like Google Translate, while also enriching our linguistic heritage

As a data partner, your organization will be acknowledged on the MeitY portal for its valuable contributions to this important national project. Upon completion, due credit will be given for your support, highlighting your role in this significant initiative.



COIL-D Project Objectives

The COIL-D (Centre of Indian Language Data) project aims to establish a unified repository of Indian language data and create a platform for systematic evaluation of Machine Translation (MT) and other Natural Language Processing (NLP) systems. It promotes the creation and preservation of language resources for Human Language Technology (HLT) applications and defines standards for linguistic benchmarking.

Step 1: Identification of Hindi language resources

Identify and list existing Hindi datasets and tools. This helps understand current resource availability.

Step 2: Acquisition of resources across target domains

Collect data from key domains like health, education, and governance. It ensures coverage of real-world language use.

Step 3: Creation of language resources and benchmarks

Develop new data for Machine Translation and NLP. Set benchmarks to evaluate tool performance.

Step 4: Development of MT evaluation leaderboards

Create leaderboards to assess translation systems. They help track progress and encourage improvements.

Step 5: Leaderboards for ASR and TTS technologies

Develop scoreboards to evaluate speech recognition and synthesis systems. These track accuracy, clarity, and performance.

Step 6: Benchmarks for PoS and NER taggers

Define evaluation tests for tools like PoS and NER taggers. This ensures fair and consistent evaluation.

Motivation

The primary motivation behind gathering Hindi language data is to facilitate translation into 17 other Indian languages. Additionally, we are focusing on Tamil-to-Dravidian language translations.

A significant emphasis is placed on the creation of high-quality language resources. Efforts will be made to generate parallel corpora, aiming for 50k–100k sentence pairs for each language pair, with 2k additional sentence pairs for evaluation purposes. Hindi will serve as a central language, enabling efficient translation between Hindi and other Indian languages. Furthermore, the creation of leaderboards for MT, ASR, and TTS will provide standardized platforms for performance benchmarking. We will also establish benchmarks for various linguistic tools, such as PoS taggers, Named Entity Recognition (NER) systems, Natural Language Generation (NLG), and Sentiment Analysis tools, which will enhance resources available for Indian languages.

Corpora


Machine Translation (MT)
Machine Translation
Machine Translation (MT) is the automatic conversion of text from one language to another using computer algorithms. It enables fast, large-scale translation without human intervention. Modern MT systems often use AI and deep learning to improve accuracy and fluency.
Natural Language Generation (NLG)
Natural Language Generation
Natural Language Generation (NLG) is a branch of AI that automatically produces human-like text from structured data or input. It helps systems write summaries, reports, or responses in natural language. NLG is widely used in chatbots, content creation, and data-driven reporting.
Part-of-Speech (PoS) Tagging
PoS Tagging
Part-of-Speech (PoS) tagging is the process of labeling each word in a sentence with its grammatical category, such as noun, verb, adjective, etc. It helps machines understand sentence structure and is a key step in many NLP tasks like parsing and information extraction.
Named Entity Recognition (NER)
Named Entity Recognition
Named Entity Recognition (NER) is a natural language processing technique used to identify and classify entities like names of people, organizations, locations, dates, and more in text. It helps in extracting structured information from unstructured data, enabling better understanding and analysis of content.
Automatic Speech Recognition (ASR)
Automatic Speech Recognition
Automatic Speech Recognition (ASR) is a technology that converts spoken language into written text. It enables voice-based interaction with devices and is used in applications like virtual assistants, transcription services, and voice commands.
Text-to-Speech (TTS)
Text-to-Speech
Text-to-Speech (TTS) is a technology that converts written text into spoken voice output. It allows computers and devices to "read aloud" text, making digital content accessible to users, especially in assistive and multilingual applications.

Leader Board

A leaderboard is a platform or framework used to benchmark and compare the performance of different models and algorithms on specific natural language processing tasks. In this case, the leaderboard compares the performance of Machine Translation systems for Indian Languages. The users of this system can be classified broadly into admin, developers and viewers. The admin side will maintain the leaderboard, updating it with new benchmarking datsets, adding newer tasks and other features to the existing framework. The developers are users who would want to compare their MT hypothesis again state of the art model hypothesis. Lastly, the viewers are the visitors to the website.

Consortium Members

Asif Ekbal
Dr. Asif Ekbal

Principal Investigator

Associate Professor
Computer Science and Engineering
IIT Patna, India

Ashwini Vaidya
Dr. Ashwini Vaidya
 

Assistant Professor
HSS
IIT Delhi, India

Tanmoy Chakraborty
Dr. Tanmoy Chakraborty
 

Associate Professor
Electrical Engineering
IIT Delhi, India

Samit Bhattacharyya
Dr. Samit Bhattacharyya
 

Associate Professor
Computer Science and Engineering
IIT Guwahati, India

Shad Akhtar
Dr. Shad Akhtar
 

Assistant Professor
Computer Science and Engineering
IIIT Delhi, India

Poonam Bansal
Dr. Poonam Bansal
 

Professor
AI & Data Science
IGDTUW, India

Amita Dev
Dr. Amita Dev
 

Vice-Chancellor

IGDTUW, India

Muralikrishna
Dr. Muralikrishna S N
 

Assistant Professor (Senior)

Manipal Institute of Technology, MAHE
Manipal, India

Career

Temporary positions available on the COIL-D project (sponsored by MeitY). Applicants should send resume (PDF) and required documents to the email provided below.

Senior Research Associate (Technical)

Posts: 01 • Salary: Rs. 70,000/month (with 10% annual increment)
Project No. 1155

Minimum eligibility

  • M.Tech / M.E. / MS in CSE / AI / DS / ML / IT / Electronics / EE with at least 1 year experience.
  • Exposure in the field of NLP, Machine Translation, Large Language Models, etc.

Junior Research Associate (Technical)

Posts: 02 • Salary: Rs. 50,000/month (with 10% annual increment)
Project No. 1155

Minimum eligibility

  • B.Tech/BE in CSE, IT, AI & DS, Electronics/ECE/EE OR MCA OR M.Sc (CS). Experience in AI/ML/NLP systems.
  • Experience of building systems in AI, ML and NLP.
  • Exposures of creating and maintaining GUI, Webpages; and having knowledge for data management.

Junior Research Associate (Odia Language)

Posts: 01 • Salary: Rs. 45,000/month (with 10% annual increment)
Project No. 1155

Minimum eligibility

  • B.A./M.A./M.Phil/PhD in Language/Linguistics. Knowledge of Odia & Hindi, experience in corpus creation and annotation.
  • Knowledge in Odia and Hindi language
  • Exposure for parallel corpus creation, data annotation etc.
  • Experience of working in a language technology project.

Office Assistant

Posts: 01 • Salary: Rs. 30,000/month (with 10% annual increment)
Project No. 1155

Minimum eligibility

  • B.Com / B.Sc. / BCA / Diploma (with 1 year experience), OR B.Tech
  • Proficiency in MS Word, Excel, Tally
  • Knowledge of Notesheets, Utilization Certificates, SOE, Purchase Indents
  • Experience in R&D/Accounts of IITs, NITs, IISERs, or other Central Govt. institutions (≥1 year)
  • Knowledge in Webpage design and/or Accounting

Note: All the above positions are temporary in nature. The CI may terminate the contract with 15-days’ notice in case the performance is not found satisfactory.

Advertisement Date: 26-08-2025

Contact

Address

IIT Patna, Bihta Kanpa Rd, Patna,
Dayalpur Daulatpur, Bihar 801106

Call Us

+91 6115 233 338

   

Email Us

coild.bhashini@gmail.com