COIL-D

A common repository of Indian Language Data for the promotion and creation of language resources for the Human Language Technology

Under BHASHINI, Funded by MeitY, Govt. of India

About Us

About Us

The Mission Bhashini, an initiative by the Ministry of Electronics and Information Technology (MeitY), Government of India, has recently recognized and funded the establishment of the "COIL-D: Centre of Indian Language Data." This esteemed project is being led by Prof. Asif Ekbal, Associate Professor at IIT Patna, in collaboration with IIT Delhi, IIIT Delhi, IIT Guwahati, IGDTUW, MIT Manipal, and DIBD.

Project
Process

Aims

Create language resources like parallel corpora for MT between Indian languages, as well as benchmark datasets for PoS tagging, NER, ASR, and TTS

Primary Focus

Curate core Hindi data from science, healthcare, agriculture, climate, tourism, judiciary, education, governance & conversational content

Key Focus

To create, preserve, and standardize language resources to support NLP research and applications

Applications

This dataset will power Hindi-to-21 language translation and multilingual chatbots, enabling efficient cross-sector communication

Initiative

Establish a central repository for Indian language data and provide a platform for developing and benchmarking HLT applications

Results

With minimal post-editing, you can quickly translate large volumes of text, making it ready for publication on print and digital platforms

Development

Develop leaderboards to evaluate MT, PoS tagging, NER, NLG, sentiment analysis, ASR, and TTS, ensuring systematic NLP assessment

Delivery

From these advanced tools at no cost surpassing existing solutions like Google Translate, while also enriching our linguistic heritage

As a data partner, your organization will be acknowledged on the MeitY portal for its valuable contributions to this important national project. Upon completion, due credit will be given for your support, highlighting your role in this significant initiative.

COIL-D Project Objectives

The COIL-D (Centre of Indian Language Data) project aims to establish a unified repository of Indian language data and create a platform for systematic evaluation of Machine Translation (MT) and other Natural Language Processing (NLP) systems. It promotes the creation and preservation of language resources for Human Language Technology (HLT) applications and defines standards for linguistic benchmarking.

Step 1: Identification of Hindi language resources

Identify and list existing Hindi datasets and tools. This helps understand current resource availability.

Step 2: Acquisition of resources across target domains

Collect data from key domains like health, education, and governance. It ensures coverage of real-world language use.

Step 3: Creation of language resources and benchmarks

Develop new data for Machine Translation and NLP. Set benchmarks to evaluate tool performance.

Step 4: Development of MT evaluation leaderboards

Create leaderboards to assess translation systems. They help track progress and encourage improvements.

Step 5: Leaderboards for ASR and TTS technologies

Develop scoreboards to evaluate speech recognition and synthesis systems. These track accuracy, clarity, and performance.

Step 6: Benchmarks for PoS and NER taggers

Define evaluation tests for tools like PoS and NER taggers. This ensures fair and consistent evaluation.

Motivation

The primary motivation behind gathering Hindi language data is to facilitate translation into 17 other Indian languages. Additionally, we are focusing on Tamil-to-Dravidian language translations.

A significant emphasis is placed on the creation of high-quality language resources. Efforts will be made to generate parallel corpora, aiming for 50k–100k sentence pairs for each language pair, with 2k additional sentence pairs for evaluation purposes. Hindi will serve as a central language, enabling efficient translation between Hindi and other Indian languages. Furthermore, the creation of leaderboards for MT, ASR, and TTS will provide standardized platforms for performance benchmarking. We will also establish benchmarks for various linguistic tools, such as PoS taggers, Named Entity Recognition (NER) systems, Natural Language Generation (NLG), and Sentiment Analysis tools, which will enhance resources available for Indian languages.

Corpora

Machine Translation (MT)

Machine Translation
Machine Translation (MT) is the automatic conversion of text from one language to another using computer algorithms. It enables fast, large-scale translation without human intervention. Modern MT systems often use AI and deep learning to improve accuracy and fluency.

Natural Language Generation (NLG)

Natural Language Generation
Natural Language Generation (NLG) is a branch of AI that automatically produces human-like text from structured data or input. It helps systems write summaries, reports, or responses in natural language. NLG is widely used in chatbots, content creation, and data-driven reporting.

Part-of-Speech (PoS) Tagging

PoS Tagging
Part-of-Speech (PoS) tagging is the process of labeling each word in a sentence with its grammatical category, such as noun, verb, adjective, etc. It helps machines understand sentence structure and is a key step in many NLP tasks like parsing and information extraction.

Named Entity Recognition (NER)

Named Entity Recognition
Named Entity Recognition (NER) is a natural language processing technique used to identify and classify entities like names of people, organizations, locations, dates, and more in text. It helps in extracting structured information from unstructured data, enabling better understanding and analysis of content.

Automatic Speech Recognition (ASR)

Automatic Speech Recognition
Automatic Speech Recognition (ASR) is a technology that converts spoken language into written text. It enables voice-based interaction with devices and is used in applications like virtual assistants, transcription services, and voice commands.

Text-to-Speech (TTS)

Text-to-Speech
Text-to-Speech (TTS) is a technology that converts written text into spoken voice output. It allows computers and devices to "read aloud" text, making digital content accessible to users, especially in assistive and multilingual applications.

Leader Board

A leaderboard is a platform or framework used to benchmark and compare the performance of different models and algorithms on specific natural language processing tasks. In this case, the leaderboard compares the performance of Machine Translation systems for Indian Languages. The users of this system can be classified broadly into admin, developers and viewers. The admin side will maintain the leaderboard, updating it with new benchmarking datsets, adding newer tasks and other features to the existing framework. The developers are users who would want to compare their MT hypothesis again state of the art model hypothesis. Lastly, the viewers are the visitors to the website.

Consortium Members

Our Collaborators

NCERT

Domain: Education

About Us

Aims

Primary Focus

Key Focus

Applications

Initiative

Results

Development

Delivery

COIL-D Project Objectives

Step 1: Identification of Hindi language resources

Step 2: Acquisition of resources across target domains

Step 3: Creation of language resources and benchmarks

Step 4: Development of MT evaluation leaderboards

Step 5: Leaderboards for ASR and TTS technologies

Step 6: Benchmarks for PoS and NER taggers

Motivation

Corpora

Machine Translation (MT)

Natural Language Generation (NLG)

Part-of-Speech (PoS) Tagging

Named Entity Recognition (NER)

Automatic Speech Recognition (ASR)

Text-to-Speech (TTS)

Leader Board

Consortium Members

Our Collaborators

NCERT

All India Radio

Free Tamil Books

Integrated Good Agricultural Practices in Seedless Grapes

Tamil Nationalized and Public Domain Books Collection

Press Information Bureau

Blog: Dr. Suneel Krishnan

Educational Blog: Venkatesh Jambulingam

Shanlax International Journal of Tamil Research

Social Welfare and Women Empowerment Dept.

Roaming Owls (Blog)

Madras High Court

PM India

Youtube channel priyaafilms247

प्रणाम पर्यटन magazine

NIMHANS, Bangalore

विज्ञानं प्रकाश

Vigyan Garima Sindhu

Prakriti Darshan

Geography and You

Down to Earth

Union Budget

UP Budget

Loksabha speeches

Ministry of Education

Career

Freelancer Engagement

Application Link:

Senior Research Associate (Technical)

Minimum eligibility

Junior Research Associate (Technical)

Minimum eligibility

Junior Research Associate (Odia Language)

Minimum eligibility

Office Assistant

Minimum eligibility

Contact

Address

Call Us

Email Us

Social Welfare and Women
Empowerment Dept.