HomeEnterprise ITArtificial IntelligenceMicrosoft releases its largest publicly available Telugu, Tamil and Gujarati speech data for research

Microsoft releases its largest publicly available Telugu, Tamil and Gujarati speech data for research

Microsoft India has announced the availability of its largest publicly available Indian language speech data for research in three languages - Telugu, Tamil and Gujarati.

Preferred Source of Google

Microsoft has announced the availability of its largest publicly available Indian language speech data for in three languages – Telugu, Tamil and Gujarati. The dataset which includes audio and corresponding transcripts is aimed at helping researchers and academia build Indian language speech recognition for all applications where speech is used, said the American firm.

The Indian language Speech Corpus content is provided by Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, , and domain-specific sciences, said the company.

According to the Redmond-based firm, today, there is a scarcity of adequate digital data for text, speech and linguistic resources – which are imperative in building large machine learning models for many vernacular languages across the world. Moreover, the differences in enunciation, accent, diction, and slang across various regions in India are very subtle. As a of these complexities, development of accurate digital tools in Indian languages has been slow.

Advertisement
Saksham Bharat 2026
Saksham Bharat 2026
A multi-stakeholder dialogue on skilling gap in Cybersecurity, Data Resilience and AI — and the roadmap to a Saksham Bharat.
Register Now →
VeeamON 2026 Tour India - Mumbai
VeeamON 2026 Tour India - Mumbai
A VeeamON 2026 India Leadership Series Mumbai for senior public sector and government technology leaders.
Register Now →
Cyber Surakshit Uttar Pradesh
Cyber Surakshit Uttar Pradesh
Find out strategies, frameworks and solutions for building a resilient and secure digital ecosystem across Uttar Pradesh.
Register Now →
VeeamON 2026 Tour India - Bengaluru
VeeamON 2026 Tour India - Bengaluru
A VeeamON 2026 India Leadership Series Bengaluru for senior public sector and government technology leaders.
Register Now →
VeeamON 2026 Tour India - Delhi
VeeamON 2026 Tour India - Delhi
A VeeamON 2026 India Leadership Series Delhi for senior public sector and government technology leaders.
Register Now →
Infosec Reimagined
Infosec Reimagined
Infosec Reimagined 2026 is the premier information security summit where top leaders—CISOs, CROs, CIOs, CTOs and risk executives—converge to redefine cyber resilience.
Register Now →
Digital Senate
Digital Senate
Digital Senate is a premier conference uniting government leaders, technologists and innovators to share ideas, success stories and strategies on digital governance, public sector transformation, cybersecurity and emerging technologies in India.
Register Now →
CIO Prism
CIO Prism
CIO Prism unites forward-thinking technology leaders to exchange transformative insights, shape digital strategies, and foster innovation, empowering enterprises to excel in an era of rapid technological change.
Register Now →

The company asserted that it was working to address this lack of data and catalyze the development of machine learning based models that can help in building systems for low resource languages, thus enabling the ecosystem of researchers, academia and tech companies working on India language models and to accelerate the needs of Indian users.

“Microsoft Indian Language Speech Corpus is an extension of our on-going efforts to reduce language barriers and empower Indians to harness the full potential of the Internet. Using our technology expertise, we want to accelerate innovation in voice-based computing for India by supporting researchers and academia,” said Sundar Srinivasan, General Manager, Artificial Intelligence & Research, Microsoft India.

The company informed that its Indian Language Speech Corpus was tested at Interspeech 2018, the world’s largest and most comprehensive conference on the science and technology of spoken language processing. In a Low Resource Speech Recognition Challenge, participants used data from Microsoft Indian language speech corpus to build Automatic Speech Recognition (ASR) systems. They were able to create high-quality speech recognition models using this data, thus validating the efficacy of the Corpus.

Get the day's headlines from Tech Observer straight in your inbox

By subscribing you agree to our Privacy Policy, T&C and consent to receive newsletters and other important communications.
Tech Observer Desk
Tech Observer Desk
Tech Observer Desk at TechObserver.in is a team of technology reporters led by a senior editor who brings latest updates and developments from the world of technology.
- Advertisement -
Powered By Veeam Logo
- Advertisement -

Subscribe to our Newsletter

By subscribing you agree to our Privacy Policy, T&C and consent to receive newsletters and other important communications.
- Advertisement -

India to Lead Global IT Security Standards Body for Two Years

India will chair the Common Criteria Development Board from April 2026, gaining influence over international IT security certification standards recognised by 38 countries.

RELATED ARTICLES