Better Data For Better AI: New Speech Datasets And Benchmarks For Data

December 17, 2021

2 min read

AI researchers and engineers need better data to enable better AI solutions. The quality of an AI solution is determined by both the learning algorithm (such as a deep-neural network model) and the datasets used to train and evaluate that algorithm. Historically, AI research has focused much more on algorithms than datasets, despite their vital importance. As a result, many algorithms are freely available as starting points, but many important problems lack large, high-quality open datasets. Further, creating new datasets is expensive and error-prone.

Recently, the data-centric AI movement has emerged, which aims to develop new methodologies and tools for constructing better datasets to fix this problem. Conferences, workshops, challenges, and platforms are being launched to support improving data quality and to foster data excellence. Thought leaders such as Andrew Ng at Landing.AI and Chris Re at Stanford University are encouraging AI developers to focus more on iterative data engineering than they do tuning their learning algorithms. Our CHI-best-paper-award-winning paper, “Everyone wants to do the model work, not the data work” highlighted the significance of data quality in the practice of ML.

At Google, we are excited to contribute to data-centric AI. Today, Google Cloud is adding a new high value dataset to the Public Dataset Program, and Google researchers are announcing DataPerf, a new multi-organizational effort to develop benchmarks for data quality and data centric algorithms.

Google Cloud is committed to helping users improve their data quality, starting with supporting better public data. The Public Datasets program provides high quality datasets pre-configured on GCP for easy access. Google Cloud is adding a new high-value dataset developed by the MLCommons™ Association (which Google co-founded) to the Public Datasets program: The Multilingual Spoken Words Corpus: a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples.

This new public dataset is aligned with the MLCommons Association vision for “open” datasets – accessible by all – that are “living” – continually being improved to raise quality and increase representation and diversity.

Google researchers, in collaboration with multiple organizations, are announcing the DataPerf effort at the NeurIPS Data-Centric AI workshop today, to develop benchmarks to improve data quality. Much like the the MLPerf™ benchmarking effort which is now the industry standard for machine learning hardware/software speed, DataPerf brings together the originators of prior efforts including: CATS4ML, Data-Centric AI Competition, DCBench, Dynabench, and the MLPerf benchmarks to define clear metrics that catalyze rapid innovation. DataPerf will measure the utility of training and test data for common problems, and algorithms for working with datasets such as: selecting core sets, correcting errors, identifying under-optimized data slices, and valuing datasets prior to labeling.

Together, supporting open, living datasets for core ML tasks, and the development of benchmarks to direct the rapid evolution of those datasets will empower the researchers and engineers who use Google Cloud to do even more amazing things – and we can’t wait to see what they create!

Acknowledgements: In collaboration with Lora Aroyo and Praveen Paritosh.

By Peter Mattson, Staff Engineer
Source Google Cloud

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

liwaiwai

Machine Learning, Google Kubernetes Engine, And More: 10 Free Training Offers To Take Advantage Of Before 2022

How AI And Weather Data Can Help You Plan For Allergy Season

April 28, 2020

Allergy sufferers care a lot about what goes on outside with allergens, pollen, air quality, and more. In fact,…

3 min read

Machine Learning Makes Building Rocket Engines Easier

September 7, 2020

Methods from scientific machine learning could address the challenges of testing the stability of rocket…

3 min read

Remote Control Of The Brain Is Coming: How Will We Use It?

August 15, 2019

Controlling the minds of others from a distance has long been a favourite science fiction theme – but recent…

3 min read

IBM Furthers Flexibility, Sustainability And Security Within The Data Center With New IBM Z16 And LinuxONE 4 Single Frame And Rack Mount Options

April 4, 2023

– New IBM z16 and IBM LinuxONE Rockhopper 4 options are designed to provide a modern, flexible hybrid…

6 min read

Better Data For Better AI: New Speech Datasets And Benchmarks For Data

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Machine Learning, Google Kubernetes Engine, And More: 10 Free Training Offers To Take Advantage Of Before 2022

Securing AI Systems With Adversarial Robustness

OpenAI Leadership Responds to Meta Offers: ‘Someone Has Broken Into Our Home’

Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors

Accelerating scientific discovery with AI

OpenAI Loses 4 Key Researchers to Meta

I Let AI Agents Plan My Vacation—and It Wasn’t Terrible

Using generative AI to help robots jump higher and land safely

OpenAI’s Unreleased AGI Paper Could Complicate Microsoft Negotiations

The AI Backlash Keeps Growing Stronger

AlphaGenome: AI for better understanding the genome

MIT and Mass General Brigham launch joint seed program to accelerate innovations in health

The Summer Adventures : Camping Essentials

Meta Wins Blockbuster AI Copyright Case—but There’s a Catch

Better Data For Better AI: New Speech Datasets And Benchmarks For Data

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Share this article

Machine Learning, Google Kubernetes Engine, And More: 10 Free Training Offers To Take Advantage Of Before 2022

Securing AI Systems With Adversarial Robustness

Read next