A New Open Data Set For Multilingual Speech Research

January 26, 2021

3 min read

What it is:

Facebook AI is releasing Multilingual LibriSpeech (MLS), a large-scale, open source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community’s work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services.

MLS provides more than 50,000 hours of audio across eight languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. It also provides language-model training data and pretrained language models along with baselines to help researchers compare different ASR systems. Because it leverages public domain audiobooks from the LibriVox project, MLS offers a large data set with a broad range of different speakers, and it can be released with a nonrestrictive license.

How it works:

MLS is a read-speech data set that leverages LibriVox audiobook data. It builds on the widely used LibriSpeech ASR benchmark, making it larger scale and extending it from English-only to the seven other languages noted above.

To create it, we segmented the audio and aligned it with the text of audiobooks in order to retrieve best-matching transcripts for audio segments. As the audiobooks can be very long, we used Facebook AI’s open source wav2letter@anywhere framework to perform streaming inference and alignment. Inspired by the success of Libri-Light, a benchmark for ASR with limited or no supervision, we also provide subsets with limited labeled data (10 minutes, 1 hour, and 10 hours) for all the included languages. This makes it suitable for training where a small amount of labeled data is available such as in self-supervised and semisupervised settings. For preparing language modeling data, we leveraged the public domain books from the Project Gutenberg digital library. We then carefully filtered the books that overlapped with the development and test sets and performed language specific text normalization to create the language model corpus.

We have trained baseline acoustic models and decode them using a 5-gram language model for each of the languages. While evaluating the model trained on MLS’s English subset against the standard noisy test set of LibriSpeech, we produced a 20 percent improvement in word error rate compared with the same model trained using LibriSpeech data.

Why it matters:

Open data sets and benchmarks have been key drivers of recent advances across AI. MLS provides a valuable resource for research in large-scale training of ASR systems. Its English-language data set is about 47x larger than the training data present in LibriSpeech. While there are data sets and benchmarks for non-English languages, they are often relatively small or scattered around different places and rarely available under an open, permissive license. We believe that by providing a large multilingual data set with a nonrestrictive license and establishing a common benchmark, MLS will promote open and collaborative research in multilingual ASR and improve speech recognition systems in more languages around the world.

Get it here:

MLS is available on OpenSLR and can be downloaded here. All the pretrained models and recipes to train and evaluate the models can be found here.

Get it here:

MLS: A large-scale multilingual dataset for speech research

By Vineel Pratap, Research Engineer | Qiantong Xu, Research Engineer | Anuroop Sriram, Research Engineer | Gabriel Synnaeve, Research Scientist | Ronan Collobert, Research Scientist

Source Facebook AI Research