200 Languages Within A Single AI Model: A Breakthrough In High-Quality Machine Translation

July 8, 2022

9 min read

Meta AI has built a single AI model, NLLB-200, that is the first to translate across 200 different languages with state-of-the-art quality that has been validated through extensive evaluations for each of them.
We’ve also created a new evaluation dataset, FLORES-200, and measured NLLB-200’s performance in each language to confirm that the translations are high quality. NLLB-200 exceeds the previous state of the art by an average of 44 percent.
We’re now using modeling techniques and learnings from the project to improve and extend translations on Facebook, Instagram, and Wikipedia.
We’re open-sourcing NLLB-200 models, FLORES-200, model training code, and code for re-creating the training dataset in order to help other researchers improve their translation tools and build on our work.

Language is our culture, identity, and lifeline to the world. But because high-quality translation tools don’t exist for hundreds of languages, billions of people today can’t access digital content or participate fully in conversations and communities online in their preferred or native languages. This is especially true for hundreds of millions of people who speak the many languages of Africa and Asia.

To help people connect better today and be part of the metaverse of tomorrow, Meta AI researchers created No Language Left Behind (NLLB), an effort to develop high-quality machine translation capabilities for most of the world’s languages. Today, we’re announcing an important breakthrough in NLLB: We’ve built a single AI model called NLLB-200, which translates 200 different languages with state-of-the-art results. Many of these languages, such as Kamba and Lao, were not supported well or at all by even the best existing translation tools today. Fewer than 25 African languages are currently supported by widely used translation tools — many of which are of poor quality. In contrast, NLLB-200 supports 55 African languages with high-quality results. In total, this single model can provide high-quality translations for languages spoken by billions of people around the globe. In total, NLLB-200’s BLEU scores improve on the previous state of the art by an average of 44 percent across all 10k directions of the FLORES-101 benchmark. For some African and Indian languages, the increase is greater than 70 percent over recent translation systems.

We are now open-sourcing the NLLB-200 model and publishing a slew of research tools to enable other researchers to extend this work to more languages and build more inclusive technologies. Meta AI is also providing up to $200,000 of grants to nonprofit organizations for real world applications for NLLB-200.

The research advancements from NLLB will support more than 25 billion translations served every day on Facebook News Feed, Instagram, and our other platforms. Imagine visiting a favorite Facebook group, coming across a post in Igbo or Luganda, and being able to understand it in your own language with just a click of a button. Highly accurate translations in more languages could also help to spot harmful content and misinformation, protect election integrity, and curb instances of online sexual exploitation and human trafficking. Modeling techniques and learnings from our NLLB research are now also being applied to translation systems used by Wikipedia editors.

Translation is one of the most exciting areas in AI because of its impact on people’s everyday lives. NLLB is about much more than just giving people better access to content on the web. It will make it easier for people to contribute and share information across languages. We have more work ahead, but we are energized by our recent progress and how it is moving us closer to fulfilling Meta’s mission.

You can explore a demo of NLLB-200 here, showing how the model can translate stories from around the world, and read the research paper here.

Unlocking translation tools for billions more people

We’ve partnered with the Wikimedia Foundation the nonprofit organization that hosts Wikipedia and other free knowledge projects, to help improve translation systems on Wikipedia. There are versions of Wikipedia in more than 300 languages, but most have far fewer articles than the 6+ million available in English. This disparity is especially large for languages primarily spoken outside of Europe and North America. For example there are around 3,260 Wikipedia articles in Lingala, a language spoken by 45 million people in the Democratic Republic of the Congo, Republic of the Congo, Central African Republic, and South Sudan. Contrast that with a language like Swedish, which has 10 million speakers in Sweden and Finland and more than 2.5 million articles.

Wikipedia editors are now using the technology behind NLLB-200, via the Wikimedia Foundation’s Content Translation Tool, to translate articles in more than 20 low-resource languages (those that don’t have extensive datasets to train AI systems), including 10 that previously were not supported by any machine translation tools on the platform.

The challenges of building a single model for hundreds of languages

Machine translation systems, like all AI models, are trained on data. For text translation systems, this typically consists of millions of sentences carefully matched between languages. But there simply aren’t large volumes of parallel sentences across, say, English and Fula. Current translation models try to overcome this by mining data from the web. But the results are often of poor quality because the source text is different for each of the languages. Furthermore, it is often full of incorrect or inconsistent spellings and is missing accent marks and other diacritical marks.

Another significant challenge is optimizing a single model to work across hundreds of languages without compromising performance or translation quality. Traditionally, the best translation quality has come from having a separate model for each language direction. But it’s difficult to scale this approach, as performance and translation quality suffer as more languages are added.

Translation models also produce errors that can be difficult to catch. These systems are built on neural networks used for text generation, so they can naturally produce errors such as hallucinations (confidently stating something as true even when it’s not), misstatements, and unsafe content. In general, there are simply fewer benchmarks and datasets for low-resource languages, which makes it much more difficult to test and improve models.

Innovating in architecture, data sourcing, benchmarking, and more

In recent years, we’ve made steady progress to overcome the challenges described above. In 2020, we announced our 100-language M2M-100 translation model, which leveraged new methods to acquire training data, new architectures to scale model size without compromising performance, and new ways to evaluate and improve the results. To scale to another 100 languages, we’ve made further advances in all three of these areas.

Expanded training resources

To collect highly accurate parallel texts in more languages, we improved LASER, our toolkit for zero-shot transfer in natural language processing (NLP). Instead of LSTM, the new version, LASER3, uses a Transformer model that is trained in a self-supervised manner with a masked language modeling objective. We further boosted performance by using a teacher-student training procedure and creating language-group specific encoders, which enabled us to scale LASER3’s language coverage and produce massive quantities of sentence pairs, even for low-resource languages. We are open-sourcing the LASER3 multilingual embedding method to make it available to other researchers, and we’re also making available billions of parallel sentences in different language pairs, which have been mined and cleaned using the techniques described here.

Since we cast a wider net when sourcing training examples in more languages, it was important to make sure the quality of the examples remained high. We completely overhauled our data cleaning pipeline to scale to 200 languages, adding major filtering steps that included first using our LID-200 models to filter data and remove noise from internet-scale corpora with high confidence. We developed toxicity lists for the full set of 200 languages, and then used those lists to assess and filter potential hallucinated toxicity. These steps ensured that we have cleaner and less toxic datasets with correctly identified languages. This is important for improving translation quality and reducing the risk of what is known as hallucinated toxicity, where the system mistakenly introduces toxic content during the translation process.

Scaling model size while maintaining high performance

Multilingual translation systems offer two major benefits. They enable similar languages — such as Assamese and Bengali, which are both written in Bengali script — to share data during training. This helps improve translation quality significantly for low-resource languages when trained together with similar high-resource languages. Also, researchers can iterate, scale, and experiment with a single multilingual model much more easily than with hundreds or even thousands of different bilingual models.

But there are still significant challenges when expanding a model from 100 to 200 languages. With more low-resource language pairs in the training data, the multilingual systems start to overfit as we train the models for longer periods. We tackled these issues by innovating on three fronts: regularization and curriculum learning, self-supervised learning, and diversifying back-translation.

First, we developed mixture-of-experts networks that have shared and specialized capacity so that low-resource languages without much data could be automatically routed to the shared capacity. This, combined with better designed regularization systems, avoids overfitting. We also followed a two-step curriculum learning approach, where we first trained the high-resource languages for a few epochs, before introducing the low-resource language pairs, which again reduced the overfitting problem. Then, given low quantities of parallel bitext data for low-resource languages, we leveraged self-supervised learning on monolingual data for both the low-resource and similar high-resource languages to improve the overall model performance.

Finally, we analyzed how to best generate back-translation data and found that mixing back-translated data generated from both bilingual statistical machine translation and multilingual neural machine translation models helped improve performance for low-resource languages due to the increased diversity of the generated synthetic data. To train the NLLB-200 model, which has 54B parameters, we leveraged our newly built Research SuperCluster (RSC), which is among the fastest AI supercomputers in the world.

Evaluation and mitigation tools for 200 languages

To evaluate and improve NLLB-200, we built FLORES-200, a unique many-to-many evaluation dataset that enables researchers to assess performance in 40,000 different language directions. We’re open-sourcing this new dataset to help other researchers rapidly test and improve their translation models. FLORES-200 can be used to evaluate translation systems for a wide range of applications, including health pamphlets, films, books, and online content within countries or regions where a number of low-resource languages are spoken.

Scaling to 200 languages meant addressing the risks of generating toxic content, which can be difficult to manage within a multidirectional translation system. We did this by building toxicity lists for all the supported languages to make it possible to detect and filter out profanity and other potentially offensive content. We’re releasing toxicity evaluation lists and benchmarks for all 200 languages to give other researchers the tools to reduce risks in their models.

And to ensure that we are expanding our efforts in a responsible manner, we are working with an interdisciplinary team that includes linguists, sociologists, and ethicists to learn more about each of the languages we consider.

This graphic shows average BLEU score on FLORES-101 translations to and from English into 100 languages. On the left there are two published state-of-the-art models, M2M and Delta LM, that support 100 languages. Models on the right support 200 languages: A baseline Transformer model with 3.3B parameters, the baseline model with self-supervised learning (SSL), the baseline model with back translation (BT), and NLLB-200, a large mixture-of-experts based model that leverages both self-supervised learning and back translation.

Expanded translation and greater inclusion

High-quality translation tools can be transformative. The reality today is that a handful of languages — including English, Mandarin, Spanish, and Arabic — dominate the web. Native speakers of these very widely spoken languages may lose sight of how meaningful it is to read something in your own mother tongue. We believe NLLB will help preserve language as it was intended to be shared rather than always requiring an intermediary language that often gets the sentiment/content wrong.

It can also help advance other NLP tasks, beyond translation. This could include building assistants that work well in languages such as Javanese and Uzbek or creating systems to take Bollywood movies and add accurate subtitles in Swahili or Oromo. As the metaverse begins to take shape, the ability to build technologies that work well in hundreds or even thousands of languages will truly help to democratize access to new, immersive experiences in virtual worlds.

A few short years ago, high-quality machine translation worked in only a handful of languages. With NLLB-200, we are closer to one day having systems that enable people to communicate with whomever they choose. We’re excited by what this unlocks in the present and what it could mean for the future as we continue to push the boundaries of machine translations.

This work is being undertaken by a multidisciplinary team at Meta AI that includes Bapi Akula, Pierre Andrews, Necip Fazil Ayan, Loic Barrault, Shruti Bhosale, Marta Ruiz Costa-jussa, James Cross, Onur Çelebi, Sergey Edunov, Maha Elbayad, Angela Fan, Cynthia Gao, Gabriel Mejia Gonzalez, Vedanuj Goswami, Francisco Guzmán, Prangthip Hansanti, Kennet Heafield, Kevin Heffernan, John Hoffman, Semarley Jarrett, Elahe Kalbassi, Philipp Koehn, Janice Lam, Daniel Licht, Jean Maillard, Alexandre Mourachko, Christophe Ropers, Kaushik Ram Sadagopan, Safiyyah Saleem, Holger Schwenk, Shannon Spruit, Anna Sun, Chau Tran, Skyler Wang, Guillaume Wenzek, Jeff Wang, and Al Youngblood.

Source Meta AI