T Technology

The First-Ever Multilingual Model To Win WMT, Beating Out Bilingual Models

November 14, 2021

5 min read

Building a universal translation system to help everyone access information and better connect with one another is the ultimate goal of the machine translation (MT) field. But the MT field needs to solve fundamental limitations in order to make that future a reality.

Most MT systems today use groups of bilingual models, which typically require extensive labeled examples for each language pair and task. Unfortunately, this approach fails many languages with scarce training data (e.g., Icelandic and Hausa). Its high complexity also makes it impractical to scale to practical applications on Facebook, where billions of people post in hundreds of languages every day.

To build a universal translator, we believe the MT field should shift away from bilingual models and advance toward multilingual translation — where a single model translates many language pairs at once, including both low-resource (e.g., Icelandic to English) and high-resource (e.g., English to German). Multilingual translation is an appealing approach — it’s simpler, more scalable, and better for low-resource languages. But until now, this approach couldn’t provide results for high-resource language pairs that were as good as specially trained bilingual models for those language pairs. As a result, delivering quality translations across many languages has generally involved using a combination of individual bilingual models, and low-resource languages have lagged behind.

Now we’ve achieved an exciting breakthrough: For the first time, a single multilingual model has outperformed the best specially trained bilingual models across 10 out of 14 language pairs to win WMT, a prestigious MT competition. Our single multilingual model provided the best translations for both low- and high-resource languages, showing that the multilingual approach is indeed the future of MT.

We show progression of quality of performance over time for English to German (English to German) translation at WMT competition, in which a multilingual model has now surpassed the bilingual model. En-De is commonly recognized as the most competitive translation direction. We report performance of all models on Newstest 2021.

This work builds on top of previous breakthroughs, which have improved the quality of translations for low-resource languages. Prior work, however, has fundamental capacity challenges when languages with various resources are added — one model becomes overwhelmed as more languages are added, each with unique linguistic properties, scripts, and vocabularies. When high-resource languages benefit from large multilingual models, low-resource language pairs risk overfitting.

Our winning model is an exciting tipping point in MT because it shows that — through new advancements in large-scale data mining, scaling model capacity, and more efficient infrastructure — it’s possible for multilingual models to achieve high performance on both high- and low-resource languages. It brings us one step closer to building a universal translator that connects people in all languages around the world, regardless of how much translation data exists.

Large-scale data mining

To train our WMT 2021 model, we built two multilingual systems: any-to-English and English-to-any. We leveraged parallel data mining techniques by identifying translations in large web crawl data sets to overcome limitations of standard training documents that are manually translated, like European Parliamentary speeches, which are not always available for all translation directions.

Comparison of the performance of our model vs. the best model submitted to WMT ’21. The numbers reported are BLEU scores on the final WMT ’21 test set.

Since the amount of monolingual data for any language vastly exceeds the amount of parallel data, it’s crucial that we leverage available monolingual data to maximize performance of MT systems. One of the most common techniques to use monolingual data is called back- translation, which we used to win both the 2018 and 2019 edition of the English-to-German WMT news translation task. In our work, we added large-scale monolingual data with hundreds of millions of sentences from all eight languages. We filtered the available monolingual data to reduce the amount of noise, and then back-translated them with an ensemble of the strongest multilingual models available.

Scaling model capacity

IIn addition to scaling data size using back-translation, we also scaled model size from 15 billion parameters to 52 billion parameters, in order to add capacity to multilingual model architectures. All of these scaling efforts wouldn’t have been possible without Facebook’s recent GPU memory-saving tool called Fully Sharded Data Parallel, which enables large-scale training by up to 5x faster than previous methods.

More efficient infrastructure

Since multilingual models inherently compete for capacity, they must strike a balance between sharing parameters and specialization for different languages. Scaling model size in proportion results in unsustainable computational cost.

Ablation of the effect of each modeling technique that builds our final submission. We use the final row (in bold) as our submission to WMT2021 as it has the strongest performance across all languages. The numbers reported are BLEU scores on WMT ’21 development set.

We used an alternative approach to leverage conditional compute approaches, which activate only a subset of the model for each training example. Specifically, we train Sparsely Gated Mixture-of-Expert (MoE) models, in which each token is routed to the top-k expert FeedForward blocks based on a learned gating function. We use a Transformer architecture with the FeedForward block in every alternate Transformer layer replaced with a Sparsely Gated Mixture-of-Experts layer with top-2 gating in the encoder and decoder. As a result, only a subset of all the model’s parameters is used per input sequence.

These models help strike a balance between allowing high-resource directions to benefit from increased expert model capacity, while also allowing transfer to low-resource directions through shared model capacity.

The ‘last mile’ challenge in machine translation

Machine translation as a field has had impressive advances in bridging barriers, but most have centered on a handful of widely spoken languages. Low-resource translation remains a “last mile” problem for MT and the biggest open challenge for the subfield today.

We believe our success at WMT 2021 cements multilingual translation as an important path toward building a single universal translation system that serves high-quality translations for everyone around the world. We’ve shown that a single multilingual model can deliver better-quality translations than bilingual models can for both high- and low-resource languages and is still easier to fine-tune to specific tasks, such as translating news articles.

This approach of “one model for many languages” may also simplify the development of translation systems in real-world applications — with the potential to replace thousands of models with just one, making it easier to bring new applications and services for everyone around the world.

We’re now working on the next set of challenges to adapt these techniques to languages beyond those featured in the WMT competition. For example, how can we develop new techniques to support scarce languages with even less monolingual data, where tried-and-true techniques like back-translation are not possible?

By Chau Tran, Research Engineer | Shruti Bhosale, Research Engineer | James Cross, Research Scientist | Angela Fan, Research Scientist
Source Facebook AI

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

liwaiwai

The First-Ever Multilingual Model To Win WMT, Beating Out Bilingual Models

Large-scale data mining

Scaling model capacity

More efficient infrastructure

The ‘last mile’ challenge in machine translation

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Scale Your data Science Workflows With The Vertex AI Workbench Notebook Executor

Supercomputing For All Is Closer Than You Think

Elon Musk Unveils Grok 4 Amid Controversy Over Chatbot’s Antisemitic Posts

Dr. ChatGPT Will See You Now

AI Is a Lousy Chef

You Asked, We Answered: All of Your AI Angst

How Video Games Became the New Battleground for Actors and AI Protections

OpenAI Poaches 4 High-Ranking Engineers From Tesla, xAI, and Meta

Grok Is Spewing Antisemitic Garbage on X

A New Kind of AI Model Lets Data Owners Take Control

Formula E accelerates its work with Google Cloud Storage and Google Workspace

Study could lead to LLMs that are better at complex reasoning

Microsoft, OpenAI, and a US Teachers’ Union Are Hatching a Plan to ‘Bring AI into the Classroom’

People Are Using AI Chatbots to Guide Their Psychedelic Trips

The First-Ever Multilingual Model To Win WMT, Beating Out Bilingual Models

Large-scale data mining

Scaling model capacity

More efficient infrastructure

The ‘last mile’ challenge in machine translation

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Share this article

Scale Your data Science Workflows With The Vertex AI Workbench Notebook Executor

Supercomputing For All Is Closer Than You Think

Read next