Building a universal translation system to help everyone access information and better connect with one another is the ultimate goal of the machine translation (MT) field. But the MT field needs to solve fundamental limitations in order to make that future a reality.
Most MT systems today use groups of bilingual models, which typically require extensive labeled examples for each language pair and task. Unfortunately, this approach fails many languages with scarce training data (e.g., Icelandic and Hausa). Its high complexity also makes it impractical to scale to practical applications on Facebook, where billions of people post in hundreds of languages every day.
To build a universal translator, we believe the MT field should shift away from bilingual models and advance toward multilingual translation — where a single model translates many language pairs at once, including both low-resource (e.g., Icelandic to English) and high-resource (e.g., English to German). Multilingual translation is an appealing approach — it’s simpler, more scalable, and better for low-resource languages. But until now, this approach couldn’t provide results for high-resource language pairs that were as good as specially trained bilingual models for those language pairs. As a result, delivering quality translations across many languages has generally involved using a combination of individual bilingual models, and low-resource languages have lagged behind.
Now we’ve achieved an exciting breakthrough: For the first time, a single multilingual model has outperformed the best specially trained bilingual models across 10 out of 14 language pairs to win WMT, a prestigious MT competition. Our single multilingual model provided the best translations for both low- and high-resource languages, showing that the multilingual approach is indeed the future of MT.
This work builds on top of previous breakthroughs, which have improved the quality of translations for low-resource languages. Prior work, however, has fundamental capacity challenges when languages with various resources are added — one model becomes overwhelmed as more languages are added, each with unique linguistic properties, scripts, and vocabularies. When high-resource languages benefit from large multilingual models, low-resource language pairs risk overfitting.
Our winning model is an exciting tipping point in MT because it shows that — through new advancements in large-scale data mining, scaling model capacity, and more efficient infrastructure — it’s possible for multilingual models to achieve high performance on both high- and low-resource languages. It brings us one step closer to building a universal translator that connects people in all languages around the world, regardless of how much translation data exists.
Large-scale data mining
To train our WMT 2021 model, we built two multilingual systems: any-to-English and English-to-any. We leveraged parallel data mining techniques by identifying translations in large web crawl data sets to overcome limitations of standard training documents that are manually translated, like European Parliamentary speeches, which are not always available for all translation directions.
Since the amount of monolingual data for any language vastly exceeds the amount of parallel data, it’s crucial that we leverage available monolingual data to maximize performance of MT systems. One of the most common techniques to use monolingual data is called back- translation, which we used to win both the 2018 and 2019 edition of the English-to-German WMT news translation task. In our work, we added large-scale monolingual data with hundreds of millions of sentences from all eight languages. We filtered the available monolingual data to reduce the amount of noise, and then back-translated them with an ensemble of the strongest multilingual models available.
Scaling model capacity
IIn addition to scaling data size using back-translation, we also scaled model size from 15 billion parameters to 52 billion parameters, in order to add capacity to multilingual model architectures. All of these scaling efforts wouldn’t have been possible without Facebook’s recent GPU memory-saving tool called Fully Sharded Data Parallel, which enables large-scale training by up to 5x faster than previous methods.
More efficient infrastructure
Since multilingual models inherently compete for capacity, they must strike a balance between sharing parameters and specialization for different languages. Scaling model size in proportion results in unsustainable computational cost.
We used an alternative approach to leverage conditional compute approaches, which activate only a subset of the model for each training example. Specifically, we train Sparsely Gated Mixture-of-Expert (MoE) models, in which each token is routed to the top-k expert FeedForward blocks based on a learned gating function. We use a Transformer architecture with the FeedForward block in every alternate Transformer layer replaced with a Sparsely Gated Mixture-of-Experts layer with top-2 gating in the encoder and decoder. As a result, only a subset of all the model’s parameters is used per input sequence.
These models help strike a balance between allowing high-resource directions to benefit from increased expert model capacity, while also allowing transfer to low-resource directions through shared model capacity.
The ‘last mile’ challenge in machine translation
Machine translation as a field has had impressive advances in bridging barriers, but most have centered on a handful of widely spoken languages. Low-resource translation remains a “last mile” problem for MT and the biggest open challenge for the subfield today.
We believe our success at WMT 2021 cements multilingual translation as an important path toward building a single universal translation system that serves high-quality translations for everyone around the world. We’ve shown that a single multilingual model can deliver better-quality translations than bilingual models can for both high- and low-resource languages and is still easier to fine-tune to specific tasks, such as translating news articles.
This approach of “one model for many languages” may also simplify the development of translation systems in real-world applications — with the potential to replace thousands of models with just one, making it easier to bring new applications and services for everyone around the world.
We’re now working on the next set of challenges to adapt these techniques to languages beyond those featured in the WMT competition. For example, how can we develop new techniques to support scarce languages with even less monolingual data, where tried-and-true techniques like back-translation are not possible?