Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • About
Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • About
  • Data
  • Machine Learning

CCMatrix: A Billion-scale Bitext Data Set For Training Translation Models

  • February 11, 2020
  • relay

CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year. Gathering a data set of this size required modifying our previous bitext mining approach used for WikiMatrix, assuming that the translation of one sentence could be found anywhere on CommonCrawl, which functions as an open archive of the internet. To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations, we used massively parallel processing, as well as our highly efficient FAISS library for fast similarity searches.

We’re sharing details about how we created CCMatrix, and the tools needed for other researchers to reproduce our results and use this corpus for their work. To demonstrate the value of automatically generating such a large number of parallel texts, we trained neural machine translation (NMT) systems on CCMatrix and compared their performance with established baselines. Our resulting models outperformed the state-of-the-art single-NMT systems evaluated in the Conference on Machine Translation (also known as WMT’19) competition in four language directions, including Russian to English, despite using only mined translations (rather than human-provided ones). And when tested against the TED corpus, CCMatrix also enabled us to significantly improve NMT performance for many language pairs, compared with other approaches.

What it does:

Parallel texts — which include sentences in one language and their corresponding translations in another — are the backbone of most NMT training methods. And while more bitext examples typically lead to better translation performance, gathering large parallel corpora across a wide number of languages is a resource-intensive task. Our method automates and parallelizes this bitext mining process, processing multiple batches of 50 million examples at a time on an 8-GPU server. Using the FAISS library, we’re able to calculate the distance between all the sentence embeddings in each batch, with every calculation performed in parallel. This enables a rapid extraction of sentence pairs, pulled from a greater variety of publicly available texts than similar data sets, including our Wikipedia-based WikiMatrix.

Read More  Building AI That Can Generate Images Of Things It Has Never Seen Before

CCMatrix’s parallelized approach to bitext mining maps the similarities between millions of sentences in many different languages at once, searching for pairs that can function as training examples for translation models.

Why it matters:

CCMatrix enables the NMT research community to leverage much larger bitext data sets than was previously possible for scores of language pairs. This can accelerate creation of more effective NMT models that work with more languages, particularly low-resource ones that have relatively limited corpora.

Because of its large scale and its use of a broad array of public texts, we believe that CCMatrix will become one of the most commonly used resources for building and evaluating systems across the field of NMT. We also hope that the technique we used to create CCMatrix will help the research community develop new ways to create large-scale data sets that will improve translation tools used by people around the globe.

Get it on GitHub:

Paper: https://arxiv.org/abs/1911.04944

Github: https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix

 

Source: Facebook AI Blog

relay

Related Topics
  • CCMatrix
  • Facebook AI
  • Machine Translation
You May Also Like
View Post
  • Engineering
  • Machine Learning

Peacock: Tackling ML Challenges By Accelerating Skills

  • March 23, 2023
View Post
  • Data
  • Machine Learning
  • Platforms

Coop Reduces Food Waste By Forecasting With Google’s AI And Data Cloud

  • March 23, 2023
View Post
  • Artificial Intelligence
  • Machine Learning
  • Robotics

Gods In The Machine? The Rise Of Artificial Intelligence May Result In New Religions

  • March 23, 2023
View Post
  • Data
  • Engineering

BigQuery Under The Hood: Behind The Serverless Storage And Query Optimizations That Supercharge Performance

  • March 22, 2023
View Post
  • Artificial Intelligence
  • Machine Learning

6 ways Google AI Is Helping You Sleep Better

  • March 21, 2023
View Post
  • Artificial Intelligence
  • Machine Learning

AI Could Make More Work For Us, Instead Of Simplifying Our Lives

  • March 21, 2023
View Post
  • Data
  • Design
  • Engineering
  • Tools

Sumitovant More Than Doubles Its Research Output In Its Quest To Save Lives

  • March 21, 2023
View Post
  • Data
  • Platforms
  • Technology

How Osmo Is Digitizing Smell With Google Cloud AI Technology

  • March 20, 2023

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay Connected!
LATEST
  • 1
    Ditching Google: The 3 Search Engines That Use AI To Give Results That Are Meaningful
    • March 23, 2023
  • 2
    Peacock: Tackling ML Challenges By Accelerating Skills
    • March 23, 2023
  • 3
    Coop Reduces Food Waste By Forecasting With Google’s AI And Data Cloud
    • March 23, 2023
  • 4
    Gods In The Machine? The Rise Of Artificial Intelligence May Result In New Religions
    • March 23, 2023
  • 5
    The Technology Behind A Perfect Cup Of Coffee
    • March 22, 2023
  • 6
    BigQuery Under The Hood: Behind The Serverless Storage And Query Optimizations That Supercharge Performance
    • March 22, 2023
  • 7
    6 ways Google AI Is Helping You Sleep Better
    • March 21, 2023
  • 8
    AI Could Make More Work For Us, Instead Of Simplifying Our Lives
    • March 21, 2023
  • 9
    Microsoft To Showcase Purpose-Built AI Infrastructure At NVIDIA GTC
    • March 21, 2023
  • 10
    The Next Generation Of AI For Developers And Google Workspace
    • March 21, 2023

about
About
Hello World!

We are liwaiwai.com. Created by programmers for programmers.

Our site aims to provide materials, guides, programming how-tos, and resources relating to artificial intelligence, machine learning and the likes.

We would like to hear from you.

If you have any questions, enquiries or would like to sponsor content, kindly reach out to us at:

[email protected]

Live long & prosper!
Most Popular
  • 1
    ABB To Expand Robotics Factory In US
    • March 16, 2023
  • 2
    Introducing Microsoft 365 Copilot: Your Copilot For Work
    • March 16, 2023
  • 3
    Linux Foundation Training & Certification & Cloud Native Computing Foundation Partner With Corise To Prepare 50,000 Professionals For The Certified Kubernetes Administrator Exam
    • March 16, 2023
  • 4
    Intel Contributes AI Acceleration to PyTorch 2.0
    • March 15, 2023
  • 5
    Sumitovant More Than Doubles Its Research Output In Its Quest To Save Lives
    • March 21, 2023
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
  • About

Input your search keywords and press Enter.