Data2vec 2.0: Highly Efficient Self-Supervised Learning For Vision, Speech And Text

December 20, 2022

4 min read

Many recent breakthroughs in AI have been powered by self-supervised learning, which enables machines to learn without relying on labeled data. But current algorithms have several significant limitations, often including being specialized for a single modality (such as images or text) and requiring lots of computational power. This contrasts with human learning: People appear to learn much more efficiently than current AI, and also learn from different kinds of information in a similar way, rather than relying on separate learning mechanisms for text, speech, and other modalities.

Meta AI addressed one of these limitations earlier this year when we released data2vec, the first high-performance self-supervised algorithm to learn the same way for three different modalities: speech, vision, and text. Data2vec made it much easier to apply research advances in, say, text understanding to an image segmentation or speech translation task.

Today, we’re sharing data2vec 2.0, a new algorithm that is vastly more efficient and outperforms its predecessor’s strong performance. It achieves the same accuracy as the most popular existing self-supervised algorithm for computer vision but does so 16x faster.

To make our research accessible to other researchers, we are now sharing the code and pretrained models.

How data2vec 2.0 works

The general idea of self-supervised learning is for machines to learn the structure of images, speech, and text simply by observing the world. Advances in this area have led to many breakthroughs in speech (e.g., wav2vec 2.0), computer vision (e.g., masked autoencoders), and natural language processing (e.g., BERT). But modern systems can be computationally demanding, as training very large models requires many GPUs.

Illustration of how data2vec 2.0 training works. It can be trained separately on text, speech, or images.

Similar to the original data2vec algorithm, data2vec 2.0 predicts contextualized representations of the data — or the layers of a neural network — instead of the pixels of an image, the words of a text passage, or the sounds of speech. Unlike with most other algorithms, these so-called target representations are contextualized, meaning they take the entire training example into account. For instance, the representation of the word bank is based on the entire sentence that the word appears in, and it is therefore easier to represent the correct meaning of the word (“financial institution” or “ground next to river”). We believe that contextualized targets lead to a richer learning task and enable data2vec 2.0 to learn faster than other algorithms.

We improved the efficiency of the original data2vec algorithm in several ways: First, we take target representations built for a particular training example and reuse them for masked versions (where we hide different random parts of the training example). We take each version and feed it into the student model, which predicts the same contextualized target representation for the different masked versions. This effectively amortizes the computational effort required to create target representations. Second, and similar to masked autoencoders, we do not run the student encoder network for the parts of the training examples that are blanked out (which is about 80 percent of an image in our case), thereby saving significant compute cycles. Finally, we use a more efficient decoder model that relies not on Transformer networks but on a multilayer convolutional network.

Relative training time improvements when training data2vec 2.0 to the same accuracy as popular existing algorithms on the same hardware.

Efficiency gains with data2vec 2.0

To get a better understanding of how much more efficient data2vec 2.0 is than its predecessor and other algorithms, we tested it on computer vision, speech, and text tasks on widely used benchmarks. We were looking at the final accuracy and the time it took to pretrain the model. We measured the speed of algorithms on the same hardware (number of GPUs, etc.).

For computer vision, we evaluated data2vec 2.0 on the standard ImageNet-1K image classification benchmark, where it learned to represent images. Data2vec 2.0 can equal the accuracy of masked autoencoders (MAE) but is 16x faster (measured in wall clock time in a like-for-like setting). If we give the algorithm more time, it can achieve even higher accuracy while still being faster than MAE.

Data2vec 2.0 for computer vision: The graph shows speed vs. image classification accuracy for different algorithms on the popular ImageNet-1K benchmark.

For speech, we tested it on the LibriSpeech speech recognition benchmark, where it performed more than 11 times faster than wav2vec 2.0 with similar accuracy. For natural language processing (NLP), we evaluated data2vec 2.0 on the popular General Language Understanding Evaluation (GLUE) benchmark, where it achieved the same accuracy as RoBERTa, a reimplementation of BERT, in half the training time.

Data2vec 2.0 for speech and NLP: The top graph shows speed vs. speech recognition word error rate for models pretrained on LibriSpeech, fine-tuned on 10 hours of Libri-light data, and then evaluated on dev-other. The bottom graph shows natural language understanding accuracy on the GLUE benchmark when using the original BERT setup.

Toward machines that learn efficiently

We are on a journey to build more general and efficient self-supervised algorithms that can use a single learning objective to learn from different modalities. The ability to learn more efficiently is particularly important for modalities such as video, which require a lot of computational effort to process. We hope that more efficient self-supervised learning algorithms such as data2vec 2.0 will lead to machines that can deeply understand extremely complex data, such as the contents of an entire movie.

Access the open source code and pretrained models here, and read the paper hear.

This blog post was made possible by the work of Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli.

Source Meta AI

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

liwaiwai

Deepfakes Are Being Used For Good – Here’s How

DigitalOcean Acquires Paperspace to Expand AI Offerings

July 3, 2023

Paperspace’s high-performance GPU tooling enables small and medium-sized businesses around the globe to test,…

4 min read

KLAS Recognizes Microsoft’s Momentum In Healthcare AI

February 9, 2021

From improving clinical decision making to better managing the COVID-19 pandemic, the benefits of artificial…

4 min read

Build, Deploy, And Scale ML Models Faster With Vertex AI’s New Training Features

February 12, 2022

Vertex AI includes over a dozen powerful MLOps tools in one unified interface, so you can build, deploy, and…

2 min read

Stanford Engineers Present New Chip That Ramps Up AI Computing Efficiency

August 22, 2022

AI-powered edge computing is already pervasive in our lives. Devices like drones, smart wearables, and…

5 min read

Data2vec 2.0: Highly Efficient Self-Supervised Learning For Vision, Speech And Text

How data2vec 2.0 works

Efficiency gains with data2vec 2.0

Toward machines that learn efficiently

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Deepfakes Are Being Used For Good – Here’s How

IBM, Casa Systems & Enea Partner To Deliver Private 5G, RAN Solutions To CSPs

OpenAI Leadership Responds to Meta Offers: ‘Someone Has Broken Into Our Home’

Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors

Accelerating scientific discovery with AI

OpenAI Loses 4 Key Researchers to Meta

I Let AI Agents Plan My Vacation—and It Wasn’t Terrible

Using generative AI to help robots jump higher and land safely

OpenAI’s Unreleased AGI Paper Could Complicate Microsoft Negotiations

The AI Backlash Keeps Growing Stronger

AlphaGenome: AI for better understanding the genome

MIT and Mass General Brigham launch joint seed program to accelerate innovations in health

The Summer Adventures : Camping Essentials

Meta Wins Blockbuster AI Copyright Case—but There’s a Catch

Data2vec 2.0: Highly Efficient Self-Supervised Learning For Vision, Speech And Text

How data2vec 2.0 works

Efficiency gains with data2vec 2.0

Toward machines that learn efficiently

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Share this article

Deepfakes Are Being Used For Good – Here’s How

IBM, Casa Systems & Enea Partner To Deliver Private 5G, RAN Solutions To CSPs

Read next