The First High-Performance Self-Supervised Algorithm That Works For Speech, Vision, And Text

January 25, 2022

5 min read

Self-supervised learning — where machines learn by directly observing the environment rather than being explicitly taught through labeled images, text, audio, and other data sources — has powered many significant recent advances in AI. But while people appear to learn in a similar way regardless of how they get information — whether they use sight or sound, for example — there are currently big differences in the way self-supervised learning algorithms learn from images, speech, text, and other modalities.

This discrepancy has been a significant barrier to applying advances in self-supervised learning more broadly. Because a powerful algorithm designed for, say, understanding images can’t be directly applied to another modality, such as text, it is difficult to push several modalities ahead at the same rate.

This is why Meta AI developed and is excited to announce data2vec, the first high-performance self-supervised algorithm that works for multiple modalities. We apply data2vec separately to speech, images and text and it outperformed the previous best single-purpose algorithms for computer vision and speech and it is competitive on NLP tasks. It also represents a new paradigm of holistic self-supervised learning, where new research improves multiple modalities rather than just one. It also does not rely on contrastive learning or reconstructing the input example. In addition to helping accelerate progress in AI, data2vec brings us closer to building machines that learn seamlessly about different aspects of the world around them. It will enable us to develop more adaptable AI, which we believe will be able to perform tasks beyond what today’s systems can do.

As part of this announcement, we are sharing code and pretrained models on data2vec so that others in the research community can build upon our work.

How data2vec works

Much of AI is still based on supervised learning, which works exclusively with labeled data. But it’s simply not possible to collect labeled data for all the things we would like machines to do. For example, while researchers have done a lot of work in creating large-scale labeled data sets for English speech and text, it is not feasible to do this for the literally thousands of languages spoken on the planet.

Self-supervision enables computers to learn about the world just by observing it and then figuring out the structure of images, speech, or text. Having machines that don’t need to be explicitly taught to classify images or understand spoken language is simply much more scalable.

Research in self-supervised learning today is almost always focused on one particular modality. So, researchers working on one modality often take a very different approach from those working on another. For text, researchers train models to fill in blanks in sentences. Speech models, however, need to learn an inventory of the basic sounds of speech in order to predict missing sounds. In computer vision, models are often trained to assign similar representations to a color image of a cow and the same image flipped upside down, so it associates the two much more closely than it would with an unrelated image, such as that of a duck.

Algorithms also predict different units for each modality: pixels or visual tokens for images, words for text, and learned inventories of sounds for speech. A collection of pixels is very different from an audio waveform or a passage of text, and because of this, algorithm design has been tied to a specific modality. This means that algorithms are still functioning differently in each modality.

Data2vec simplifies this by training models to predict their own representations of the input data, regardless of the modality. By focusing on these representations — the layers of a neural network — instead of predicting visual tokens, words, or sounds, a single algorithm can work with completely different types of input. This removes the dependence on modality-specific targets in the learning task. Directly predicting representations is not straightforward, and it required defining a robust normalization of the features for the task that would be reliable in different modalities.

Our method uses a teacher network to first compute target representations from an image, a piece of text, or a speech utterance. Next, we mask part of the input and repeat the process with a student network, which then predicts the latent representations of the teacher. The student model has to predict representations of the full input data even though it has a view of only some of the information. The teacher network is identical to the student model but with weights that are slightly out of date.

We tested the method on the popular ImageNet computer vision benchmark, where it performed better than existing methods for popular model sizes. On speech, we found that it performed better than wav2vec 2.0 or HuBERT, two previous Meta AI self-supervised algorithm for speech. For text, we tested it on the popular GLUE benchmark suite, and it performed as well as RoBERTa, a reimplementation of BERT.

Data2vec for computer vision: performance on the popular ImageNet benchmark for ViT-B models compared with other recent methods.

Data2vec for speech: performance for Base models on the LibriSpeech benchmark with 10h labeled data compared with other recent methods. Lower error rate indicates better performance.

Data2vec for text: performance on the GLUE natural language understanding benchmark for Base models compared with RoBERTa when retrained with the original BERT settings. Higher score indicates better performance.

Toward machines that learn from observing the world around them

While self-supervised learning has made great progress in computer vision, videos, and other individual modalities through different learning objectives, the core idea of this approach is to learn more generally: AI should be able to learn to do many different tasks, including those that are entirely unfamiliar. We want a machine to not only recognize animals shown in its training data but also adapt to recognize new creatures if we tell it what they look like. Data2vec demonstrates that the same self-supervised algorithm can work well in different modalities — and often better than the best existing algorithms. This paves the way for more general self-supervised learning and brings us closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread. We also hope data2vec will bring us closer to a world where computers need very little labeled data in order to accomplish tasks. Since it is difficult and sometimes impossible to collect annotated examples — to train speech recognition models for thousands of languages, for example — data2vec is an important step toward more general AI. This project complements research on general model architectures, and we hope that in the future we can remove the need for modality-specific feature extractors by combining these two lines of work.

Access the open source code and release pretrained models here and read the paper here.

This blog post was made possible by the work of Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli.
Source Facebook AI Blog

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

liwaiwai

AI: Building Momentum As We Start 2022

7 Myths About Artificial Intelligence (AI) You Must Stop Believing

February 25, 2022

A captivating conversation is taking place about the future of artificial intelligence and what it will/should…

4 min read

Model Innovators: How Digital Twins Are Making Industries More Efficient

March 27, 2024

From Taiwan to Germany, companies are seeing the benefits of physics-informed models and simulations with NVIDIA…

3 min read

Computer-Aided Creativity In Robot Design

December 1, 2020

Choosing the right shape will be vital for your robot’s ability to traverse a particular terrain. And it’s…

5 min read

10+ Questions You Should Ask Yourself Before Developing An AI Solution

June 16, 2021

So, you’d like to spice things up in your current business strategy and add an extra layer of high technology…

8 min read

The First High-Performance Self-Supervised Algorithm That Works For Speech, Vision, And Text

How data2vec works

Toward machines that learn from observing the world around them

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

AI: Building Momentum As We Start 2022

Cloudian Partners With WEKA To Deliver High-Performance, Exabyte-Scalable Storage For AI, Machine Learning And Other Advanced Analytics

OpenAI Leadership Responds to Meta Offers: ‘Someone Has Broken Into Our Home’

Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors

Accelerating scientific discovery with AI

OpenAI Loses 4 Key Researchers to Meta

I Let AI Agents Plan My Vacation—and It Wasn’t Terrible

Using generative AI to help robots jump higher and land safely

OpenAI’s Unreleased AGI Paper Could Complicate Microsoft Negotiations

The AI Backlash Keeps Growing Stronger

AlphaGenome: AI for better understanding the genome

MIT and Mass General Brigham launch joint seed program to accelerate innovations in health

The Summer Adventures : Camping Essentials

Meta Wins Blockbuster AI Copyright Case—but There’s a Catch

The First High-Performance Self-Supervised Algorithm That Works For Speech, Vision, And Text

How data2vec works

Toward machines that learn from observing the world around them

For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Share this article

AI: Building Momentum As We Start 2022

Cloudian Partners With WEKA To Deliver High-Performance, Exabyte-Scalable Storage For AI, Machine Learning And Other Advanced Analytics

Read next