What the research is:

What if you could search for specific video content — “sheepdogs on speedboats,” say, or “child in green hat singing the ABC song” — and receive an accurate, complete list of results, regardless of how the videos were tagged? Or, picture the reverse: You describe a scene — or upload atmospheric music, perhaps with a little dialogue — and a software program generates a video based on your suggestion, without the need for cameras, lights, actors, or editors. These applications are among the goals of multimodal research, in which AI models are trained to find relationships between different types of data, typically images, video, audio, and text. Although the AI community has made significant advances on image-text tasks, progress in the video-text domain lags. This is due in part to the difficulty inherent in modeling complicated spatiotemporal information, as well as the massive reserves of storage space and computing power that this work demands.

To push the field forward, we are releasing MUGEN (short for Multimodal Understanding and GENeration), a dataset of 375K linked video, audio, and text samples. Drawn from an enhanced version of the platform game CoinRun, the videos in MUGEN show the main character navigating environments and interacting with other characters in diverse and complex ways. Each video is annotated with text descriptions written by humans, as well as automatically generated templated text descriptions, frame-level pixel-accurate semantic segmentation maps, and synchronized audio. This combination of simplified visuals and rich annotations will enable the community to make progress on multimodal video-audio-text tasks without requiring prohibitively large computing resources.

 

One of the obstacles to advancing multimodal technology is that multimodal datasets tend to be either too complex or too simple to be useful. Existing datasets belong to two categories: open world, meaning collected in the wild, and closed world, which depict simple, often artificially generated environments. Most video-text datasets are open world, but the complex dynamics in these live-action videos have proved too challenging for current text-to-video generation systems to represent accurately. In response, researchers have turned to more constrained datasets for the training and evaluation of their models. However, the videos in these closed-world collections portray only limited actions and interactions between entities, aspects crucial to modeling real-world videos.

With MUGEN, we sought a happy medium: a compilation of closed-world videos that is narrow along some dimensions but rich along others, to enable focused advances in multimodal research. As with Meta’s other recent multimodal work, including our open source Pythia framework for vision and language research, the Situated and Interactive Multimodal Conversations (SIMMC) dataset, the Halo annotation platform, the Learning from Videos project, and CommerceMM, a new approach to multimodal understanding for online shopping, we wanted to use our resources to help improve AI’s understanding of complex data.

How it works:

We based our dataset on OpenAI’s CoinRun, a platform game about an alien — whom we call Mugen — who tries to gather coins without being killed by monsters. We modified CoinRun to make our videos more diverse and delightful — adding audio, slowing game physics, adjusting camera zoom and stabilization, and enabling new interactions between characters. In our version of CoinRun, Mugen can take 16 different actions. The core actions — walk, jump, collect coin, kill monster, power-up, climb ladder, bump head, and die — trigger different sound effects, which are layered with background music to produce the full audio track. Ten monsters interact with Mugen; some walk, others hop, and one flies.

To collect videos, we trained 14 reinforcement learning (RL) agents — AI systems that learn through trial and error to navigate an environment in search of a predetermined reward — to play and record the game. We assigned each agent a different objective in order to increase the diversity of our dataset. For example, if we programmed an agent to value immediate rewards over future gain, the agent would take more risks and, as a result, die more often.

Our RL agents recorded 233,000 unique videos, which we split into three-second clips. (We’ve included both the longer, uncut videos and the clips in the dataset.) To demonstrate associations between text and video, we then asked human annotators to describe, in one to two sentences, what happened during those snippets of gameplay. We also developed a template-based algorithm to generate text descriptions based on game engine metadata about Mugen’s actions. After filtering low-quality annotations, we added 378,902 text descriptions for the 375,368 video clips to the MUGEN dataset. (Some clips have more than one description.)

Each video is also paired with synced audio, generated from background music and sound effects. In many open world video-text datasets, the audio and video are not well aligned: Irrelevant background noise is common, and the people shown may be talking about something unrelated to the current scene. In contrast, the video and audio in MUGEN are synchronized based on Mugen’s actions, which will facilitate experiments with underexplored tasks, like audio generation from video or text. Semantic segmentation maps delineate the objects in each frame pixel by pixel, helping models learn to associate the visual features in a clip with the coinciding sounds and text, and with similar content in other video frames.

Taken together, our annotations — the semantic maps, synced audio, and written descriptions — help illuminate the connections between video, sound, and language, which we hope will guide multimodal models to refine their understanding of these relationships.

We are also releasing the updated game engine, which can enable more customized data collection, and the trained RL agents. In our paper, we have benchmarked the performance of retrieval and generation between every pair of modalities. The code for these baselines is available here.

Why it matters:

We believe the research community can benefit from datasets that are narrow but rich — offering challenges that are achievable for current systems while reflecting the variety of activity depicted in real-world videos. Research done with the MUGEN dataset will have many real-world applications down the line. Imagine if you could, for example, find all the TV shows that have featured a particular tune (audio-to-video retrieval) or type a description of an unidentified bird call to search a library of sound effects (text-to-audio retrieval). MUGEN can also help researchers build new text-to-video generation systems — automatically creating videos from scratch based on written instructions. This type of cross-modal generation task is comparatively under-researched, largely due to a lack of feasible datasets.

The interactions between entities in MUGEN are more diverse than in other closed-world datasets. At the same time, compared with an in-the-wild dataset, the videos in MUGEN have a limited set of stripped-down objects and scenes: The physics are simplified, the camera angle is fixed, and the lighting is consistent. That narrowness will allow researchers to make steady progress on more feasible, bite-size challenges. We have also developed a storage-efficient online pipeline that can render videos and semantic maps at different resolutions on the fly based on metadata stored in a json file. By reducing data and compute requirements, MUGEN will allow a much wider swath of the AI community to join in the work to advance multimodal understanding and generation.

Read the paper
Get the code

 

By Xi Yin, Research Scientist | Songyang Zhang, Research Intern | Thomas Hayes, Research Engineer | Devi Parikh, Research Director
Source Meta AI

Previous Engineers Build LEGO-Like Artificial Intelligence Chip
Next Stronger Security For Smart Devices