What the research is:

Meta AI is sharing new research on using Vision Transformers (ViTs) for object detection. Our approach, ViTDet, outperforms previous alternatives on benchmarks on the Large Vocabulary Instance Segmentation (LVIS) dataset, which was released by Meta AI researchers in 2019 to facilitate research on low-shot object detection. In this task, the model must learn to recognize a much wider variety of objects than conventional computer vision systems can. ViTDet outperforms previous ViT-based models in accurately recognizing objects in the LVIS dataset, which includes not just standard items like tables and chairs, but also bird feeders, wreaths, doughnuts, and much more.

To enable the research community to reproduce and build upon these advancements, we are now releasing the ViTDet code and training recipes as new baselines in our open source Detectron2 object detection library.

How it works:

Over the past year, ViTs have been established as a powerful backbone for visual recognition. Unlike typical convolutional neural networks, the original ViT is a plain, non-hierarchical architecture that maintains a single-scale feature map throughout its processing. Challenges arise when applying ViTs to object detection, however. For example, how can we detect multiscale objects effectively with a plain backbone? And is a ViT too inefficient to use for object detection in high-resolution images?

Unlike existing research, such as Swin and MViTv2, ViTDet uses only plain, nonhierarchical ViT backbones. It builds a simple feature pyramid from the single-scale feature map output by the ViT and primarily uses simple, nonoverlapping window attention to extract features from high-resolution images efficiently. This design decouples the pretraining of ViT from the fine-tuning demands of detection and thus enables the object detector to benefit from readily available pretrained masked autoencoder (MAE) models.

ViTDet builds a simple feature pyramid from the output of a plain, nonhierarchical vision transformer. The decoupling of its detector-specific designs from the ViT backbone enables it to benefit from Masked Autoencoder (MAE) pretraining.


We start by training ViTDet detectors following the Mask R-CNN framework with ViT backbones of base (B), large (L), and huge (H) sizes. We evaluate two pretraining strategies: supervised pretraining, and self-supervised MAE pretraining (supervised pretrained ViT-H model weights are not available). We measure the accuracy on LVIS by average precision of masks (Mask AP) and average precision of masks on the rare categories (Mask AP-rare). Achieving good performance on rare categories is challenging as there are 10 or fewer training samples per rare category. We have two primary observations:

  1. Compared with supervised pretraining, MAE pretraining delivers improved LVIS results as we scale ViTDet’s ViT backbone size.

  2. We observe strong Mask AP gains for rare category detection, which is at the heart of the low-shot detection problem posed by LVIS.

Mask AP can be improved from 38.1 to 43.5 (+5.4) when scaling MAE pretrained ViT backbone from base to large while there is only +1.1 Mask AP gain when scaling supervised pretrained ViT backbone. ViTDet with ViT-H backbone in Mask R-CNN can achieve remarkable 45.9 Mask AP and 37.9 Mask AP-rare.


We also benchmark Mask R-CNN using other recently proposed hierarchical ViT backbones, including Swin and MViTv2. Swin and MViTv2 are pretrained with supervision on ImageNet-1K and ImageNet-21K. We search for optimal recipes separately for each backbone of base (B), large (L), and huge (H) sizes whenever available. Out of all the benchmarked backbones, ViTDet with MAE pretraining has the best scaling behavior and delivers the best performance on LVIS.

Benchmark results on LVIS for object detectors with different backbones in Mask R-CNN. ViTDet-H achieved 45.9 Mask AP with ImageNet-1K self-supervised MAE pretraining.


Why it matters:

Object detection is an important computer vision task with applications ranging from autonomous driving to e-commerce to augmented reality. To make object detection more useful, CV systems need to recognize uncommon objects and objects that appear only very rarely in their training data. With ViTDet, we now see a tipping point that shows LVIS, the benchmarking dataset for low-shot object detection challenge, benefits strongly from larger backbones and better pretraining. We hope that by open-sourcing our newly established strong baselines with ViTDet, we will help the research community to further push the state of the art and to build more effective CV systems.

By Hanzi Mao, Research Scientist | Yanghao Li, Research Engineer | Kaiming He, Research Scientist | Ross Girshick, Research Scientist
Source Meta AI

Previous Engineers Repurpose 19th-Century Photography Technique To Make Stretchy, Color-Changing Films
Next Using Artificial Intelligence To Control Digital Manufacturing