We’ve built and are now sharing the largest available data set designed to help AI researchers develop new systems that can identify image manipulation at scale. The first-of-its-kind Image Similarity data set provides a global benchmark for combating harmful image manipulation and abuses online.
The Image Similarity data set contains over 1 million images including 50,000 reference images by Facebook AI.
We’ve also launched the Image Similarity Challenge, a first-of-its-kind online competition hosted by DrivenData with a $200,000 total prize pool. The challenge is being supported by Pinterest, BBC, Getty Images, iStock and Shutterstock.
While most online users manipulate images in ways that are benign or entertaining, some create images to spread misinformation, violence, or hate that has the potential to result in online or offline harm. These images often evade automated detection systems that are not robust enough to handle image manipulations at scale.
Today, we are releasing the Image Similarity data set to the broader research community and announcing the launch of an associated competition, hosted by DrivenData, with a $200,000 prize pool. The competition began for researchers on June 19, 2021 and ends on October 28, 2021. Pinterest, BBC, Getty Images, iStock and Shutterstock are supporting the challenge. The Image Similarity data set consists of 1 million reference images and 50,000 query images, part of which are manipulated versions of a reference image. With this data set and challenge, we hope to enable new implementations of machine-learning (ML) based and other systems that can be used to help predict the similarity of two pieces of visual content and bring the industry closer to at-scale detection of manipulated images.
Harmful images: an industry-wide challenge
The threat of misinformation and abuse on social media has elicited multiple responses across the industry, particularly in the area of data provenance, with several leading organizations coming together to form the Content Authenticity Initiative and Project Origin. As with our work on initiatives like the DeepFake Detection Challenge, The Hateful Memes Challenge, and the ML Code Completeness Checklist, Facebook AI believes the best solutions will come from open collaboration by experts across the AI community. Our Image Similarity data set is the largest known data set on image similarity, and includes human and automated edits that are representative of on-platform behavior.
Many social media networks use content tracing and image similarity detection to block or slow down the spread of content (images and video) that have a clearly negative social impact. To protect online communities, these networks incorporate manual content moderation with automated matching tools, such as those designed to detect viral nudity, graphic violence, or known misinformation.
What is image similarity?
Image similarity consists of identifying the source of an altered image within a large collection of unrelated images. This technology is applied to a range of content moderation domains, including misinformation, copyright infringement, scams and others.
Image similarity detection, a method of content tracing for visual data, has not received enough attention within the computer vision community despite its growing importance for detecting manipulation and improving the detection of data provenance. Models developed for other tasks, like classification or object instance recognition, deliver subpar results when faced with the sheer scale of images and videos shared each day. Furthermore, the lack of a large and standardized data set to measure the performance of image similarity algorithms discourages researcher involvement.
The image similarity data set
We designed the Image Similarity data set to serve as a benchmark for work in image similarity detection, providing a reference collection of 1 million images and a set of 50,000 query images. The query images are versions of reference images transformed through human and non-human edits to include various types of image edition, collages, and re-encoding.
To provide researchers with a data set with open licensing terms, we selected certain images with broad licenses from the YFCC100M data set as well as still images from our Deepfake Detection Challenge data set (raw frames of real people, no deepfake technique applied). We also used some images from our Casual Conversations data set, with the Ciagan deepfake technique applied by our collaborators Maxim Maximov, Ismail Elezi, and Laura Leal-Taixé at Technical University of Munich to modify the faces and make it harder for AI to identify source images.
We transformed these source images in several different ways. We applied a wide range of automated transformations to a subset of the 50,000 query images using the AugLy library, which was developed here at Facebook AI and recently open sourced. The augmentations provided in AugLy are all inspired by real transformations we see on our platforms, so many were useful for our use case — for example overlaying text and emojis on top of the source images, running the images through .jpeg compression, cropping part of the image out, and pasting the source image onto a background image. We also worked with trained third-party annotators to manually transform a smaller subset of the images to ensure we have even more selections representative of the way a human user would transform images. The annotators used image manipulation software GIMP to manually alter images in diverse ways that we cannot easily automate, for example handwriting or drawing on the images or cropping to leave only the part of the image most salient to the human eye.
The image similarity challenge
The Image Similarity Challenge invites participants to test their image matching techniques on the Image Similarity data set. More information for researchers is available here, and the accompanying paper is available here. For researchers considering attending NeurIPS 2021 in December, we’re also pleased to announce that the Image Similarity Challenge has been accepted for the NeurIPS 2021 competition track, where we will be announcing the winners of this challenge (The competition is subject to official rules. See the competition website for eligibility, entry dates, submission requirements, evaluation metrics, and prizing.)
Participants in the Image Similarity Challenge are tasked with finding the source reference image from all queries within the data set. Baseline methods include all techniques from the instance matching literature (keypoint matching, global descriptor extraction). We worked together with image matching experts Giorgos Tolias, Tomas Jenicek and Ondrej Chum, from the Czech Technical University in Prague to choose the right evaluation metrics and calibrate the difficulty of the transformations.
Human review alone will not solve the online integrity challenges caused by image manipulation at scale. Getting scalable visual similarity detection right can reduce the amount of exposure to harmful content that online communities, businesses, and content moderators face from bad actors. Facebook AI views this first-of-its-kind benchmark as a step forward in the fight against malicious image manipulation.
Like deepfakes, hateful memes and other adversarial challenges, manipulated content will continue to evolve. By providing a data set expressly made to help researchers tackle this problem, along with a common benchmark and a community competition, we are confident that the Image Similarity Challenge will spur faster progress across the industry in dealing with harmful content and help advance the similarity detection domain.