Benchmarks — from MNIST to ImageNet to GLUE — have played a hugely important role in driving progress in AI research. They provide a target for the community to work toward; a common objective to exchange ideas around; and a clear, quantitative way to compare model performance. It is hard to imagine the progress we have made in AI in a world without these shared data sets to focus our efforts.
However, benchmarks have been saturating faster and faster — especially in natural language processing (NLP). While it took the research community about 18 years to achieve human-level performance on MNIST and about six years to surpass humans on ImageNet, it took only about a year to beat humans on the GLUE benchmark for language understanding.
Researchers in NLP will readily concede that while we have made good progress, we are far from having machines that can truly understand natural language. So something is amiss: While models quickly achieve human-level performance on specific NLP benchmarks, we still are far from AI that can understand language at a human level. What’s more, static benchmarks have other challenges as well:
They may contain inadvertent biases and annotation artifacts. For example, a model can do well on the Stanford Natural Language Inference corpus, a well-known reasoning task about whether a hypothesis is true for a given context, by using just the hypothesis and ignoring the context altogether. In visual question answering, the answer to a “how much” or “how many” question is frequently “2.” Modern machine learning algorithms are excellent at exploiting biases in benchmark data sets, and researchers must be vigilant against overfitting any particular set of examples.
Static benchmarks force the community to focus too much on one particular thing. We ultimately care about not one specific metric or task, but rather how well AI systems can do what they’re supposed to do when people interact with them. The real metric for AI should not be accuracy or perplexity, but model error rate when interacting with people — whether directly or indirectly.
When introducing a new benchmark, it is important not to make it too easy, as it will quickly become outdated, but also not to make it too hard, as everyone will fail. And when benchmarks saturate, researchers must engage in the time-consuming work of new ones. This process is already happening implicitly with many of our benchmarks, as illustrated by different iterations of the ImageNet large-scale visual recognition challenge, SQuAD 1.1 and 2.0, GLUE and SuperGLUE, and WMT, which introduces a new set of challenges every year.
We believe the time is ripe to radically rethink the way AI researchers do benchmarking and to break free of the limitations of static benchmarks: We introduce a novel platform called Dynabench, which puts humans and state-of-the-art AI models “in the loop” together and measures how often models make mistakes when humans attempt to fool them. And by adapting to a model’s responses, Dynabench can challenge it in ways that a static test can’t. For example, a college student might try to ace an exam by just memorizing a large set of facts. But that strategy wouldn’t work in an oral exam, where the student must display true understanding when asked probing, unanticipated questions.
Dynabench allows us to more accurately measure how good NLP models are today — and how far we still have to go. In doing so, the process yields valuable insights about the mistakes that current models make, which can in turn serve to train the next generation of state-of-the-art AI models in the loop. Dynamic benchmarking with Dynabench overcomes the main limitations of static benchmarks: The process cannot saturate, it will be less prone to bias and artifacts, and it allows us to measure performance in ways that are closer to the real-world applications we care most about.
Leveraging human creativity to challenge machines more effectively
With Dynabench, human annotators try to find examples that fool even state-of-the-art models into making an incorrect prediction. For example, the annotator could deliberately write a positive restaurant review — “The tacos are to die for! It stinks I won’t be able to go back there anytime soon!” — so that the model might misunderstand and then miscategorize this as a negative review. The harder the model is to fool, the better it is at its job. Moreover, the platform can encourage annotators to come up with ever more creative examples to test models in new ways.
Dynamic benchmarking happens over multiple rounds. In each, the researcher or engineer using Dynabench selects one or more state-of-the-art models to serve as the target to be tested. Dynabench then collects examples using these models and periodically releases updated data sets to the community. When new state-of-the-art models catch most or all of the examples that fooled previous models, we can then start a new round with these better models in the loop. This cyclical process can be frequently and easily repeated, so that if biases appear over time, Dynabench can be used to identify them and create new examples that test whether the model has overcome them.
Dynabench’s approach also tackles data set artifacts and biases. Taking the question answering example noted above, models must quickly figure out that “2” is not always the right answer to a “how much” or “how many” question. Otherwise, it would be easy for human annotators to generate lots of examples to fool them. When Dynabench adds those examples to the models’ training data over time, they should become more robust to this vulnerability and the human annotators can then go in search of other weaknesses.
Question answering, sentiment analysis, and more
In its first iteration, Dynabench centers on four official core tasks. We focused on NLP because it currently suffers most from rapid benchmark saturation. We selected tasks the community knows and cares about, that have clear evaluation metrics, and that have real-world applicability. In each case, we’ve partnered with researchers from academic institutions including UNC–Chapel Hill, UCL, and Stanford, who will each be the “owner” of a particular task. We will fund crowdsourced and/or expert annotators for these tasks, and we’ll encourage everyone to generate as many new and challenging examples as possible.
The ideas underlying Dynabench have previously been explored. Recently, Facebook AI employed a similar process to collect the Adversarial NLI (ANLI) data set and applied it to dialogue safety. Facebook AI introduced this idea in our work on Mechanical Turker Descent, but there are many other related papers from elsewhere in the AI research community. For example, Build It, Break It: The Language Edition, Beat the AI, and Trick Me if You Can. What is new in Dynabench is that it works at a large scale, on many tasks, and through a single platform. With ANLI, we showed this process generates examples that are very challenging for even very powerful models. (For instance, the GPT-3 model did not perform better than chance.) ANLI also demonstrates how useful this approach is for training: Generating examples is more sample-efficient than using statically collected data.
For each task in Dynabench, we have multiple rounds of evaluation. Each has one or more target models in the loop. The models are served in the cloud, via torchserve (with interpretability via Captum, Facebook AI’s open source model interpretability library for PyTorch). Crowdsourced annotators will be connected to the platform via Mephisto. And humans interacting with the model receive almost instantaneous feedback on the model’s response. They can employ tactics such as making the system focus on the wrong word and using clever references to real-world knowledge that the machine does not have access to.
An open, evolving approach to benchmarking
Dynabench is ultimately a scientific experiment to accelerate progress in AI research. We hope it will help show the world what state-of-the-art AI models can achieve today as well as how much work we have yet to do.
After starting with the four NLP tasks described above, we plan to open Dynabench so that anyone can create their own tasks and get human annotators to find weaknesses in their models. Dynabench focuses on English for now, but we’d love to add other languages and other modalities.
We hope Dynabench will help the AI community build systems that make fewer mistakes, are less subject to harmful biases, and are more useful and beneficial to people in the real world.
Get started at dynabench.org