Flashlight: Fast And Flexible Machine Learning In C++

April 20, 2021

3 min read

What it is:

Flashlight is a new open source machine learning (ML) library, written entirely in C++, that was built by FAIR to power groundbreaking research by enabling teams to rapidly and easily modify deep and ML frameworks to better fit their needs.

Deep and ML frameworks are good at what they do — but altering the internals of these frameworks has traditionally proved difficult. Finding the right code to change is time-consuming and error-prone, as low-level internals can be unintentionally obfuscated, closed-source, or hand-tuned for particular purposes. And once you’ve made changes, recompiling the framework afterward is both time- and compute-intensive.

We designed Flashlight to be customizable to the core. It contains only the most basic building blocks needed for research, making it simple and intuitive to navigate. And when you change Flashlight’s core components, it takes just seconds to rebuild the entire library and its training pipelines, thanks to its minimalist design and freedom from language bindings.

We wrote Flashlight from the ground up in modern C++ because the language is a powerful tool for doing research in high-performance computing environments. Flashlight has incredibly low framework overhead, as modern C++ enables first-class parallelism and out-of-the-box speed. Flashlight also provides simple bridges to integrate code from low-level domain-specific languages and libraries.

We are open-sourcing Flashlight to make it easier for the AI community to tinker with the low-level code underpinning deep and ML frameworks, taking better advantage of the hardware at hand and pushing the limits of performance.

What it does:

Flashlight is built on top of a shallow stack of basic abstractions that are modular and easy to use. We started with the ArrayFire tensor library, which supports dynamic tensor shapes and types, removing the need for rigid compile-time specifications and C++ templates. ArrayFire also optimizes operations on the fly with an efficient just-in-time compiler.

Building on these base components, Flashlight includes custom, tunable memory managers and APIs for distributed and mixed-precision training. Combined with a fast, lightweight autograd — a deep learning staple that automatically computes derivatives of chained operations common in deep neural networks — Flashlight also features modular abstractions for working with data and training at scale. These components are built to support general research directions, whether in deep learning or elsewhere.

Flashlight’s lightweight domain applications (shown in the image above) support research across a variety of modalities, including speech recognition, language modeling, and image classification and segmentation — all in a single codebase. This design removes the need to combine many separate domain-specific libraries, enabling Flashlight to support validating new ideas on a variety of setups, making multimodal research simpler. Doing so requires only a single incremental rebuild rather than changes and rebuilds for individual upstream domain-specific frameworks.

While common primitives in deep learning are implemented via well-optimized kernels from hardware-specific vendor libraries, writing custom high-performance code can be difficult to integrate and iterate on quickly. Flashlight makes it trivial to build new low-level computational abstractions. You can cleanly integrate CUDA or OpenCL kernels, Halide AOT pipelines, or other custom C/C++ code with minimal effort.

Modern C++ also obviates the need for tasks like memory management while providing powerful tools for functional programming. Flashlight supports doing research in C++ with no need to adjust external fixtures or bindings and no need for adapters to do things like threading, memory mapping, or interoperating with low-level hardware. As a result, integrating fast, parallel code becomes simple and direct.

Why it matters:

With deep and ML models growing more and more complex, progress depends on optimization. Advanced AI models and research require high-performance code that efficiently utilizes available hardware. Flashlight’s modular internals make it a powerful research framework for research frameworks. Flashlight’s fast rebuilds facilitate doing research on the library itself that can then be applied downstream to other frameworks. By making it easier to rapidly iterate on custom low-level code, Flashlight opens the door to research that pushes the limits of performance.

We’re already using Flashlight at Facebook in our research focused on developing a fast speech recognition pipeline, a threaded and customizable train-time relabeling pipeline for iterative pseudo-labeling, and a differentiable beam search decoder. Our ongoing research is further accelerated by the ability to integrate external platform APIs for new hardware or compiler toolchains and achieve instant interoperability with the rest of Flashlight.

We hope that open-sourcing Flashlight will make it easier to modify the code underpinning AI models and integrate new low-level languages and libraries — and ultimately help those in the AI community iterate faster on their ideas.