Shortly after the new year 2021, the Media Synthesis community at Reddit began to become more than usually psychedelic.
The board became saturated with unearthly images depicting rivers of blood, Picasso’s King Kong, a Pikachu chasing Mark Zuckerberg, Synthwave witches, acid-induced kittens, an inter-dimensional portal, the industrial revolution and the possible child of Barack Obama and Donald Trump.
These bizarre images were generated by inputting short phrases into Google Colab notebooks (web pages from which a user can access the formidable machine learning resources of the search giant), and letting the trained algorithms compute possible images based on that text.
In most cases, the optimal results were obtained in minutes. Various attempts at the same phrase would usually produce wildly different results.
In the image synthesis field, this free-ranging facility of invention is something new; not just a bridge between the text and image domains, but an early look at comprehensive AI-driven image generation systems that don’t need hyper-specific training in very limited domains (i.e. NVIDIA’s landscape generation framework GauGAN [on which, more later], which can turn sketches into landscapes, but only into landscapes; or the various sketch>face Pix2Pix projects, that are likewise ‘specialized).
The quality is rudimentary at best, like a half-formed remembrance of dreams, and the technology clearly nascent; but the long-term implications for VFX are significant.
So where did these technologies come from?
A New Wave Of Multimodal Neural Networks
In early 2021 artificial intelligence research laboratory OpenAI, fresh from the previous year’s press sensation regarding the power of its GPT-3 autoregressive language model , released details of two new multimodal neural networks capable of straddling two diverse computing domains: text and imagery.
CLIP (Contrastive Language-Image Pre-Training) is a zero-shot neural network that evaluates the relevance of a text snippet for any given image without the same optimization that preceding networks had to undertake in order to achieve this.
CLIP is trained not on a fixed and pre-labeled dataset, but on internet-obtained images (image content), which it attempts to pair with the most apposite of 32,768 randomly-sampled text snippets.
With this approach, CLIP breaks away from previous systems which – the company claims – are typically over-optimized for benchmark tests, and have inferior general applicability to ‘unseen’ image datasets.
In tests, CLIP was able to match the performance of ImageNet’s benchmark ResNet5017 challenge without the benefit of referring to any of the dataset’s 1.28 million labeled images – an unprecedented advance. Subsequently OpenAI made CLIP accessible via a Colab.
DALL-E uses 12 billion out of the 175 billion parameters of the GPT-3 dataset to generate text-image pairings capable of producing relatively photorealistic images — depending on the availability of image source material that will be summoned up from the data set by the text prompt:
The system is prone to the occasional semantic glitch:
As the creators note, DALL-E is disposed to confuse colors and objects when the number of requested objects in the text prompt are increased. Also, rephrasing the prompt into grammatically complex or unfamiliar forms can kill the model’s interpretive capabilities (for instance, where the object of the possessive is unclear).
If the web-derived image data for a text prompt is frequent and detailed19, it’s possible to create a genuine sense of instrumentality, in the style of a 3D modelling interface, albeit not remotely in real time:
Presumably the inclusion of the previously-generated image in a sequence of this nature would improve continuity across the frames; but in the above example, as far as we know, DALL-E is ‘starting fresh’ with each image, based on the text prompt, and on what it can see of the top of Plato’s head.
DALL-E’s transformative and interpretive capabilities are currently capturing the imagination of image synthesis enthusiasts, not least because the systems have been quickly commoditized by researchers and enthusiasts (see ‘Derivative Works’ below):
DALL-E builds on OpenAI’s Image GPT generator, a proof-of-concept transformer model capable of finishing incomplete images (albeit with a wide range of possible responses).
Sadly, there is no general access to these extraordinary transformative capabilities; OpenAI will only release the dVAE decoder and convolutional neural network (CNN) encoder from DALL-E, and there is no official plan to release the transformer model that powers the higher quality images seen in the ‘official’ posts (see above). However, a comment from an OpenAI developer on GitHub in February states that the company is ‘considering’ API access to DALL-E, which would extend the interpretive power of third-party image synthesis systems.
In the meantime, developers have attempted to recreate the functionality of DALL-E with such APIs as are available. Though the results are not as delineated or cohesive as those produced by the cloistered-off DALL-E, they hint at what can be achieved by leveraging text-to-image technologies that are powered by wide-ranging image datasets.
CLIP-GlaSS operates across a set of distinct GAN frameworks, of which the most generalized is DeepMind’s BigGAN. The other dedicated GANs offer domains for face generation, a car, a church, and for text generation via GPT-2, as well as higher quality variants (for the image-based GANs).
The overnight popularity of CLIP-GlaSS may have something to do with the very accessible Colab notebook that the Italian researchers have made available.
Unlike Big Sleep (see below), the quality of CLIP-GlaSS images usually improves when allowed to train a little longer. But since these are ‘zero shot’ generators, they tend to reach their quality ceiling (a perceived ‘minimum loss’) quickly.
Big Sleep and Aleph2Image
Programmer and University of Utah PhD student Ryan Murdock maintains a number of CLIP-derived projects, including Aleph2Image and BigSleep – the latter of which has so entranced the Reddit media synthesis group.
‘I created BigSleep and Aleph because I love making neural art,’ Murdock told me. ‘and I wanted to see if it was possible to get a backdoor around DALL-E not being released.’
BigSleep has the most user-friendly Colab notebook out of any of the current offerings:
BigSleep has two specific eccentricities: firstly, BigGAN specializes in 1000 particular ImageNet categories (a great many of which are animal species). A prompt that includes any of those categories is likely to fare better than one outside that scope.
Secondly, BigSleep’s ‘seed’ image is always a random picture of a dog breed, and sometimes the iterations have difficulty letting go of that canine element. Subsequently, people in BigSleep images can end up wearing an unfashionable number of fur coats (see the ‘Humphrey Bogart’ images later in the article).
A more recent project from Murdock is Aleph2Image, which exploits DALL-E’s Vector Quantised-Variational AutoEncoder (VQ-VAE).
The text-prompted results are more impressionistic even than BigSleep, but tend to have more visual consistency:
The Reddit Media Synthesis group maintains a list of other image synthesis Colabs and GitHub repositories that leverage CLIP for image generation, though some have been broken by framework updates that the older notebooks don’t support.
Murdock believes that the use of CLIP+GAN deployments like this in production environments, even for general design exploration, is ‘pretty rare’ at the moment, and concedes that there are some hard roadblocks to overcome in developing these systems further.
‘I think hardware is a huge bottleneck,’ he told me. ‘I doubt it’s possible for anyone with any software to emulate something like CLIP without at least a number of GPUs and plenty of time.
‘But data is the most serious bottleneck and the unsung hero of most advances. Software may be a bottleneck, but it already has the brunt of people’s attention.’
Navigating The Strange Obsessions Of CLIP
Even where text is not directly involved in image synthesis, as it is with the CLIP Colabs, it’s deeply bound up in the semantic segmentation that defines objects and entities, and which underpins many of the generative systems described here.
CLIP’s neural network is a perfect construct trained by data from an imperfect world. Since the world is very interested in Spider-Man, CLIP’s multimodal neurons will fire for even the vaguest reference to your friendly neighborhood web-slinger:
Leaving aside whether CLIP should be more interested in Spider-Man than Shakespeare or Gandhi, it should be considered that image synthesis systems which rely on it will also have to suffer CLIP’s eccentricities.
Neural networks that have been trained by the internet are prone to demonstrate obsessions and biases in accordance with the volume and breadth of material that they might find on any particular topic. The more pervasive an entity is in an image synthesis system’s contributing dataset, the more likely it is to appear unbidden, or to be triggered by apparently unrelated words or images:
If you experiment long enough with CLIP-based text-to-image tools, you’ll quickly build up a picture of its ‘interests’, associations and obsessions. It’s not a neutral system – it has a lot of opinions, and it didn’t necessarily form them from the best sources, as OpenAI recently conceded in a blog post.
Therefore, in terms of the development of professional VFX image synthesis systems, CLIP implementations should perhaps be seen as a proof-of-concept playground for future and more rigorous data architectures.
Pure Image Synthesis in Professional VFX?
However, it doesn’t seem that VFX professionals are likely to either feel threatened by the psychedelic ramblings of CLIP-based GAN image generators such as BigSleep, or even necessarily feel able to satisfy their curiosity and enthusiasm about image synthesis, so long as the instrumentality is so constrained and the ability to control the output is so limited.
Perhaps Amara’s Law is applicable here; the late American futurist and scientist stated ‘We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.’
Continuing the analogy with CGI, we can see that the evolution of disruptive VFX technologies is sporadic and inconsistent; and that new technologies often languish as industry curiosities and niche pursuits until an unpredictable breakthrough renders a cherished VFX role extinct, and makes it necessary to catch up and adapt, as stop-motion master Phil Tippett did during the making of Jurassic Park (1993).
In the second regard – let’s take a look at new approaches that could bring a greater measure of control and instrumentality to pure image synthesis, facilitating the development of new tools, and perhaps even new professions within the sector.
The Power And Pitfalls Of Paired Datasets
“Once you get to the stage where you’re no longer even necessarily designing a tool, but you’re asking the AI to do something for you, and it does that…that opens up a lot of possibilities.”
Valentine Kozin is the lead technical artist at UK games studio Rare. His passion for the possibilities that machine learning can open up for VFX pipelines led him to host a popular presentation on practical deep learning for the industry in December 2020.
“With GPT-3,” he told me. “you can even use it for game systems. Instead of codifying the rules of what happens when a flaming torch hits a body of water, you don’t have to code in the fact that the fire gets extinguished – you can just ask the AI ‘What happens if I combine fire and water?’, and the AI will tell you ‘The fire gets extinguished’, because it has some of that sort of interaction knowledge.”
One of Kozin’s key research interests at the moment is the potential of paired datasets to help generative adversarial networks transform sketches (or other forms of rudimentary, indicative imagery) into fully-realized, AI-generated images.
To accomplish this, two datasets are needed for training – one where the character faces are traditionally-rendered CGI, and one where the same faces appear in the form of crude sketches. Generating the latter can be done through open source GitHub projects such as PhotoSketch.
But the current general standard for sketch>image inference is Pix2Pix, a project out of the Berkeley AI Research (BAIR) Laboratory, which uses conditional adversarial networks as a Rosetta stone for image translation.
Contributing Datasets That Know Too Little Or Too Much
Though the potential for previz and asset generation using these techniques is obvious, the path to true instrumentality and fine-grained control for high-resolution, photorealistic output is almost as constrained by considerations of data design and architecture as by current hardware limitations for neural networks.
For example, casting Humphrey Bogart in Blade Runner is a predictably impressionistic experience in the new range of text-to-image Colab playgrounds such as BigSleep and Aleph2Image.
Partly the problem is semantic, as many image synthesis enthusiasts report of the popular new T2I Colabs: if you prompt the AI with ‘A man in a dark suit’, you could, despite the ‘obvious’ intent of the phrase, end up with a) a courtroom scene b) a single-occupant apartment made out of gabardine or c) Darth Vader.
Apart from the aforementioned issues with the frequency of ‘popular’ high level concepts in the contributing database/s, this is a problem linked to inference of intent in natural language processing (NLP); and even, arguably, to the field of sentiment analysis. Improvements in this respect are likely to come out of upstream NLP research that has loftier aims (or at least more immediately commercial ones).
But in terms of developing VFX-centered machine learning workflows, the more pressing issue is one of the scope and breadth of the contributing image datasets.
If your generator uses vast and generalized data repositories such as ImageNet or the Common Crawl, you’ll be summoning up just a small subsection of available imagery related to your intent – and these representative datasets do not have enough space or remit to exhaustively catalog every Humphrey Bogart or Blade Runner image on the internet.
Therefore why not curate your own Bogart and Blade Runner dataset, and train the model exclusively on these?
Brittle Image Synthesis Output From Overfitted Datasets
As Valentine Kozin notes, the excessive specificity of such datasets, whilst improving image quality, makes for brittle and inflexible output, without the ‘random seed’ that characterizes the more vivacious output of BigSleep and its fellow T2I Colabs:
In the end Kozin was forced to compromise on a more generalized and diversely trained model, resulting in an algorithm that’s more robust and interpretive; for instance, it will produce faces with mouths even where the user did not draw a mouth:
But Kozin observes that now the model is incapable of drawing a face without a mouth, if the user should desire that, and that, in a sense, this restored ‘flexibility’ comes with its own set of frustrating constraints.
Potential Data Architectures For Image Synthesis
“The biggest unsolved part of what OpenAI are doing with things like GPT-3,” Kozin says. “is that it’s very much a static knowledge bank. It’s got a thousand, two thousand character attention window…it’s not an agent that can dynamically learn more about the task and then adapt, so that there’s a ‘snapshot’ that we can apply to different things.
“The solution to that would be AI architectures that are able to have this large data set, but then are also able to learn more about the specifics of the problem they’re trying to solve, and able to apply both that generalized and specific knowledge. And then, that’s kind of getting close to human level creativity, I suppose.
“I think that’s a model that we don’t really have yet. Once we do have it, that’s probably going to be very powerful.”
Ryan Murdock believes that for the moment, using far-ranging, high-volume image sets remains necessary for generating coherent and novel structures, and that training a separate GAN for every element of a potential image (movement, pose structure, lighting, etc.) would be labor-intensive.
‘The benefit of the huge data,’ he says, ‘is allowing the system to fill in gaps through its large associative knowledge. Hopefully it’ll be possible someday to get a very specific , but for now I think the closest thing is to guide CLIP by giving it a sort of structure.
‘You could basically just give any image a structure and then have CLIP fill in the rest, either altering the image or giving it a specific area to fill, allowing it to have some potentially useful constraint.
‘I haven’t done this yet, because I’m super busy, but it’s the type of thing that is surprisingly low-hanging fruit for putting humans closer into the loop.’
A number of research projects are currently addressing the challenge of training generative adversarial networks with more limited data, without producing brittle results or overfitting. These include initiatives from NVIDIA, MIT, and the MetaGAN few-shot learning study.
In 2019 NVIDIA’s research arm released details of GauGAN, an AI art application that allows the user to develop photorealistic landscapes from crude sketches.
GauGAN is arguably a perfect example of the kind of topically-constrained media synthesis system that can be developed by training paired datasets on a small number of specific domains – in this case, landscape-related imagery, pair-trained against sketch imagery to develop an interpretive interchange between crude daubs and photo-real images.
Though it’s not clear how one can create anything other than landscape imagery with GauGAN, NVIDIA has promoted VFX concept artist and modeler Colie Wertz’s use of the tool as the basis of a more complex spaceship sketch.
A later promotional video features output created with GauGAN by VFX practitioners from Lucasfilm, Ubisoft and Marvel Studios:
In the live interactive demo, GauGAN allows the user to define a segmentation map, where colors are anchored to different types of output, such as sea, rocks, trees, and other facets of landscape imagery.
The map can either be drawn directly into the interface, or generated elsewhere by the user and uploaded as a direct input into the synthesis system.
The neural network in the SPADE generator (see ‘Google’s Infinite Flyovers’ below) controls the layer activations via the segmentation mask, bypassing the downsampling layers used by the Pix2PixHD model (pictured upper center in the image below), achieving better results and improved performance with fewer parameters.
Extended Interpretation In Image And Video Synthesis
Semantic Segmentation: The Key To AI World Building?
The year before GauGAN was released, NVIDIA showed the potential of semantic segmentation to generate randomized urban street scenes, rather than idyllic pastures:
This system, called Vid2Vid, was trained on hundreds of hours of car-level street footage, and then against a separate segmentation neural network, in order to derive object classes for the contents of ‘fictional’ street videos – cars, roads, markings, street lamps, and other environmental details.
Once segmentation is established, it’s possible to attach many different aspects to generated footage, including surface changes…
…time of day…
…and time of year.
Achieving Temporal Consistency In Video Synthesis
With NVIDIA’s Vid2Vid, as with so many of the emerging image synthesis tools, VFX professionals have been prone to express concern about the lack of control, and the ability to apply consistent logic or continuity to the output.
In the summer of 2020 new research from NVIDIA, titled World-Consistent Video-to-Video Synthesis, addressed this issue with a new neural network architecture capable of achieving long-term temporal consistency.
The new architecture generates ‘guidance images’ – a kind of ‘key-frame’ that’s evaluated from the motion data in the scene flow, and which collates all the previous viewpoints instead of merely calculating the next image from the previous frame, as with the original Vid2Vid, or with Google’s ‘Infinite Nature’ system (which we will look at shortly).
Attention and focus are critical areas in general AI research, and while it’s not clear how much cohesive and consistent output NVIDIA’s new implementation of Vid2Vid can achieve for longer video synthesis footage, this is exactly the kind of effective corralling of semantic logic that AI in VFX needs in order to move forward into production environments.
In August of 2020 researchers from UC Berkeley, UC San Diego and Google released details of a new machine learning method capable called Neural Radiance Fields (NeRF) – an innovative take on transforming 2D imagery into a navigable, AI-driven 3D space:
NeRF notably improves upon four predecessors in this field:
This is not mere in-betweening or image interpolation, but equivalent to the creation of a points cloud that’s so consistent it can even be used to generate a 3D mesh (see image below), and allows of panning, tilting and craning, with highly accurate lighting transitions and occlusion depth maps.
‘I think the NeRF paper might be the first domino in a new breed of machine learning papers that will bear more fruit for visual effects.’ says VFX supervisor and machine learning enthusiast Charlie Winter. ‘The spinoffs so far demonstrate the ability to do things that would simply be impossible with a traditional GAN.’
‘The way that a convolutional neural network generates an image.’ Winter told me, ‘is by progressively up-scaling and running computations on a 2D piece of information. Given a similar task, the limitation to this is that if you don’t have many samples around a particular angle, the results get much worse and even unstable, as it can only use tricks learned in 2D to roughly approximate what might be there.
‘With NeRF, what’s being learned is an approximation of 3D shape and light reflectance, and an image is only generated by marching rays through the scene and sampling the network at each point.
‘It’s similar to the way a CG scene is rendered. Using this technique, there can still be angles with less quality. However, because this technique doesn’t “cut the corner” of learning in 3D, it’s much more likely that coherent information comes out.’
Winter notes that NeRF is very demanding in terms of processing time, wherein one complicated scene might take a month to render. This can be alleviated through the use of multiple TPUs/TPUs, but this only shortens render time – it doesn’t cut the processing costs.
DyNeRF: Facebook’s Neural 3D Video Synthesis
In March 2021, a subsequent collaboration between Facebook and USC, entitled Neural 3D Synthesis, evolved the principles of NeRF into a new iteration called Dynamic Neural Radiance Field (DyNeRF). The system is capable of producing navigable 3D environments from multi-camera video inputs.
With 8 GPUs fielding 24,576 samples for each iteration and 300,000 total iterations required, training would normally require 3-4 days – notably more for successive iterations to improve the quality of the video.
To improve these calculation times, DyNeRF evaluates the priority of each pixel, and assigns it a weight at the expense of other pixels that will contribute less to the rendered output. In this way, convergence times (how long it takes the training session to achieve an acceptable minimum loss result) are kept within workable bounds.
Nonetheless, the researchers admit that the results are ‘near’ to photorealism; that the computational power needed to generate this level of interpretability is formidable; and that extrapolating objects beyond the bounds of the camera views will be an additional challenge.
DyNeRF is able to cope well with traditionally problematic visual elements such as fire, steam and water, and with anisotropic elements such as flat material reflections, and other light-based environmental considerations.
Once the synthesis is achieved, it is remarkably versatile. The frame rate can be upscaled from the native 30fps to 150fps or more, and reversing that mechanism allows extreme slow-motion and ‘bullet-time’ effects to be achieved.
Facebook SynSin – View Synthesis From A Single Image
A 2020 research paper from Facebook AI Research (FAIR) offers a less exhausting approach to dynamic view synthesis, albeit a less dazzling one.
SynSin derives moving images from one static, single image through inference and context.
Here the model must incorporate an understanding of 3D structure and semantics, since it must deduce and in-paint the parts of the scene that are revealed as the motion tracks beyond the bounds of the original photo.
Similar work in recent years relies on multi-camera views (such as DyNeRF), or on images that are accompanied by depth map data, synthesized externally (to the training model) or else captured at the time the image was taken.
Instead, SynSin estimates a more limited point cloud than DyNeRF, and uses a depth regressor to cover the amount of movement necessary. The features learned by the neural network are projected into the calculated 3D space, and those rendered features are then passed to a refinement network for final output.
Google’s Infinite Flyovers
While this kind of ‘moving image’ extends a little beyond Apple’s iOS Live Photos feature62, (and Adobe also moved into this territory in 2018 with its Project Moving Stills initiative63) it’s not exactly the ‘infinite zoom’ we were promised in Blade Runner.
Google Research offers something nearer to this in the form of Infinite Nature64, a machine learning system capable of generating ‘perpetual views’ – endless flyovers which are seeded by a single image, and which use domain knowledge of aerial views to visualize a flight right through the initial picture and beyond, into and over imaginary landscapes:
Infinite Nature is trained on 700 videos of drone footage from coastlines and nature scenes, a database of more than two million frames. The researchers ran a structure-from-motion (SfM65)
pipeline to establish the 3D camera trajectories. The resultant database, called the Aerial Coastline Imagery Dataset (ACID), has been made publicly available.
Though the system takes some inspiration from SynSin’s complete computation of a motion path, it instead performs an infinite loop, dubbed by the researchers as Render – Refine – Repeat.
The system’s differentiable renderer compares the existing frame to a projected next frame and generates a ‘disparity map’, which reveals where the holes are in the landscape.
Google provides an interactive Colab notebook, wherein you can take a (geographically inaccurate!) journey from your own uploaded starting image.
However, the sequence will transform very quickly into aerial landscape imagery, should you choose to provide any other kind of image.
There is nothing to stop future developers from applying this logic to other types of ‘infinite journey’ in different domains, such as Vid2Vid-style urban street imagery, or to use artificial datasets to produce randomly-generated voyages through other worlds or alien cultures.
If you can gather together enough gritty photos of early 1980s downbeat New York tenements, you could even finish Deckard’s Esper analysis of his retro Polaroid, and finally find out just how far that ‘enhance’ function goes.
The development costs for GPT-3 and DALL-E range in the low millions, in contrast to big budget VFX outings such as Avengers: Infinity War, which is reported to have spent $220 million USD on visual effects alone. Therefore it seems likely that the major studios will funnel some research money into comparable image synthesis frameworks, if only to ascertain the current limitations of this kind of architecture – but hopefully, eventually, to radically advance the state of the art.
As an analogy to how long CGI took to gain a true foothold in the VFX industry, image synthesis is currently nearer to Westworld (1973) than Wrath Of Khan (1982); but there is the sound of distant thunder.
Martin Anderson is a freelance writer on machine learning and AI.
This article is republished from hackernoon.com