NVIDIA Corporation’s foray into the already vast generative AI industry can be game-changing. Results show that it can produce better quality images, with an output having higher affinity to the given textual prompt than other text-to-image generators.
eDiffi (short for ensemble diffusion for images), NVIDIA’s new text-to-image modelling, separates itself from other generative AIs which use standard diffusion models to produce images by using expert denoising systems in their model. Standard diffusion models use an iterative denoising process, which involves a random noise image being passed through denoising neural networks where high-quality image is synthesised at each step in the network based on the textual prompt.
However, as the NVIDIA researchers show, in the standard models, while in the beginning the denoising network synthesises images conditioned on the prompt, it gradually loses its way until the denoisers entirely ignore the text conditioning, and instead, only focus on producing high fidelity images.
In contrast, eDiffi entirely does away with the concept of using similar denoisers in the network, and uses specialised models specifically trained for each step in the iterative generation process. The idea of using specialised models at each stage in the denoising network comes in response to the observation that diffusion models have different behaviour at different noise levels.
(Illustration of denoising system varying in eDiffi and Standard Diffusion models)
eDiffi uses an ensemble of encoders—T5 text encoder, the CLIP text encoder, and the CLIP image encoder—to provide inputs to the model. The two text encoders, the authors say, brings together the capabilities of CLIP text embeddings to have the correct foreground object and that of T5 text embedding for better compositions. On the other hand, the image encoder offers style transfer capabilities, where a user can use a reference image for the model to produce a similar style in the output.
Users were particularly impressed by the stylistic images the model was able to produce.
In their paper, NVIDIA researchers also compared the output images generated from a single prompt between Stable Diffusion, Dall E, and eDiffi, respectively. Here is one example:
(AI-generated output to the prompt “A photo of a golden retriever puppy wearing a green shirt. The shirt has text that says “NVIDIA rocks”. Background office. 4k dslr.”
Left: Stable Diffusion; Center: Dall E 2, Right: eDiffi)
NVIDIA’s model works better than the rest when it comes to customised prompts, due to the expert denoising system which trains denoisers to maintain fidelity to the textual prompt even in the later stage of the generation process.
Departure from GAN
But, this is not the first time NVIDIA stepped into the waters of text-to-image modelling. Before coming up with eDiffi, NVIDIA used deep learning models to create versions of the GauGAN model. The second version of the model, released in November 2021, was trained on 10 million high-quality landscape images. The application demo allowed users to produce images based on any text input they provide.
The GauGAN model is based on generative adversarial networks (GAN), unlike eDiffi, which uses diffusion modelling for generating images.
So why did NVIDIA take a departure from using GAN for their text-to-image feature?
Arash Vahdat and Karsten Kreis, the creators of eDiffi, in a blog dated April 2022, explained that for generative models to have wide use cases in the real world, they should be able to satisfy three key requirements:
- High quality sampling
- Mode coverage and sample diversity
- Fast and computationally inexpensive sampling
However, in the models that existed, there was always a trade off, since no single model could achieve all three requirements—this was referred to as the “generative learning trilemma”.
(Generative learning trilemma)
Hence, while diffusion models offer high sample quality and diversity, they lack the sampling speed of GANs. One of the reasons, they said, sampling in a diffusion model is slow is because “mapping from a simple Gaussian noise distribution to a challenging multimodal data distribution is complex”. To address this, they introduced Latent Score-based Generative Model (LSGM). LSGM is a framework that maps input data to a latent space rather than data space directly.
In discussing the advantages this current model has over traditional GANs, the researchers alluded to the training stabilities and mode collapse issue of GANs. The possible reasons for this, they said, include “the difficulty of directly generating samples from a complex distribution in one shot, as well as overfitting problems when the discriminator only looks at clean samples.”
Hence, according to them, the denoising diffusion systems are more suited for overcoming the generation learning trilemma than traditional GANs.
Paint with words
Besides the text-to-image generation feature, the new model also has an additional feature called ‘paint with words’. This allows users to doodle their imagination and specify the spatial location of objects on the canvas. The output will be a highly synthesised image even from a highly rough sketch drawn on a canvas.
In comparison, segmentation-to-image methods like GANs, the authors said, are likely to fail when the sketch drawn on canvas is vastly different from shapes of real objects.
Final thoughts
This year has been a year of AI-based image generators, and NVIDIA, although late to the party, still appeared with a bang. Expert denoising systems, style transfer capabilities, and painting with words —each adds to the repertoire of what AI art can do. The image synthesis quality in the new model has substantially improved, but more importantly, the output generated is more aligned with the input texts than other diffusion models we have seen.