Part A: The Power of Diffusion Models!

Overview

Part A of this two part project shows what you can do utilizing a diffusion model. How to use it to restore noisy images, and generate new ones! To do this, we use the DeepFloyd IF diffusion model, a two stage model trained by Stability AI. Out of the box, it can be used to generate images of an oil painting of a snowy mountain village, a man wearing a hat, and a rocket ship as shown below:

Generated using 20 inference steps

Generated using 30 inference steps

(I used seed 180 for this project, but the seed wasn’t always set, so replicating exact images may not work perfectly)

Sampling Loops

It’s not very interesting to just generate the images, however; we want to understand how to generate them. Diffusion models predict the noise an image has, and aims to remove it..

Implementing the Forward Process

In order to remove noise from an image, we must first

Original Image                        Noise Level 250                Noise Level 500                Noise Level 750

Classical Denoising

We then create a simple, more classical denoising technique of Gaussian blur filtering to see how good our model actually is.

Noise Level 250                Noise Level 500                Noise Level 750

One-Step Denoising

This is ok, but very clearly not great. So now it’s time for our model

Noise Level 250                Noise Level 500                Noise Level 750

This does much better! Our resulting image appears to now resemble the campanile instead of confetti passed through YouTube’s compression algorithm. But we can still do better. This is predicting the total noise on our image and then removing all of it.

Iterative Denoising

While we get decent looking results, our image becomes much better when we iteratively denoise it, and we can see the difference below

       Original                        Iteratively Denoised                  One-Step Denoised                      Gaussian Blurred

Diffusion Model Sampling

Repeating the same process with completely random noise now allows us to generate completely new and spectacular images. We use the prompt "a high quality photo", and get these.

Classifier-Free Guidance (CFG)

These look good, but not great. To fix some of the “nonsense” we see, we can use CFG. This process essentially nudges the images more into the imputed prompt, giving us these images.

I’m not entirely sure why they’re all faces. While developing and testing the code I was getting landscapes and other stuff, but not it’s only generating faces.

Image-to-image Translation

Now, let’s try to turn one image into another. Similar to before, we take an image, add noise to it, and then denoise. However, instead of trying to get back our original image, we add more and more noise, causing us to get a new image subtly based on our original.

i_start=1                     i_start=3                       i_start=5                        i_start=7                 i_start=10                   i_start=20                      Original

Editing Hand-Drawn and Web Images

Now, let’s do the same but with images off the internet!

i_start=1                     i_start=3                       i_start=5                        i_start=7                 i_start=10                   i_start=20                      Original

And now using some that were very beautifully drawn by yours truly!

i_start=1                     i_start=3                       i_start=5                        i_start=7                 i_start=10                   i_start=20                      Original

Inpainting

To alter an image, we can also instead only edit a part of an image, and have it fit in with the rest.

                                      Original                                 Mask                                Hole to Fill                      Impainted Image

Text-Conditional Image-to-image Translation

Now, we will do the same thing as SDEdit, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language.

Visual Anagrams

In this part, we are finally ready to implement Visual Anagrams and create optical illusions with diffusion models. In this part, we will create an image that looks like "an oil painting of people around a campfire", but when flipped upside down will reveal "an oil painting of an old man". We also have “a photo of a dog” and “'a man wearing a hat”, and “a photo of the amalfi coast” and “a lithograph of a skull”

Hybrid Images

In this part we'll implement Factorized Diffusion and create hybrid images. To do this, we sample at a high frequency for one prompt and a low frequency for the other. Below we have “a lithograph of waterfalls” and “a lithograph of a skull”, “a photo of the amalfi coast” and “an oil painting of a snowy mountain village”, and finally "a photo of a dog" and “an oil painting of an old man”

TODO bells and whistles

Part B: Diffusion Models from Scratch!

Overview

The second and final part is about training a brand new diffusion model on the MNIST dataset.

Training a Single-Step Denoising UNet

To start, a one-step denoiser was built. This takes in a noisy image, and aims to reduce the L2 loss between the model’s correction of it and the original image.

Implementing the UNet

Loss

Epoch 1

Epoch 5

Sample results on the test set with out-of-distribution noise levels after the model is trained.

Training a Diffusion Model (Using a Time-Conditioned UNet)

Loss

Epoch 1

Epoch 20

Adding Class-Conditioning to UNet

Loss

Epoch 5

Epoch 20