# Demystifying AI Generated Art | Stable Diffusion

Generative AI has made progress in leaps and bounds in the past year with the release of LLMs (large language models) like ChatGPT from OpenAI and LLama2 from Meta as well as image generation models like DALL-E from OpenAI and Stable Diffusion from [stability.ai](http://stability.ai).

These image generation models can convert a simple image description like

> **“A car on a road, sci-fi style”**

into beautiful photo-realistic photographs like this

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169224769/5ed4d697-0f9e-474a-b381-b9e355d21e2f.png align="left")

In this article, we are going to take a deep dive into how the stable diffusion (which is the current state-of-the-art image generation AI) model works.

# How do diffusion models work?

Diffusion Models are a type of generative model which means they are built to generate outputs similar to ones they have been trained on.

**T**here are two different types of diffusion processes:

## Forward Diffusion

This process progressively adds noise to the image until it is converted into **uncharacteristic** noise. Uncharacteristic noise means you can’t tell whether the original image was a dog, a cat or maybe even a car. This is a very important step in the process.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169186582/e4db7a2f-c9d9-4b50-ab83-c3fcc7a91941.png align="center")

## Reverse Diffusion

This is the fun part! The reverse diffusion process tries to reconstruct the original training image from the noisy image we got from **Forward Diffusion**.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169168562/e8937370-4c8c-496e-aec0-369f6f3cab66.png align="center")

The reconstructed image isn’t always the same as the original as there is randomness that comes into play here. Now we know what the model needs to do. But the question is **HOW** 🤔

## Training Process

To reconstruct the diffusion process, we need to find out how much noise was added to the image. To achieve this we train a neural network to predict the noise that was added. It is called a **Noise Predictor** which is a U-Net Model.

**The steps are as follows:**

* Pick a random training image
    
* Generate some noise
    
* Add this noise to the training image for a certain number of steps
    
* Teach the noise predictor how much noise was added
    

> 🔥 Now we have a fully trained noise predictor

## Inference

To use this noise predictor, we generate a new noisy image. The noise predictor estimates how much noise was added and then removes the noise from the image. We repeat this process for a specified number of sampling steps.

![Noise removal visualisation](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169112702/3baf4db1-7f31-4754-822e-3bdf91fbc8f7.png align="center")

The above process we discussed is running in **pixel space** (holds data for the red, green and blue channels for every single pixel). A single image of resolution 512x512 has **786,432** dimensions to it. Running these computations in pixel space is very, very slow and requires several GPUs at a minimum to run 🤯. But, Stable Diffusion has a trick up its sleeve 👇

# Stable Diffusion

## Latent Space

To overcome the computational speed issues, we have what are called **Latent Diffusion Models** such as the Stable Diffusion family of models which compress the high dimensional space from **pixel space** into something called **latent space**.

Latent Space is **48x** smaller than pixel space which makes it exponentially faster to run which unlocks the ability to run inference on a single GPU with decent speeds.

## Variational Autoencoder (VAE)

To perform the compression we use a technique called the variational autoencoder (VAE). It consists of two parts:

1. **Encoder -** It handles compressing the image from pixel space to latent space
    
2. **Decoder -** It handles converting the image from latent space back to pixel space
    
    ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169105982/a7c210e2-4e2a-469b-ae28-5d1f5e239609.png align="center")
    

The image is compressed into latent space without any loss of information. This is possible since natural images are **not random.** For example, faces follow a certain spatial relationship between eyes, nose and other features. You can read more about this here → [Manifold Hypothesis](https://en.wikipedia.org/wiki/Manifold_hypothesis).

## Inference in Latent Space

The inference in latent space is mostly the same as the one in pixel space except, a random latent space matrix is used instead of the generated noise image. An additional **VAE Decoder** step is also added after inference completes to convert the latent matrix back into a regular image (pixel space) which is our final generated image.

## Conditioning

In the steps we discussed above, we never specified what we wanted the model to generate. Telling or ***“conditioning”*** the model to generate a certain kind of desired result is known as **conditioning**.

There are several types of conditioning such as:

* Text conditioning (aka prompting)
    
* inpainting
    
* outpainting
    
* controlnets
    
* and more….
    

In this article, we will only look into text conditioning which is the most widely used conditioning method and it is also used in several other conditioning methods.

### Text Conditioning

Here is a high-level view of how the text prompts are processed and fed into the noise predictor (U-NET). This might look familiar to some who know about the transformer model architecture for LLMs.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169033447/60b4663b-3204-41f9-870f-721bec5c1611.png align="center")

**Tokenizer**

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169022513/0d49057b-b4f7-4280-b3d0-5ffd6283673f.png align="center")

The text prompt is **tokenized** using the [CLIP](https://openai.com/research/clip) tokenizer. Tokenization allows the model to understand the prompt without having to understand “words”. Each word **doesn’t** always correspond to a single token. One word may consist of multiple tokens as well.

**Embedding Model**

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169030288/5ec9bbe6-c137-4c28-a7e7-3dfe81fa174e.png align="center")

The embedding model converts these tokens into vectors. Each vector has a unique fixed vector embedding which is learned by the embedding model when it was trained. Vector embedding allows computers to understand how semantically similar two tokens using the distance between any two vectors. Stable Diffusion used OpenAI’s ViT-L/14 CLIP Model.

**Text Transformer**

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694169018451/4bcb69e3-4d5f-489a-8705-6280ebadf9a2.png align="center")

The Text Transformer is the final step in the pipeline for processing the text prompt. It serves as an adapter for other conditioning methods. The inputs to the transformer are not limited to text, it can include images, depth maps and a variety of other conditioning inputs.

**Cross-Attention Mechanism**

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694168873168/632da9ca-ded0-42e1-9c4b-bc3411421431.png align="center")

The Noise Predictor (U-NET) ingests the output of the text transformer via a cross-attention mechanism. It has two parts to it:

**(a) Self Attention (within the prompt)**

Assume the text prompt is as follows:

> ***“A blue car on the road”***

The self-attention mechanism pairs up **“blue”** and “***car”*** so the model generates images with a **“blue car”** and not a **“blue road”**. For an in-depth look into this, read the [Attention is all you need](https://arxiv.org/abs/1706.03762) paper.

**(b) Cross Attention (between prompt and image)**

The model then uses the information from **(a)** to guide the reverse diffusion process to generate the images containing blue cards.

This is a very important part of the conditioning pipeline, so much so that modifying its functionality can change the style of the generated images. Modifying these to fine-tune model outputs is known as Hypernetworks, you can read about them [here](https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac).

## End-to-End Pipeline Overview

Based on everything we have discussed above, here is what the finished pipeline looks like.

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1694170003584/b78cdd44-f2ee-4be3-8dca-eb467643e151.png align="center")

and this is a visualization of how the noise is converted into an image.

![stable diffusion euler](https://i0.wp.com/stable-diffusion-art.com/wp-content/uploads/2022/12/cat_euler_15.gif?resize=512%2C512&ssl=1 align="center")

If you made it till here, 👏. Thank you for reading the article. I hope you found the information useful. Please post any feedback you have in the comments below. Peace ✌️.

![](https://y.yarn.co/4b40d729-6ee0-4245-8f97-97904651c279_text.gif align="center")