Stable Diffusion 3 on Colab (Run the Full model without quantization)

 

Run the full Stable diffusion 3 model on colab (T4 gpu) without quantization with long prompts / extended context length & prompt weighing.


The stable diffusion 3 Hugging Face page states

“SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using fp16 precision.”

and gives some options like using quantize version of the T5 text encoder or dropping it. CPU offload does not work in the free version of colab and sequential offload to cpu takes long time to generate the image.

Good for us that colab or the T4 gpu has enough gpu memory to load all the three text encoders at once without any quantization and get text embeddings and then empty the gpu space just enough to load the transformer and vae and perform next steps for the image generation.

So, the basic steps to prepare the pipeline will look like:

  • load all the 3 text encoders with their tokenizer on the gpu.
  • load the transformer and vae on the cpu.
  • get text embeddings as all the encoders models are on gpu (with extended context length which allows to encode prompts with more than the 77 token limit of the CLIP encoders and prompt weighing)
  • to save time only move enough modules from text encoder 3 (T5) to cpu so that the transformer and vae can be loaded onto the gpu.
  • complete the image generation process.
  • move the text encoder 3 modules back to gpu and transformer, vae back to cpu to start a new inference step.

This repository has this implemented. The code in this repository transfers some selected modules between devices in batches to avoid out of memory (OOM) error on the T4 machine instead of deleting the individual models or moving the whole pipeline. This ensures the pipeline can be used for continuous inference as the transfer between devices is minimal and also ensures the inference time remains close to the time the actual model would take without the above setup.

For implementing extended context length with prompt weighing i have taken the code from the sd_embed repository with some modifications to reduce memory usage like using torch.no_grad()

To add weights to your prompts refer to the above given sd_embed repository. So now we can run the full stable diffusion 3 model with long prompts and weighing without compromising on the text encoder model.

You will need minimum 12.5 gb of cpu ram which is provided in colab and also on T4 machine like aws g4.dn.xlarge (16 gb cpu)

Installation and Usage:

Refer to the repository page for details.

Note: The pipeline works well for lengthy prompts / long prompts but for very long prompts it can result in OOM error. In such cases try running the pipeline with little lower context length as warm up steps and try inference again using your long prompt.

Consider giving a star if the above mentioned repository was of any help to you.

Thank You for reading!!

Comments

Popular posts from this blog

vLLM Parameter Tuning for Better Performance

LLM Web Scraping - Webpage to LLM Friendly Text - Fully Open Source