Recently, I found out that Machine Learning/Artificial Intelligence is a very interesting topic. The fact is that more and more applications take advantage of this technology to create value and help human’s life better inspired me a lot. One of the topics mentioned many times this year is Text-to-Image models and applications, which is the main subject I will cover in this series. This is the first part in the series, the second part is here. P/S: The avatar of my website is also generated from one of the model mentioned in this series.

Overview

What is Text-to-Image about?

Text-to-Image, when hearing about this, you can imagine this is a kind of applications or softwares that can take the input as a statement, and the output is an image or picture. For example, if you input the string “sailors in the big boat met the storm, mystic” to Midjourney API (Midjourney is a powerful platform in Text-to-Image world, I will have a quick guideline so that you guys can try this in part two of this series), you will get this image:

Generated image from Midjourney

Look cool right? But what is behind these “smart” applications? Or what may be the core algorithm that the researchers used to develop these applications?

The answer is varied and depends, but in general, these applications need to have a core Text-to-image machine learning model. That model will have 2 components:

  • A language model will convert the input which is usually in text format, or a statement, like the string “sailors in the big boat met the storm, mystic” above, to output in the format of a latent representation.
  • A generative model will take the latent representation artifact in the previous step as input, and create the images based on that.

Common models at the moment

2022 is a markable year when text-to-image models can generate high-quality, creative, and artistic images that can compare with human paintings or actual pictures. Some very potential candidates in this area are Stable Diffusion - which I will talk about in following sections, Midjourney, and DALL-E 2 - which I will talk about in part two of this series.

Stable Diffusion

Introduction

Stable Diffusion is one of the common name in text-to-image world. First released in 2022, it is not only used for genarating from text to image, but also used for image-to-image enhancement or variation tasks.

The origin of Stable Diffusion is from the CompVis research lab at Ludwig Maximilian University of Munich. The researchers developed it as a latent diffusion model, and then with the collaboration with Runway Stability AI, the model has been released as an opensource project under the CreativeML OpenRAIL M license (this is the best thing!). All information about this project can be found in this main Github repository.

Getting started with Stable Diffusion and Google Colab

In this guide, I used Google Colab because it has a nice UI to develop Jupyter notebooks. Especially, Google Colab’s paid plans provide the access to premium GPUs with high-RAM option, which makes it easy to run this demo. If you have a laptop or workstation with a GPU has at least 10GB VRAM, you can run it locally as well, the commands will be quite the same. In this demo, I will use monthly Google Colab Pro subscription, you can use the same or choose the other plans that suits you here.

Change Runtime in Google Colab

Stable Diffusion requires more RAM to make it works, so you need to change Runtime in Google Colab before running the code. After purchase the plan, you can change the runtime by choosing RuntimeChange runtime type.

Change runtime type

Then, you can choose Hardware accelerator is GPU, GPU class is Premium and Runtime shape is High-RAM.

Choose runtime type

Then, click the Connect button in the top right of the window. If you see the information about Runtime like this, Runtime is successfully changed to Premium High-RAM GPU.

Successfully changed

Setup environment

Miniconda is a lightweight version of Anaconda, a set of package management and deployment system - which is a common tool if you want to work with Machine Learning or Data Science. You can find more information here. To install Miniconda, you need to download the installation script to our notebook and run that script with these commands.

!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.12.0-Linux-x86_64.sh
!bash ./Miniconda3-py37_4.12.0-Linux-x86_64.sh -b -f -p /usr/local

Install miniconda

Next, you need to clone the project’s repository, and using conda to install the requirements.

!git clone https://github.com/CompVis/stable-diffusion.git
import os
os.chdir('stable-diffusion')
!conda env update -n base -f environment.yaml

Clone source code

Then, download the Stable Diffusion checkpoint, the latest checkpoint currently is sd-v1-4.ckpt, which can be dowloaded by using curl, and check with the ls command.

!curl https://f004.backblazeb2.com/file/aai-blog-files/sd-v1-4.ckpt > sd-v1-4.ckpt

Download checkpoints

This is the most interesting part, running the your own Stable Diffusion model.

!python scripts/txt2img.py \
--dpm_solver \
--ckpt sd-v1-4.ckpt \
--skip_grid \
--n_samples 1  \
--n_iter 1 \
--outdir . \
--seed 119 \
--ddim_steps 100 \
--prompt "sailors in the big boat met the storm, mystic"

Run model

Finally, you can view the output in the samples folder by running the code below, or just download from the Folder view.

from IPython.display import Image
Image('/content/stable-diffusion/samples/00000.png')

Output

And this is the output.

Output

You can adjust the seed parameter to get different image. Here is the output of model with the --seed 120.

Output

Or --seed 130.

Output

Updated: The Stable Diffusion model is used in this blog is v1. After this blog is published, Stability AI announced the release of Stable Diffusion v2 here. I will have another blog about it soon.

Conclusion

In this blog, I introduced in general about Text-to-Image concept in Machine Learning, with some common models in the market. I also did a demo with Stable Diffusion, an open-source Text-to-Image model in Google Colab, is a convenient platform to run our Machine Learning or Data Science code. In part two of this series, I will talk about the other powerful platform in the Text-to-Image world, Midjourney and Dall-E 2. Let’s wait for it!

Many thanks for your time reading this. If you have any questions, or suggestions about my blog, please reach out to me via Linkedin or email on my homepage

Research references