3 parts of stable diffusion model
A language model, a diffusion model, and a decoder
At a high level what does the language model in stable diffusion do?
transforms the text prompt you enter to a representation that can be fed to the diffusion model
SD’s diffusion model is what?
Basically a time conditional U-Net
What does SD’s diffusion model take as input?
some Gaussian noise and the representation of the text prompt
What does SD’s diffusion model do with its inputs ?
Denoise (for several times) the Gaussian noise to get closer to the representation of your text prompt [IIRC Gaussian noise is one of the inputs]