Diffusion Models are a class of generative AI models that create images or video by learning to reverse a gradual noising process. They work by starting with pure random noise and iteratively refining it through learned denoising steps until a coherent image emerges that matches the characteristics of the training data and any provided conditioning inputs such as text prompts.
The process is trained in two stages: first, a forward diffusion process gradually adds noise to training images until they become indistinguishable from random static; second, a neural network learns to reverse this process, predicting how to remove noise at each step to recover the original image structure. During generation, the model starts with random noise and applies this learned denoising process guided by text prompts or other conditioning signals, gradually sculpting the noise into a meaningful image. This approach has proven highly effective for producing diverse, high-quality outputs and forms the foundation of models such as Stable Diffusion, DALL-E 2, Imagen, and many other contemporary image generation systems.
Diffusion models represent a fundamental shift in how generative AI works compared to earlier approaches like GANs. Their ability to produce high-fidelity, diverse outputs while being relatively stable to train has made them the dominant architecture in modern AI image and video generation, and understanding how they work helps creators develop better intuition for how to guide and control generation outcomes.