Diffusion Models
What is Diffusion Models?
Diffusion models learn to make images by starting with random noise and gradually cleaning it up, step by step, until a coherent picture emerges that matches a text prompt or other instructions.
At a glance
- Also known as
- Denoising diffusion modelsScore-based generative modelsLatent diffusion models (for the latent space variant)
- Used for
- Text-to-image generationImage editing and inpaintingVideo generationAudio generationCustom model fine-tuning
- Common tools
- Stable diffusionDALL-e 2DALL-e 3MidjourneyImagenAI video generation platforms
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.
How it compares
Generative Adversarial Networks, or GANs, were the dominant image generation architecture before diffusion models. GANs use two competing networks, a generator and a discriminator, trained adversarially. While capable of producing sharp images, GANs are unstable to train, prone to mode collapse, and less diverse in their outputs. Diffusion models are more stable, produce greater diversity, handle conditioning more reliably, and scale better with additional compute, which is why they have replaced GANs as the dominant approach for high-quality image and video generation.
Pro tip
When using diffusion-based tools, the number of denoising steps, often called inference steps or sampling steps in the interface, directly affects both quality and generation time. More steps give the model more opportunities to refine the image, generally producing better detail and coherence, but each step takes time. For rapid concept exploration, lower step counts produce usable results quickly. For final-quality generations, higher step counts extract more detail from the model. Finding the minimum step count that produces acceptable quality for your use case is a practical way to balance speed and output quality.
Types and variations
- Pixel-space diffusion models operate directly on full-resolution image pixels, requiring significant computational resources.
- Latent diffusion models, including Stable Diffusion, operate in a compressed latent space rather than on pixels directly, substantially reducing computational requirements while maintaining output quality.
- Score-based models are a mathematically related approach that achieves similar generation quality through a different formulation.
- Video diffusion models extend the architecture to the temporal dimension, generating coherent sequences of frames rather than individual images.
Ready to make your first scene in Morphic?
Try MorphicCommon use cases
- Generating images from text prompts across creative, commercial, and research applications.
- Inpainting and outpainting existing images by replacing or extending regions using diffusion-based generation.
- Fine-tuning pre-trained diffusion models on custom datasets to produce specialized character models, style-consistent generators, or domain-specific tools.
- Video generation using temporal diffusion model architectures that produce coherent motion across multiple frames.
- Research into generative AI capabilities, alignment, and safety using diffusion model frameworks.
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.
FAQs
A diffusion model is a type of generative AI that creates images by learning to reverse a noise-adding process. Starting from random noise, it progressively removes noise step by step until a coherent image emerges, guided by a text prompt or other conditioning input.
Diffusion models produce high-quality, diverse outputs that are more stable to train and better at following text conditioning than earlier generative architectures like GANs. Their ability to scale with compute and handle a wide range of conditioning inputs made them the dominant architecture in modern AI image and video generation.
A latent diffusion model operates in a compressed representation of the image called latent space rather than on the full-resolution pixels directly. This significantly reduces computational requirements while maintaining output quality, and is the approach used by Stable Diffusion and many other production image generation systems.
A text encoder converts the written prompt into a numerical representation that is provided to the denoising network at each step, guiding which direction the denoising process should move to produce an image consistent with the prompt rather than just any statistically plausible image.
Denoising steps are the individual iterations of noise removal that the diffusion model performs to produce a final image. More steps give the model more opportunities to refine the image, generally improving quality and detail, but each step requires computation time. Lower step counts generate faster but may produce less refined results.
Most major text-to-image tools use diffusion model architectures, including Stable Diffusion, DALL-E 2, DALL-E 3, Midjourney, and Imagen. Most contemporary AI video generation models are also diffusion-based or heavily influenced by diffusion model principles.
GANs use competing generator and discriminator networks trained adversarially and were the dominant approach before diffusion models. GANs are prone to instability and limited diversity. Diffusion models are more stable to train, produce more diverse outputs, and handle text conditioning more reliably, which is why they have replaced GANs for most high-quality generation applications.
Yes. Video diffusion models extend the architecture to include the temporal dimension, generating coherent sequences of frames rather than individual images. Most modern AI video generation systems are built on or significantly influenced by diffusion model principles applied to temporal sequences.