Diffusion Models

What is Diffusion Models?

Diffusion models learn to make images by starting with random noise and gradually cleaning it up, step by step, until a coherent picture emerges that matches a text prompt or other instructions.

At a glance

Also known as
Denoising diffusion modelsScore-based generative modelsLatent diffusion models (for the latent space variant)
Used for
Text-to-image generationImage editing and inpaintingVideo generationAudio generationCustom model fine-tuning
Common tools
Stable diffusionDALL-e 2DALL-e 3MidjourneyImagenAI video generation platforms

Ready to create?

Direct scenes, design characters, and ship full films

All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.

How it compares

How it compares

Diffusion modelsGANs

Generative Adversarial Networks, or GANs, were the dominant image generation architecture before diffusion models. GANs use two competing networks, a generator and a discriminator, trained adversarially. While capable of producing sharp images, GANs are unstable to train, prone to mode collapse, and less diverse in their outputs. Diffusion models are more stable, produce greater diversity, handle conditioning more reliably, and scale better with additional compute, which is why they have replaced GANs as the dominant approach for high-quality image and video generation.


Pro tip

When using diffusion-based tools, the number of denoising steps, often called inference steps or sampling steps in the interface, directly affects both quality and generation time. More steps give the model more opportunities to refine the image, generally producing better detail and coherence, but each step takes time. For rapid concept exploration, lower step counts produce usable results quickly. For final-quality generations, higher step counts extract more detail from the model. Finding the minimum step count that produces acceptable quality for your use case is a practical way to balance speed and output quality.

Types and variations

  • Pixel-space diffusion models operate directly on full-resolution image pixels, requiring significant computational resources.
  • Latent diffusion models, including Stable Diffusion, operate in a compressed latent space rather than on pixels directly, substantially reducing computational requirements while maintaining output quality.
  • Score-based models are a mathematically related approach that achieves similar generation quality through a different formulation.
  • Video diffusion models extend the architecture to the temporal dimension, generating coherent sequences of frames rather than individual images.

Ready to make your first scene in Morphic?

Try Morphic

Common use cases

  • Generating images from text prompts across creative, commercial, and research applications.
  • Inpainting and outpainting existing images by replacing or extending regions using diffusion-based generation.
  • Fine-tuning pre-trained diffusion models on custom datasets to produce specialized character models, style-consistent generators, or domain-specific tools.
  • Video generation using temporal diffusion model architectures that produce coherent motion across multiple frames.
  • Research into generative AI capabilities, alignment, and safety using diffusion model frameworks.

Ready to create?

Direct scenes, design characters, and ship full films

All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.

FAQs

What is a diffusion model?

A diffusion model is a type of generative AI that creates images by learning to reverse a noise-adding process. Starting from random noise, it progressively removes noise step by step until a coherent image emerges, guided by a text prompt or other conditioning input.

Why are diffusion models so widely used today?

Diffusion models produce high-quality, diverse outputs that are more stable to train and better at following text conditioning than earlier generative architectures like GANs. Their ability to scale with compute and handle a wide range of conditioning inputs made them the dominant architecture in modern AI image and video generation.

What is a latent diffusion model?

A latent diffusion model operates in a compressed representation of the image called latent space rather than on the full-resolution pixels directly. This significantly reduces computational requirements while maintaining output quality, and is the approach used by Stable Diffusion and many other production image generation systems.

How does text conditioning work in diffusion models?

A text encoder converts the written prompt into a numerical representation that is provided to the denoising network at each step, guiding which direction the denoising process should move to produce an image consistent with the prompt rather than just any statistically plausible image.

What are denoising steps and why do they matter?

Denoising steps are the individual iterations of noise removal that the diffusion model performs to produce a final image. More steps give the model more opportunities to refine the image, generally improving quality and detail, but each step requires computation time. Lower step counts generate faster but may produce less refined results.

Which image generation tools use diffusion models?

Most major text-to-image tools use diffusion model architectures, including Stable Diffusion, DALL-E 2, DALL-E 3, Midjourney, and Imagen. Most contemporary AI video generation models are also diffusion-based or heavily influenced by diffusion model principles.

What is the difference between diffusion models and GANs?

GANs use competing generator and discriminator networks trained adversarially and were the dominant approach before diffusion models. GANs are prone to instability and limited diversity. Diffusion models are more stable to train, produce more diverse outputs, and handle text conditioning more reliably, which is why they have replaced GANs for most high-quality generation applications.

Do diffusion models work for video as well as images?

Yes. Video diffusion models extend the architecture to include the temporal dimension, generating coherent sequences of frames rather than individual images. Most modern AI video generation systems are built on or significantly influenced by diffusion model principles applied to temporal sequences.

Can't find what you are looking for?
Contact us and let us know.
bg