Inference
What is Inference?
Inference is what happens when you click 'generate' — the AI applies everything it learned during training to produce a new image or video based on your prompt.
At a glance
- Also known as
- Model inferenceGenerationForward pass
- Used for
- Generating images and video from promptsRunning AI models to produce new outputsApplying trained model knowledge to user inputs
- Common tools
- Stable diffusionMidjourneyRunwayKlingAny AI generation platform
- Related terms
- Diffusion modelsSamplingCFG scaleLatent spaceModel distillation
- How it works in simple terms
- A trained AI model contains learned patterns and parameters. During inference, the model takes your input ( a text prompt, a reference image, or other conditioning ) and runs it through those learned parameters in a single forward pass, producing an output that reflects both the training data's patterns and the specific guidance you provided.
- Where you encounter this
- Inference is what occurs every time you generate content using an AI tool. The wait time between submitting a prompt and receiving a result is the inference time. Cost-per-generation pricing on AI platforms reflects the computational cost of running inference. When platforms offer speed options: draft quality versus high quality, or different model sizes: they are offering different inference configurations.
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.
How it compares
Compared with related concepts
Inference is the operational counterpart to training. Training is the computationally massive, one-time process of building a model's capabilities over millions of examples; inference is the comparatively smaller computation that runs the trained model to produce individual outputs. A model trained once can then be used for countless inference runs, which is why large companies invest heavily in training but can offer inference at relatively low per-generation costs.
Pro tip
When you encounter slow generation times or want to reduce costs, look for settings that control inference steps or quality levels. Reducing steps from the default can produce faster, lower-fidelity outputs suitable for concept exploration, while maximising steps and resolution uses more compute to produce the highest quality result for final production.
Types and variations
- Inference configurations vary by the number of sampling steps used (more steps generally produce higher quality but take longer), the guidance scale applied (how closely the model follows the prompt), the image resolution requested, and the underlying model architecture.
- Batch inference allows multiple generations to run simultaneously, improving throughput.
- Real-time inference optimises for speed above quality, enabling near-instantaneous generation for interactive applications.
Ready to make your first scene in Morphic?
Try MorphicCommon use cases
- Inference is central to every AI generation workflow.
- It is what occurs when generating images from prompts, creating video from text or reference images, running style transfers, performing inpainting, upscaling images, or using any AI model to produce new content.
- Understanding inference helps creators manage generation costs, interpret speed and quality tradeoffs, and make informed choices about which models and settings to use for different tasks.
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.
FAQs
Inference is the process of running a trained AI model to generate new outputs ( images, video, text, or other content ) from user inputs such as prompts or reference images. It is the operational phase that follows training and represents what actually happens when a creator requests a generation.
Training is the process of building a model's capabilities by exposing it to large datasets and adjusting its parameters over many iterations: a computationally massive, one-time process. Inference is the process of using the already-trained model to generate new outputs, which is comparatively less computationally demanding but still requires significant GPU resources for large models.
Inference time is determined by the number of processing steps the model performs, the resolution of the output, and the size of the model itself. Diffusion models, which iteratively refine noise over multiple denoising steps, are particularly computationally intensive because each step requires running the full model forward pass: a process that must be repeated tens or hundreds of times per generation.
The main factors are model size (larger models require more compute per step), the number of denoising steps (more steps mean better quality but longer generation time), output resolution (higher resolution requires more memory and computation), and the hardware available (better GPUs significantly reduce inference time).
Most platforms charge per generation based on the computational cost of running inference, which varies with model quality, output resolution, and generation duration for video. Premium models with higher output quality typically cost more per generation because they consume more compute during inference.
Model distillation is a technique for creating smaller, faster models that approximate the behaviour of larger, more capable ones. Distilled models run inference significantly faster and at lower cost while attempting to maintain most of the quality of the original. Many platforms offer distilled model variants for use cases where speed is more important than maximum quality.
Yes. On most platforms, users can control inference quality through parameters such as the number of sampling steps, the guidance scale, and the choice of sampler. More steps generally produce higher quality at the cost of longer generation times. Some platforms abstract these controls into simple quality presets ( draft, standard, and high quality ) that adjust the underlying inference settings automatically.
Real-time inference refers to configurations optimised to produce outputs fast enough for interactive applications: in some cases, near-instantaneously. Achieving real-time inference typically requires using smaller, distilled models and reducing output resolution or quality, making it suitable for live previews, interactive experiences, or rapid iteration rather than final production.