Dataset
What is Dataset?
A dataset is the collection of examples an AI learns from during training. The quality, diversity, and content of the dataset directly determine what the model knows and what it can generate.
At a glance
- Also known as
- Training datasetTraining dataTraining set
- Used for
- Training AI models from scratchFine-tuning models on specific styles or subjectsEvaluating model performanceUnderstanding the sources of model bias and capability
- Common tools
- Data annotation platformsWeb scraping pipelinesStock image librariesSynthetic data generation tools
- Related terms
- AI model trainingFine-tuningLoRADreamBoothOverfitting
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.
How it compares
A dataset is the collection of examples used to train a model; the model is the learned system that emerges from the training process. The dataset defines what the model learns from; the model is what applies that learning to new inputs. A change to the dataset produces a different model even if the training architecture remains the same, while the same dataset trained with a different architecture will also produce different results. Both are essential and interdependent components of the AI development process.
Pro tip
When curating a fine-tuning dataset for a custom character or style model, prioritize quality and variation over volume. Ten to thirty high-quality images showing the subject from varied angles, in different lighting conditions, and at different distances will train a more robust and flexible model than a hundred near-identical images from the same angle. Diversity within the dataset produces diversity in what the model can generate.
Types and variations
- A pre-training dataset is the large-scale collection used to train a foundation model from scratch, typically containing billions of examples.
- A fine-tuning dataset is a smaller, curated collection used to specialize an already-trained model on a specific domain, style, or subject.
- A synthetic dataset consists of artificially generated examples rather than real-world data, used when collecting real examples at sufficient scale is impractical.
- A labelled dataset contains explicit annotations, such as text descriptions paired with images, that allow supervised learning.
- An unlabelled dataset contains raw examples without annotations, used in unsupervised and self-supervised learning approaches.
Ready to make your first scene in Morphic?
Try MorphicCommon use cases
- Training large foundation models on diverse web-scraped image-text pairs to give them broad generative capability across many subjects and styles.
- Fine-tuning existing models on curated small datasets to create specialized character models, style-consistent generators, or brand-specific visual tools.
- Evaluating model performance by testing on held-out examples not seen during training.
- Understanding why a model produces certain outputs, biases, or failure modes by examining the characteristics of its training data.
- Building custom LoRA or DreamBooth models from a personal image set of a specific subject.
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.