Tokenization
What is Tokenization?
Tokenization is how AI models chop up your text into small pieces before reading it: the way a model breaks words into manageable chunks it can process mathematically.
At a glance
- Also known as
- Text tokenizationSubword tokenizationByte-pair encoding (BPE)Lexical analysisTokenisation
- Used for
- Converting raw text into numerical token sequences for AI model processingHandling rare or unusual words through subword decompositionBalancing vocabulary size against sequence length in model architectureDiagnosing prompt interpretation issues caused by unexpected token splits
- Key features
- Converts text to integer token sequences before model processingSubword schemes handle rare words by decomposing them into familiar fragmentsToken boundaries affect how models associate related terms and conceptsLanguage, spelling, and formatting choices interact with tokenizer behaviour
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.
How it compares
Compared with related concepts
Tokenization is distinct from but closely related to the concept of vocabulary in language models. A model's vocabulary is the complete set of token types it knows: the fixed list of integer indices and their corresponding text fragments that the tokenizer can produce and the model can process. Tokenization is the process of mapping input text onto sequences drawn from this vocabulary. A model with a larger vocabulary can represent more distinct concepts as single tokens, while a model with a smaller vocabulary may split the same concepts across multiple tokens. Tokenization is also distinct from embedding, the next step in processing: embedding converts each token integer into a high-dimensional numerical vector that encodes its meaning, while tokenization merely converts text into a sequence of integer indices with no semantic information encoded.
Think of it like…
Imagine reading a hand-written letter where some words are entirely legible and others are smudged or written in an unfamiliar script. Your brain handles the legible words as whole units, understood instantly. For smudged or unfamiliar words, you break them down letter by letter and piece together a best guess from the fragments you can make out. This is roughly how subword tokenization works: familiar common words are processed as single tokens; unusual, rare, or malformed words are split into their component pieces and reconstructed from familiar subword fragments, with the model doing its best to infer the intended meaning from the parts.
Pro tip
When a prompt term is not producing the expected result, consider whether the issue might be tokenization rather than model knowledge. Try replacing unusual spellings, creative compounds, or technical jargon with more standard alternatives that are likely to tokenize as single, well-represented tokens. For example, if a stylistic reference to an obscure technique is not landing, try describing the visual qualities of that technique in plain words rather than using its name: the descriptive language may tokenize and associate more reliably than the name itself. This reframing from labels to descriptions is one of the most effective prompt debugging techniques for tokenization-related interpretation failures.
Types and variations
- The main tokenization approaches represent different trade-offs between vocabulary size, sequence length, and handling of novel vocabulary.
- Word-level tokenization maps each distinct word to a single token, producing short, intuitive sequences but requiring enormous vocabularies and failing entirely on unknown words.
- Character-level tokenization uses individual characters as tokens, minimising vocabulary to a few hundred items but producing very long sequences that are expensive to process.
- Subword tokenization, the dominant approach in modern language models, sits between these extremes: byte-pair encoding iteratively merges frequent character pairs into composite tokens; WordPiece uses a probabilistic criterion for merges; SentencePiece is a language-agnostic implementation that treats the input as a raw byte stream before tokenizing, making it more robust across languages and character sets.
- Each scheme produces a different balance of token granularity, vocabulary coverage, and sequence length, which in turn affects how efficiently a model processes prompts and how it handles the boundaries between familiar and novel language.
Ready to make your first scene in Morphic?
Try MorphicCommon use cases
- Tokenization underpins every interaction with a text-based AI system, operating invisibly in the background of all language model usage from conversational AI to generation prompts.
- It becomes explicitly relevant when troubleshooting prompt performance: if a specific term is being ignored, misinterpreted, or conflated with an unrelated concept despite appearing clearly in the prompt, tokenization is a likely cause.
- Practitioners building AI applications on top of model APIs need to implement tokenizers in their code to accurately estimate token counts for cost management and context window planning.
- For AI video generation creators, tokenization awareness is a diagnostic skill: understanding why an unusual word might not prompt the expected visual association helps guide prompt revision toward terms that the model's tokenizer and training jointly handle more reliably.
Ready to create?
Direct scenes, design characters, and ship full films
All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.