Token

What is Token?

A token is the small chunk of text ( roughly a word or part of a word ) that AI models use as their basic unit of processing, like the individual bricks a model builds its understanding from.

At a glance

Also known as
Text tokenInput tokenOutput tokenVisual token
Used for
Measuring prompt length and context window consumption in AI modelsCalculating the cost of AI API usage based on tokens processedRepresenting image patches as visual tokens in multimodal architecturesUnderstanding how model attention is distributed across prompt content
Key features
Basic unit of text processing: roughly one word or part of a wordToken limits define maximum prompt length, output length, and session memoryExtended to visual tokens in multimodal models for image and video inputsToken position and proximity influence how strongly concepts are associated

Ready to create?

Direct scenes, design characters, and ship full films

All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.

How it compares

How it compares

Compared with related concepts

Tokens are related to but distinct from words, characters, and parameters. Words are the human unit of language that tokens approximate; characters are the raw letter-level units that tokens aggregate; parameters are the learned weights within a model's neural network, an entirely different concept that is sometimes confused with tokens in casual discussion. A model's parameter count describes its size and learning capacity, while its token count describes the length of text it can process at once: a model with more parameters is not necessarily one with a larger context window, and a larger context window does not imply more model knowledge or capability. The distinction matters when evaluating AI tools: parameter count is a measure of what a model knows; token limits are a measure of how much it can attend to at once.


Think of it like…

Think of a token as a puzzle piece in a very large jigsaw. A word is often one piece, but an unusual or technical word might need to be broken into two or three smaller pieces that the model assembles into meaning from context. The model can only hold a certain number of pieces on the table at once: its context window. If you pour too many pieces onto the table, the oldest ones slide off the edge and are forgotten. This is why long prompts sometimes lose track of instructions specified far from the current generation point: those tokens have moved beyond the active attention space.


Pro tip

When writing prompts for AI video or image generation, treat the opening twenty to thirty tokens as prime real estate. Lead with the most critical creative decisions ( subject, camera treatment, visual style, lighting ) before adding secondary details like background elements, colour temperature, or mood. Models weight earlier tokens more consistently than later ones, and a long prompt that buries the key instruction in paragraph three will often under-execute on that instruction while faithfully following the details described early. If your prompts are consistently long, try a trimming pass that removes any phrase that could be inferred from context, freeing tokens for the genuinely specific creative direction that the model cannot guess.

Types and variations

  • Tokens take different forms depending on the modality and context in which they are used.
  • Text tokens are the standard form: units of language produced by a tokenizer from input text and processed sequentially by the model's attention layers.
  • Input tokens are those submitted by the user as part of the prompt; output tokens are those generated by the model as its response.
  • These are often priced differently in commercial AI APIs because output generation is computationally more intensive than input processing.
  • Visual tokens extend the concept to image data, where an image is divided into fixed-size spatial patches and each patch is converted into a numerical vector that the model processes alongside text tokens.
  • In video models, temporal tokens represent sequences of frames, adding a time dimension to the spatial patch structure.
  • Special tokens: such as those marking the beginning or end of a sequence, or separator tokens between different content types: are used internally by models to manage context structure.

Ready to make your first scene in Morphic?

Try Morphic

Common use cases

  • Token awareness is most directly relevant when working with AI models through APIs, where usage is billed per token and where context window limits require careful management of prompt length and conversation history.
  • Developers building AI-powered applications must track cumulative token counts across a session to avoid exceeding context limits and to manage API costs.
  • For creators using AI generation interfaces directly, token considerations become relevant when constructing long, detailed prompts: particularly for complex scenes with multiple subjects, specific stylistic references, and detailed technical instructions: where there is a risk that the prompt's later content will be under-attended by the model.
  • Understanding token allocation also helps explain why multi-subject scenes sometimes under-specify one subject: if the prompt spends many tokens establishing the first subject in detail, fewer tokens remain to describe the second, resulting in unequal generation quality across the composition.

Ready to create?

Direct scenes, design characters, and ship full films

All-in-one AI creative platform with simple, transparent pricing, no speed throttles, and an infinite Canvas for max creativity.

FAQs

What is a token in AI, and why does it matter?

A token is the basic unit of text that an AI model processes. Rather than reading raw characters or complete words, models work on token sequences produced by breaking input text into standardised units using a tokenizer. Token counts matter because they determine prompt length limits, session memory size, and API usage costs: and because a model's ability to attend to content across a very long token sequence decreases for content far from the current generation point, affecting generation quality for long or complex prompts.

How many words is a token, roughly?

A useful rule of thumb is that one hundred tokens corresponds to approximately seventy-five words in English, meaning one word averages about one and a third tokens. Common short words like the or and are typically single tokens, while longer or rarer words may split into two or more tokens. Punctuation, spaces, and special characters also consume tokens, so actual word-to-token ratios vary with writing style, vocabulary complexity, and the specific tokenization scheme a model uses.

What is a context window, and how does it relate to tokens?

A context window is the maximum number of tokens an AI model can process in a single session: its working memory. All input tokens (the prompt) and output tokens (the response) count toward this limit. When a conversation or prompt exceeds the context window, earlier content is truncated or down-weighted, meaning the model loses access to information it was given earlier. Context window sizes vary significantly between models, from a few thousand tokens in smaller systems to hundreds of thousands in frontier models.

Do visual inputs like images also consume tokens?

Yes: in multimodal models that accept image inputs, images are divided into spatial patches and each patch is converted into a visual token. A typical image might generate several hundred visual tokens depending on its resolution and the model's patch size. Higher-resolution images consume more tokens, which means using high-resolution reference images in a multimodal prompt can significantly reduce the remaining token budget for text instructions. Being mindful of image resolution when using visual inputs helps manage context window usage in image-conditioned generation workflows.

Why do AI models sometimes ignore instructions near the end of a long prompt?

Models process tokens sequentially and distribute attention across the full sequence, but this attention is not perfectly uniform. Content near the beginning of a prompt and content immediately before the generation point tend to receive the most consistent attention. Instructions buried deep in a long prompt ( many hundreds of tokens from the start ) are at greater risk of being under-weighted, particularly if the prompt is approaching the model's context window limit. Placing the most critical creative instructions early in the prompt and keeping prompts concise reduces this effect.

What is the difference between input tokens and output tokens?

Input tokens are the tokens that make up the prompt submitted to the model: all the text, image patches, or other content provided by the user. Output tokens are the tokens the model generates as its response. In commercial AI APIs, these are typically priced differently because generating output tokens requires running the full model forward pass for each token produced, which is computationally more intensive than processing input tokens. For generation tasks with long outputs: such as generating a full script or a lengthy creative treatment: output token costs can exceed input token costs significantly.

How should I think about tokens when writing video generation prompts?

For video and image generation prompts, token awareness means leading with the most important creative and compositional decisions ( subject framing, camera movement, visual style, lighting ) before adding secondary details. Models attend most consistently to early tokens, so burying the key instruction in the middle or end of a dense paragraph risks inconsistent execution. Aim for concise, precise prompts that front-load creative specifics and avoid redundant phrasing that consumes tokens without adding new information. Shorter, well-structured prompts often outperform longer, more exhaustive ones for this reason.

Are tokens the same as model parameters?

No: tokens and parameters describe entirely different aspects of an AI model. Tokens are the units of text or visual input that a model processes at inference time; they describe what goes into and comes out of the model during use. Parameters are the learned numerical weights stored within the model's neural network that encode its knowledge and capabilities; they describe what the model knows and how it processes information. A model with more parameters has more learned capacity, while a model with a larger token context window can process more information at once: these are independent properties that vary separately across different models.

Can't find what you are looking for?
Contact us and let us know.
bg