Hear Seed Audio 1.0
Documentary narration
Speech, warm and measured
Thriller voice-over
Speech, hushed and tense
Spice-market ambience
Sound effects, open-air bed
Thunderstorm
Sound effects, storm to a clap
Orchestral cue
Music, rising strings and brass
Lo-fi beat
Music, soft keys and vinyl
Seed Audio 1.0 use cases
One-pass video audio
Give a video clip its narration, sound design, and music in one generation. Describe the scene, who speaks, what happens, and the mood, and the model handles the full audio track.

Narrated explainers and tutorials
A composed voice with room tone and a light music bed in one output. The narration carries the content, and the model fills the acoustic space so it sounds placed and finished.

Short ads and promos
Spoken line, sound effects, and music as one ready-to-use track. Write the timing into the prompt, and the model hits the beat on the right word and fades the music on cue.

Scripted dialogue and audio drama
Multi-character scenes with distinct voices, accurate emotional delivery, and matching ambience, all in a single prompt. Write the script, label the speakers, and the model casts and directs.

Consistent voice across a series
Clone a character or narrator voice from a reference clip and carry it across every episode or chapter. Voice consistency across hours of content from a single short sample.

Audio editing and repair
Extend a take, fill a gap, swap a line, or stitch two segments. The same model that generates original audio handles revision without re-recording the whole track.

How to write a Seed Audio 1.0 prompt
A strong prompt reads like a short scene brief, not a text-to-speech line, so the model fits voice, music, and effects into one scene. Run through SPACE before you send.
| SPACE | Include | Example |
|---|---|---|
| Speaker | Voice character, age, emotion | Calm male narrator, mid-30s, warm |
| Phrasing | The exact line, in quotes | 'Combine the flour and the butter.' |
| Ambience | Acoustic space and background | Soft kitchen ambience, a low oven-fan hum |
| Composition | Music mood, genre, or tempo | Light acoustic guitar, under the voice |
| Extra cues | Timing, effects, transitions | A brief chime at the end, then silence |
Two habits separate strong prompts from generic ones: name the setting, since with no place the model defaults to flat room tone, and cue music timing, where "fades in after the first line" beats a bare "upbeat music."
Voice cloning with Seed Audio 1.0
Zero-shot voice cloning works from up to three reference clips of about 30 seconds each, with no training. Prepare clips against the CLEAR checklist:
- Clean recording, with little background noise
- Length under 30 seconds per clip
- Emotion aligned to the delivery you want
- Accent consistent within each clip
- Room tone steady across clips
The model reads the vocal character and carries it across the whole generation.
With no clip, describe the voice in text, giving age, accent, and pace rather than "nice" or "professional." A character image also works: the model derives a matching voice from apparent age and character, useful for fictional or animated speakers.
How to use Seed Audio 1.0
Getting a finished track takes four steps, and none of them need a separate editor.
- Write the scene brief. Describe who speaks, what they say, the setting, and the mood, following the SPACE checklist above.
- Set the voice. Clone it from a short reference clip, or define it with a text description or a character image.
- Generate. One pass returns the voice, music, and sound effects together, already mixed, up to two minutes long.
- Refine in place. Extend the clip, swap a line, or fill a gap with the editing modes, with no re-recording.
FAQs
Inpainting fills a gap between two existing audio segments without re-generating the content around it. You provide the surrounding audio as context, and the model generates only the missing part, matched in voice character and acoustic space to what surrounds it.
English and Chinese at launch, with broader language support planned. For voice cloning, matching the reference clip language to the output language gives the most consistent result.
Yes. Beyond generating from scratch, the same model extends a clip, fills a gap, swaps a single line, or stitches two takes into one continuous piece, so you can revise a track without re-recording it.
Yes. Label each line in the prompt, for example Host: ... and Guest: ..., and the model gives each speaker a distinct voice, emotion, and pacing in a single generation. Define additional voices by reference clip, text description, or character image.
Up to two minutes in a single pass. For longer productions, continuation mode extends the output while preserving voice character, musical style, and consistency with what came before.
Significantly. Text-to-speech produces one voice track from written text. Seed Audio 1.0 generates the full scene, the voice, background music, and sound effects together in one output, with editing tools to revise specific sections afterward. The difference in scope is the entire audio production versus only the voice.
