Grok Imagine Video 1.5 is xAI's image-to-video model, and it generates native synchronized audio with the motion, so dialogue, sound effects, and music arrive in the same pass. This guide covers what it does, the technical specs, how to get the best results, and how to prompt it. Everything here runs on Morphic, alongside image, music, and audio generation.
Grok Imagine Video 1.5 features and capabilities
| Feature | What it does | Best for |
|---|---|---|
| Native synchronized audio | Dialogue, effects, ambience, and music generate with the motion, in sync | Talking heads, product spots, music clips |
| Animate a still image | Turns a still into motion while holding its light, color, and texture | Photos, product shots, artwork |
| Video extension | Continues from the last frame into a longer sequence | Story sequences, longer beats |
| Reference-guided generation | Holds a character or style across clips via reference images | Consistent characters, style-locked series |
| Prompt following + camera control | Shot type, camera move, and timing cues land as written | Storyboard previews, precise shots |
Native synchronized audio
Audio is generated together with the video in a single pass: spoken dialogue with lip-sync, sound effects, ambient background, and music. You describe the sound in the same prompt as the motion, so there is no separate audio step.
Animate a still image
An image you provide is used as the first frame, and the model animates outward from it. The original lighting, color, and detail are kept rather than regenerated, which suits animating photos, product shots, or finished artwork.

Video extension
An existing clip can continue from its final frame to make a longer shot, keeping the same subject, lighting, and motion. Repeat the step to build a multi-part sequence from one starting clip.

Reference-guided generation
Reference images guide the style and character without fixing the first frame. The model carries that look into new shots, which keeps a character or visual style consistent across separate generations.

Strong prompt following and camera control
The model follows detailed direction, including the shot type, a specific camera move such as a dolly or pan, and timing for when an action happens. That makes a planned shot more predictable to reproduce.
Grok Imagine Video 1.5 technical specs
| Spec | Grok Imagine Video 1.5 |
|---|---|
| Provider | xAI |
| Modes | Image-to-video, text-to-video, reference-to-video, video editing, video extension |
| Audio | Native, synchronized (dialogue, effects, ambience, music) |
| Resolution | 480p or 720p |
| Duration | 1 to 15 seconds |
| Frame rate | 24 fps |
| Aspect ratios | 16:9, 9:16, 4:3, 3:4, 3:2, 2:3, 1:1 |
How to get the best out of Grok Imagine Video 1.5
The model rewards a strong starting frame and a clear, motion-focused brief. A few practices carry most of the quality:
- Start from a still. Generate or attach a 16:9 image first, then animate it. A good first frame is the single biggest lever on the result.
- Keep the motion prompt short and specific. Name the action and one camera move; let the image carry the composition and style.
- Always name the audio. Dialogue, sound effects, ambience, or music, in plain language, so the model generates sound with the motion instead of a silent clip.
- One action per clip. Pack a single beat into a few seconds and use video extension for longer sequences.
- For talking characters, use a front-facing portrait with the mouth in frame and keep lines short for clean lip-sync.
- Use reference images when a look or a character has to stay steady across clips.
Grok Imagine Video 1.5 prompt guide
A strong prompt reads like a short shot brief, not a caption. Two things drive the result: a clear list of what the shot contains, and concrete wording instead of vague wording.
What goes in a prompt
| Element | What to include | Example |
|---|---|---|
| Subject | Who or what is in frame, described concretely | a presenter in a charcoal sweater |
| Motion | What moves, and how | she smiles and looks to camera |
| Camera | Shot type plus one move | medium shot, slow push-in |
| Audio | Dialogue, effects, ambience, or music | she says, 'Welcome'; soft room tone |
| Duration | Clip length and aspect ratio | 5 seconds, 16:9 |
Weak vs strong prompts
Name the camera, the motion and its timing, and the audio rather than leaving them to chance.
| Focus | Weak | Strong |
|---|---|---|
| Camera | A woman in a city at night | Handheld tracking shot following a woman through rain-slicked streets, neon reflections, shallow depth of field |
| Motion and timing | The door opens and someone walks in | The door swings open slowly, a figure steps through after a beat, then the camera settles |
| Audio | A chef plating a dish | Close-up of a chef plating a dish, steam rising. Audio: pan sizzle, soft kitchen ambience, and 'Service.' |
Settings to know
| Setting | Notes |
|---|---|
| Duration | 1 to 15 seconds; keep one action per clip |
| Resolution | 480p or 720p |
| Aspect ratio | Follows your input image, or set 16:9, 9:16, or 1:1 |
| Reference images | Add them to hold a style or character across clips |
| Longer sequences | Use video extension to continue from the last frame |
Common mistakes
- Leaving the prompt silent: always write at least one sound cue.
- Vague camera: "cinematic" tells the model nothing; name the shot and the move.
- Too much in one clip: one action per clip, then extend.
常见问题
Start from a strong 16:9 still, keep the motion prompt short and specific, name one camera move, and always include an audio cue. Keep one action per clip and use video extension for longer sequences.
Yes. Audio generates natively with the video and stays in sync with the motion. A single generation can include lip-synced dialogue, sound effects, ambience, and music, with no separate audio pass.
An image plus a text prompt for image-to-video, or a text prompt alone for text-to-video. You can also pass reference images to guide style and character, and continue or modify an existing clip with video extension and editing.
Clips run from 1 to 15 seconds at 480p or 720p, 24 fps. For image-to-video the aspect ratio follows your input image, and you can set a ratio for landscape, square, or vertical delivery.
The original Grok Imagine is xAI's cross-modal model spanning text-to-image, image edits, and several video paths. Grok Imagine Video 1.5 is the dedicated video release, tuned for image-to-video with native synchronized audio, lip-synced dialogue, and video extension.

