Happy Horse 1.1 features and capabilities
Happy Horse 1.1 is Alibaba's video model, served on fal and available on Morphic. It generates video and audio together in a single pass, with native lip-sync across seven languages, and supports reference-to-video with up to nine subjects, nine aspect ratios, and 1080p output.
| Feature | What it does | Best for |
|---|---|---|
| Joint audio and video | Generates the clip and its synchronized audio in one pass, with no separate audio step | Dialogue scenes, music clips, talking heads |
| Multilingual lip-sync | Speaks and lip-syncs across 7 languages, with mouth shapes that match the phonetics | Localized ads, multilingual presenters |
| Reference-to-video, up to 9 | Carries up to nine reference subjects into a new scene, each called by index | Ensemble scenes, character-consistent series |
| Image-to-video | Animates a still first frame into a moving 1080p clip with audio | Product shots, key art, photo animation |
| Nine aspect ratios | Delivers from 16:9 and 9:16 to ultrawide 21:9, in nine ratios | Cinematic, vertical, and square delivery |
Joint audio and video in one pass
Happy Horse generates the picture and its sound together rather than adding audio afterward. Spoken dialogue with lip-sync, ambient room tone, sound effects, and music all come out of the same generation, so motion and sound line up from the first frame. You describe the sound in the same prompt as the action.
Multilingual native lip-sync
The model speaks and lip-syncs across English, Mandarin, Cantonese, Japanese, Korean, German, and French. The mouth shapes follow the phonetics of the spoken language rather than being approximated, which makes it a fit for dialogue scenes and localized versions of the same shot.
Reference-to-video with up to 9 subjects
Pass up to nine reference images and refer to each by index in the prompt, as character1 through character9 matching the order you supply them. With up to nine subjects, a full cast can stay recognizable across shots. Describe each subject, then the scene and the action.
Image-to-video
Provide a still first frame, such as a product shot or a character frame, add a prompt describing the motion and the sound, and the model animates outward from that image while holding its lighting and detail. It also runs text-to-video when you have no starting image.
Nine aspect ratios
Deliver in nine ratios: 16:9, 9:16, 1:1, 4:3, 3:4, 21:9, 9:21, 5:4, and 4:5. The same prompt framework produces an ultrawide cinematic cut and a vertical social cut without a separate workflow per format.
Happy Horse 1.1 technical specs
| Spec | Happy Horse 1.1 |
|---|---|
| Provider | Alibaba (served on fal) |
| Modes | Text-to-video, image-to-video, reference-to-video |
| Audio | Native, synchronized, with multilingual lip-sync |
| Languages | 7 (English, Mandarin, Cantonese, Japanese, Korean, German, French) |
| Resolution | 720p or 1080p |
| Duration | 3 to 15 seconds (default 5) |
| Aspect ratios | 16:9, 9:16, 1:1, 4:3, 3:4, 21:9, 9:21, 5:4, 4:5 |
| Reference images | Up to 9 (character1 to character9) |
| Prompt length | Up to 2,500 characters |
| Released | June 2026 |
Happy Horse 1.1 use cases
Dialogue and talking-head scenes
Characters speak with synced lip movement, room tone, and timing, generated in one pass. Write the line in the prompt and the audio comes back with the motion.
Multi-character ensemble scenes
Carry up to nine subjects from reference images into a single scene, calling each by index so the whole cast stays recognizable from shot to shot.
Music videos and performance clips
Because video and audio generate together, motion lands on beat from the first pass. Build a performance clip with a score and synced movement in one generation.
Ultrawide cinematic cuts
Use the 21:9 ratio for a widescreen, cinematic frame, then deliver the same scene as a 9:16 vertical from the same prompt.
Multilingual ad localization
Keep the same scene and characters and swap the dialogue across languages with native lip-sync, so one treatment ships in several markets.
How to get the best out of Happy Horse 1.1
Happy Horse rewards a brief that names the motion and the sound together, and a clean set of reference images when characters have to stay consistent. A few practices carry most of the quality:
- Always name the audio. Dialogue, sound effects, ambience, or music in plain language, so the model generates sound with the motion instead of a silent clip.
- Write motion, not a photo. Describe how the subject and camera move over the clip, not just how the frame looks at a single instant.
- Index your references. For reference-to-video, refer to each subject as character1, character2, and so on, matching the order you supply the reference images.
- Keep lines short for clean lip-sync. For talking characters, use a front-facing frame with the mouth visible and keep each spoken line brief.
- One beat per clip. Pack a single action into a few seconds rather than crowding several into one generation.
- Pick the ratio up front. Choose 21:9 for a cinematic cut or 9:16 for vertical, since the framing changes how you stage the action.
Happy Horse 1.1 prompt guide
A strong prompt reads like a short shot brief, not a caption. Two things drive the result: a clear list of what the shot contains, and concrete wording in place of vague wording.
What goes in a prompt
| Element | What to include | Example |
|---|---|---|
| Subject | Who or what is in frame, described concretely | a news anchor in a navy suit at a glass desk |
| Motion | What moves, and how | he turns to a second camera and gestures |
| Camera | Shot type plus one move | medium shot, slow push-in |
| Audio | Dialogue, effects, ambience, or music | he says, 'Good evening'; soft studio room tone |
| Format | Duration and aspect ratio | 10 seconds, 16:9 |
Reference and dialogue syntax
For reference-to-video, refer to each subject as character1, character2, and so on, matching the order you supply the reference images. For timed dialogue, mark the spoken lines against the clip's timeline so the lip-sync lands where you want it.
character1 and character2 sit across a café table, warm window light. 0-4s: character1 says in French, "Tu as vu ça?"; 4-8s: character2 laughs and replies, "Incroyable." Soft café ambience, gentle handheld.
Weak vs strong prompts
Name the camera, the motion and its timing, and the audio rather than leaving them to chance.
| Focus | Weak | Strong |
|---|---|---|
| Camera | A woman in a city at night | Handheld tracking shot following a woman through rain-slicked streets, shop lights reflecting on the pavement, shallow depth of field |
| Motion and timing | The door opens and someone walks in | The door swings open slowly, a figure steps through after a beat, then the camera settles into a medium shot |
| Audio | A chef plating a dish | Close-up of a chef plating a dish, steam rising. Audio: pan sizzle, soft kitchen ambience, and 'Service.' |
Common mistakes
- Leaving the prompt silent: always write at least one sound cue, since the model generates audio with the video.
- Vague camera: "cinematic" tells the model nothing; name the shot and the move.
- Unindexed references: for reference-to-video, label each subject as character1, character2, rather than "use these references."
- Too much in one clip: keep one action per clip, and keep spoken lines short for clean lip-sync.
FAQs
Name the audio in every prompt, since Happy Horse 1.1 generates sound with the video. Describe motion rather than a still frame, and give a shot type with one camera move. For multi-character scenes, index each subject as character1, character2, and keep spoken lines short for clean lip-sync. Draft at 720p, then re-run the keeper at 1080p.
Yes. Happy Horse 1.1 generates audio with the video in a single pass, so it stays in sync with the motion. A generation can include lip-synced dialogue, sound effects, ambience, and music, with native lip-sync across seven languages and no separate audio step.
Pass up to nine reference images and refer to each by index, as character1 through character9, matching the order you supply them. State which subject comes from which image, then describe the scene and action. Happy Horse 1.1 carries each subject into the new scene so a cast stays recognizable from shot to shot.
Happy Horse 1.1 outputs 720p or 1080p in clips of 3 to 15 seconds, with a 5-second default. It supports nine aspect ratios, including 16:9, 9:16, and ultrawide 21:9, plus 9:21, 5:4, and 4:5. Choose the ratio first, since framing changes how you stage the action.
Open Morphic, switch the prompt bar to Video mode, and pick Happy Horse 1.1. Describe the scene, attach a still for image-to-video or up to nine reference images for reference-to-video, choose a resolution and aspect ratio, then run the prompt. Audio generates in the same pass.

