🐎 Happy Horse 1.0

Alibaba's dominant 15B open model. It debuted anonymously on the Artificial Analysis Video Arena in April 2026 and took #1 in both Text-to-Video and Image-to-Video on blind human preference votes. Its signature trick: joint video + synchronized audio in a single pass, with near-perfect lip-sync across seven languages. And it's fast.

Developer: Alibaba (Taotian / Tongyi) Launched Apr 2026 T2V I2V R2V Video edit Open source

Why it stands out

#1 ranked, blind-tested: 1333 Elo (T2V) and 1392 Elo (I2V), top of public leaderboards by real human preference.
Joint audio-video: a unified single-stream Transformer generates picture and sound together, so audio fits the scene.
7-language lip-sync with industry-low word error rate: English, Mandarin, Cantonese, Japanese, Korean, German, French.
Fast & cheap: ~10s average generation; a 720p tier runs ~half the price of 1080p.

Specs at a glance

Parameters

~15B · 40-layer

Resolution

1080p (720p tier)

Duration

3–15 s

Gen speed

~10 s avg

Prompt length

up to 2,500 chars

Aspect ratios

16:9 · 9:16 · 1:1 · 4:3 · 3:4

How to access

Available via API on fal (T2V, I2V, R2V, video-edit endpoints), Alibaba Cloud Bailian, and assorted wrappers (MuAPI, etc.). Open-source release includes base model, distilled model, super-resolution module, and inference code, with commercial-use rights.

curl -X POST 'https://happyhorse.app/api/generate' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "happyhorse-1.0/video",
    "prompt": "A cinematic shot of mountains at sunrise",
    "mode": "pro",
    "duration": 5,
    "aspect_ratio": "16:9"
  }'

Modes: pro vs std; audio on/off changes credit cost. multi_shots supports multi-prompt sequences where total duration is the sum of shots.

What Happy Horse is best for

T2V

Top-tier general quality, fast. The #1 blind-preference ranking makes it a safe default when you want the best-looking clip fast.

I2V + audio

Talking characters & dubbing. Best-in-class multilingual lip-sync for explainers, presenters, localized ads, and virtual anchors.

R2V

Reference-driven shots with native sound, via fal's reference-to-video endpoint.

Optimal prompt pattern

Same director formula. Because audio is native and prompts can run long (2,500 chars), describe the soundscape and (for talking heads) the exact dialogue line and language for the lip-sync engine.

Medium close-up of a friendly female presenter in a bright modern studio, soft key light from the left, shallow depth of field. She looks into the lens and says warmly in English: "Welcome back, today we're keeping it simple." Natural blinking and subtle head movement, precise lip-sync. Quiet room tone, faint keyboard clack.

A majestic eagle soars through golden sunlit clouds, camera tracking alongside in slow motion, individual feathers catching the light, wind rush and a distant cry. Epic, awe-inspiring tone.

Pro tips

Draft on the 720p tier (~half cost), finalize at 1080p.
Use std for iteration, pro for the keeper.
For dialogue, name the language and keep lines short for the cleanest lip-sync.
Multi-shot: when using multi_shots, make per-shot durations sum to your total.

Content policy

Happy Horse is geared toward mainstream/commercial use; hosted APIs apply standard filters and permissiveness varies by provider. Open-source availability makes self-hosting possible, but mature-content tooling lags far behind the Wan ecosystem. For adult work, prefer Seedance 2.0 or Wan 2.7; reach for Happy Horse when you want the best-looking SFW clip fast, especially talking characters.

← BackWan 2.7 Next →NSFW Best Practices