T2V · I2V · R2V Explained
These models support several generation "modes." Picking the right one is your biggest lever on quality and control. This page covers what each mode does and when to use it.
Text-to-Video (T2V)
You describe a scene in words; the model invents the rest: subject, environment, motion, camera, and (on most 2026 models) audio. No input image anchors the result, so you trade control for creative freedom.
- Use it for: concepting, B-roll, abstract or fantastical scenes, anything where you don't already have a subject image.
- Strength: maximum imagination, fastest to start.
- Weakness: least control over exact appearance; the same prompt yields different faces/looks each run.
Pro move: If you need a specific look, generate a still image first (T2I), then feed it into I2V. You get the imagination of text plus the lock-in of an image.
Image-to-Video (I2V)
You supply a starting image (and sometimes an ending frame); the model animates it. Appearance, composition, lighting, and identity are largely "locked" by the image, so your prompt mostly controls motion and camera.
- Use it for: bringing a specific character/product/photo to life, consistent looks, controlled results, most mature content workflows.
- Strength: highest fidelity to a known subject; far fewer "lottery" re-rolls.
- Weakness: input quality caps output quality. Blurry image in, blurry video out.
First/last frame: Kling 3, Seedance 2.0, and Wan 2.7 support setting both a start and end frame. This is the most reliable way to choreograph a precise transformation (pose A → pose B).
Reference-to-Video (R2V)
The most powerful and most misunderstood mode. Instead of one start frame, you feed the model a library of references (character images, a motion clip, a style board, an audio track) and you tell it, in your prompt, what to take from each. The model extracts those elements and builds a new video.
This is where Seedance 2.0 and Wan 2.7 shine. A single generation can combine, for example: the character from Image 1, the camera move from Video 1, the lighting style from Image 2, and the voice timbre from Audio 1.
- Use it for: character consistency across shots, motion transfer (copy a dance/camera move onto a new subject), style borrowing, voice cloning, multi-character scenes.
- Strength: director-level control; combine elements no single image could capture.
- Weakness: steeper learning curve. You must say which reference governs which dimension, or the model guesses.
R2V sub-tasks (Seedance terminology)
ByteDance's own guide splits reference workflows into three task types, a useful mental model for any model:
| Task | What it does | Prompt pattern |
|---|---|---|
| Reference | Extract elements (subject, style, motion, sound) to make a new video. | "Refer to the [action/style/sound] in @Video1 to generate…" |
| Edit | Modify part of an existing video; everything unmentioned stays the same. | "Strictly edit @Video1, changing its [original feature] to [new feature]…" |
| Extend | Continue a clip forward (or backward) with consistent identity and style. | "Extend @Video1, generate…" |
When editing or extending, reference the clip directly (e.g. @Video1) rather than saying "Reference Video 1," which the model can misread as a new referencing task.
Watch: Reference-to-Video in Venice Studio
Full library: Video Guides.