MultiTalk
MultiTalk creates videos of two people talking from just audio and one image. Natural lip sync, full-body motion and actual back-and-forth dialogue.
Overview
Most AI video stuff makes one person talk. MultiTalk does two. At the same time. In the same frame. And it actually sounds and looks like a real convo.
You feed it one image with two people two audio tracks and a text prompt. It figures out who’s who and makes a full scene with lip movements facial expressions even body gestures.
Who built this? Some smart folks at Meituan Sun Yat-sen University and HKUST. They’ve worked on similar stuff before like EchoMimic and Wan2.1 so they’re not new to this.
So what makes MultiTalk different?
Other tools glitch when you give them two voices. They’ll animate both at once or mess up who’s speaking. MultiTalk doesn’t do that. It uses something called L-RoPE to lock each voice to the right person.
Even cooler? It follows prompts like “the woman puts down her coffee and hugs the man” and actually does it.
Start with an image. Then toss in two voices and a prompt. The model lines it all up and brings it to life.
At first it trained with one-talker videos. Then it leveled up with convo scenes. They fine-tuned only part of the model so it doesn’t need tons of power.
What can you use it for?
Here’s the fun part:
- Make convos for animated shows or videos
- Auto-generate customer service hosts or streamers
- Turn podcast audio into talking scenes
- Teach language with realistic dialogue visuals
It’s released under Apache 2.0 so you're free to use it for work or side gigs. Just don’t use it for anything shady.
You can:
- Use it in your app
- Tweak or retrain it
- Share it as long as you include the license stuff
How Much VRAM do You Need to Run MultiTalk?
MultiTalk functions effectively only with the native WAN model (which is pretty large). Distilled models such as FusionX, CausVid, and others disrupt its performance as they entirely eliminate the CFG. So while you might be able to run it with as little as 12 GB VRAM it'll likely never come close to the best examples you're seeing form it.
The MultiTalk team has done an outstanding job with this tool; however, using those distilled models significantly diminishes the quality that can be attained.
Tags
Freeware Apache License 2.0 PC-based #Video & AnimationLinks
This tool is free to use when installed locally and is offered under Apache License 2.0.
In a casual, intimate setting, a man and a woman are engaged in a heartfelt conversation inside a car. The man, sporting a denim jacket over a blue shirt, sits attentively with a seatbelt fastened, his gaze fixed on the woman beside him. The woman, wearing a black tank top and a denim jacket draped over her shoulders, smiles warmly, her eyes reflecting genuine interest and connection. The car's interior, with its beige seats and simple design, provides a backdrop that emphasizes their interaction. The scene captures a moment of shared understanding and connection, set against the soft, diffused light of an overcast day. A medium shot from a slightly angled perspective, focusing on their expressions and body language
Generated on July 30, 2025:
Useful Links
ComfyUI WanVideoWrapper
Version
kijai's ComfyUI wrapper nodes for WanVideo and related models supports MultiTalk
This page was last updated on July 30, 2025 at 7:24 AM