MultiTalk

MultiTalk creates videos of two people talking from just audio and one image. Natural lip sync, full-body motion and actual back-and-forth dialogue.

Visit This Site

Overview

Most AI video stuff makes one person talk. MultiTalk does two. At the same time. In the same frame. And it actually sounds and looks like a real convo.

You feed it one image with two people two audio tracks and a text prompt. It figures out who’s who and makes a full scene with lip movements facial expressions even body gestures.

Who built this? Some smart folks at Meituan Sun Yat-sen University and HKUST. They’ve worked on similar stuff before like EchoMimic and Wan2.1 so they’re not new to this.

So what makes MultiTalk different?

Other tools glitch when you give them two voices. They’ll animate both at once or mess up who’s speaking. MultiTalk doesn’t do that. It uses something called L-RoPE to lock each voice to the right person.

Even cooler? It follows prompts like “the woman puts down her coffee and hugs the man” and actually does it.

Start with an image. Then toss in two voices and a prompt. The model lines it all up and brings it to life.

At first it trained with one-talker videos. Then it leveled up with convo scenes. They fine-tuned only part of the model so it doesn’t need tons of power.

What can you use it for?

Here’s the fun part:

Make convos for animated shows or videos
Auto-generate customer service hosts or streamers
Turn podcast audio into talking scenes
Teach language with realistic dialogue visuals

It’s released under Apache 2.0 so you're free to use it for work or side gigs. Just don’t use it for anything shady.

You can:

Use it in your app
Tweak or retrain it
Share it as long as you include the license stuff

How Much VRAM do You Need to Run MultiTalk?

MultiTalk functions effectively only with the native WAN model (which is pretty large). Distilled models such as FusionX, CausVid, and others disrupt its performance as they entirely eliminate the CFG. So while you might be able to run it with as little as 12 GB VRAM it'll likely never come close to the best examples you're seeing form it.

The MultiTalk team has done an outstanding job with this tool; however, using those distilled models significantly diminishes the quality that can be attained.

Links

Educators and Trainers Creative Professionals Content Creators Media and Film Makers Marketing and Branding Specialists Voice and Audio Professionals Developers and Tech Creators Nonprofit and Advocacy Creators Small Business Owners Entertainment and Performance Artists Professional Content Creators

Prompt:

In a casual, intimate setting, a man and a woman are engaged in a heartfelt conversation inside a car. The man, sporting a denim jacket over a blue shirt, sits attentively with a seatbelt fastened, his gaze fixed on the woman beside him. The woman, wearing a black tank top and a denim jacket draped over her shoulders, smiles warmly, her eyes reflecting genuine interest and connection. The car's interior, with its beige seats and simple design, provides a backdrop that emphasizes their interaction. The scene captures a moment of shared understanding and connection, set against the soft, diffused light of an overcast day. A medium shot from a slightly angled perspective, focusing on their expressions and body language

Generated on July 30, 2025:

Demo generation released by the developers.

for link to original generation.

Rating:

Favorite

Useful Links

Wan2GP enables Multitalk on very low VRAM

Version

Run Multitalk on very low VRAM hardware (8 GB of VRAM) inside of Wan2GP by deepbeepmeep.

Added on: August 12, 2025

ComfyUI WanVideoWrapper

Version

kijai's ComfyUI wrapper nodes for WanVideo and related models supports MultiTalk

Added on: July 30, 2025

This page was last updated on August 12, 2025 at 3:56 AM

MultiTalk

Overview

So what makes MultiTalk different?

What can you use it for?

Tags

Links

What can it do?

Who is it for?

Community feedback and reviews

MultiTalk examples

Useful Links