HuMo - Human-Centric Video Generation via Collaborative Multi-Modal Conditioning - is a video generation framework focused on creating human‑centric videos (i.e. videos where people are central) using multi‑modal inputs. That means it can take text prompts, reference images, and/or audio and produce videos with better control over how humans, their appearance, motion, etc. appear.

HuMo can handle different inputs:

You can use text plus an image to guide a character’s look.

You can use text with audio to make videos where the motion matches the sound.

Or you can mix text, image, and audio for more control.

It’s said to follow text prompts closely.

If you share a reference photo of a person or their clothing the video should keep that look across frames.

Audio can drive motion too. Lip movement or gestures should match the sound you give it.

It supports 480p and 720p with better quality at 720p. The model was trained on clips of 97 frames at 25 FPS. Longer videos may lose some flow or consistency.

On September the 16th 2025 HuMO dropped its 1.7B model, now officially supported in ComfyUI. It makes a 480p video in about 8 minutes using a 32G GPU. The video look isn’t as good as the 17B version but audio and visuals mostly stay in sync.

If you'd like to access this model, you can explore the following possibilities:

HuMo video model

Where To Find HuMo

Other Models by ByteDance