VoxCPM2 is the open-source model in the VoxCPM line from OpenBMB, which was dropped mid-April 2026. It follows earlier versions like VoxCPM-0.5B and VoxCPM1.5. The official card says it is a 2B parameter text to speech model. It does not use a tokenizer and runs on a diffusion autoregressive setup. It was trained on over 2 million hours of multilingual speech and supports 30 languages with 48 kHz audio output.
OpenBMB is not a one off group. Its GitHub shows work on foundation models and lists projects like MiniCPM and MiniCPM-o. VoxCPM2 connects to that same line, since the repo says it uses a MiniCPM-4 backbone. So it fits into a bigger model stack, not just a single test project.
One big idea here is skipping the speech tokenizer. Earlier VoxCPM research says token pipelines can lose some voice detail to stay stable. The model family uses a layered semantic and acoustic design to keep more natural sound. That research is tied to VoxCPM as a whole, not a full paper just for VoxCPM2, so it works more like the base idea behind the series.
From a user view it does more than simple text to speech. The model runs in three main modes. Standard TTS is the basic one. Voice design lets you create a new voice from a text description like age or tone. Then there are two cloning levels. One uses a short audio clip with control over style and pace. The higher level uses audio plus transcript for closer voice matching. The repo also includes streaming and deployment paths, so it looks built for real apps, not just demos.
There are some limits to keep in mind. Voice design and style control can change between runs. Quality can shift by language based on training data. Long or very expressive inputs can get unstable at times. OpenBMB also blocks uses like impersonation, fraud and false info, and says AI audio should be labeled.
The model outputs speech audio. It uses an AudioVAE V2 pipeline to reach 48 kHz output. It can take 16 kHz reference audio and scale it up inside the system.
Some notes on its features:
Controllable cloning. Copies a voice from a short clip and lets you adjust tone or pace. Uses audio plus transcript for closer voice detail.
Streaming. Supports chunked generation through the API, but it could be more accurate to call it “developer-level streaming / near-real-time TTS” rather than fully optimized real-time out of the box/
Fine tuning. Supports full SFT and LoRA with about 5 to 10 minutes of audio.
Deployment. Uses Nano-vLLM-VoxCPM for higher throughput serving.
Use cases line up with those features. It fits multilingual narration, custom voices for apps or games, consent based voice cloning, quick prototyping of voice interfaces, and local TTS setups for teams that do not want a paid closed API.
For local setup you need Python 3.10 or newer, PyTorch 2.5 or newer and CUDA 12.0 or newer. The model lists about 8 GB VRAM for inference. The model file itself is around 4.96 GB, but real usage takes more memory due to runtime overhead, buffers and activations.
If you'd like to access this model, you can explore the following possibilities: