VoXtream is a research text to speech system built for fast real time speech. The model turns text into audio and it starts speaking almost right away even while more words are still coming in. That makes it useful for live assistants and chat style AI where waiting for a full sentence slows things down.
The system falls into the audio generator or TTS model group. It works as a streaming zero shot voice model. A short voice sample – just a few seconds – guides the voice style. The model then produces speech that follows that voice. It can run locally too which means users can keep generation private if they want.
VoXtream comes from researchers at KTH Royal Institute of Technology in Stockholm. The work is by Nikita Torgashov Gustav Eje Henter and Gabriel Skantze. Their paper titled VoXtream Full Stream Text to Speech with Extremely Low Latency appeared on arXiv in September 2025 and later got an oral presentation slot at ICASSP 2026. The team focuses on spoken interaction systems so the project fits their research direction.
The big idea is full stream speech generation. Many TTS tools wait for a full sentence before speaking. VoXtream starts from the first word and keeps going as new text arrives. In tests the system produced audio chunks of about 80 ms and reached around 102 ms first packet delay on GPU. That is very quick compared with many open streaming TTS tools.
The project also supports zero shot voice adaptation. A short audio prompt about three to five seconds gives the model a voice example. The system can accept up to ten seconds of prompt audio along with matching text up to about 250 characters. Target speech text can reach about 1000 characters and output audio can last roughly one minute.
Another point is efficiency. The repo says inference can run with around 2 GB of VRAM which is light compared with many modern speech models. Training is heavier and was tested on GPUs like RTX 3090 and A100 cards with large memory setups.
Users mainly get speech audio output. Example runs save WAV files and the streaming setup sends small audio chunks while speech forms. The public material does not push consumer style export tools like dubbing packs multivoice scenes or fancy audio formats.
If you'd like to access this model, you can explore the following possibilities: