LuxTTS is a text to speech and voice cloning model. It takes text and a reference voice clip then makes speech. The public repo and Hugging Face page use the name LuxTTS.

A clear public version number does not seem to be shown. The repo mentions a future LuxTTS v1.5 so the current public release looks earlier than that... but it is not clearly tagged. The main public forms are the GitHub code, the Hugging Face weights and a hosted fal.ai deployment.

The model is linked to Yatharth Sharma. His public pages focus on speech and audio work. Other projects tied to his profiles include LavaSR, NovaSR, MiraTTS, LinaCodec, FlashSR and GLM ASR Nano.

The code and model weights are open source under Apache-2.0. The hosted fal.ai version is also presented as available for paid or business use-ish setups.

LuxTTS is a voice cloning TTS model built for people who want strong speech results without a big GPU or a slow setup. Its main points are 48 kHz audio, very fast generation and speaker cloning from a short voice clip. Here’s the deal, it tries to sit in the practical local model space instead of the heavy model space.

In use, a person gives it text and a short sample of a target voice. Then it makes new speech in that voice. The GitHub and Hugging Face pages push speed and efficiency almost as much as sound output, and that makes it stand out a bit from many TTS releases.

LuxTTS looks like part of a wider personal audio research line, not just a one-off demo. The same developer also works on speech cleanup, audio upsampling, expressive TTS and neural audio compression tools. That gives the project a more connected feel.

On the tech side, the model is described as ZipVoice-based. It is distilled to 4 steps and paired with a custom 48 kHz vocoder instead of the 24 kHz setup named in the comparison. The repo also says it uses a better sampling method than standard Euler sampling. So it seems the project builds on an existing base and pushes speed and packaging in a useful way.

The biggest practical claim is that LuxTTS can run in about 1 GB of VRAM. It can also run on CPU and reach about 150x realtime on one GPU. Those are strong claims for local voice cloning. But the bigger claims, like being on par with models 10x larger, should be seen as the developer’s own claim, not a widely confirmed result.

The model supports voice cloning from reference audio. It makes 48 kHz speech output and the public code saves audio at 48,000 Hz. The project also claims fast inference, 4-step generation, local use with low VRAM needs and support for GPU, CPU and Apple MPS.

The repo suggests using at least a 3 second reference clip for cloning. The main output is speech audio. The shown output format is WAV.

LuxTTS audio model

Key Features

Supported Languages

Model Performance Editor’s Rating

User Ratings

Where To Find LuxTTS

Related Audio Models