The neuTTS-JP-150m model published end of March 2026 on HuggingFace takes Japanese text and a voice sample, then makes spoken audio.
The size is about 150 million params, sometimes listed close to 0.2B. It is based on llm-jp-3-150m, which got fine tuned for speech. Outputs audio at 24 kHz.
You can download it for free from Hugging Face. The license is Apache 2.0, updated in late Feb 2026 to match the base model. That license usually allows business use if you follow the rules.
The model focuses only on Japanese. It is not built for many languages. It uses a changed tokenizer and an audio token setup with NeuCodec. The goal is to turn a small language model into a speech system that outputs 24 kHz audio while staying kind of lightweight.
It was shared by a developer named tsukemono. Looks like an independent builder.
The base model comes from a research group at Japan’s National Institute of Informatics. So this system builds on that, not from zero. It tweaks the tokenizer, changes inputs and outputs, and updates weights so it can handle both text and audio tokens.
The tokenizer is heavily changed, so it basically works only with Japanese. The model also aims at voice cloning, which is a bit interesting for a model this small. The process is simple-ish… you give a voice clip, it turns that into codec tokens, adds Japanese text, then generates new audio tokens which turn into speech.
There is some interest here since it is small and open, so people can run it locally. But info is still kind of limited. There are audio samples, but no clear test results, no strong comparisons, and not much detail about performance or hardware needs.
If you'd like to access this model, you can explore the following possibilities: