Kani-TTS-2 audio model

Name: Kani-TTS
Version: 2
Variant: en
Licence: Apache License 2.0
Creator: nineninesix

Kani-TTS-2-en is an English text to speech model that makes 22 kHz WAV audio. It has around 400 million parameters and runs under the Apache 2.0 license so you can use it for commercial work. The team nineninesix released it in mid February 2026 and made it free and open source.

It focuses on real time conversation. The model mixes a transformer style text system with a neural audio codec to create natural speech. And it runs fast. On newer NVIDIA GPUs it can make five seconds of audio in about one second which is pretty quick.

This model is the English version of the Kani-TTS-2 series.

It uses about 400M parameters so it sits in the mid range for TTS models. Not tiny. Not huge either. It aims to keep speech clear and natural without needing massive VRAM (likely 4–6 GB) like those billion parameter systems.

Core features. It turns text into speech and saves it as WAV files. It supports preset voices and voice cloning. It also lets you adjust speed and run everything locally through a Python interface.

Voices. It includes region styled English voices like Boston Oakland and Glasgow. You can also clone a custom voice by extracting speaker embeddings.

Output details. It makes mono WAV audio at 22 kHz. It runs faster than real time on modern GPUs. It usually needs around 3 GB of VRAM on RTX 50 series cards.

Hardware needs. A GPU with about 4 to 6 GB VRAM works well. You can run it on CPU but it will be slower. It uses a Python based pipeline for inference.

Use cases. It works for chatbot speech voice assistants game NPC dialogue audiobooks and quick voice prototypes. Because it runs fast it fits interactive systems where response time matters.

The license is Apache 2.0 which means you can use it in commercial products. So if you want a local first TTS option instead of paying for closed services this one gives you that choice... simple and practical.

Key Features

Supported Languages

Model Performance Editor’s Rating

No editor performance evaluations available for this model yet.

User Ratings

Censorship

Lower = less censorship. Higher = stricter filtering.

Creativity

Expressiveness

Generation Speed

ID preservation

Prompt Following

Realism

Kani-TTS-2 Examples

Quite basic but very good for open-source. Prebuilt voice: Andrew from San Francisco Generated on February 16, 2026

Compare With Other Models

Prebuilt voice Robert is a bit more expressive and confident-sounding Generated on February 16, 2026