pocket‑tts (TTS that fits in your CPU) is an open-source text-to-speech model by Kyutai Labs released in January 2026. It’s designed to generate natural-sounding speech locally, efficiently, and without requiring a GPU — even on ordinary laptops or desktops.
- Lightweight, CPU-focused TTS engine that runs in real time on standard hardware (e.g., laptop CPUs).
- Voice cloning support — you can make it imitate a voice from a short sample.
- Targeted at developers who want TTS without cloud APIs or GPU requirements.
Model & Performance
- ~100 million parameters, making it very small for a modern speech model.
- Low latency: ~200 ms to first audio chunk and usually faster-than-real-time on CPUs.
- Only 2 CPU cores needed, and doesn’t require GPU-enabled PyTorch.
APIs & Interfaces
- Command-line interface (CLI) — for quick text-to-speech generation.
- Python library API — integrate into Python apps.
- Serve mode — run a local HTTP service to generate speech via REST calls.
Voice & Language
- Includes a small catalog of builtin voices.
- Voice cloning by providing a WAV file sample for personalization.
- Primarily English support in the core project (some tools outside can supply voices).
Usage Scenarios
- Local TTS engines for accessibility tools, desktop assistants, embedded applications.
- Temporary voice synthesis (e.g., reading text aloud).
- Prototyping speech apps without cloud dependency.
Limitations
- Current build only supports CPU (no browser or GPU builds yet).
- Primarily English — limited other language support out of the box.
- Does not yet support some advanced features like silence control or quantized int8 models.
No sample outputs available for this model yet.