MOSS-TTS is an open source text to speech model that turns written words into natural human-like audio. It’s built for real use and research. It sits inside the MOSS-TTS family, which includes tools for voice copying, long narration, live streaming speech and even sound effect creation.

The model is out since mid-February 2026 and comes from the MOSI.AI & OpenMOSS Team. This group works with the Shanghai Innovation Institution, Fudan University and MOSI Intelligence. They focus on large language models and systems that handle text, audio and more in an AI-driven way.

MOSS-TTS creates high quality speech. The audio sounds natural and keeps a steady voice tone. It can copy a person’s voice from a short sample with no extra training - you give it a clip and it mimics the speaker.

It also handles very long speech. You can generate hour-long narration and the voice stays steady. It supports mixed languages like Chinese and English in the same sentence. And you get tight control over pronunciation and timing, down to phoneme and token level.

Under the hood, the system uses an autoregressive discrete token method. In simple terms it builds audio step by step using learned sound tokens. This helps it keep speech stable over long stretches and gives you better control.

There are different versions within the MOSS-TTS family.

Released Models (MOSS-TTS Family)

All models are developed by OpenMOSS Team and available via Hugging Face.

MOSS-TTS (8B)

Architecture: MossTTSDelay
Size: 8B parameters
Flagship full-scale TTS model
Designed for high-quality, long-form speech synthesis
Focus on stability and voice consistency over extended narration

MossTTSLocal (1.7B)

Architecture: MossTTSLocal
Size: 1.7B parameters
Lightweight version for research and local deployment
Lower hardware requirements than the 8B model

MOSS-TTSD-V1.0 (8B)

Architecture: MossTTSDelay
Size: 8B parameters
Enhanced variant of MOSS-TTS
Focused on improved long-form stability and synthesis quality

MOSS-VoiceGenerator (1.7B)

Architecture: MossTTSDelay
Size: 1.7B parameters
Optimized for controlled voice generation and cloning tasks
Smaller footprint for flexible deployment

MOSS-SoundEffect (8B)

Architecture: MossTTSDelay
Size: 8B parameters
Designed for AI-generated sound effects rather than speech
Extends the MOSS ecosystem beyond standard TTS

MOSS-TTS-Realtime (1.7B)

Architecture: MossTTSRealtime
Size: 1.7B parameters
Optimized for low-latency, streaming speech synthesis
Intended for real-time applications such as voice agents

Supported Languages (20 Total)

MOSS-TTS, MOSS-TTSD, and MOSS-TTS-Realtime support:

Chinese (zh), English (en), German (de), Spanish (es), French (fr), Japanese (ja), Italian (it), Hebrew (he), Korean (ko), Russian (ru), Persian/Farsi (fa), Arabic (ar), Polish (pl), Portuguese (pt), Czech (cs), Danish (da), Swedish (sv), Hungarian (hu), Greek (el), Turkish (tr).

You can use these models for audiobooks, game characters, voice agents that switch languages, reading tools for access needs, dubbing projects where you want tight speech control.

MOSS-TTS audio model

Key Features

Supported Languages

Model Performance Editor’s Rating

User Ratings

Where To Find MOSS-TTS

Related Audio Models