AI creators tools

MOSS-TTS audio model

Name: MOSS-TTS
Licence: Apache License 2.0
Creator: MOSI.AI

MOSS-TTS is an open source text to speech model that turns written words into natural human-like audio. It’s built for real use and research. It sits inside the MOSS-TTS family, which includes tools for voice copying, long narration, live streaming speech and even sound effect creation.

The model is out since mid-February 2026 and comes from the  MOSI.AI & OpenMOSS Team. This group works with the Shanghai Innovation Institution, Fudan University and MOSI Intelligence. They focus on large language models and systems that handle text, audio and more in an AI-driven way.

MOSS-TTS creates high quality speech. The audio sounds natural and keeps a steady voice tone. It can copy a person’s voice from a short sample with no extra training -  you give it a clip and it mimics the speaker.

It also handles very long speech. You can generate hour-long narration and the voice stays steady. It supports mixed languages like Chinese and English in the same sentence. And you get tight control over pronunciation and timing, down to phoneme and token level. 

Under the hood, the system uses an autoregressive discrete token method. In simple terms it builds audio step by step using learned sound tokens. This helps it keep speech stable over long stretches and gives you better control.

There are different versions within the MOSS-TTS family.

Released Models (MOSS-TTS Family)

All models are developed by OpenMOSS Team and available via Hugging Face.

MOSS-TTS (8B)

  • Architecture: MossTTSDelay

  • Size: 8B parameters

  • Flagship full-scale TTS model

  • Designed for high-quality, long-form speech synthesis

  • Focus on stability and voice consistency over extended narration

MossTTSLocal (1.7B)

  • Architecture: MossTTSLocal

  • Size: 1.7B parameters

  • Lightweight version for research and local deployment

  • Lower hardware requirements than the 8B model

MOSS-TTSD-V1.0 (8B)

  • Architecture: MossTTSDelay

  • Size: 8B parameters

  • Enhanced variant of MOSS-TTS

  • Focused on improved long-form stability and synthesis quality

MOSS-VoiceGenerator (1.7B)

  • Architecture: MossTTSDelay

  • Size: 1.7B parameters

  • Optimized for controlled voice generation and cloning tasks

  • Smaller footprint for flexible deployment

MOSS-SoundEffect (8B)

  • Architecture: MossTTSDelay

  • Size: 8B parameters

  • Designed for AI-generated sound effects rather than speech

  • Extends the MOSS ecosystem beyond standard TTS

MOSS-TTS-Realtime (1.7B)

  • Architecture: MossTTSRealtime

  • Size: 1.7B parameters

  • Optimized for low-latency, streaming speech synthesis

  • Intended for real-time applications such as voice agents

Supported Languages (20 Total)

MOSS-TTS, MOSS-TTSD, and MOSS-TTS-Realtime support:

Chinese (zh), English (en), German (de), Spanish (es), French (fr), Japanese (ja), Italian (it), Hebrew (he), Korean (ko), Russian (ru), Persian/Farsi (fa), Arabic (ar), Polish (pl), Portuguese (pt), Czech (cs), Danish (da), Swedish (sv), Hungarian (hu), Greek (el), Turkish (tr).

You can use these models for audiobooks, game characters, voice agents that switch languages, reading tools for access needs, dubbing projects where you want tight speech control.

Key Features
No performance evaluations available for this model yet.
No sample outputs available for this model yet.

Where To Find MOSS-TTS

If you'd like to access this model, you can explore the following possibilities: