xAI TTS is a paid text to speech API. It turns written text into spoken audio. The model comes from xAI, the company behind Grok.
The official docs mostly call it Text to Speech or Text to Speech Beta.
It is built as a cloud tool for developers who want natural voice output without local setup. The main draw is pretty practical. It offers a small voice set, expressive speech tags, several output formats, and support for many languages. It feels more like an API for assistants, IVR, narration, and app voice features than a voice cloning studio.
For MP3 output, bit rates go from 32 kbps to 192 kbps. The default is 128 kbps. xAI says 44.1 kHz fits CD quality work, and 48 kHz fits studio grade audio.
xAI places this tool inside a wider voice system. In its docs it appears next to speech to text tools and real time voice agent tools. That suggests the company sees it as one part of a larger conversation platform, not a solo creator product.
The API takes plain text, a chosen voice, and a language code. Then it returns raw audio bytes that a developer can save or stream. It supports five built in voices which are eve, ara, rex, sal, and leo. xAI gives each one a different style, from upbeat and casual to more formal and businesslike. The voice range is smaller than some rival TTS tools, but it also keeps the product simple to test.
One useful part is the support for expressive control tags. A developer can add tags for pauses, laughter, sighs, whispers, faster or slower delivery, stronger emphasis, and pitch changes right inside the text. That gives more control than a basic read aloud setup. For app voice use, honestly, that is a real plus.
The max input per request is 15,000 characters. xAI lists 20 language codes plus auto detection. These include English, Arabic variants, Bengali, Chinese, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese variants, Russian, Spanish variants, Turkish, and Vietnamese.
The pricing page lists rate limits of 600 RPM and 10 concurrent requests per team.
The clearly supported features include API access, text to speech, voices with emotional or expressive control, and speed control through tags like <slow> and <fast>. It also supports multilingual speech handling through BCP 47 language inputs and auto language detection. But custom accent design is not a main marketed feature, so that claim should stay limited and careful.
If you'd like to access this model, you can explore the following possibilities: