The Text-to-Speech WebSockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the WebSockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:
However, it may not be the best choice when:
Your single use token. Use this if you want to initiate a session from the client. When providing this parameter, xi-api-key is no longer required for authentication.
The ISO 639-1 language code (for specific models).
Timeout for inactivity before a context is closed (seconds), can be up to 180 seconds.
Reduces latency by disabling chunk schedule and buffers. Recommended for full sentences/phrases.
If specified, system will best-effort sample deterministically. Integer between 0 and 4294967295.
This parameter controls text normalization with three modes - ‘auto’, ‘on’, and ‘off’. When set to ‘auto’, the system will automatically decide whether to apply text normalization (e.g., spelling out numbers). With ‘on’, text normalization will always be applied, while with ‘off’, it will be skipped. For the ‘eleven_flash_v2_5’ model, text normalization can only be enabled with Enterprise plans. Defaults to ‘auto’.