WebSocket
The Text-to-Speech WebSockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the WebSockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:
- The input text is being streamed or generated in chunks.
- Word-to-audio alignment information is required.
However, it may not be the best choice when:
- The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
- You want to quickly experiment or prototype. Working with WebSockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.
HandshakeTry it
Headers
Path parameters
Query parameters
The ISO 639-1 language code (for specific models).
Timeout for inactivity before a context is closed (seconds), can be up to 180 seconds.
Reduces latency by disabling chunk schedule and buffers. Recommended for full sentences/phrases.
This parameter controls text normalization with three modes - ‘auto’, ‘on’, and ‘off’. When set to ‘auto’, the system will automatically decide whether to apply text normalization (e.g., spelling out numbers). With ‘on’, text normalization will always be applied, while with ‘off’, it will be skipped. Cannot be turned on for ‘eleven_turbo_v2_5’ or ‘eleven_flash_v2_5’ models. Defaults to ‘auto’.
If specified, system will best-effort sample deterministically. Integer between 0 and 4294967295.