Multi-Context WebSockets
The Multi-Context Text-to-Speech WebSockets API allows for generating audio from text input while managing multiple independent audio generation streams (contexts) over a single WebSocket connection. This is useful for scenarios requiring concurrent or interleaved audio generations, such as dynamic conversational AI applications.
Each context, identified by a context_id
, maintains its own state. You can send text to specific
contexts, flush them, or close them independently. A close_socket
message can be used to terminate
the entire connection gracefully.
For more information on how to use this API for conversational agents see the conversational agents guide.
HandshakeTry it
Headers
Path parameters
The unique identifier for the voice to use in the TTS process.
Query parameters
The model ID to use.
The ISO 639-1 language code (for specific models).
Whether to enable logging of the request.
Whether to enable SSML parsing.
The output audio format
Timeout for inactivity before a context is closed (seconds), can be up to 180 seconds.
Whether to include timing data with every audio chunk.
Reduces latency by disabling chunk schedule and buffers. Recommended for full sentences/phrases.
This parameter controls text normalization with three modes - ‘auto’, ‘on’, and ‘off’. When set to ‘auto’, the system will automatically decide whether to apply text normalization (e.g., spelling out numbers). With ‘on’, text normalization will always be applied, while with ‘off’, it will be skipped. Cannot be turned on for ‘eleven_turbo_v2_5’ or ‘eleven_flash_v2_5’ models. Defaults to ‘auto’.
If specified, system will best-effort sample deterministically. Integer between 0 and 4294967295.