Multi-Context WebSocket

The Multi-Context Text-to-Speech WebSockets API allows for generating audio from text input while managing multiple independent audio generation streams (contexts) over a single WebSocket connection. This is useful for scenarios requiring concurrent or interleaved audio generations, such as dynamic conversational AI applications. Each context, identified by a context id, maintains its own state. You can send text to specific contexts, flush them, or close them independently. A `close_socket` message can be used to terminate the entire connection gracefully. For more information on best practices for how to use this API, please see the [multi context websocket guide](/docs/developers/guides/cookbooks/multi-context-web-socket).

Handshake

WSS
/v1/text-to-speech/:voice_id/multi-stream-input

Headers

xi-api-keystringOptional

Path parameters

voice_idstringRequired
The unique identifier for the voice to use in the TTS process.

Query parameters

authorizationstringOptional
Your authorization bearer token.
single_use_tokenstringOptional

Your single use token. Use this if you want to initiate a session from the client. When providing this parameter, xi-api-key is no longer required for authentication.

model_idstringOptional
The model ID to use.
language_codestringOptional

The ISO 639-1 language code (for specific models).

enable_loggingbooleanOptionalDefaults to true
Whether to enable logging of the request.
enable_ssml_parsingbooleanOptionalDefaults to false
Whether to enable SSML parsing.
output_formatenumOptional
The output audio format
inactivity_timeoutintegerOptionalDefaults to 20

Timeout for inactivity before a context is closed (seconds), can be up to 180 seconds.

sync_alignmentbooleanOptionalDefaults to false
Whether to include timing data with every audio chunk.
auto_modebooleanOptionalDefaults to false

Reduces latency by disabling chunk schedule and buffers. Recommended for full sentences/phrases.

apply_text_normalizationenumOptionalDefaults to auto
This parameter controls text normalization with three modes - 'auto', 'on', and 'off'. When set to 'auto', the system will automatically decide whether to apply text normalization (e.g., spelling out numbers). With 'on', text normalization will always be applied, while with 'off', it will be skipped. For 'eleven_turbo_v2_5' and 'eleven_flash_v2_5' models, text normalization can only be enabled with Enterprise plans. Defaults to 'auto'.
Allowed values:
seedintegerOptional>=0

If specified, system will best-effort sample deterministically. Integer between 0 and 4294967295.

Send

Initialize Connection MultiobjectRequired
OR
Initialise ContextobjectRequired
OR
Send Text MultiobjectRequired
OR
Flush ContextobjectRequired
OR
Close ContextobjectRequired
OR
Close SocketobjectRequired
OR
Keep Context AliveobjectRequired

Receive

Audio Output MultiobjectRequired
OR
Final Output MultiobjectRequired