Multi-Context WebSocket | ElevenLabs Documentation

Each context, identified by a context id, maintains its own state. You can send text to specific contexts, flush them, or close them independently. A close_socket message can be used to terminate the entire connection gracefully.

For more information on best practices for how to use this API, please see the multi context websocket guide.

The Multi-Context Text-to-Speech WebSockets API allows for generating audio from text input while managing multiple independent audio generation streams (contexts) over a single WebSocket connection. This is useful for scenarios requiring concurrent or interleaved audio generations, such as dynamic conversational AI applications. Each context, identified by a context id, maintains its own state. You can send text to specific contexts, flush them, or close them independently. A `close_socket` message can be used to terminate the entire connection gracefully. For more information on best practices for how to use this API, please see the [multi context websocket guide](/docs/eleven-api/guides/how-to/websockets/multi-context-web-socket).

Handshake

WSS

/v1/text-to-speech/:voice_id/multi-stream-input

Headers

xi-api-keystringOptional

Path parameters

voice_idstringRequired

The unique identifier for the voice to use in the TTS process.

Query parameters

authorizationanyOptional

Your authorization bearer token.

single_use_tokenanyOptional

Your single use token. Use this if you want to initiate a session from the client. When providing this parameter, xi-api-key is no longer required for authentication.

model_idanyOptional

Identifier of the model that will be used, you can query them using GET /v1/models.

language_codeanyOptional

Language code (ISO 639-1) used to enforce a language for the model and text normalization. If the model does not support the provided language code, it will be ignored. This parameter is not supported for multilingual_v2 models.

enable_logginganyOptional

When enable_logging is set to false zero retention mode will be used for the request. This will mean history features are unavailable for this request, including request stitching. Zero retention mode may only be used by enterprise customers.

output_formatanyOptional

Output format of the generated audio. Formatted as codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at 32kbs is represented as mp3_22050_32. MP3 with 192kbps bitrate requires you to be subscribed to Creator tier or above. PCM with 44.1kHz sample rate requires you to be subscribed to Pro tier or above. Note that the μ-law format (sometimes written mu-law, often approximated as u-law) is commonly used for Twilio audio inputs.

inactivity_timeoutanyOptional

The number of seconds that the connection can be inactive before it is automatically closed. The default timeout is set to 20, with a maximum allowed value of 180.

sync_alignmentanyOptional

Sync the text alignment to every returned response

auto_modeanyOptional

Whether to use auto mode for this request. This setting focuses on reducing the latency by disabling the chunk schedule and all buffers. It is only recommended when sending full sentences, sending partial sentences will result in highly reduced quality.

apply_text_normalizationanyOptional

This parameter controls text normalization with three modes: ‘auto’, ‘on’, and ‘off’. When set to ‘auto’, the system will automatically decide whether to apply text normalization (e.g., spelling out numbers). With ‘on’, text normalization will always be applied, while with ‘off’, it will be skipped.

seedanyOptional

If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed.

enable_ssml_parsinganyOptional

Whether to enable/disable parsing of SSML tags within the provided text. For best results, we recommend sending SSML tags as fully contained messages to the websockets endpoint, otherwise this may result in additional latency. Please note that rendered text, in normalizedAlignment, will be altered in support of SSML tags. The rendered text will use a . as a placeholder for breaks, and syllables will be reported using the CMU arpabet alphabet where SSML phoneme tags are used to specify pronunciation. IMPORTANT: When using phoneme-based pronunciation dictionaries (IPA/CMU), SSML parsing is automatically enabled if this parameter is not set. Setting this to false with phoneme dictionaries is deprecated and will be ignored in a future release, as phoneme dictionaries require SSML parsing to work correctly.

Send

initializeConnectionMultiobjectRequired

Message to initialize a new TTS context in a multi-context stream.

initialiseContextobjectRequired

Message to initialize or re-initialize a TTS context with text and settings for multi-stream connections.

sendTextMultiobjectRequired

Message to send text for synthesis to a specific context.

flushContextClientobjectRequired

Message to flush the audio buffer for a specific context.

closeContextClientobjectRequired

Message to close a specific TTS context.

closeSocketClientobjectRequired

Message to gracefully close the entire WebSocket connection.

keepContextAliveobjectRequired

Message to keep a specific context alive by resetting its inactivity timeout.

Receive

audioOutputMultiobjectRequired

Server message containing an audio chunk for a specific context.

finalOutputMultiobjectRequired

Server message indicating the final output for a specific context.

For more information on best practices for how to use this API, please see the multi context websocket guide.

Headers

xi-api-keystringOptional

Path parameters

voice_idstringRequired

The unique identifier for the voice to use in the TTS process.

Query parameters

authorizationanyOptional

Your authorization bearer token.

single_use_tokenanyOptional

Your single use token. Use this if you want to initiate a session from the client. When providing this parameter, xi-api-key is no longer required for authentication.

model_idanyOptional

Identifier of the model that will be used, you can query them using GET /v1/models.

language_codeanyOptional

enable_logginganyOptional

output_formatanyOptional

inactivity_timeoutanyOptional

The number of seconds that the connection can be inactive before it is automatically closed. The default timeout is set to 20, with a maximum allowed value of 180.

sync_alignmentanyOptional

Sync the text alignment to every returned response

auto_modeanyOptional

apply_text_normalizationanyOptional

seedanyOptional

enable_ssml_parsinganyOptional

Message to initialize a new TTS context in a multi-context stream.

Message to initialize or re-initialize a TTS context with text and settings for multi-stream connections.

Message to send text for synthesis to a specific context.

Message to flush the audio buffer for a specific context.

Message to close a specific TTS context.

Message to gracefully close the entire WebSocket connection.

Message to keep a specific context alive by resetting its inactivity timeout.

Server message containing an audio chunk for a specific context.

Server message indicating the final output for a specific context.

URL	wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/multi-stream-input
Method	GET
Status	101 Switching Protocols

HandshakeTry it

Headers

Path parameters

Query parameters

Send

Receive

HandshakeTry it

Headers

Path parameters

Query parameters

Send

Receive

Handshake

Handshake