WebSockets

The Text-to-Speech WebSockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the WebSockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:

  • The input text is being streamed or generated in chunks.
  • Word-to-audio alignment information is required.

However, it may not be the best choice when:

  • The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
  • You want to quickly experiment or prototype. Working with WebSockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.

HandshakeTry it

GET
/v1/text-to-speech/:voice_id/stream-input

Path parameters

voice_idstringRequired

The unique identifier for the voice to use in the TTS process.

Query parameters

model_idstringOptional

The model ID to use

language_codestringOptional

The ISO 639-1 language code (for Turbo v2.5 and Flash v2.5 models only)

enable_loggingstringOptional

Whether to enable logging of the request

enable_ssml_parsingbooleanOptionalDefaults to false

Whether to enable SSML parsing

optimize_streaming_latencyenumOptionalDefaults to 0Deprecated

Latency optimization level (deprecated)

Allowed values:
output_formatenumOptionalDefaults to mp3_44100

The output audio format

inactivity_timeoutdoubleOptionalDefaults to 20

Timeout for inactivity before connection is closed

sync_alignmentbooleanOptionalDefaults to false

Whether to include timing data with every audio chunk

auto_modebooleanOptionalDefaults to false

This parameter focuses on reducing the latency by disabling the chunk schedule and all buffers. It is only recommended when sending full sentences or phrases, sending partial phrases will result in highly reduced quality. By default it’s set to false.

apply_text_normalizationenumOptionalDefaults to auto

This parameter controls text normalization with three modes - ‘auto’, ‘on’, and ‘off’. When set to ‘auto’, the system will automatically decide whether to apply text normalization (e.g., spelling out numbers). With ‘on’, text normalization will always be applied, while with ‘off’, it will be skipped. Cannot be turned on for ‘eleven_turbo_v2_5’ model. Defaults to ‘auto’.

Allowed values:
seedintegerOptional>=0

If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed. Must be an integer between 0 and 4294967295.

Send

Initialize ConnectionobjectRequired
OR
Send TextobjectRequired
OR
Close ConnectionobjectRequired

Receive

Audio OutputobjectRequired
OR
Final OutputobjectRequired