Websockets
This API provides real-time text-to-speech conversion using WebSockets. This allows you to send a text message and receive audio data back in real-time.
Endpoint:wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}
When to use
The Text-to-Speech Websockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the Websockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:
- The input text is being streamed or generated in chunks.
- Word-to-audio alignment information is required.
For a practical demonstration in a real world application, refer to the Example of voice streaming using ElevenLabs and OpenAI section.
When not to use
However, it may not be the best choice when:
- The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
- You want to quickly experiment or prototype. Working with Websockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.
In these cases, use the Text to Speech API instead.
Protocol
The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.
Streaming input text
The client can send messages with text input to the server. The messages can contain the following fields:
Should always end with a single space string " "
. In the first message, the text should be a space " "
.
This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.
Use this to attempt to immediately trigger the generation of audio, overriding the chunk_length_schedule
. Unlike flush, try_trigger_generation
will only generate audio if our buffer contains more than a minimum threshold of characters, this is to ensure a higher quality response from our model.
Note that overriding the chunk schedule to generate small amounts of text may result in lower quality audio, therefore, only use this parameter if you really need text to be processed immediately. We generally recommend keeping the default value of false
and adjusting the chunk_length_schedule
in the generation_config
instead.
This property should only be provided in the first message you send.
properties
Defines the stability for voice settings.
Defines the similarity boost for voice settings.
Defines the style for voice settings. This parameter is available on V2+ models.
Defines the use speaker boost for voice settings. This parameter is available on V2+ models.
This property should only be provided in the first message you send.
properties
This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.
Determines the minimum amount of text that needs to be sent and present in our buffer before audio starts being generated. This is to maximise the amount of context available to the model to improve audio quality, whilst balancing latency of the returned audio chunks.
The default value is: [120, 160, 250, 290].
This means that the first chunk of audio will not be generated until you send text that totals at least 120 characters long. The next chunk of audio will only be generated once a further 160 characters have been sent. The third audio chunk will be generated after the next 250 characters. Then the fourth, and beyond, will be generated in sets of at least 290 characters.
Customize this array to suit your needs. If you want to generate audio more frequently to optimise latency, you can reduce the values in the array. Note that setting the values too low may result in lower quality audio. Please test and adjust as needed.
Each item should be in the range 50-500.
Flush forces the generation of audio. Set this value to true
when you have finished sending text, but want to keep the websocket connection open.
This is useful when you want to ensure that the last chunk of audio is generated even when the length of text sent is smaller than the value set in chunk_length_schedule
(e.g. 120 or 50).
To understand more about how our websockets buffer text before audio is generated, please refer to this section.
Provide the XI API Key in the first message if it’s not in the header.
Authorization bearer token. Should be provided only in the first message if not present in the header and the XI API Key is not provided.
For best latency we recommend streaming word-by-word, this way we will start generating as soon as we reach the predefined number of un-generated characters.
Close connection
In order to close the connection, the client should send an End of Sequence (EOS) message. The EOS message should always be an empty string:
Should always be an empty string ""
.
Streaming output audio
The server will always respond with a message containing the following fields:
A generated partial audio chunk, encoded using the selected output_format, by default this is MP3 encoded as a base64 string.
Indicates if the generation is complete. If set to True
, audio
will be null.
Alignment information for the generated audio given the input normalized text sequence.
properties
A list of starting times (in milliseconds) for each character in the normalized text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example here for how to use this.
A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms.
The list of characters in the normalized text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.
Alignment information for the generated audio given the original text sequence.
properties
A list of starting times (in milliseconds) for each character in the original text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example here for how to use this.
A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms.
The list of characters in the original text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.
Path parameters
Voice ID to be used, you can use Get Voices to list all the available voices.
Query parameters
Identifier of the model that will be used, you can query them using Get Models.
Language code (ISO 639-1) used to enforce a language for the model. Currently only our v2.5 Flash & Turbo v2.5 models support language enforcement. For other models, an error will be returned if language code is provided.
Whether to enable request logging, if disabled the request will not be present in history nor bigtable. Enabled by default. Note: simple logging (aka printing) to stdout/stderr is always enabled.
Whether to enable/disable parsing of SSML tags within the provided text. For best results, we recommend sending SSML tags as fully contained messages to the websockets endpoint, otherwise this may result in additional latency. Please note that rendered text, in normalizedAlignment, will be altered in support of SSML tags. The rendered text will use a . as a placeholder for breaks, and syllables will be reported using the CMU arpabet alphabet where SSML phoneme tags are used to specify pronunciation. Disabled by default.
You can turn on latency optimizations at some cost of quality. The best possible final latency varies by model. Possible values:
Defaults to 0
Output format of the generated audio. Must be one of:
Defaults to mp3_44100
The number of seconds that the connection can be inactive before it is automatically closed.
Defaults to 20
seconds, with a maximum allowed value of 180
seconds.
The audio for each text sequence is delivered in multiple chunks. By default when it’s set to false, you’ll receive all timing data (alignment information) with the first chunk only. However, if you enable this option, you’ll get the timing data with every audio chunk instead. This can help you precisely match each audio segment with its corresponding text.
This parameter focuses on reducing the latency by disabling the chunk schedule and all buffers. It is only recommended when sending full sentences or phrases, sending partial phrases will result in highly reduced quality. By default it’s set to false.
Example - Voice streaming using ElevenLabs and OpenAI
The following example demonstrates how to leverage the ElevenLabs Websockets API to stream input from OpenAI’s GPT model, while the answer is being generated, thereby minimizing the overall latency of the operation.
Example - Other examples for interacting with our Websocket API
Some examples for interacting with the Websocket API in different ways are provided below
Example - Getting word start times using alignment values
This code example shows how the start times of words can be retrieved using the alignment values returned from our API.
Understanding how our websockets buffer text
Our websocket service incorporates a buffer system designed to optimize the Time To First Byte (TTFB) while maintaining high-quality streaming.
All text sent to the websocket endpoint is added to this buffer and only when that buffer reaches a certain size is an audio generation attempted. This is because our model provides higher quality audio when the model has longer inputs, and can deduce more context about how the text should be delivered.
The buffer ensures smooth audio data delivery and is automatically emptied with a final audio generation either when the stream is closed, or upon sending a flush
command. We have advanced settings for changing the chunk schedule, which can improve latency at the cost of quality by generating audio more frequently with smaller text inputs.