This API provides real-time text-to-speech conversion using WebSockets. This allows you to send a text message and receive audio data back in real-time.

Endpoint:
wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}

When to use

The Text-to-Speech Websockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the Websockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:

  • The input text is being streamed or generated in chunks.
  • Word-to-audio alignment information is required.

For a practical demonstration in a real world application, refer to the Example of voice streaming using ElevenLabs and OpenAI section.

When not to use

However, it may not be the best choice when:

  • The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
  • You want to quickly experiment or prototype. Working with Websockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.

In these cases, use the Text to Speech API instead.

Protocol

The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.

Streaming input text

The client can send messages with text input to the server. The messages can contain the following fields:

1{
2 "text": "This is a sample text ",
3 "voice_settings": {
4 "stability": 0.8,
5 "similarity_boost": 0.8
6 },
7 "generation_config": {
8 "chunk_length_schedule": [120, 160, 250, 290]
9 },
10 "xi_api_key": "<XI API Key>",
11 "authorization": "Bearer <Authorization Token>"
12}
text
stringRequired

Should always end with a single space string " ". In the first message, the text should be a space " ".

try_trigger_generation
booleanDeprecated

This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.

Use this to attempt to immediately trigger the generation of audio, overriding the chunk_length_schedule. Unlike flush, try_trigger_generation will only generate audio if our buffer contains more than a minimum threshold of characters, this is to ensure a higher quality response from our model.

Note that overriding the chunk schedule to generate small amounts of text may result in lower quality audio, therefore, only use this parameter if you really need text to be processed immediately. We generally recommend keeping the default value of false and adjusting the chunk_length_schedule in the generation_config instead.

voice_settings
object

This property should only be provided in the first message you send.

stability
number

Defines the stability for voice settings.

similarity_boost
number

Defines the similarity boost for voice settings.

style
number

Defines the style for voice settings. This parameter is available on V2+ models.

use_speaker_boost
boolean

Defines the use speaker boost for voice settings. This parameter is available on V2+ models.

generation_config
object

This property should only be provided in the first message you send.

chunk_length_schedule
array

This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.

Determines the minimum amount of text that needs to be sent and present in our buffer before audio starts being generated. This is to maximise the amount of context available to the model to improve audio quality, whilst balancing latency of the returned audio chunks.

The default value is: [120, 160, 250, 290].

This means that the first chunk of audio will not be generated until you send text that totals at least 120 characters long. The next chunk of audio will only be generated once a further 160 characters have been sent. The third audio chunk will be generated after the next 250 characters. Then the fourth, and beyond, will be generated in sets of at least 290 characters.

Customize this array to suit your needs. If you want to generate audio more frequently to optimise latency, you can reduce the values in the array. Note that setting the values too low may result in lower quality audio. Please test and adjust as needed.

Each item should be in the range 50-500.

flush
boolean

Flush forces the generation of audio. Set this value to true when you have finished sending text, but want to keep the websocket connection open.

This is useful when you want to ensure that the last chunk of audio is generated even when the length of text sent is smaller than the value set in chunk_length_schedule (e.g. 120 or 50).

To understand more about how our websockets buffer text before audio is generated, please refer to this section.

xi_api_key
string

Provide the XI API Key in the first message if it’s not in the header.

authorization
string

Authorization bearer token. Should be provided only in the first message if not present in the header and the XI API Key is not provided.

For best latency we recommend streaming word-by-word, this way we will start generating as soon as we reach the predefined number of un-generated characters.

Close connection

In order to close the connection, the client should send an End of Sequence (EOS) message. The EOS message should always be an empty string:

End of Sequence (EOS) message
1{
2 "text": ""
3}
text
stringRequired

Should always be an empty string "".

Streaming output audio

The server will always respond with a message containing the following fields:

Response message
1{
2 "audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZSA6KQ==",
3 "isFinal": false,
4 "normalizedAlignment": {
5 "charStartTimesMs": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21],
6 "charDurationsMs": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3],
7 "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
8 },
9 "alignment": {
10 "charStartTimesMs": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21],
11 "charDurationsMs": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3],
12 "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
13 }
14}
audio
string

A generated partial audio chunk, encoded using the selected output_format, by default this is MP3 encoded as a base64 string.

isFinal
boolean

Indicates if the generation is complete. If set to True, audio will be null.

normalizedAlignment
string

Alignment information for the generated audio given the input normalized text sequence.

char_start_times_ms
array

A list of starting times (in milliseconds) for each character in the normalized text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example here for how to use this.

chars_durations_ms
array

A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms.

chars
array

The list of characters in the normalized text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.

alignment
string

Alignment information for the generated audio given the original text sequence.

char_start_times_ms
array

A list of starting times (in milliseconds) for each character in the original text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. Note these times are relative to the returned chunk from the model, and not the full audio response. See an example here for how to use this.

chars_durations_ms
array

A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms.

chars
array

The list of characters in the original text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.

Path parameters

voice_id
string

Voice ID to be used, you can use Get Voices to list all the available voices.

Query parameters

model_id
string

Identifier of the model that will be used, you can query them using Get Models.

language_code
string

Language code (ISO 639-1) used to enforce a language for the model. Currently only our v2.5 Flash & Turbo v2.5 models support language enforcement. For other models, an error will be returned if language code is provided.

enable_logging
string

Whether to enable request logging, if disabled the request will not be present in history nor bigtable. Enabled by default. Note: simple logging (aka printing) to stdout/stderr is always enabled.

enable_ssml_parsing
boolean

Whether to enable/disable parsing of SSML tags within the provided text. For best results, we recommend sending SSML tags as fully contained messages to the websockets endpoint, otherwise this may result in additional latency. Please note that rendered text, in normalizedAlignment, will be altered in support of SSML tags. The rendered text will use a . as a placeholder for breaks, and syllables will be reported using the CMU arpabet alphabet where SSML phoneme tags are used to specify pronunciation. Disabled by default.

optimize_streaming_latency
stringDeprecated

You can turn on latency optimizations at some cost of quality. The best possible final latency varies by model. Possible values:

ValueDescription
0default mode (no latency optimizations)
1normal latency optimizations (about 50% of possible latency improvement of option 3)
2strong latency optimizations (about 75% of possible latency improvement of option 3)
3max latency optimizations
4max latency optimizations, but also with text normalizer turned off for even more latency savings (best latency, but can mispronounce eg numbers and dates).

Defaults to 0

output_format
string

Output format of the generated audio. Must be one of:

ValueDescription
mp3_44100default output format, mp3 with 44.1kHz sample rate
pcm_16000PCM format (S16LE) with 16kHz sample rate
pcm_22050PCM format (S16LE) with 22.05kHz sample rate
pcm_24000PCM format (S16LE) with 24kHz sample rate
pcm_44100PCM format (S16LE) with 44.1kHz sample rate
ulaw_8000μ-law format (mulaw) with 8kHz sample rate. (Note that this format is commonly used for Twilio audio inputs.)

Defaults to mp3_44100

inactivity_timeout
number

The number of seconds that the connection can be inactive before it is automatically closed.

Defaults to 20 seconds, with a maximum allowed value of 180 seconds.

sync_alignment
boolean

The audio for each text sequence is delivered in multiple chunks. By default when it’s set to false, you’ll receive all timing data (alignment information) with the first chunk only. However, if you enable this option, you’ll get the timing data with every audio chunk instead. This can help you precisely match each audio segment with its corresponding text.

auto_mode
boolean

This parameter focuses on reducing the latency by disabling the chunk schedule and all buffers. It is only recommended when sending full sentences or phrases, sending partial phrases will result in highly reduced quality. By default it’s set to false.

Example - Voice streaming using ElevenLabs and OpenAI

The following example demonstrates how to leverage the ElevenLabs Websockets API to stream input from OpenAI’s GPT model, while the answer is being generated, thereby minimizing the overall latency of the operation.

1import asyncio
2import websockets
3import json
4import base64
5import shutil
6import os
7import subprocess
8from openai import AsyncOpenAI
9
10# Define API keys and voice ID
11OPENAI_API_KEY = '<OPENAI_API_KEY>'
12ELEVENLABS_API_KEY = '<ELEVENLABS_API_KEY>'
13VOICE_ID = '21m00Tcm4TlvDq8ikWAM'
14
15# Set OpenAI API key
16aclient = AsyncOpenAI(api_key=OPENAI_API_KEY)
17
18def is_installed(lib_name):
19 return shutil.which(lib_name) is not None
20
21
22async def text_chunker(chunks):
23 """Split text into chunks, ensuring to not break sentences."""
24 splitters = (".", ",", "?", "!", ";", ":", "—", "-", "(", ")", "[", "]", "}", " ")
25 buffer = ""
26
27 async for text in chunks:
28 if buffer.endswith(splitters):
29 yield buffer + " "
30 buffer = text
31 elif text.startswith(splitters):
32 yield buffer + text[0] + " "
33 buffer = text[1:]
34 else:
35 buffer += text
36
37 if buffer:
38 yield buffer + " "
39
40
41async def stream(audio_stream):
42 """Stream audio data using mpv player."""
43 if not is_installed("mpv"):
44 raise ValueError(
45 "mpv not found, necessary to stream audio. "
46 "Install instructions: https://mpv.io/installation/"
47 )
48
49 mpv_process = subprocess.Popen(
50 ["mpv", "--no-cache", "--no-terminal", "--", "fd://0"],
51 stdin=subprocess.PIPE, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
52 )
53
54 print("Started streaming audio")
55 async for chunk in audio_stream:
56 if chunk:
57 mpv_process.stdin.write(chunk)
58 mpv_process.stdin.flush()
59
60 if mpv_process.stdin:
61 mpv_process.stdin.close()
62 mpv_process.wait()
63
64
65async def text_to_speech_input_streaming(voice_id, text_iterator):
66 """Send text to ElevenLabs API and stream the returned audio."""
67 uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_flash_v2_5"
68
69 async with websockets.connect(uri) as websocket:
70 await websocket.send(json.dumps({
71 "text": " ",
72 "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
73 "xi_api_key": ELEVENLABS_API_KEY,
74 }))
75
76 async def listen():
77 """Listen to the websocket for audio data and stream it."""
78 while True:
79 try:
80 message = await websocket.recv()
81 data = json.loads(message)
82 if data.get("audio"):
83 yield base64.b64decode(data["audio"])
84 elif data.get('isFinal'):
85 break
86 except websockets.exceptions.ConnectionClosed:
87 print("Connection closed")
88 break
89
90 listen_task = asyncio.create_task(stream(listen()))
91
92 async for text in text_chunker(text_iterator):
93 await websocket.send(json.dumps({"text": text}))
94
95 await websocket.send(json.dumps({"text": ""}))
96
97 await listen_task
98
99
100async def chat_completion(query):
101 """Retrieve text from OpenAI and pass it to the text-to-speech function."""
102 response = await aclient.chat.completions.create(model='gpt-4', messages=[{'role': 'user', 'content': query}],
103 temperature=1, stream=True)
104
105 async def text_iterator():
106 async for chunk in response:
107 delta = chunk.choices[0].delta
108 yield delta.content
109
110 await text_to_speech_input_streaming(VOICE_ID, text_iterator())
111
112
113# Main execution
114if __name__ == "__main__":
115 user_query = "Hello, tell me a very long story."
116 asyncio.run(chat_completion(user_query))

Example - Other examples for interacting with our Websocket API

Some examples for interacting with the Websocket API in different ways are provided below

1import asyncio
2import websockets
3import json
4import base64
5
6async def text_to_speech(voice_id):
7 model = 'eleven_flash_v2_5'
8 uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}"
9
10 async with websockets.connect(uri) as websocket:
11
12 # Initialize the connection
13 bos_message = {
14 "text": " ",
15 "voice_settings": {
16 "stability": 0.5,
17 "similarity_boost": 0.8
18 },
19 "xi_api_key": "api_key_here", # Replace with your API key
20 }
21 await websocket.send(json.dumps(bos_message))
22
23 # Send "Hello World" input
24 input_message = {
25 "text": "Hello World "
26 }
27 await websocket.send(json.dumps(input_message))
28
29 # Send EOS message with an empty string instead of a single space
30 # as mentioned in the documentation
31 eos_message = {
32 "text": ""
33 }
34 await websocket.send(json.dumps(eos_message))
35
36 # Added a loop to handle server responses and print the data received
37 while True:
38 try:
39 response = await websocket.recv()
40 data = json.loads(response)
41 print("Server response:", data)
42
43 if data["audio"]:
44 chunk = base64.b64decode(data["audio"])
45 print("Received audio chunk")
46 else:
47 print("No audio data in the response")
48 break
49 except websockets.exceptions.ConnectionClosed:
50 print("Connection closed")
51 break
52
53asyncio.get_event_loop().run_until_complete(text_to_speech("voice_id_here"))

Example - Getting word start times using alignment values

This code example shows how the start times of words can be retrieved using the alignment values returned from our API.

1import asyncio
2import websockets
3import json
4import base64
5
6# Define API keys and voice ID
7ELEVENLABS_API_KEY = "INSERT HERE" <- INSERT YOUR API KEY HERE
8VOICE_ID = 'nPczCjzI2devNBz1zQrb' #Brian
9
10def calculate_word_start_times(alignment_info):
11 # Alignment start times are indexed from the start of the audio chunk that generated them
12 # In order to analyse runtime over the entire response we keep a cumulative count of played audio
13 full_alignment = {'chars': [], 'charStartTimesMs': [], 'charDurationsMs': []}
14 cumulative_run_time = 0
15 for old_dict in alignment_info:
16 full_alignment['chars'].extend([" "] + old_dict['chars'])
17 full_alignment['charDurationsMs'].extend([old_dict['charStartTimesMs'][0]] + old_dict['charDurationsMs'])
18 full_alignment['charStartTimesMs'].extend([0] + [time+cumulative_run_time for time in old_dict['charStartTimesMs']])
19 cumulative_run_time += sum(old_dict['charDurationsMs'])
20
21 # We now have the start times of every character relative to the entire audio output
22 zipped_start_times = list(zip(full_alignment['chars'], full_alignment['charStartTimesMs']))
23 # Get the start time of every character that appears after a space and match this to the word
24 words = ''.join(full_alignment['chars']).split(" ")
25 word_start_times = list(zip(words, [0] + [zipped_start_times[i+1][1] for (i, (a,b)) in enumerate(zipped_start_times) if a == ' ']))
26 print(f"total duration:{cumulative_run_time}")
27 print(word_start_times)
28
29
30async def text_to_speech_alignment_example(voice_id, text_to_send):
31 """Send text to ElevenLabs API and stream the returned audio and alignment information."""
32 uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_flash_v2_5"
33 async with websockets.connect(uri) as websocket:
34 await websocket.send(json.dumps({
35 "text": " ",
36 "voice_settings": {"stability": 0.5, "similarity_boost": 0.8, "use_speaker_boost": False},
37 "generation_config": {
38 "chunk_length_schedule": [120, 160, 250, 290]
39 },
40 "xi_api_key": ELEVENLABS_API_KEY,
41 }))
42
43 async def text_iterator(text):
44 """Split text into chunks to mimic streaming from an LLM or similar"""
45 split_text = text.split(" ")
46 words = 0
47 to_send = ""
48 for chunk in split_text:
49 to_send += chunk + ' '
50 words += 1
51 if words >= 10:
52 print(to_send)
53 yield to_send
54 words = 0
55 to_send = ""
56 yield to_send
57
58 async def listen():
59 """Listen to the websocket for audio data and write it to a file."""
60 audio_chunks = []
61 alignment_info = []
62 received_final_chunk = False
63 print("Listening for chunks from ElevenLabs...")
64 while not received_final_chunk:
65 try:
66 message = await websocket.recv()
67 data = json.loads(message)
68 if data.get("audio"):
69 audio_chunks.append(base64.b64decode(data["audio"]))
70 if data.get("alignment"):
71 alignment_info.append(data.get("alignment"))
72 if data.get('isFinal'):
73 received_final_chunk = True
74 break
75 except websockets.exceptions.ConnectionClosed:
76 print("Connection closed")
77 break
78 print("Writing audio to file")
79 with open("output_file.mp3", "wb") as f:
80 f.write(b''.join(audio_chunks))
81
82 calculate_word_start_times(alignment_info)
83
84
85 listen_task = asyncio.create_task(listen())
86
87 async for text in text_iterator(text_to_send):
88 await websocket.send(json.dumps({"text": text}))
89 await websocket.send(json.dumps({"text": " ", "flush": True}))
90 await listen_task
91
92
93# Main execution
94if __name__ == "__main__":
95 text_to_send = "The twilight sun cast its warm golden hues upon the vast rolling fields, saturating the landscape with an ethereal glow."
96 asyncio.run(text_to_speech_alignment_example(VOICE_ID, text_to_send))

Understanding how our websockets buffer text

Our websocket service incorporates a buffer system designed to optimize the Time To First Byte (TTFB) while maintaining high-quality streaming.

All text sent to the websocket endpoint is added to this buffer and only when that buffer reaches a certain size is an audio generation attempted. This is because our model provides higher quality audio when the model has longer inputs, and can deduce more context about how the text should be delivered.

The buffer ensures smooth audio data delivery and is automatically emptied with a final audio generation either when the stream is closed, or upon sending a flush command. We have advanced settings for changing the chunk schedule, which can improve latency at the cost of quality by generating audio more frequently with smaller text inputs.

Built with