This API provides real-time text-to-speech conversion using WebSockets. Clients can send a text message and receive audio data in real-time.

Endpoint: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}

​ When to use

The Text-to-Speech Websockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the Websockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:

Word-to-audio alignment information is required.

The input text is being streamed or generated in chunks.

However, it may not be the best choice when:

The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.

You want to quickly experiment or prototype. Working with Websockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.

For a practical demonstration in a real world application, refer to the Example of voice streaming using ElevenLabs and ChatGPT section.

The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.

​ Streaming input text

The client can send messages with text input to the server. The messages can contain the following fields:

{ "text" : "This is a sample text " , "voice_settings" : { "stability" : 0.8 , "similarity_boost" : 0.8 } , "generation_config" : { "chunk_length_schedule" : [ 120 , 160 , 250 , 290 ] } , "xi_api_key" : "<XI API Key>" , "authorization" : "Bearer <Authorization Token>" }

text string required Should always end with a single space string " " . In the first message, the text should be a space " " .

voice_settings object Should only be provided in the first message Show child attributes stability number Defines the stability for voice settings. similarity_boost number Defines the similarity boost for voice settings. style number Defines the style for voice settings. This parameter is available on V2+ models. use_speaker_boost boolean Defines the use speaker boost for voice settings. This parameter is available on V2+ models.

try_trigger_generation boolean required If we should try to trigger the generation.

generation_config object Should only be provided in the first message Show child attributes chunk_length_schedule array Determines how text is chunked for processing. Default: [120, 160, 250, 290].

Each item should be in the range [50, 500].

xi_api_key string Provide the XI API Key in the first message if it’s not in the header. See Authentication for more details.

authorization string Authorization bearer token. Should be provided only in the first message if not present in the header and the XI API Key is not provided.

For best latency we recommend streaming word-by-word, this way we will start generating as soon as we reach the predefined number of un-generated characters.

​ End of input

In order to close the connection, the client should send a EOS message. The EOS message should always be an empty string:

End of Sequence (EOS) message { "text" : "" }

text string required Should always be an empty string "" .

​ Streaming output audio

The server will always respond with a message containing the following fields:

Response message { "audio" : "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZSA6KQ==" , "isFinal" : false , "normalizedAlignment" : { "char_start_times_ms" : [ 0 , 3 , 7 , 9 , 11 , 12 , 13 , 15 , 17 , 19 , 21 ] , "chars_durations_ms" : [ 3 , 4 , 2 , 2 , 1 , 1 , 2 , 2 , 2 , 2 , 3 ] "chars" : [ "H" , "e" , "l" , "l" , "o" , " " , "w" , "o" , "r" , "l" , "d" ] } , "alignment" : { "char_start_times_ms" : [ 0 , 3 , 7 , 9 , 11 , 12 , 13 , 15 , 17 , 19 , 21 ] , "chars_durations_ms" : [ 3 , 4 , 2 , 2 , 1 , 1 , 2 , 2 , 2 , 2 , 3 ] "chars" : [ "H" , "e" , "l" , "l" , "o" , " " , "w" , "o" , "r" , "l" , "d" ] } }

audio string A generated partial MP3 audio chunk encoded as a base64 string.

isFinal boolean Indicates if the generation is complete. If set to True , audio will be null.

normalizedAlignment string Alignment information for the generated audio given the input normalized text sequence. Show child attributes char_start_times_ms array A list of starting times (in milliseconds) for each character in the normalized text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. chars_durations_ms array A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms. chars array The list of characters in the normalized text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.

alignment string Alignment information for the generated audio given the original text sequence. Show child attributes char_start_times_ms array A list of starting times (in milliseconds) for each character in the original text as it corresponds to the audio. For instance, the character ‘H’ starts at time 0 ms in the audio. chars_durations_ms array A list providing the duration (in milliseconds) for each character’s pronunciation in the audio. For instance, the character ‘H’ has a pronunciation duration of 3 ms. chars array The list of characters in the original text sequence that corresponds with the timings and durations. This list is used to map the characters to their respective starting times and durations.

​ Path parameters

voice_id string Voice ID to be used, you can use Get Voices to list all the available voices.

​ Query parameters

model_id string Identifier of the model that will be used, you can query them using Get Voices.

enable_logging string Whether to enable request logging, if disabled the request will not be present in history nor bigtable. Enabled by default. Note: simple logging (aka printing) to stdout/stderr is always enabled.

optimize_streaming_latency string You can turn on latency optimizations at some cost of quality. The best possible final latency varies by model. Possible values: Value Description 0 default mode (no latency optimizations) 1 normal latency optimizations (about 50% of possible latency improvement of option 3) 2 strong latency optimizations (about 75% of possible latency improvement of option 3) 3 max latency optimizations 4 max latency optimizations, but also with text normalizer turned off for even more latency savings (best latency, but can mispronounce eg numbers and dates). Defaults to 0

output_format string Output format of the generated audio. Must be one of: Value Description mp3_44100 default output format, mp3 with 44.1kHz sample rate pcm_16000 PCM format (S16LE) with 16kHz sample rate pcm_22050 PCM format (S16LE) with 22.05kHz sample rate pcm_24000 PCM format (S16LE) with 24kHz sample rate pcm_44100 PCM format (S16LE) with 44.1kHz sample rate ulaw_8000 μ-law format (mulaw) with 8kHz sample rate. (Note that this format is commonly used for Twilio audio inputs.) Defaults to mp3_44100

​ Example of voice streaming using ElevenLabs and ChatGPT

The following example demonstrates how to leverage the ElevenLabs Websockets API to stream input from ChatGPT while the answer is being generated, thereby minimizing the overall latency of the operation.

import asyncio import websockets import json import base64 import shutil import os import subprocess from openai import AsyncOpenAI OPENAI_API_KEY = '<OPENAI_API_KEY>' ELEVENLABS_API_KEY = '<ELEVENLABS_API_KEY>' VOICE_ID = '21m00Tcm4TlvDq8ikWAM' aclient = AsyncOpenAI ( api_key = OPENAI_API_KEY ) def is_installed ( lib_name ) : return shutil . which ( lib_name ) is not None async def text_chunker ( chunks ) : """Split text into chunks, ensuring to not break sentences.""" splitters = ( "." , "," , "?" , "!" , ";" , ":" , "—" , "-" , "(" , ")" , "[" , "]" , "}" , " " ) buffer = "" async for text in chunks : if buffer . endswith ( splitters ) : yield buffer + " " buffer = text elif text . startswith ( splitters ) : yield buffer + text [ 0 ] + " " buffer = text [ 1 : ] else : buffer += text if buffer : yield buffer + " " async def stream ( audio_stream ) : """Stream audio data using mpv player.""" if not is_installed ( "mpv" ) : raise ValueError ( "mpv not found, necessary to stream audio. " "Install instructions: https://mpv.io/installation/" ) mpv_process = subprocess . Popen ( [ "mpv" , "--no-cache" , "--no-terminal" , "--" , "fd://0" ] , stdin = subprocess . PIPE , stdout = subprocess . DEVNULL , stderr = subprocess . DEVNULL , ) print ( "Started streaming audio" ) async for chunk in audio_stream : if chunk : mpv_process . stdin . write ( chunk ) mpv_process . stdin . flush ( ) if mpv_process . stdin : mpv_process . stdin . close ( ) mpv_process . wait ( ) async def text_to_speech_input_streaming ( voice_id , text_iterator ) : """Send text to ElevenLabs API and stream the returned audio.""" uri = f"wss://api.elevenlabs.io/v1/text-to-speech/ { voice_id } /stream-input?model_id=eleven_monolingual_v1" async with websockets . connect ( uri ) as websocket : await websocket . send ( json . dumps ( { "text" : " " , "voice_settings" : { "stability" : 0.5 , "similarity_boost" : 0.8 } , "xi_api_key" : ELEVENLABS_API_KEY , } ) ) async def listen ( ) : """Listen to the websocket for audio data and stream it.""" while True : try : message = await websocket . recv ( ) data = json . loads ( message ) if data . get ( "audio" ) : yield base64 . b64decode ( data [ "audio" ] ) elif data . get ( 'isFinal' ) : break except websockets . exceptions . ConnectionClosed : print ( "Connection closed" ) break listen_task = asyncio . create_task ( stream ( listen ( ) ) ) async for text in text_chunker ( text_iterator ) : await websocket . send ( json . dumps ( { "text" : text , "try_trigger_generation" : True } ) ) await websocket . send ( json . dumps ( { "text" : "" } ) ) await listen_task async def chat_completion ( query ) : """Retrieve text from OpenAI and pass it to the text-to-speech function.""" response = await aclient . chat . completions . create ( model = 'gpt-4' , messages = [ { 'role' : 'user' , 'content' : query } ] , temperature = 1 , stream = True ) async def text_iterator ( ) : async for chunk in response : delta = chunk . choices [ 0 ] . delta yield delta . content await text_to_speech_input_streaming ( VOICE_ID , text_iterator ( ) ) if __name__ == "__main__" : user_query = "Hello, tell me a very long story." asyncio . run ( chat_completion ( user_query ) )

​ Other examples

Some examples for interacting with the Websocket API in different ways are provided below