This API provides real-time text-to-speech conversion using WebSockets. This allows you to send a text message and receive audio data back in real-time.

Endpoint:
wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}

When to use

The Text-to-Speech Websockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the Websockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:

  • The input text is being streamed or generated in chunks.
  • Word-to-audio alignment information is required.

For a practical demonstration in a real world application, refer to the Example of voice streaming using ElevenLabs and OpenAI section.

When not to use

However, it may not be the best choice when:

  • The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
  • You want to quickly experiment or prototype. Working with Websockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.

In these cases, use the Text to Speech API instead.

Protocol

The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.

Streaming input text

The client can send messages with text input to the server. The messages can contain the following fields:

{
  "text": "This is a sample text ",
  "voice_settings": {
    "stability": 0.8,
    "similarity_boost": 0.8
  },
  "generation_config": {
    "chunk_length_schedule": [120, 160, 250, 290]
  },
  "xi_api_key": "<XI API Key>",
  "authorization": "Bearer <Authorization Token>"
}

text
string
required

Should always end with a single space string " ". In the first message, the text should be a space " ".

try_trigger_generation
boolean
default: falsedeprecated

This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.

Use this to attempt to immediately trigger the generation of audio, overriding the chunk_length_schedule. Unlike flush, try_trigger_generation will only generate audio if our buffer contains more than a minimum threshold of characters, this is to ensure a higher quality response from our model.

Note that overriding the chunk schedule to generate small amounts of text may result in lower quality audio, therefore, only use this parameter if you really need text to be processed immediately. We generally recommend keeping the default value of false and adjusting the chunk_length_schedule in the generation_config instead.

voice_settings
object

This property should only be provided in the first message you send.

generation_config
object

This property should only be provided in the first message you send.

chunk_length_schedule
array

This is an advanced setting that most users shouldn’t need to use. It relates to our generation schedule explained here.

Determines the minimum amount of text that needs to be sent and present in our buffer before audio starts being generated. This is to maximise the amount of context available to the model to improve audio quality, whilst balancing latency of the returned audio chunks.

The default value is: [120, 160, 250, 290].

This means that the first chunk of audio will not be generated until you send text that totals at least 120 characters long. The next chunk of audio will only be generated once a further 160 characters have been sent. The third audio chunk will be generated after the next 250 characters. Then the fourth, and beyond, will be generated in sets of at least 290 characters.

Customize this array to suit your needs. If you want to generate audio more frequently to optimise latency, you can reduce the values in the array. Note that setting the values too low may result in lower quality audio. Please test and adjust as needed.

Each item should be in the range 50-500.

flush
boolean

Flush forces the generation of audio. Set this value to true when you have finished sending text, but want to keep the websocket connection open.

This is useful when you want to ensure that the last chunk of audio is generated even when the length of text sent is smaller than the value set in chunk_length_schedule (e.g. 120 or 50).

To understand more about how our websockets buffer text before audio is generated, please refer to this section.

xi_api_key
string

Provide the XI API Key in the first message if it’s not in the header.

authorization
string

Authorization bearer token. Should be provided only in the first message if not present in the header and the XI API Key is not provided.

For best latency we recommend streaming word-by-word, this way we will start generating as soon as we reach the predefined number of un-generated characters.

Close connection

In order to close the connection, the client should send an End of Sequence (EOS) message. The EOS message should always be an empty string:

End of Sequence (EOS) message
{
    "text": ""
}
text
string
required

Should always be an empty string "".

Streaming output audio

The server will always respond with a message containing the following fields:

Response message
{
    "audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZSA6KQ==",
    "isFinal": false,
    "normalizedAlignment": {
        "charStartTimesMs": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21],
        "charDurationsMs": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3],
        "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
    },
    "alignment": {
        "charStartTimesMs": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21],
        "charDurationsMs": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3],
        "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
    }
}
audio
string

A generated partial audio chunk, encoded using the selected output_format, by default this is MP3 encoded as a base64 string.

isFinal
boolean

Indicates if the generation is complete. If set to True, audio will be null.

normalizedAlignment
string

Alignment information for the generated audio given the input normalized text sequence.

alignment
string

Alignment information for the generated audio given the original text sequence.

Path parameters

voice_id
string

Voice ID to be used, you can use Get Voices to list all the available voices.

Query parameters

model_id
string

Identifier of the model that will be used, you can query them using Get Models.

language_code
string

Language code (ISO 639-1) used to enforce a language for the model. Currently only Turbo v2.5 supports language enforcement. For other models, an error will be returned if language code is provided.

enable_logging
string

Whether to enable request logging, if disabled the request will not be present in history nor bigtable. Enabled by default. Note: simple logging (aka printing) to stdout/stderr is always enabled.

enable_ssml_parsing
boolean

Whether to enable/disable parsing of SSML tags within the provided text. For best results, we recommend sending SSML tags as fully contained messages to the websockets endpoint, otherwise this may result in additional latency. Please note that rendered text, in normalizedAlignment, will be altered in support of SSML tags. The rendered text will use a . as a placeholder for breaks, and syllables will be reported using the CMU arpabet alphabet where SSML phoneme tags are used to specify pronunciation. Disabled by default.

optimize_streaming_latency
string
deprecated

You can turn on latency optimizations at some cost of quality. The best possible final latency varies by model. Possible values:

ValueDescription
0default mode (no latency optimizations)
1normal latency optimizations (about 50% of possible latency improvement of option 3)
2strong latency optimizations (about 75% of possible latency improvement of option 3)
3max latency optimizations
4max latency optimizations, but also with text normalizer turned off for even more latency savings (best latency, but can mispronounce eg numbers and dates).

Defaults to 0

output_format
string

Output format of the generated audio. Must be one of:

ValueDescription
mp3_44100default output format, mp3 with 44.1kHz sample rate
pcm_16000PCM format (S16LE) with 16kHz sample rate
pcm_22050PCM format (S16LE) with 22.05kHz sample rate
pcm_24000PCM format (S16LE) with 24kHz sample rate
pcm_44100PCM format (S16LE) with 44.1kHz sample rate
ulaw_8000μ-law format (mulaw) with 8kHz sample rate. (Note that this format is commonly used for Twilio audio inputs.)

Defaults to mp3_44100

inactivity_timeout
number

The number of seconds that the connection can be inactive before it is automatically closed.

Defaults to 20 seconds, with a maximum allowed value of 180 seconds.

sync_alignment
boolean

The audio for each text sequence is delivered in multiple chunks. By default when it’s set to false, you’ll receive all timing data (alignment information) with the first chunk only. However, if you enable this option, you’ll get the timing data with every audio chunk instead. This can help you precisely match each audio segment with its corresponding text.

Example - Voice streaming using ElevenLabs and OpenAI

The following example demonstrates how to leverage the ElevenLabs Websockets API to stream input from OpenAI’s GPT model, while the answer is being generated, thereby minimizing the overall latency of the operation.

import asyncio
import websockets
import json
import base64
import shutil
import os
import subprocess
from openai import AsyncOpenAI

# Define API keys and voice ID
OPENAI_API_KEY = '<OPENAI_API_KEY>'
ELEVENLABS_API_KEY = '<ELEVENLABS_API_KEY>'
VOICE_ID = '21m00Tcm4TlvDq8ikWAM'

# Set OpenAI API key
aclient = AsyncOpenAI(api_key=OPENAI_API_KEY)

def is_installed(lib_name):
    return shutil.which(lib_name) is not None


async def text_chunker(chunks):
    """Split text into chunks, ensuring to not break sentences."""
    splitters = (".", ",", "?", "!", ";", ":", "—", "-", "(", ")", "[", "]", "}", " ")
    buffer = ""

    async for text in chunks:
        if buffer.endswith(splitters):
            yield buffer + " "
            buffer = text
        elif text.startswith(splitters):
            yield buffer + text[0] + " "
            buffer = text[1:]
        else:
            buffer += text

    if buffer:
        yield buffer + " "


async def stream(audio_stream):
    """Stream audio data using mpv player."""
    if not is_installed("mpv"):
        raise ValueError(
            "mpv not found, necessary to stream audio. "
            "Install instructions: https://mpv.io/installation/"
        )

    mpv_process = subprocess.Popen(
        ["mpv", "--no-cache", "--no-terminal", "--", "fd://0"],
        stdin=subprocess.PIPE, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
    )

    print("Started streaming audio")
    async for chunk in audio_stream:
        if chunk:
            mpv_process.stdin.write(chunk)
            mpv_process.stdin.flush()

    if mpv_process.stdin:
        mpv_process.stdin.close()
    mpv_process.wait()


async def text_to_speech_input_streaming(voice_id, text_iterator):
    """Send text to ElevenLabs API and stream the returned audio."""
    uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_turbo_v2_5"

    async with websockets.connect(uri) as websocket:
        await websocket.send(json.dumps({
            "text": " ",
            "voice_settings": {"stability": 0.5, "similarity_boost": 0.8},
            "xi_api_key": ELEVENLABS_API_KEY,
        }))

        async def listen():
            """Listen to the websocket for audio data and stream it."""
            while True:
                try:
                    message = await websocket.recv()
                    data = json.loads(message)
                    if data.get("audio"):
                        yield base64.b64decode(data["audio"])
                    elif data.get('isFinal'):
                        break
                except websockets.exceptions.ConnectionClosed:
                    print("Connection closed")
                    break

        listen_task = asyncio.create_task(stream(listen()))

        async for text in text_chunker(text_iterator):
            await websocket.send(json.dumps({"text": text}))

        await websocket.send(json.dumps({"text": ""}))

        await listen_task


async def chat_completion(query):
    """Retrieve text from OpenAI and pass it to the text-to-speech function."""
    response = await aclient.chat.completions.create(model='gpt-4', messages=[{'role': 'user', 'content': query}],
    temperature=1, stream=True)

    async def text_iterator():
        async for chunk in response:
            delta = chunk.choices[0].delta
            yield delta.content

    await text_to_speech_input_streaming(VOICE_ID, text_iterator())


# Main execution
if __name__ == "__main__":
    user_query = "Hello, tell me a very long story."
    asyncio.run(chat_completion(user_query))


Example - Other examples for interacting with our Websocket API

Some examples for interacting with the Websocket API in different ways are provided below

Example - Getting word start times using alignment values

This code example shows how the start times of words can be retrieved using the alignment values returned from our API.

import asyncio
import websockets
import json
import base64

# Define API keys and voice ID
ELEVENLABS_API_KEY = "INSERT HERE" <- INSERT YOUR API KEY HERE
VOICE_ID = 'nPczCjzI2devNBz1zQrb' #Brian

def calculate_word_start_times(alignment_info):
    # Alignment start times are indexed from the start of the audio chunk that generated them
    # In order to analyse runtime over the entire response we keep a cumulative count of played audio
    full_alignment = {'chars': [], 'charStartTimesMs': [], 'charDurationsMs': []}
    cumulative_run_time = 0
    for old_dict in alignment_info:
        full_alignment['chars'].extend([" "] + old_dict['chars'])
        full_alignment['charDurationsMs'].extend([old_dict['charStartTimesMs'][0]] + old_dict['charDurationsMs'])
        full_alignment['charStartTimesMs'].extend([0] + [time+cumulative_run_time for time in old_dict['charStartTimesMs']])
        cumulative_run_time += sum(old_dict['charDurationsMs'])
    
    # We now have the start times of every character relative to the entire audio output
    zipped_start_times = list(zip(full_alignment['chars'], full_alignment['charStartTimesMs']))
    # Get the start time of every character that appears after a space and match this to the word
    words = ''.join(full_alignment['chars']).split(" ")
    word_start_times = list(zip(words, [0] + [zipped_start_times[i+1][1] for (i, (a,b)) in enumerate(zipped_start_times) if a == ' ']))
    print(f"total duration:{cumulative_run_time}")
    print(word_start_times)


async def text_to_speech_alignment_example(voice_id, text_to_send):
    """Send text to ElevenLabs API and stream the returned audio and alignment information."""
    uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_turbo_v2_5"
    async with websockets.connect(uri) as websocket:
        await websocket.send(json.dumps({
            "text": " ",
            "voice_settings": {"stability": 0.5, "similarity_boost": 0.8, "use_speaker_boost": False},
            "generation_config": {
                "chunk_length_schedule": [120, 160, 250, 290]
            },
            "xi_api_key": ELEVENLABS_API_KEY,
        }))

        async def text_iterator(text):
            """Split text into chunks to mimic streaming from an LLM or similar"""
            split_text = text.split(" ")
            words = 0
            to_send = ""
            for chunk in split_text:
                to_send += chunk  + ' '
                words += 1
                if words >= 10:
                    print(to_send)
                    yield to_send
                    words = 0
                    to_send = ""
            yield to_send

        async def listen():
            """Listen to the websocket for audio data and write it to a file."""
            audio_chunks = []
            alignment_info = []
            received_final_chunk = False
            print("Listening for chunks from ElevenLabs...")
            while not received_final_chunk:
                try:
                    message = await websocket.recv()
                    data = json.loads(message)
                    if data.get("audio"):
                        audio_chunks.append(base64.b64decode(data["audio"]))
                    if data.get("alignment"):
                        alignment_info.append(data.get("alignment"))
                    if data.get('isFinal'):
                        received_final_chunk = True
                        break
                except websockets.exceptions.ConnectionClosed:
                    print("Connection closed")
                    break
            print("Writing audio to file")
            with open("output_file.mp3", "wb") as f:        
                f.write(b''.join(audio_chunks))

            calculate_word_start_times(alignment_info)


        listen_task = asyncio.create_task(listen())

        async for text in text_iterator(text_to_send):
            await websocket.send(json.dumps({"text": text}))
        await websocket.send(json.dumps({"text": " ", "flush": True}))
        await listen_task


# Main execution
if __name__ == "__main__":
    text_to_send = "The twilight sun cast its warm golden hues upon the vast rolling fields, saturating the landscape with an ethereal glow."
    asyncio.run(text_to_speech_alignment_example(VOICE_ID, text_to_send))

Understanding how our websockets buffer text

Our websocket service incorporates a buffer system designed to optimize the Time To First Byte (TTFB) while maintaining high-quality streaming.

All text sent to the websocket endpoint is added to this buffer and only when that buffer reaches a certain size is an audio generation attempted. This is because our model provides higher quality audio when the model has longer inputs, and can deduce more context about how the text should be delivered.

The buffer ensures smooth audio data delivery and is automatically emptied with a final audio generation either when the stream is closed, or upon sending a flush command. We have advanced settings for changing the chunk schedule, which can improve latency at the cost of quality by generating audio more frequently with smaller text inputs.