This API provides real-time text-to-speech conversion using WebSockets. Clients can send a text message and receive audio data in real-time.

Endpoint: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}

When to use

The Text-to-Speech Websockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the Websockets API isn’t a one-size-fits-all solution. It’s well-suited for scenarios where:

  • Word-to-audio alignment information is required.
  • The input text is being streamed or generated in chunks.

However, it may not be the best choice when:

  • The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
  • You want to quickly experiment or prototype. Working with Websockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.

For a practical demonstration in a real world application, refer to the Example of voice streaming using ElevenLabs and ChatGPT section.

Protocol

The WebSocket API uses a bidirectional protocol that encodes all messages as JSON objects.

Streaming input text

The client can send messages with text input to the server. The messages can contain the following fields:

{
  "text": "This is a sample text ",
  "voice_settings": {
    "stability": 0.8,
    "similarity_boost": true
  },
  "generation_config": {
    "chunk_length_schedule": [120, 160, 250, 290]
  },
  "xi_api_key": "<XI API Key>",
  "authorization": "Bearer <Authorization Token>"
}

text
string
required

Should always end with a single space string " ". In the first message, the text should be a space " ".

voice_settings
object

Should only be provided in the first message

try_trigger_generation
boolean
required

If we should try to trigger the generation.

generation_config
object

Should only be provided in the first message

xi_api_key
string

Provide the XI API Key in the first message if it’s not in the header. See Authentication for more details.

authorization
string

Authorization bearer token. Should be provided only in the first message if not present in the header and the XI API Key is not provided.

For best latency we recommend streaming word-by-word, this way we will start generating as soon as we reach the predefined number of un-generated characters.

End of input

In order to close the connection, the client should send a EOS message. The EOS message should always be an empty string:

End of Sequence (EOS) message
{
    "text": ""
}
text
string
required

Should always be an empty string "".

Streaming output audio

The server will always respond with a message containing the following fields:

Response message
{
    "audio": "Y3VyaW91cyBtaW5kcyB0aGluayBhbGlrZSA6KQ==",
    "isFinal": false,
    "normalizedAlignment": {
        "char_start_times_ms": [0, 3, 7, 9, 11, 12, 13, 15, 17, 19, 21],
        "chars_durations_ms": [3, 4, 2, 2, 1, 1, 2, 2, 2, 2, 3]
        "chars": ["H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
    }
}
audio
string

A generated partial MP3 audio chunk encoded as a base64 string.

isFinal
boolean

Indicates if the generation is complete. If set to True, audio will be null.

normalizedAlignment
string

Alignment information for the generated audio given the input normalized text sequence.

Path parameters

voice_id
string

Voice ID to be used, you can use Get Voices to list all the available voices.

Query parameters

model_id
string

Identifier of the model that will be used, you can query them using Get Voices.

enable_logging
string

Whether to enable request logging, if disabled the request will not be present in history nor bigtable. Enabled by default. Note: simple logging (aka printing) to stdout/stderr is always enabled.

optimize_streaming_latency
string

You can turn on latency optimizations at some cost of quality. The best possible final latency varies by model. Possible values:

ValueDescription
0default mode (no latency optimizations)
1normal latency optimizations (about 50% of possible latency improvement of option 3)
2strong latency optimizations (about 75% of possible latency improvement of option 3)
3max latency optimizations
4max latency optimizations, but also with text normalizer turned off for even more latency savings (best latency, but can mispronounce eg numbers and dates).

Defaults to 0

output_format
string

Output format of the generated audio. Must be one of:

ValueDescription
mp3_44100default output format, mp3 with 44.1kHz sample rate
pcm_16000PCM format (S16LE) with 16kHz sample rate
pcm_22050PCM format (S16LE) with 22.05kHz sample rate
pcm_24000PCM format (S16LE) with 24kHz sample rate
pcm_44100PCM format (S16LE) with 44.1kHz sample rate
ulaw_8000μ-law format (mulaw) with 8kHz sample rate. (Note that this format is commonly used for Twilio audio inputs.)

Defaults to mp3_44100

Example of voice streaming using ElevenLabs and ChatGPT

The following example demonstrates how to leverage the ElevenLabs Websockets API to stream input from ChatGPT while the answer is being generated, thereby minimizing the overall latency of the operation.

import asyncio
import websockets
import json
import openai
import base64
import shutil
import os
import subprocess

# Define API keys and voice ID
OPENAI_API_KEY = '<OPENAI_API_KEY>'
ELEVENLABS_API_KEY = '<ELEVENLABS_API_KEY>'
VOICE_ID = '21m00Tcm4TlvDq8ikWAM'

# Set OpenAI API key
openai.api_key = OPENAI_API_KEY


def is_installed(lib_name):
    return shutil.which(lib_name) is not None


async def text_chunker(chunks):
    """Split text into chunks, ensuring to not break sentences."""
    splitters = (".", ",", "?", "!", ";", ":", "—", "-", "(", ")", "[", "]", "}", " ")
    buffer = ""

    async for text in chunks:
        if buffer.endswith(splitters):
            yield buffer + " "
            buffer = text
        elif text.startswith(splitters):
            yield buffer + text[0] + " "
            buffer = text[1:]
        else:
            buffer += text

    if buffer:
        yield buffer + " "


async def stream(audio_stream):
    """Stream audio data using mpv player."""
    if not is_installed("mpv"):
        raise ValueError(
            "mpv not found, necessary to stream audio. "
            "Install instructions: https://mpv.io/installation/"
        )

    mpv_process = subprocess.Popen(
        ["mpv", "--no-cache", "--no-terminal", "--", "fd://0"],
        stdin=subprocess.PIPE, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
    )

    print("Started streaming audio")
    async for chunk in audio_stream:
        if chunk:
            mpv_process.stdin.write(chunk)
            mpv_process.stdin.flush()

    if mpv_process.stdin:
        mpv_process.stdin.close()
    mpv_process.wait()


async def text_to_speech_input_streaming(voice_id, text_iterator):
    """Send text to ElevenLabs API and stream the returned audio."""
    uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id=eleven_monolingual_v1"

    async with websockets.connect(uri) as websocket:
        await websocket.send(json.dumps({
            "text": " ",
            "voice_settings": {"stability": 0.5, "similarity_boost": True},
            "xi_api_key": ELEVENLABS_API_KEY,
        }))

        async def listen():
            """Listen to the websocket for audio data and stream it."""
            while True:
                try:
                    message = await websocket.recv()
                    data = json.loads(message)
                    if data.get("audio"):
                        yield base64.b64decode(data["audio"])
                    elif data.get('isFinal'):
                        break
                except websockets.exceptions.ConnectionClosed:
                    print("Connection closed")
                    break

        listen_task = asyncio.create_task(stream(listen()))

        async for text in text_chunker(text_iterator):
            await websocket.send(json.dumps({"text": text, "try_trigger_generation": True}))

        await websocket.send(json.dumps({"text": ""}))

        await listen_task


async def chat_completion(query):
    """Retrieve text from OpenAI and pass it to the text-to-speech function."""
    response = await openai.ChatCompletion.acreate(
        model='gpt-4', messages=[{'role': 'user', 'content': query}],
        temperature=1, stream=True
    )

    async def text_iterator():
        async for chunk in response:
            delta = chunk['choices'][0]["delta"]
            if 'content' in delta:
                yield delta["content"]
            else:
                break

    await text_to_speech_input_streaming(VOICE_ID, text_iterator())


# Main execution
if __name__ == "__main__":
    user_query = "Hello, tell me a very long story."
    asyncio.run(chat_completion(user_query))


Other examples

Some examples for interacting with the Websocket API in different ways are provided below

import asyncio
import websockets
import json
import base64

async def text_to_speech(voice_id):
    model = 'eleven_monolingual_v1'
    uri = f"wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input?model_id={model}"

    async with websockets.connect(uri) as websocket:

        # Initialize the connection
        bos_message = {
            "text": " ",
            "voice_settings": {
                "stability": 0.5,
                "similarity_boost": True
            },
            "xi_api_key": "api_key_here",  # Replace with your API key
        }
        await websocket.send(json.dumps(bos_message))

        # Send "Hello World" input
        input_message = {
            "text": "Hello World ",
            "try_trigger_generation": True
        }
        await websocket.send(json.dumps(input_message))

        # Send EOS message with an empty string instead of a single space
        # as mentioned in the documentation
        eos_message = {
            "text": ""
        }
        await websocket.send(json.dumps(eos_message))

        # Added a loop to handle server responses and print the data received
        while True:
            try:
                response = await websocket.recv()
                data = json.loads(response)
                print("Server response:", data)

                if data["audio"]:
                    chunk = base64.b64decode(data["audio"])
                    print("Received audio chunk")
                else:
                    print("No audio data in the response")
                    break
            except websockets.exceptions.ConnectionClosed:
                print("Connection closed")
                break

asyncio.get_event_loop().run_until_complete(text_to_speech("voice_id_here"))