Reducing Latency

Seven methods for reducing streaming latency, in order of highest to lowest effectiveness:

1. Use the v2.5 Flash Model

Our cutting-edge Eleven v2.5 Flash Model is ideally suited for tasks demanding extremely low latency. The new flash model_id is eleven_flash_v2_5.

2. Use the streaming API

ElevenLabs provides three text-to-speech endpoints:

  1. A regular text-to-speech endpoint
  2. A streaming text-to-speech endpoint
  3. A websockets text-to-speech endpoint

The regular endpoint renders the audio file before returning it in the response. The streaming endpoint streams back the audio as it is being generated, resulting in much lower response time from request to first byte of audio received. For applications that require low latency, the streaming endpoint is therefore recommended.

3. Use the input streaming Websocket

For applications where the text prompts can be streamed to the text-to-speech endpoints (such as LLM output), this allows for prompts to be fed to the endpoint while the speech is being generated. You can also configure the streaming chunk size when using the websocket, with smaller chunks generally rendering faster. As such, we recommend sending content word by word, our model and tooling leverages context to ensure that sentence structure and more are persisted to the generated audio even if we only receive a word at a time.

4. Update to the Enterprise plan

Enterprise customers receive top priority in the rendering queue, which ensures that they always experience the lowest possible latency, regardless of model usage load.

5. Use Default Voices, Synthetic Voices, & IVCs rather than PVCs

Based on our testing, we’ve observed that default voices (formerly called ‘premade’ voices), synthetic voices, and Instant Voice Clones tend to have lower latency compared to Professional Voice Clones. A Professional Voice Clone allows for a much more accurate clone of the original samples, albeit with a slight increase in latency at present. Unfortunately, it is a slight tradeoff and we’re currently working on optimizing the latency of the Professional Voice Clones for v2.5 Flash.

6. Reuse HTTPS Sessions When Streaming

When streaming through the websocket, reusing an established SSL/TLS session helps reduce latency by skipping the handshake process. This improves latency for all requests after the session’s first.

6a. Limit the Number of Websocket Connection Closures

Similarly, for websockets we leverage the WSS protocol and so an SSL/TLS handshake takes place at the beginning of a connection, which adds overhead. As such, we recommend to limit the number of times a connection is closed and reopened to the extent possible.

7. Leverage Servers Closer to the US

Today, our APIs are served from the US, and as such users may experience latency from increased network routing when communicating with these APIs outside of the United States.