Latency optimization
Learn how to optimize text-to-speech latency.
This guide covers the core principles for improving text-to-speech latency.
While there are many individual techniques, we’ll group them into four principles.
Four principles
Enterprise customers benefit from increased concurrency limits and priority access to our rendering queue. Contact sales to learn more about our enterprise plans.
Use Flash models
Flash models deliver ~75ms inference speeds, making them ideal for real-time applications. The trade-off is a slight reduction in audio quality compared to Multilingual v2.
75ms refers to model inference time only. Actual end-to-end latency will vary with factors such as your location & endpoint type used.
Leverage streaming
There are three types of text-to-speech endpoints available in our API Reference:
- Regular endpoint: Returns a complete audio file in a single response.
- Streaming endpoint: Returns audio chunks progressively using Server-sent events.
- Websockets endpoint: Enables bidirectional streaming for real-time audio generation.
Streaming
Streaming endpoints progressively return audio as it is being generated in real-time, reducing the time-to-first-byte. This endpoint is recommended for cases where the input text is available up-front.
Streaming is supported for the Text to Speech API, Voice Changer API & Audio Isolation API.
Websockets
The text-to-speech websocket endpoint supports bidirectional streaming making it perfect for applications with real-time text input (e.g. LLM outputs).
Setting auto_mode
to true automatically handles generation triggers, removing the need to
manually manage chunk strategies.
If auto_mode
is disabled, the model will wait for enough text to match the chunk schedule before starting to generate audio.
For instance, if you set a chunk schedule of 125 characters but only 50 arrive, the model stalls until additional characters come in—potentially increasing latency.
For implementation details, see the text-to-speech websocket guide.
Choose appropriate voices
We have observed that in some cases, voice selection can impact latency. Here’s the order from fastest to slowest:
- Default voices (formerly premade), Synthetic voices, and Instant Voice Clones (IVC)
- Professional Voice Clones (PVC)
Higher audio quality output formats can increase latency. Be sure to balance your latency requirements with audio fidelity needs.
Consider geographic proximity
We serve our models from multiple regions to optimize latency based on your geographic location. By default all self-serve users use our US region.
For example, using Flash models with Websockets, you can expect the following TTFB latencies via our US region:
*European customers can access our dedicated European tech stack for optimal latency of 150-200ms. Contact your sales representative to get onboarded to our European infrastructure.
Global TTS API preview
ElevenLabs is launching inference servers in additional geographical regions to reduce latency for clients outside of the US. This section describes how to use the early preview of this feature.
This feature is still under development and is being provided solely on an “AS IS” and “AS AVAILABLE” basis, without warranties of any kind. It may be modified or discontinued at our sole discretion.
Use of this feature may result in request processing outside of the USA, specifically in the Netherlands and Singapore, to reduce latency where possible. This feature does not provide data residency guarantees, and all data will continue to be stored and backed up in our USA-based servers. If data residency is required, please refer to our data residency documentation.
How to use
Simply replace api.elevenlabs.io
with api-global-preview.elevenlabs.io
in your TTS API calls. When using the SDK you need to override the base URL.
The geographically closest region will handle your request to minimize latency. You can check which region is serving your request by inspecting the x-region
header:
Expected latency improvements
Based on our benchmarking, for the Turbo and Flash models we expect the following improvements to TTFB (time to first byte), depending on your location:
- Europe: 80-100 ms
- Japan: 80-100 ms
- India: 150-200 ms
- Singapore: 150-200 ms
Limitations and known issues
-
Limited product support: Only TTS requests (
/v1/text-to-speech
) are supported at this time.- Requests to other API products will seamlessly fall back to the US servers, but we don’t recommend using the
api-global-preview
for those. - Additional product support will be added in the future.
- Requests to other API products will seamlessly fall back to the US servers, but we don’t recommend using the
-
Cache misses: Some initial requests might be slower than expected due to cache misses. We recommend running each request several times if you are benchmarking. During normal operations with smooth traffic the latency improves.
-
Limited capacity: During this preview phase the capacity is limited. If you see an increased amount of 429 errors please retry later. We will add more capacity as we move out of preview.
-
Model compatibility: Latency improves for the low-latency models (Turbo and Flash). For some of the slower models like Multilingual you might see worse latency. We only recommend using
api-global-preview
for Turbo and Flash at this time.
For requests which are slow or failing, please provide the value of the x-trace-id
header if
possible (we recommend logging it for all requests you do).