Voice agent latency optimization: Step-by-step guide
- Published
ListenListen to this article
The responsiveness of a voice agent is determined by the total delay between when a user finishes speaking and when the agent begins to reply. That delay is rarely caused by a single slow component. It accumulates across several independent stages, each contributing a few tens or hundreds of milliseconds, and reducing it requires knowing how much each stage spends.
Voice agent latency optimization is the work of finding where that time hides and recovering it stage by stage.
This article acts as a companion to the conceptual latency overview. Where that page explains what latency is, this one covers architecture and measurement, so you’ll leave with a latency budget you can measure against and a set of concrete actions to take.
TL;DR
- Time-to-first-audio represents the whole pipeline, not a single model’s inference time.
- The LLM’s time-to-first-token and endpointing are the two largest line items.
- Overlapping stages, rather than running them in series, recovers most of the budget.
- Streaming, codec choice, and player buffer tuning each shave measurable milliseconds.
- You should measure per region against your own deployment, reporting P50 and P95.
Defining the voice agent latency budget
A latency budget is a total time-to-first audio target decided across the pipeline stages, with each stage given an allowance that has to sum under your target. Defining it is the first step and is also where latency work most often goes wrong, as engineers may conflate two numbers that look similar but mean different things.
The first is model inference latency: the time a model spends generating output. For our Flash models this is approximately 75ms for typical short inputs, excluding network and application overhead. It is an internal figure, and it is useful for comparing one model against another. It is not the number your user experiences.
From a user’s perspective, you’ll focus on time-to-first-audio (TTFA): the elapsed time from when the user stops speaking to when they hear the first sample of the agent's reply. TTFA is always larger than any single model's inference latency, because it sums the whole pipeline.
A cascaded voice agent is a chain of five stages:
- capture (mic) -> STT -> LLM -> TTS -> playback
Audio is captured from the microphone, transcribed to text, sent to a language model, the model's text is synthesized back to speech, and that speech is buffered and played. Each stage adds latency, and in several stages the largest cost is not the one you would expect.
Here is a worked example for an English-language agent with servers reasonably close to the user. The numbers are illustrative ranges, not guarantees.
Typically, the two largest latency line items are the LLM's time-to-first-token and the endpointing delay at the start of the chain.
The table is a useful way to visualize the pipeline, but it implies the stages run strictly in series, which they don’t. Several of the most significant voice agent latency optimizations come from overlapping them, and that overlap is where most of the budget below is recovered.
Speech to Text: transcription and endpointing latency optimization
Transcription is the second stage in the pipeline, and its real cost is not the transcription itself but deciding when the user has stopped talking. This section covers both aspects to help you optimize voice agent latency.
Transcription happens before it reaches the LLM. Scribe v2 Realtime (scribe_v2_realtime) returns partial transcriptions in approximately 150ms and streams in audio chunks, so the transcript is materialized while the user is still speaking. It supports PCM at 8kHz to 48kHz and mu-law encoding, which matters for the codec section below. The 150ms partials are inexpensive.
The larger latency cost is endpointing: the moment your system decides the user has actually finished their turn.
Voice Activity Detection (VAD) segments speech on silence, and that is where the time accumulates. If you wait for, say, 700ms of silence before declaring the turn over, you have added 700ms to every turn, on top of the transcription itself. That delay is invisible in a transcription-accuracy benchmark but very present in a real conversation. It is frequently the largest controllable latency in the whole pipeline, and because it is controllable, it is a good place to start.
Endpointing is a tradeoff between responsiveness and interruption. A short silence threshold makes the agent reply quickly but risks cutting the user off mid-sentence on a natural pause. A long threshold is safe but sluggish. In practice, the three changes that optimize latency in speech to text are:
- Fine-tune the silence threshold: Tighten the silence threshold to the smallest value that does not truncate your users' natural pauses, then measure interruption rate in production rather than guessing.
- Embed a physical control event: Use manual commit control when your application knows the turn is over from another signal (a push-to-talk release, a UI event), instead of waiting for the VAD timer.
- Overlap with LLM processes: Run partials downstream early. Feed stable partials into the LLM and revise if the final transcript differs, a form of speculative execution that hides the endpointing delay behind LLM prompt processing.
For more information, Scribe v2 Realtime is described in more detail on the speech to text capabilities page and the realtime speech to text product page.
The LLM latency contribution
The language model is usually the largest single contributor to TTFA, so it is also where overlap pays off most in voice agent latency optimization. The key insight here is that the agent does not need the whole answer before it starts speaking.
The pattern that recovers the most latency budget is to stream tokens out of the LLM and feed them into TTS as they arrive, chunked at sentence or clause boundaries. The logic is to buffer tokens until a sentence boundary, then synthesize that sentence while the next one is still being generated:
For long-running conversations, prefer the TTS WebSocket so that an open connection can receive text incrementally without re-paying connection setup on every sentence. Only the time the model is actively generating audio counts toward your concurrency limit, so an idle open WebSocket is nearly free.
Text to Speech: streaming and voice choice
Text to speech is the stage where you can pin down latency most precisely. It has two main levers: how you stream the audio out and which voice you choose.
Flash v2.5 (eleven_flash_v2_5) is the model to use in an agent. It delivers approximately 75ms of model inference for short inputs, supports 32 languages, and accepts up to 40,000 characters per request.
The 75ms figure is inference only. The TTS TTFA line in the budget above is larger because it adds the network round-trip and server scheduling on top of inference.
The largest lever here is streaming. If you request the full audio and wait for it, the user waits for the entire clip to synthesize before hearing anything. If you stream, the user hears the first chunk as soon as it is generated, and the rest arrives while they are already listening. Streaming does not make the model faster; it simply starts outputting to the user while it is still generating.
The streaming how-to guide covers HTTP streaming, and the realtime WebSocket guide covers the WebSocket path you will want when feeding tokens from an LLM.
Initialize the client once and reuse it for every call below:
Then set up a stream and forward it as it comes in:
The other lever is the choice of voice, which also has a latency cost. Default voices, synthetic voices, and Instant Voice Clones (IVCs) synthesize faster than Professional Voice Clones (PVCs), because PVCs carry additional model complexity that adds per-generation overhead. For an agent with strict latency requirements, the combination of Flash plus an IVC or a default voice is the lowest-latency option.
Streaming chunk size choices
With tokens flowing into TTS and audio flowing back, the next decision is how large to make the pieces and how much the player buffers before it starts.
Smaller chunks reach the player sooner, lowering first-byte latency, at the cost of more messages and slightly more per-chunk overhead. Larger chunks are more efficient to transport but make the user wait longer for the first one. For interactive agents, bias toward smaller chunks early in the utterance, because the first chunk is the one the user is waiting on; later chunks arrive while audio is already playing, and their size matters less.
The player accounts for a significant amount of the remaining latency. Most audio players do not begin playback at the first byte. They buffer a small amount to avoid stuttering if the stream briefly slows. A 500ms default buffer is common, and it is added directly to perceived latency. Reducing it trades a small increase in stutter risk for lower TTFA, and the right value depends on the network jitter between your server and the client:
- On a stable connection (server-side playback, a co-located client), a buffer of 50 to 150ms is usually safe and shaves a noticeable amount off TTFA.
- On a jittery mobile or cross-region connection, a larger buffer prevents audible gaps that are worse than the latency they cost.
The exact configuration you choose here depends on your active use case and what you prioritize.
Codec choices
Where the audio is going should dictate the codec you request. We return formats such as mp3_44100_128, mp3_22050_32, pcm_16000, pcm_24000, and ulaw_8000. Matching the transport’s native format removes a transcoding step, helping with voice agent latency optimization.
For telephony, such as Twilio and similar providers, use ulaw_8000. The telephony network is 8kHz mu-law end-to-end, so requesting it directly avoids a transcoding step in your pipeline and matches what the carrier expects. There is no benefit to synthesizing higher-fidelity audio that the phone network will immediately downsample; you would only add latency and lose nothing audible.
For WebRTC and browser playback, use PCM (pcm_24000 or pcm_16000) or an MP3 format. PCM is uncompressed, so there is no decode step on the client, which removes a small amount of per-chunk latency and is convenient when you are feeding a Web Audio pipeline directly. MP3 is more compact on the wire, which helps on constrained connections, at the cost of a lightweight client-side decode.
Geography and network distances
Every optimization above assumes the bytes have a short distance to travel. Geography sets the floor on your latency budget, meaning it's worth examining before you tune anything else.
We serve requests from clusters in North America, Europe, and Southeast Asia and route each request to the nearest cluster automatically. The network round-trip over the public internet is typically 20 to 200ms depending on geographic proximity, and it is irreducible without changing where your infrastructure runs.
An agent that feels instant in San Francisco, a short hop from a North American cluster, can feel sluggish to a user in South Asia whose traffic crosses an ocean twice per turn.
The fix is to co-locate your application servers with your users, not only with us. If your users are in Europe, run your agent backend in Europe so that the user-to-your-server leg is short; our routing then handles the your-server-to-model leg from a nearby cluster.
Measuring voice agent latency yourself
The numbers in the latency budget table above are illustrative ranges to plan against. The numbers you ship against should come from a script like this one, run against your own deployment.
The instrumentation below measures TTFA for the TTS stage in isolation, the time from request to first audio chunk, across many trials, and reports the percentiles. Run it from the same region your servers run in, not from your development machine. It assumes the elevenlabs client from earlier:
A few things to remember:
- Report P50 and P95: Focus on these, rather than mean. The mean hides the tail, and the tail is what makes an agent feel unreliable. P95 is the experience of one turn in twenty.
- Location-based experimentation: Run the same script from each region you serve and keep the results separate.
- Stagger for accuracy: Space your requests (the setTimeout above). If you fire them all at once, you measure your own queuing instead of the service. When the concurrency limit is exceeded, requests queue by priority, which typically adds about 50ms, and beyond capacity you receive HTTP 429.
- Measure the entire latency chain: Extend the same timing pattern to the other stages. Wrap your STT finalization, your LLM first-token, and your player startup in the same performance.now() brackets, and you can populate the full budget table with your own numbers and see which stage to attack first.
By following these tips, you’ll be able to measure voice agent latency yourself. From there, you’ll have a clear path of priorities to tackle first.
What reduces voice agent latency the most?
If you want some quick action items to focus on, these are the highest-leverage changes.
Roughly in order of impact, you can use the following methods to reduce agent latency:
- Start LLM work on stable STT partials to hide the endpointing delay.
- Stream LLM tokens into TTS at sentence boundaries so the synthesis of sentence one overlaps the generation of sentence two.
- Stream TTS audio to the player and trim the player buffer to the smallest value your network jitter tolerates.
- Use Flash plus a default voice or IVC for the lowest-latency TTS, and match the codec to the transport (ulaw_8000 for telephony, PCM or MP3 for browser/WebRTC).
- Co-locate your servers with your users and measure per region, because the network legs are real and unequal.
For deeper specific techniques, see the latency optimization how-to developer guide. For a full runnable starting point, the API quickstart and the streaming how-to have complete examples.
Want faster access to fine-tuned agent cascades? ElevenAgents implements this pipeline with overlap optimizations already in place.
Build low-latency voice agents with ElevenAgents
Voice agent latency optimization requires measuring each stage and then overlapping stages so the slowest ones run behind work that’s already happening. You can build and tune that cascade by hand over several iterations, making use of the patterns above or get started from a pipeline that already has latency optimizations in place.
ElevenAgents implements this full cascade, from streaming STT to token-by-token LLM handoff to Flash TTS, with overlap techniques already built in. Rather than starting from scratch, you’ll tune thresholds for the performance that matters most to you.
Get started by using ElevenAgents to create an agent today or contact sales for more information.

.webp&w=3840&q=80)
.webp&w=3840&q=80)

