
Unpacking ElevenAgent's Orchestration Engine
A look under the hood at how ElevenAgents manages context, tools, and workflows to deliver real-time, enterprise-grade conversations.
A breakdown of the five main voice agent architectures and the tradeoffs between reasoning, control, and naturalness.
Most people think voice agents are built using either a cascaded or fused model. However, in practice, agents are designed along a spectrum between the two, with five architectures typically used, depending on the application.
The agent’s architecture helps determine how natural, intelligent, and consistent its responses are, and whether it behaves predictably over time. For example, an agent built using a fusion-based architecture might sound especially lifelike in short exchanges but struggle with reasoning or staying consistent during longer conversations.
At ElevenLabs, we use a cascade-based architecture that chains together specialized components for speech recognition, reasoning, and speech generation. In contrast, OpenAI’s Realtime model adopts a fusion-based approach that consolidates those stages into a single network.
In this post, we walk through the five main conversational agent architectures we see today, outlining their core designs, tradeoffs, and how teams choose between them depending on their goals.
Teams building conversational agents typically optimize their agent’s behavior across several key dimensions:
While teams also care about factors such as concurrency, integrations, and voice quality, the dimensions above can be more directly influenced by the agent’s architecture. The most successful teams tailor their architecture to optimize these dimensions for their specific use case.
Cascade-based architectures are built by chaining together specialized components: Speech to Text, a Large Language Model, and Text to Speech. Each stage can be optimized, tested, and upgraded independently.
.webp&w=3840&q=95)
This modularity allows teams to plug in the latest frontier LLMs for stronger reasoning, apply explicit guardrails at the text layer, and precisely control how the agent speaks through contextual TTS. The main tradeoff is that cascaded architectures tend to lose more prosodic cues - such as intonation, rhythm, and emotion - because speech is broken down into text before being regenerated. These cues can be partially recovered through explicit modeling, but they’re not captured as naturally as in fused approaches. Other dimensions, such as latency and turn-taking, can typically be optimized to comparable performance levels in both approaches.
Meanwhile, fused approaches combine these steps into a single multimodal model. Audio goes in and audio comes out, with speech recognition, reasoning, and generation happening inside the same network.

This design allows fusion-based architectures to preserve and reproduce prosody more effectively, since the model processes pronunciation and intonation directly. However, fused models are harder to test and control, since intermediate outputs aren’t exposed. They also tend to rely on lighter-weight LLM cores, which limits reasoning and tool-calling performance compared to cascaded approaches that can pair with the strongest models available.
While cascade-based and fusion-based architectures define the two sides of the design spectrum, most agents fall somewhere between them in practice. We see five core architectures being explored today that balance reasoning, reliability, and naturalness in different ways.

In basic cascaded architectures, audio is transcribed, the LLM produces a text reply, and then TTS speaks the exact words it is given. Because every stage operates on plain text, teams get full visibility and control. Guardrails can be enforced at the text layer, tool calls and API integrations are handled by the LLM directly, and deterministic flows can route conversations and enforce business logic in a more predictable way.
However, the agent doesn't recognize nuances in speech like tone, rhythm, and emotion, which can limit how natural the conversation feels.
Potential use cases include:

Advanced cascaded architectures introduce contextual TTS, where the LLM not only decides what to say but also how to say it, passing delivery instructions such as "say this reassuringly" or "respond with emphasis" to the TTS model. The agent speaks in a more realistic tone and style, while retaining the same guardrails, deterministic flows, tool use, and auditability of a basic cascaded system.
This is the approach behind Expressive Mode in ElevenAgents, which pairs a context-aware TTS model that adapts tone and emotion across turns, with an advanced turn-taking system built on signals from Scribe v2 Realtime. Together, they enable more expressive and emotionally nuanced delivery, without sacrificing modularity or control.
Potential use cases include more expressive versions of:

Some cascaded architectures feed acoustic features (e.g. pronunciation, emotion, tone) from the input speech directly into the LLM as embeddings. This architecture preserves more of the user's original intent while still keeping TTS modular. Tool use and guardrails are still possible, but the fused ASR+LLM block is harder to audit than a clean text handoff, and the LLM can no longer be swapped as easily as in a cascaded model.
Potential use cases include:

In sequential fused architectures, a single multimodal model handles recognition, reasoning, and speech generation. Operating one turn at a time, the model listens until the user finishes, then produces audio directly. By processing audio end to end, these architectures naturally capture cues like pronunciation, pacing, and intonation, often resulting in more fluid and expressive speech delivery.
However, the tradeoff is that guardrails are harder to enforce without a text layer, tool use is limited by lighter-weight reasoning cores, and there’s limited observability without clear intermediate outputs.
Potential use cases include:

In duplex fused architectures, the model processes input and output simultaneously. This can produce the most human-like conversational flow, with more genuine overlapping speech during short conversations, but it also introduces significant complexity. Guardrails are harder to enforce, crosstalk and interruptions can cause errors, and observability is minimal compared to cascade-based architectures.
Potential use cases include:
There is no one-size-fits-all architecture for conversational agents. Each variant carries strengths and tradeoffs, from the predictability and control of cascaded models to the natural prosody of fused ones.
At ElevenLabs, we favor modular architectures that leverage the strongest Speech to Text, LLM, and Text to Speech models to optimize for intelligent, customizable and reliable agents. We then incorporate prosodic cues, latency optimizations, and a turn-taking model for natural sounding agent responses.
As conversational AI continues to expand into customer support, education, marketing, personal assistants and more, the agents that succeed will be those whose architectures are well suited for their specific use cases.

A look under the hood at how ElevenAgents manages context, tools, and workflows to deliver real-time, enterprise-grade conversations.

More expressive voice agents, built for real-world customer conversations.