Cascaded vs Fused Models: Comparing the architectures behind conversational agents

A breakdown of the five main voice agent architectures and the tradeoffs between reasoning, control, and naturalness.

Cascaded-vs-fused-model-cover-thumbnail

Most people think voice agents are built using either a cascaded or fused model. However, in practice, agents are designed along a spectrum between the two, with five architectures typically used, depending on the application.

The agent’s architecture helps determine how natural, intelligent, and consistent its responses are, and whether it behaves predictably over time. For example, an agent built using a fusion-based architecture might sound especially lifelike in short exchanges but struggle with reasoning or staying consistent during longer conversations.

At ElevenLabs, we use a cascade-based architecture that chains together specialized components for speech recognition, reasoning, and speech generation. In contrast, OpenAI’s Realtime model adopts a fusion-based approach that consolidates those stages into a single network.

In this post, we walk through the five main conversational agent architectures we see today, outlining their core designs, tradeoffs, and how teams choose between them depending on their goals.

What teams optimize for when building agents

Teams building conversational agents typically optimize their agent’s behavior across several key dimensions:

  1. Reasoning and tool use: How effectively the agent understands context, performs complex reasoning, and calls external tools or APIs to complete tasks.
  2. Reliability: How predictably the agent behaves - including its ability to enforce guardrails, maintain consistent tone and personality, and provide transparency through transcripts, test results, and monitoring.
  3. Prosody: How naturally the agent interprets speech and responds - delivering the correct rhythm, stress, and intonation that make interactions feel human.
  4. Latency: How quickly the agent generates a response.
  5. Turn-taking: How accurately the agent detects when to respond, pause, or yield in overlapping speech.

While teams also care about factors such as concurrency, integrations, and voice quality, the dimensions above can be more directly influenced by the agent’s architecture. The most successful teams tailor their architecture to optimize these dimensions for their specific use case.

The tradeoffs between cascaded and fused architectures

Cascade-based architectures are built by chaining together specialized components: Speech to Text, a Large Language Model, and Text to Speech. Each stage can be optimized, tested, and upgraded independently. 

Cascaded Architecture

Cascaded (Overview) Diagram

This modularity allows teams to plug in the latest frontier LLMs for stronger reasoning, apply explicit guardrails at the text layer, and precisely control how the agent speaks through contextual TTS. The main tradeoff is that cascaded architectures tend to lose more prosodic cues - such as intonation, rhythm, and emotion - because speech is broken down into text before being regenerated. These cues can be partially recovered through explicit modeling, but they’re not captured as naturally as in fused approaches. Other dimensions, such as latency and turn-taking, can typically be optimized to comparable performance levels in both approaches.

Meanwhile, fused approaches combine these steps into a single multimodal model. Audio goes in and audio comes out, with speech recognition, reasoning, and generation happening inside the same network.

Fused Model

Sequential Fused Diagram

This design allows fusion-based architectures to preserve and reproduce prosody more effectively, since the model processes pronunciation and intonation directly. However, fused models are harder to test and control, since intermediate outputs aren’t exposed. They also tend to rely on lighter-weight LLM cores, which limits reasoning and tool-calling performance compared to cascaded approaches that can pair with the strongest models available.

While cascade-based and fusion-based architectures define the two sides of the design spectrum, most agents fall somewhere between them in practice. We see five core architectures being explored today that balance reasoning, reliability, and naturalness in different ways.

The five potential architectures

1. Basic Cascaded

Basic Cascaded Diagram

In basic cascaded architectures, audio is transcribed, the LLM produces a text reply, and then TTS speaks the exact words it is given. Because every stage operates on plain text, teams get full visibility and control. Guardrails can be enforced at the text layer, tool calls and API integrations are handled by the LLM directly, and deterministic flows can route conversations and enforce business logic in a more predictable way.

However, the agent doesn't recognize nuances in speech like tone, rhythm, and emotion, which can limit how natural the conversation feels.

Potential use cases include:

  • Customer support
  • Sales assistants
  • AI receptionists
  • Entertainment and gaming NPCs
  • IVR replacements
  • FAQ handling and documentation onboarding
  • Outbound notifications (reminders, alerts, appointment confirmations)

2. Advanced Cascaded

Advanced Cascaded Diagram

Advanced cascaded architectures introduce contextual TTS, where the LLM not only decides what to say but also how to say it, passing delivery instructions such as "say this reassuringly" or "respond with emphasis" to the TTS model. The agent speaks in a more realistic tone and style, while retaining the same guardrails, deterministic flows, tool use, and auditability of a basic cascaded system.

This is the approach behind Expressive Mode in ElevenAgents, which pairs a context-aware TTS model that adapts tone and emotion across turns, with an advanced turn-taking system built on signals from Scribe v2 Realtime. Together, they enable more expressive and emotionally nuanced delivery, without sacrificing modularity or control.

Potential use cases include more expressive versions of:

  • Customer support
  • Sales assistants
  • AI receptionists
  • Entertainment and gaming NPCs

3. Hybrid Cascaded and Fused

Hybrid Cascaded Diagram

Some cascaded architectures feed acoustic features (e.g. pronunciation, emotion, tone) from the input speech directly into the LLM as embeddings. This architecture preserves more of the user's original intent while still keeping TTS modular. Tool use and guardrails are still possible, but the fused ASR+LLM block is harder to audit than a clean text handoff, and the LLM can no longer be swapped as easily as in a cascaded model.


Potential use cases include:

  • Language learning & coaching focused on pronunciation
  • Tone-sensitive, low complexity customer support

4. Sequential Fused

Sequential Fused Diagram

In sequential fused architectures, a single multimodal model handles recognition, reasoning, and speech generation. Operating one turn at a time, the model listens until the user finishes, then produces audio directly. By processing audio end to end, these architectures naturally capture cues like pronunciation, pacing, and intonation, often resulting in more fluid and expressive speech delivery.

However, the tradeoff is that guardrails are harder to enforce without a text layer, tool use is limited by lighter-weight reasoning cores, and there’s limited observability without clear intermediate outputs.


Potential use cases include:

  • Personal companions
  • Entertainment chatbots

5. Duplex Fused

Duplex Fused Diagram

In duplex fused architectures, the model processes input and output simultaneously. This can produce the most human-like conversational flow, with more genuine overlapping speech during short conversations, but it also introduces significant complexity. Guardrails are harder to enforce, crosstalk and interruptions can cause errors, and observability is minimal compared to cascade-based architectures.


Potential use cases include:

  • Experimental companions, chatbots & social voice apps

Choosing the right architecture for your use case

There is no one-size-fits-all architecture for conversational agents. Each variant carries strengths and tradeoffs, from the predictability and control of cascaded models to the natural prosody of fused ones.

Architecture
Reliability
Reasoning & Tool Use
Prosody & Naturalness
Potential Use Cases
Basic Cascaded
●●●
●●●
IVR systems, FAQs, reminders, notifications
Advanced Cascaded
●●●
●●●
●●
Customer support, AI receptionists, sales assistants
Hybrid (Cascaded + Fused)
●●
●●
●●●
Language learning, tone-sensitive support, coaching
Sequential Fused
●●
●●●
Personal companions, entertainment chatbots
Duplex Fused
●●
●●●
Real-time social apps, experimental companions

At ElevenLabs, we favor modular architectures that leverage the strongest Speech to Text, LLM, and Text to Speech models to optimize for intelligent, customizable and reliable agents. We then incorporate prosodic cues, latency optimizations, and a turn-taking model for natural sounding agent responses.

As conversational AI continues to expand into customer support, education, marketing, personal assistants and more, the agents that succeed will be those whose architectures are well suited for their specific use cases.

Explore articles by the ElevenLabs team

ElevenLabs

Create with the highest quality AI Audio

Get started free

Already have an account? Log in