Building voice agents that last: some lessons learned from forward deployed engineering

Written by: Nicolas Bernier; Amanda Milberg
Published: Apr 2, 2026
Last updated: Jun 17, 2026

ListenListen to this article

0:00

0:000:00

For most organizations, point solutions for the support function have long been measured on deflection. This means reducing call volume and minimizing live agent interactions. But deflection doesn't equal resolution, and the gap between the two is where a customers experience breaks down. Closing that gap requires agents with access not just to data, but to the systems needed to act on it. As a result, agents can process refunds, guide customers through a checkout flow, and hand off to a human agent with full context whenever the situation calls for it. This enables enterprises to handle customer interactions at scale, meaningfully reducing the load on human support teams while improving the experience on both sides of the call. In a recent deployment with Revolut, a fintech company serving 70 million customers globally, this translated to an 8x reduction in time to resolution and a 99.7% call success rate.

Organizations must approach changes of this magnitude iteratively, tied closely to the company's core mission and driven by strong executive sponsorship. At the technical level, reasoning over an unstructured environment comes with inherent risks that must be carefully managed. Giving an agent the ability to act across the Customer Relationship Management (CRM), modify an order in the point-of-sale system, or escalate a case means the governance model matters just as much as the model itself. The focus then becomes not whether agents can handle real work, but what mechanisms are required to deploy them safely and repeatedly.

In this post, we draw on our experience to share what makes successful agents from the first deployment to scaling across an organization’s entire customer operation.

Shipping agents vs. shipping software

Before diving deeper into agent building, it's worth contrasting the deployment of voice agents with traditional software, something enterprises have been doing for decades. Through this lens, agents can be separated into two distinct components: traditional software and core orchestrator.

Software

A set of deployment channels for voice and messaging agents, spanning telephony, contact center platforms, digital surfaces, and messaging apps through to flexible SDK and API integrations.

A full suite of observability and governance tools for managing agent quality in production, from evaluations, testing, and simulations to compliance, PII redaction, and continuous improvement.

Core orchestrator

A diagram showing how the Voice Engine handles audio orchestration (speech-to-text, turn taking, interruption detection) and passes transcripts to the Agent Orchestration layer, where an LLM reasons over a system prompt, knowledge base, and RAG to drive workflows and routing.

The traditional software components are primarily aimed at improving the delivery and performance of the agent. For ElevenAgents, this includes features such as versioning, A/B testing, telephony and first message configuration, among others. These components exhibit little to no drift after deployment, making their behavior highly predictable. Through robust engineering practices, organizations can build on these features quickly and maintain a deep understanding of their production performance through a rigorous set of metrics, traces, and logs. Latency improvements in this layer follow well-understood patterns: caching, connection pooling, infrastructure scaling, and protocol optimization are all reliable levers with deterministic outcomes.

Core Orchestrator components are harder to predict by nature, but they dictate the runtime performance of the agent both in terms of answer quality and perceived latency. Unlike traditional software, these components operate over natural language and audio, where the input space is effectively unbounded and small changes in phrasing, context, background noise, or user behavior can produce meaningfully different outputs over time. This makes conventional testing insufficient on its own: an agent may perform flawlessly across hundreds of test cases and still fail in production in ways that are difficult to anticipate.

Latency in this layer is also less deterministic, driven by model inference times, injection of auditory artifacts, tool call chains, and the variability inherent to generative systems. Managing these components well requires a different discipline, one built around evaluation frameworks, production monitoring, and a willingness to iterate continuously based on real conversation data rather than pre-deployment assumptions alone.

This distinction shapes how organizations should approach adoption: starting with use cases that are organizationally relevant but low in risk, then scaling deliberately as confidence in the system grows.

Release Cycle

Selecting path-finders

For teams starting to adopt voice agents, selecting the right path-finders is one of the most consequential early decisions. It also has less to do with technology than most expect. Teams that rack up early successes and avoid the endless POC abyss tend to share something in common: they can answer the following questions with clarity.

How does this use case drive measurable business value? The right use case to start with is not the most technically interesting one, but the one most likely to move the needle on an outcome the business already cares about. This is measured by revenue impact, cost reduction, or customer satisfaction and other metrics leaders are already tracking and accountable for. Without that direct line to business value, it becomes difficult to justify the iteration cycles required to get the agent right, and momentum is likely to stall before the technology has a chance to prove itself.
Is it immediately clear to users what the agent's scope and purpose are? Ambiguity in scope is one of the most common sources of drift from development to production. Users who do not understand what an agent can and cannot do will test its boundaries in ways the evaluation suite never anticipated. A well-scoped agent sets expectations from the first message and handles out-of-scope requests gracefully.
What do good and bad interactions look like, and can they be codified into a concrete set of evaluation criteria? A good interaction is not simply one where the agent completes the task, but one where the user feels heard, escalation happens at the right moment, and the outcome is aligned with business intent. Evaluation criteria fall into two categories: quantitative metrics captured by the platform such as task completion rate and escalation rate, and transcript-based criteria that require analyzing the conversation itself. Defining the transcript-based criteria early gives the team a concrete target to build toward. They also set a natural go-live threshold. When your agent is passing its evaluation criteria consistently and platform metrics have stabilized, you have the confidence to move to production. Without defined criteria, going live is a judgment call.
What are the tradeoffs between performance and control, and which matters more at this stage? The more autonomy an agent is given, the more natural and flexible interactions become, but the greater the risk of operating outside validated boundaries. Tighter control through constrained prompting and stricter escalation logic reduces that risk but can make the agent feel rigid. Neither extreme is right. Organizations that lock down too early end up with a glorified IVR. Those that move too fast before establishing trust create support burdens that outweigh the gains. Understanding where this dial should sit at each stage of maturity will shape model configuration, escalation logic, and how much of the agent's knowledge lives in the prompt versus in retrieved or structured sources.

With these questions answered, an organization is ready to move from strategy to execution and begin scoping the build.

Grounding the initial build

When moving to execution, teams can draw on methodologies almost as old as software itself. Test Driven Development (TDD) provides the scaffolding for keeping agents aligned to core metrics throughout the build.

The agent development lifecycle, where scoping feeds into a continuous cycle of defining tests, building, and deploying, with both pre-production and production failures looping back to expand the test suite over time.

Concretely, development teams and business stakeholders should jointly define and build two foundational artifacts: Success Evaluation Criteria, which establishes what good looks like at both the individual call and aggregate level, and Agent Tests, which repeatedly verify specific behaviors the agent is expected to exhibit. The former is best informed by reviewing real human calls once they occur. The latter is built incrementally, starting with an initial set of expected behaviors and expanding as new ones are introduced and edge cases are discovered.

With an initial set of tests in place, agent development begins with the system prompt. This is where the rules, tone, and approach of the agent are defined: what it should do, what it should not do, and how it should behave at the edges of its role. A well-crafted system prompt is as much about structure as it is about content. Separating instructions into clearly labeled sections, keeping related guidance together, and avoiding conditional phrasing all make a meaningful difference in how consistently the agent behaves. We often return to the prompting guide at this stage.

Alongside the system prompt, the core components of the agent are configured: the LLM, the text-to-speech (TTS) model, and the voice. LLM selection is primarily a latency-versus-performance trade-off where models optimized for speed typically sacrifice some reasoning capability, and vice versa. For TTS, the right choice depends on what the use case demands most, whether that's expressive delivery, low latency, or multilingual support. The voice, however, is as much a brand decision as a technical one. It shapes how an organization comes across to every caller, making it one of the few configuration decisions that belongs as much to brand and marketing teams as it does to the engineers building the agent. This means voice selection can happen in parallel to the rest of the development process, rather than becoming a bottleneck at the start or end. ElevenAgents offers access to over 10,000 voices, and if none fit, teams can clone or create their own.

From here, agents can optionally be extended with a Knowledge Base, tools, and channel configurations. Each addition unlocks new capabilities but also introduces new surface area to test. Whether that means telephony integration, access to external databases, or the ability to take action on behalf of a customer, these decisions are worth pressure-testing against the evaluation criteria before expanding scope. When tools are added, the system prompt and tool description provide explicit guidance on when and how to invoke each of them, so the agent uses them consistently and in the right context.

With these foundations in place, the agent is ready to be put to the test.

Towards production readiness

With the tests and evaluation criteria defined during the grounding phase now running against a built agent, development becomes a tight loop: add more tests, identify failures, update the system prompt or configuration, and run again. Most failures at this stage are not model failures, but rather prompt failures. An instruction that seemed clear in isolation turns out to be ambiguous when the agent encounters it mid-conversation. Edge cases surface that the initial test suite didn't anticipate. Each one becomes a new Next Turn test that can be created from the conversation itself. The question of when to stop iterating has a concrete answer: when the agent is passing its evaluation criteria consistently across multiple runs, and platform metrics such as task completion rate and escalation rate have stabilized within acceptable ranges. This is why defining those criteria before building matters so much. Without them, readiness becomes a judgment call and the finish line keeps moving.

In practice, most teams find that a small set of recurring failure patterns account for the majority of issues. The most common are prompt ambiguity, where the agent receives conflicting or underspecified instructions and defaults to unpredictable behavior; tool misuse, where the agent invokes a tool in the wrong context or fails to invoke it when it should; and escalation drift, where the agent either escalates too aggressively or holds on to conversations it should have handed off. Each of these has a prompt-level fix. Tightening the relevant instruction, adding an explicit example, or adjusting the escalation threshold is usually sufficient. The risk is in not catching them before go-live.

The most common way teams get this wrong is by treating a passing test suite as a guarantee rather than a signal. A suite that only covers the happy path will pass easily and mean very little. Coverage across refusals, mid-conversation pivots, ambiguous inputs, and tool-heavy interactions is what gives the results weight. Similarly, teams that skip simulation testing and rely solely on turn-level tests miss a class of failures that only emerge across a full conversation, such as context drift, where the agent loses track of earlier turns, or compounding errors, where a small misstep early in the call compounds into a bad outcome. Once recurring failure patterns are resolved and the agent handles the long tail of edge cases gracefully, rather than perfectly, the marginal value of further iteration in staging diminishes. At that point, the more valuable signal comes from real conversations.

Going live does not mean iteration is over. It means the locus of learning shifts from synthetic tests to production transcripts. The evaluation criteria that defined go-live become the baseline against which live performance is measured, and the cycle continues from there.

Feedback loops, evaluation, and knowing when to stop iterating

Once tests are defined and running, gaps in the pipeline become visible quickly. Through Conversation Analysis, teams can pinpoint the exact moment an interaction went wrong and use that signal to create a new test and inform what needs to change. The most common interventions are prompt-level: tightening tool call descriptions, adding more explicit instructions for edge cases, or clarifying escalation conditions that turned out to be ambiguous in practice. In some cases, the issue sits deeper, and the underlying model configuration needs to be revisited if latency or reasoning quality is falling short of what the use case demands.

The most important discipline at this stage is validating changes rather than assuming them. A fix that solves one failure can quietly introduce another. ElevenAgents supports versioning, allowing teams to test new iterations against a small percentage of users before rolling out to the broader population. This makes it possible to confirm that improvements are actually improving outcomes rather than shifting the failure mode elsewhere.

What can go wrong

The single most consequential mistake at this stage is skipping branched rollouts and pushing changes directly to the full user population. Without staged rollouts, you lose the ability to isolate the impact of any given change, and at scale, this makes it nearly impossible to understand what is actually driving improvements or regressions in your platform metrics. Treating the full user base as your test environment isn't just risky; it eliminates the observability you need to make confident decisions going forward.

Beyond rollout strategy, two other failure modes are worth guarding against. The first is over-indexing on recent failures. When a high-profile conversation goes wrong, there is a natural impulse to patch it immediately and broadly, but reactive prompt changes made without running the full test suite frequently cause regressions in behaviors that were previously stable. Every change, however minor, should be treated as a new iteration and tested accordingly. The second is evaluation drift. Over time, teams can unconsciously lower the bar for what counts as a passing test, particularly under pressure to ship. The evaluation criteria defined during scoping should remain the anchor. If they start to feel too strict, the right response is to revisit and deliberately update them, not let standards erode informally.

Scaling with confidence

Increasing traffic is a confidence decision, not a time-based one. The signal to expand is when the agent is passing its evaluation criteria consistently across multiple test runs, platform metrics have stabilized, and branched rollouts have shown no meaningful regression against the control group.

A common question at this stage is how much traffic is enough to draw a conclusion. Batches fewer than 100 calls per branch produce too much variance to evaluate outcomes reliably. A 60% pass rate on 25 calls and a 60% pass rate on 100 calls represent very different levels of confidence. Beyond a defined number, the batch should also be large enough to surface the full range of realistic inputs, including likely edge cases, uncommon intents, and failure modes that only appear at volume and rarely show up in small samples.

More traffic amplifies both what is working and what is not. Expanding before core failure patterns are addressed creates a support burden that is difficult to walk back.

Rinse and repeat

Knowing where to stop is as important as knowing what to fix. Iteration has diminishing returns, and the right signal to pause is when the agent is consistently meeting the evaluation criteria set during scoping. At that point, further changes carry more risk than reward.

What "consistently meeting criteria" looks like varies by context. Teams with limited data access or incomplete integrations may find escalation rates around 50% are a realistic ceiling until those constraints are resolved. Where data access is strong, the best-performing deployments typically target task completion above 80% and escalation below 20%. More important than any single number is stability: consistent performance across several weeks of production traffic, with no meaningful regression across test runs, is the real signal. When the marginal gain from the next iteration is smaller than the risk of regression, it is time to stop.

That does not mean the work is finished. When new requirements emerge, the process starts again from the top. The scoping questions from the first build remain just as relevant for the second. The difference is that teams entering a second cycle do so with a test suite, an evaluation baseline, and operational experience the first cycle had to build from scratch. That compounding advantage is what separates organizations that get lasting value from voice agents from those that remain stuck in proof of concept.

Conclusion

The teams we've seen close the gap between deflection and resolution are the ones that define what good looks like before they start building, maintain discipline through the iteration cycle, and treat each deployment as the foundation for the next. Conversational agents are not a one-time deployment - real conversations surface edge cases no test suite fully anticipates, and the work of improvement does not stop at go-live.

ElevenAgents is built around this reality. Agent Testing, Conversation Analysis, and branched rollouts are the foundation that turns a proof of concept into a system that actually resolves customer issues at scale - not just deflects them. That is the gap worth closing.