Testing Conversational AI Agents

Discover how to effectively test and improve conversational AI agents using robust evaluation criteria and conversation simulations.

Abstract

When conversational agents go live, how do you monitor them at scale? How do you catch when they’re not behaving as intended? And once you’ve made changes, how do you test them?

These questions shaped our work on Alexis — our documentation assistant powered by Conversational AI. As Alexis evolved, we built a system for monitoring, evaluating, and testing agents, grounded in evaluation criteria and conversation simulations.

Laying the Foundation: Reliable Evaluation Criteria

Improving any agent starts with understanding how it behaves in the wild. That meant refining our evaluation criteria — ensuring they were accurate and reliable enough to monitor agent performance. We define a failed conversation as one where the agent either gives incorrect information or doesn’t help the user achieve their goal.

Flow chart

We developed the following Evaluation Criteria:

  • Interaction: is this a valid conversation, did the user ask relevant questions, did the conversation make sense?
  • Positive interaction: did the user walk away satisfied, or were they confused or frustrated?
  • Understanding the root cause: did the agent correctly identify the user’s underlying issue?
  • Solving the user’s enquiry: did the agent solve the user’s problem or provide an alternative support method?
  • Hallucination: did the agent hallucinate information that isn’t in the knowledge base?

If Interaction fails, the conversation itself isn’t valid. If any other criteria fail, we investigate further. The investigation guides how we improve the agent. Sometimes it’s about refining tool usage or timing. Other times, it’s adding guardrails to prevent unsupported actions.

Iterating with Confidence: Conversation Simulation API

Once we’ve identified what to improve, the next step is testing. That’s where our Conversation Simulation API  comes in. It simulates realistic user scenarios—both end-to-end and in targeted segments — and automatically evaluates results using the same criteria we apply in production. It supports tool mocking and custom evaluation, making it flexible enough to test specific behaviors.

We use two approaches:

  • Full simulations — Test entire conversations from start to finish.
  • Partial simulations — Start mid-conversation to validate decision points or sub-flows. This is our go-to method for unit testing, enabling rapid iteration and targeted debugging.

Clear, focused scenarios let us control what the LLM is being tested on—ensuring coverage for edge cases, tool usage, and fallback logic.

Automating for Scale: Embedding Tests in CI/CD

The final piece is automation. We used ElevenLabs’ open APIs to connect with our GitHub DevOps flow — embedding evaluation and simulation into our CI/CD pipeline. Every update is automatically tested before deployment. This prevents regressions and gives us fast feedback on real-world performance.

Results: A Stronger, Smarter Alexis

This process transformed how we build and maintain Alexis. We’ve created a feedback loop that connects real usage with structured evaluation, targeted testing, and automated validation — allowing us to ship improvements faster and with greater confidence.

And it’s a framework we can now apply to any agent we build.

Utforska mer

Company
Stripe

Here's how ElevenLabs scaled with Stripe

ElevenLabs used Stripe to scale their AI voice platform from 11 to over 5,000 voices, creating a global marketplace and distributing over $5 million to contributors.

ElevenLabs

Skapa ljud och röster som imponerar med de bästa AI-verktygen

Kom igång gratis

Har du redan ett konto? Logga in