.webp&w=3840&q=95)
Here's how ElevenLabs scaled with Stripe
ElevenLabs used Stripe to scale their AI voice platform from 11 to over 5,000 voices, creating a global marketplace and distributing over $5 million to contributors.
Discover how to effectively test and improve conversational AI agents using robust evaluation criteria and conversation simulations.
When conversational agents go live, how do you monitor them at scale? How do you catch when they’re not behaving as intended? And once you’ve made changes, how do you test them?
These questions shaped our work on Alexis — our documentation assistant powered by Conversational AI. As Alexis evolved, we built a system for monitoring, evaluating, and testing agents, grounded in evaluation criteria and conversation simulations.
Improving any agent starts with understanding how it behaves in the wild. That meant refining our evaluation criteria — ensuring they were accurate and reliable enough to monitor agent performance. We define a failed conversation as one where the agent either gives incorrect information or doesn’t help the user achieve their goal.
If Interaction fails, the conversation itself isn’t valid. If any other criteria fail, we investigate further. The investigation guides how we improve the agent. Sometimes it’s about refining tool usage or timing. Other times, it’s adding guardrails to prevent unsupported actions.
Once we’ve identified what to improve, the next step is testing. That’s where our Conversation Simulation API comes in. It simulates realistic user scenarios—both end-to-end and in targeted segments — and automatically evaluates results using the same criteria we apply in production. It supports tool mocking and custom evaluation, making it flexible enough to test specific behaviors.
Clear, focused scenarios let us control what the LLM is being tested on—ensuring coverage for edge cases, tool usage, and fallback logic.
The final piece is automation. We used ElevenLabs’ open APIs to connect with our GitHub DevOps flow — embedding evaluation and simulation into our CI/CD pipeline. Every update is automatically tested before deployment. This prevents regressions and gives us fast feedback on real-world performance.
This process transformed how we build and maintain Alexis. We’ve created a feedback loop that connects real usage with structured evaluation, targeted testing, and automated validation — allowing us to ship improvements faster and with greater confidence.
And it’s a framework we can now apply to any agent we build.
ElevenLabs used Stripe to scale their AI voice platform from 11 to over 5,000 voices, creating a global marketplace and distributing over $5 million to contributors.
Xaia uses both speech to text and text to speech to improve patient care
ElevenLabs द्वारा संचालित कन्वर्सेशनल AI