Experiments
Run controlled A/B tests on production traffic to optimize agent performance with data, not intuition
Run controlled A/B tests on production traffic to optimize agent performance with data, not intuition
Experiments let you run controlled A/B tests across any aspect of agent configuration — prompt structure, workflow logic, voice, personality, tools, knowledge base — by routing a defined slice of traffic to a variant, measuring the impact on key outcomes, and promoting winners to production.
Experiments are built on top of agent versioning. Versioning must be enabled on your agent before you can run experiments.
Without structured experimentation, optimization relies on intuition. A prompt tweak “feels” better. A workflow adjustment “should” improve containment. A new escalation path “seems” more efficient.
Experiments replace guesswork with evidence. You test changes against live traffic, measure real outcomes, and promote what works.
Experiments follow a four-step workflow:
Start from your current agent configuration and create a new branch. Modify anything — system prompt, workflow, voice, tools, knowledge base, guardrails, or evaluation criteria. Each change is tracked as a versioned configuration.
Navigate to the Branches tab in your agent settings and click Create branch.
Define what percentage of live conversations should go to your variant. Start small (5–10%) to limit risk, then increase as confidence grows.
Click Edit traffic split and set the percentages for each branch. Percentages must total exactly 100%.

Compare variant performance against your baseline using the analytics dashboard. Click See analytics from the branches panel to jump directly to a branch-filtered view.

Teams can measure outcomes such as:
Traffic is split between branches by percentage. Routing is deterministic based on the conversation ID, so the same user consistently reaches the same branch across sessions.
By default, traffic is randomized across the user base. If you use the API to initiate conversations, you can route specific cohorts to specific branches by controlling which conversations are initiated with which branch configuration.
All traffic percentages must sum to exactly 100%. A deployment will fail if they don’t.
Experiments support continuous optimization across customer-facing and operational workflows.
Test whether a revised escalation flow improves CSAT without increasing handling time. Compare different greeting styles, empathy levels, or resolution strategies.
Test whether a more direct tone or different qualification logic increases conversion. Experiment with objection handling, pricing presentation, or follow-up timing.
Measure whether tool logic changes reduce average handling time or infrastructure cost. Test different knowledge base configurations or workflow structures.
Each experiment is tied to a specific agent version, so every performance shift is attributable to a defined configuration change.
Any aspect of agent configuration can be varied between branches:
Define what you expect to improve and how you’ll measure it before creating a variant. For example: “Changing the escalation prompt to include a summary of the issue will improve our resolution-rate evaluation criterion by 10%.”
Isolating a single variable makes it clear what caused any performance difference. If you change the prompt, voice, and workflow simultaneously, you won’t know which change drove the result.
Configure success evaluation criteria before running experiments. These provide the structured metrics you need to compare variants objectively.
Begin with 5–10% of traffic on the variant. This limits exposure if something goes wrong while still generating meaningful data.
Allow enough conversations to accumulate before drawing conclusions. Small sample sizes lead to unreliable results. Monitor the analytics dashboard and wait for trends to stabilize.
Merge or discard experiments promptly. Long-running branches become harder to merge and may drift from the main configuration.
Learn the underlying versioning system — branches, versions, and API reference
Monitor experiment performance with the analytics dashboard
Define custom success criteria to measure experiment outcomes
Set up automated tests before branching to establish a baseline