The agent testing framework enables you to move from slow, manual phone calls to a fast, automated, and repeatable testing process. Create comprehensive test suites that verify both conversational responses and tool usage, ensuring your agents behave exactly as intended before deploying to production.
The framework consists of two complementary testing approaches:
Both test types can be created from scratch or directly from existing conversations, allowing you to quickly turn real-world interactions into repeatable test cases.
Scenario testing evaluates your agent’s conversational abilities by simulating interactions and assessing responses against defined success criteria.

Create context for the text. This can be multiple turns of interaction that sets up the specific scenario you want to evaluate. Our testing framework currently only supports evaluating a single next step in the conversation. For simulating entire conversations, see our simulate conversation endpoint and conversation simulation guide.
Example scenario:
Describe in plain language what the agent’s response should achieve. Be specific about the expected behavior, tone, and actions.
Example criteria:
Supply both success and failure examples to help the evaluator understand the nuances of your criteria.
Success Example:
“I understand how frustrating duplicate charges can be. Let me look into this right away for you. I can see there were indeed two charges this month - I’ll process a refund for the duplicate charge immediately. Would you still like to proceed with cancellation, or would you prefer to continue once this is resolved?”
Failure Example:
“You need to contact billing department for refund issues. Your subscription will be cancelled.”
Transform real conversations into test cases with a single click. This powerful feature creates a feedback loop for continuous improvement based on actual performance.

When reviewing call history, if you identify a conversation where the agent didn’t perform well:
Tool call testing verifies that your agent correctly uses tools and passes the right parameters in specific situations. This is critical for actions like call transfers, data lookups, or external integrations.

Choose which tool you expect the agent to call in the given scenario (e.g.,
transfer_to_number, end_call, lookup_order).
Specify what data the agent should pass to the tool. You have three validation methods:
Exact Match
The parameter must exactly match your specified value.
Regex Pattern The parameter must match a specific pattern.
LLM Evaluation An LLM evaluates if the parameter is semantically correct based on context.
Tool call testing is essential for high-stakes scenarios:
Simulation testing evaluates your agent across a full, multi-turn conversation with a simulated AI user. Unlike single-turn evaluations, this test type checks whether the complete interaction reaches your defined outcome.
Simulation testing is currently in public alpha. Functionality and UI behavior may change.

Describe the user’s context, intent, and behavior in natural language. The simulator uses this scenario to drive the conversation.
Example scenario:
“A tourist who is not fluent in English is trying to place an order at a restaurant.”
Define the outcome that should count as a pass. This prompt is used to evaluate whether the full conversation succeeded.
Example success condition:
“The agent confirmed the order details, handled clarifying questions, and completed the order without misunderstandings.”
You can refine simulation behavior in the test configuration panel:
Simulation tests support tool mocking so your agent can receive controlled responses during a run instead of calling live systems.
System tools and workflow tools are never mocked.
If a mocked tool is called and no matching mock response is found, choose one of these behaviors:
The fallback setting appears only when at least one tool is mocked.
The framework supports an iterative development cycle that accelerates agent refinement:
Define the desired behavior by creating tests for new features or identified issues.
Run tests instantly without saving changes. Watch them fail, then adjust your agent’s prompts or configuration.
Navigate to the Tests tab in your agent’s interface. From there, you can run individual tests or execute your entire test suite at once using the “Run All Tests” button.

Execute all tests at once to ensure comprehensive coverage:
Test that your agent maintains its defined personality, tone, and behavioral boundaries across diverse conversation scenarios and emotional contexts.
Create scenarios that test the agent’s ability to maintain context, follow conditional logic, and handle state transitions across extended conversations.
Evaluate how your agent responds to attempts to override its instructions or extract sensitive system information through adversarial inputs.
Test how effectively your agent clarifies vague requests, handles conflicting information, and navigates situations where user intent is unclear.