Webinar Recap: Build Safe AI Agents for an Enterprise Deployment

Last updated May 9, 2026 • 6 minutes reading time

Marco Mancini, Security & Safety, ElevenLabs,

Jonatan von Martens, AI Safety Engineering,

Getting an AI agent to handle conversations is the easy part. Getting your security team, legal team, and customers to trust it is where most enterprise deployments stall.

Watch the live session

Webinar Recap: Build Safe AI Agents for an Enterprise Deployment

Getting an AI agent to handle conversations is the easy part. Getting your security team, legal team, and customers to trust it is where most enterprise deployments stall.

This post recaps our live workshop, Building Safe AI Agents for Enterprise Deployment, where we walked through the tools, frameworks, and deployment practices that make enterprise agent deployments work at scale.

How to build a layered approach to safety

Over four million agents have been deployed on the ElevenAgents platform. The ones that perform reliably in enterprise settings share one trait: safety was built in from the start, not added after the first incident.

Our live session covered the frameworks, controls, and deployment practices that separate agents that clear security review from those that don't.

Different agents require fundamentally different boundaries.

A video game character might need to use explicitly violent language as part of the experience - but should never break character or reveal it's an AI.
A healthcare receptionist needs to discuss injuries and medical contexts - but should never give medical advice.
A credit card support agent shouldn't touch explicit content at all, and shouldn't share account details with unverified callers.

Because agents are non-deterministic, no single safeguard can fully protect against all potential risks. That's why enterprise teams need a layered approach - multiple controls working together to make safety failures the rare exception.

That principle shaped the four questions we organized the session around:

How can I control what my agent says and does?
How can I check that it works?
How can I protect data to meet security and compliance requirements?
How can I build processes to deploy safely

How to think about controlling agent behavior

In any agent conversation, there are three points where safety must be considered.

Input
The user says something. Adversarial users may try things like "ignore all previous instructions" or "pretend you're a different assistant." You need to detect and handle manipulation attempts before they ever reach the model. This prevents unnecessary cost and stops bad actors from extracting information they shouldn't have.

Decision-making
The LLM decides what to say or do. This is where your system prompt is your primary control surface - but in long or complex conversations, LLMs can drift from their instructions. You need mechanisms that reinforce behavior throughout the conversation, not just at the start. You should also define escalation paths: are there situations where the agent should hand off to a human or a more specialized agent, and under what conditions?

Output
Even with strong guidance, something can slip through - especially in long-running conversations. You need a final safety net. Think of it as a mini-agent checking the work of your main agent: it evaluates the response before it reaches the user and decides whether to deliver it, retry, or escalate. It also runs in parallel with response generation, adding minimal latency.

Across all three, you need to define your exit strategies in advance: does a violation end the conversation, trigger a retry with corrective guidance, or transfer to a human? This decision shapes the user experience when something goes wrong.

Demo 1: Configuring guardrails in ElevenAgents

Scenario: A website sales and support agent is configured with multiple layers of safety controls to prevent manipulation, off-topic responses, and policy violations.

What was shown:

Manipulation guardrail (input) - found in the Security tab, this toggle detects prompt injection patterns - attempts to override system instructions - and terminates the conversation before the agent responds. Recommended for all production agents.
System prompt and Focus guardrail (decision-making) -the system prompt is foundational. For every important rule, it should live there explicitly. In the demo, an instruction was added mid-session: "Do not offer any discounts." The Focus guardrail, enabled separately, automatically reinforces the system prompt throughout the entire conversation - addressing the drift problem that occurs in longer interactions. The combination of a strong system prompt and Focus enabled is the single most effective pairing for keeping an agent on track.
Content guardrail (output) - Pre-configured categories covering profanity, legal advice or and political opinions. Each has an adjustable confidence threshold - medium is the recommended starting point. This is the fallback layer: if the agent is about to produce something it shouldn't, this catches it before delivery.
Custom guardrail (output) - user-defined checks written in natural language for any case not covered by presets. In the demo, a "no discounts" guardrail was configured: "Block any response that mentions discounts, promotions, or special pricing that the agent is not authorized to offer." Custom guardrails use an additional LLM evaluation — so there's usage-based cost and a latency consideration. Write tight instructions and split distinct checks into separate guardrails rather than combining them
Action on violation - two options: end the call, or retry. On retry, you can provide additional instructions to guide the agent's next attempt — for example, escalating to a human or delivering a default redirect message.

Why it matters: These controls are not one size fits all. They are configurable at the individual guardrail level. That granularity is what makes the difference between an agent that is theoretically safe and one that is operationally safe across diverse enterprise contexts.

Demo 2: Simulation testing before launch

Scenario: A support agent is tested against two discount-related conversation scenarios to confirm it redirects users to the pricing page without offering discounts.

What was shown:

Two simulation tests defined in the Tests tab, each with a simulated user scenario, a defined number of conversation turns, and explicit success criteria
One test initially failed because the system prompt lacked specific edge case instructions
The missing instructions were added to the guardrail section of the system prompt
The agent was republished and both tests were rerun - both passed
Detailed run history shows exactly which part of a conversation failed, including tool calls and agent actions

Why it matters: Simulation testing lets teams validate agent behavior in a controlled environment before any real user sees the agent. It covers both routine scenarios and adversarial ones. It also runs against the full conversation flow, not just individual responses. As changes are made, tests can be rerun immediately to confirm the fix held.

Demo 3: PII Redaction for Sensitive Deployments

Scenario: An enterprise agent is configured to redact personally identifiable information from conversation logs.

What was shown:

Conversation History Redaction toggle found in the Advanced tab under Privacy settings
A list of specific data entities that can be toggled for redaction individually, including date of birth, age, and other sensitive fields
Option to select all entities or only those relevant to the specific agent's use case

Why it matters: PII redaction is not a replacement for zero retention mode in high-compliance environments like HIPAA. What it does is reduce data exposure in conversation logs used for internal review or quality control. Teams can keep the logs they need while stripping out the data they don't. It is currently available for enterprise customers.

Best practices for safe enterprise agent deployment

Use a layered approach. No single control ensures safe behavior. Input guardrails, output validation, prompt hardening, and testing must work together. Each layer reinforces the others, and together they significantly reduce the risk of safety issues.
Match guardrails to context. A healthcare agent and a retail support agent need different rules. Define the boundaries specific to your use case, not a generic template.
Start with a use case that matters. The most successful enterprise deployments don't begin with a throwaway pilot. They pick something real like customer support, scheduling - and invest in getting it right.
Test before launch, then keep testing. Use simulation testing and external red-teaming tools. Test against both routine scenarios and adversarial ones. Add new edge cases discovered in production back into your test suite.
Deploy in stages. Start with limited traffic. Monitor real conversations. Identify what the agent struggles with. Make adjustments, retest, then expand.
Choose execution mode deliberately. Use blocking mode for text agents where strict validation matters more than speed. Use streaming mode for voice agents where latency is the priority.
Define clear actions on guardrail violation. Decide in advance whether a violation should end the call, trigger a retry, or escalate to a human.
Keep custom guardrail instructions concise. Guardrails run in parallel. A long, complex custom guardrail increases latency. Write tight instructions and split distinct checks into separate guardrails.
Understand what certifications actually cover. SOC 2 Type 2 and ISO 27001 are table stakes. Domain-specific standards like HIPAA and PCI DSS address regulated industries. Newer AI-specific certifications like ISO 42001 and AIUC-1 address bias, transparency, and adversarial resilience - and AIUC-1 certification can unlock AI-specific insurance.
Build the process muscle early. The first deployment takes the most time. Teams that invest in testing and deployment process see dramatically faster iteration on every subsequent agent.