Prompting guide
System design principles for production-grade conversational AI
System design principles for production-grade conversational AI
Effective prompting transforms ElevenLabs Agents from robotic to lifelike.

A system prompt is the personality and policy blueprint of your AI agent. In enterprise use, it tends to be elaborate—defining the agent’s role, goals, allowable tools, step-by-step instructions for certain tasks, and guardrails describing what the agent should not do. The way you structure this prompt directly impacts reliability.
The system prompt controls conversational behavior and response style, but does not control conversation flow mechanics like turn-taking, or agent settings like which languages an agent can speak. These aspects are handled at the platform level.

A system prompt is the personality and policy blueprint of your AI agent. In enterprise use, it tends to be elaborate—defining the agent’s role, goals, allowable tools, step-by-step instructions for certain tasks, and guardrails describing what the agent should not do. The way you structure this prompt directly impacts reliability.
The following principles form the foundation of production-grade prompt engineering:
Separating instructions into dedicated sections with markdown headings helps the model prioritize and interpret them correctly. Use whitespace and line breaks to separate instructions.
Why this matters for reliability: Models are tuned to pay extra attention to certain headings (especially # Guardrails), and clear section boundaries prevent instruction bleed where rules from one context affect another.
Keep every instruction short, clear, and action-based. Remove filler words and restate only what is essential for the model to act correctly.
Why this matters for reliability: Concise instructions reduce ambiguity and token usage. Every unnecessary word is a potential source of misinterpretation.
If you need the agent to maintain a specific tone, define it explicitly and concisely in the # Personality or # Tone section. Avoid repeating tone guidance throughout the prompt.
Highlight critical steps by adding “This step is important” at the end of the line. Repeating the most important 1-2 instructions twice in the prompt can help reinforce them.
Why this matters for reliability: In complex prompts, models may prioritize recent context over earlier instructions. Emphasis and repetition ensure critical rules aren’t overlooked.
Text-to-speech models, especially faster ones, are best at generating speech from alphabetical text. Therefore, digits and symbols such as ”@” or ”£” are more likely to cause incorrect pronunciations or voice hallucinations.
To address this, we normalize non-alphabetical text into words before it reaches the TTS model (e.g., 123 -> one-hundred and twenty three, john@gmail.com -> john at gmail dot com), and allow you to choose from different normalization strategies with different trade-offs.
We supports two normalization strategies via the text_normalisation_type agent configuration:
system_prompt (default) — Adds instructions to the system prompt telling the LLM to write out numbers and symbols as words before the text reaches the TTS model.
If you do not want to use the TTS normalizer and you notice the LLM still occasionally respond with unnormalized text, consider switching to a more intelligent LLM or adding additional normalization instructions to the system prompt.
elevenlabs — Uses our TTS normalizer to normalize text after LLM generation, before it reaches the TTS model.
If transcript readability matters for your use case consider using the elevenlabs normalizer. It
keeps transcripts clean with natural symbols and numbers while still producing correctly spoken
audio.
Find this configuration in our platform under the “Agent” tab by clicking the cog icon in the “Voices” section to open the common voice settings sheet, and configuring it at the bottom.
When using the system_prompt normalization setting, the LLM writes out symbols and numbers as words in its responses (e.g., john at gmail dot com instead of john@gmail.com). User transcriptions from speech-to-text can also arrive in a non-standard form. This means that when using these details as parameters in tool calls, the LLM may used the unstructured version present in the conversation context.
If a tool parameter expects a correctly formatted value (e.g., john@gmail.com not john at gmail dot com), the LLM needs to know this. Include the expected format directly in the tool parameter description with an example.
List all non-negotiable rules the model must always follow in a dedicated # Guardrails section. Models are tuned to pay extra attention to this heading.
Why this matters for reliability: Guardrails prevent inappropriate responses and ensure compliance with policies. Centralizing them in a dedicated section makes them easier to audit and update.
To learn more about designing effective guardrails, see our guide on Guardrails.
Agents capable of handling transactional workflows can be highly effective. To enable this, they must be equipped with tools that let them perform actions in other systems or fetch live data from them.
Equally important as prompt structure is how you describe the tools available to your agent. Clear, action-oriented tool definitions help the model invoke them correctly and recover gracefully from errors.
When creating a tool, add descriptions to all parameters. This helps the LLM construct tool calls accurately.
Tool description: “Looks up customer order status by order ID and returns current status, estimated delivery date, and tracking number.”
Parameter descriptions:
order_id (required): “The unique order identifier, formatted as written characters (e.g., ‘ORD123456’)”include_history (optional): “If true, returns full order history including status changes”Why this matters for reliability: Parameter descriptions act as inline documentation for the model. They clarify format expectations, required vs. optional fields, and acceptable values.
Clearly define in your system prompt when and how each tool should be used. Don’t rely solely on tool descriptions—provide usage context and sequencing logic.
When tools require structured identifiers (emails, phone numbers, codes), make the expected format explicit in the parameter description with an example. This is especially important because normalization and speech-to-text transcription can produce spoken-form values in the conversation context. See structured data for tool inputs for background.
Tools can sometimes fail due to network issues, missing data, or other errors. Include clear instructions in your system prompt for recovery.
Why this matters for reliability: Tool failures are inevitable in production. Without explicit handling instructions, agents may hallucinate responses or provide incorrect information.
For detailed guidance on building reliable tool integrations, see our documentation on Client tools, Server tools, and MCP tools.
While strong prompts and tools form the foundation of agent reliability, production systems require thoughtful architectural design. Enterprise agents handle complex workflows that often exceed the scope of a single, monolithic prompt.
Overly broad instructions or large context windows increase latency and reduce accuracy. Each agent should have a narrow, clearly defined knowledge base and set of responsibilities.
Why this matters for reliability: Specialized agents have fewer edge cases to handle, clearer success criteria, and faster response times. They’re easier to test, debug, and improve.
A general-purpose “do everything” agent is harder to maintain and more likely to fail in production than a network of specialized agents with clear handoffs.
For complex tasks, design multi-agent workflows that hand off tasks between specialized agents—and to human operators when needed.
Architecture pattern:
Benefits of this pattern:
When designing multi-agent workflows, specify exactly when and how control should transfer between agents or to human operators.
For detailed guidance on building multi-agent workflows, see our documentation on Workflows.
Selecting the right model depends on your performance requirements—particularly latency, accuracy, and tool-calling reliability. Different models offer different tradeoffs between speed, reasoning capability, and cost.
Latency: Smaller models (fewer parameters) generally respond faster, making them suitable for high-frequency, low-complexity interactions.
Accuracy: Larger models provide stronger reasoning capabilities and better handle complex, multi-step tasks, but with higher latency and cost.
Tool-calling reliability: Not all models handle tool/function calling with equal precision. Some excel at structured output, while others may require more explicit prompting.
Based on deployments across millions of agent interactions, the following patterns emerge:
GPT-4o or GLM 4.5 Air (recommended starting point): Best for general-purpose enterprise agents where latency, accuracy, and cost must all be balanced. Offers low-to-moderate latency with strong tool-calling performance and reasonable cost per interaction. Ideal for customer support, scheduling, order management, and general inquiry handling.
Gemini 2.5 Flash Lite (ultra-low latency): Best for high-frequency, simple interactions where speed is critical. Provides the lowest latency with broad general knowledge, though with lower performance on complex tool-calling. Cost-effective at scale for initial routing/triage, simple FAQs, appointment confirmations, and basic data collection.
Claude Sonnet 4 or 4.5 (complex reasoning): Best for multi-step problem-solving, nuanced judgment, and complex tool orchestration. Offers the highest accuracy and reasoning capability with excellent tool-calling reliability, though with higher latency and cost. Ideal for tasks where mistakes are costly, such as technical troubleshooting, financial advisory, compliance-sensitive workflows, and complex refund/escalation decisions.
Model performance varies significantly based on prompt structure and task complexity. Before committing to a model:
For detailed model configuration options, see our Models documentation.
Reliability in production comes from continuous iteration. Even well-constructed prompts can fail in real use. What matters is learning from those failures and improving through disciplined testing.
Attach concrete evaluation criteria to each agent to monitor success over time and check for regressions.
Key metrics to track:
For detailed guidance on configuring evaluation criteria in ElevenLabs, see Success Evaluation.
When agents underperform, identify patterns in problematic interactions:
Review conversation transcripts where user satisfaction was low or tasks weren’t completed.
Update specific sections of your prompt to address identified issues:
Avoid making multiple prompt changes simultaneously. This makes it impossible to attribute improvements or regressions to specific edits.
Configure your agent to summarize data from each conversation. This allows you to analyze interaction patterns, identify common user requests, and continuously improve your prompt based on real-world usage.
For detailed guidance on configuring data collection in ElevenLabs, see Data Collection.
Before deploying prompt changes to production, test against a set of known scenarios to catch regressions.
For guidance on testing agents programmatically, see Simulate Conversations.
Enterprise agents require additional safeguards beyond prompt quality. Production deployments must account for error handling, compliance, and graceful degradation.
Every external tool call is a potential failure point. Ensure your prompt includes explicit error handling for:
The following examples demonstrate how to apply the principles outlined in this guide to real-world enterprise use cases. Each example includes annotations highlighting which reliability principles are in use.
Principles demonstrated:
# Personality, # Goal, # Tools, etc.)# Goal numbered steps)Principles demonstrated:
# Goal sectionHow you format your prompt impacts how effectively the language model interprets it:
# for main sections, ## for subsections# Goal not # GOALCreate shared prompt templates for common sections like character normalization, error handling, and guardrails. Store these in a central repository and reference them across specialist agents. Use the orchestrator pattern to ensure consistent routing logic and handoff procedures.
At minimum, include: (1) Personality/role definition, (2) Primary goal, (3) Core guardrails, and (4) Tool descriptions if tools are used. Even simple agents benefit from explicit section structure and error handling instructions.
When deprecating a tool, add a new tool first, then update the prompt to prefer the new tool while keeping the old one as a fallback. Monitor usage, then remove the old tool once usage drops to zero. Always include error handling so agents can recover if a deprecated tool is called.
Generally, prompts structured with the principles in this guide work across models. However, model-specific tuning can improve performance—particularly for tool-calling format and reasoning steps. Test your prompt with multiple models and adjust if needed.
No universal limit exists, but prompts over 2000 tokens increase latency and cost. Focus on conciseness: every line should serve a clear purpose. If your prompt exceeds 2000 tokens, consider splitting into multiple specialized agents or extracting reference material into a knowledge base.
Define core personality traits, goals, and guardrails firmly while allowing flexibility in tone and verbosity based on user communication style. Use conditional instructions: “If the user is frustrated, acknowledge their concerns before proceeding.”
Yes. System prompts can be modified at any time to adjust behavior. This is particularly useful for addressing emerging issues or refining capabilities as you learn from user interactions. Always test changes in a staging environment before deploying to production.
Include explicit error handling instructions for every tool. Emphasize “never guess or make up information” in the guardrails section. Repeat this instruction in tool-specific error handling sections. Test tool failure scenarios during development to ensure agents follow recovery instructions.
This guide establishes the foundation for reliable agent behavior through prompt engineering, tool configuration, and architectural patterns. To build production-grade systems, continue with:
For enterprise deployment support, contact our team.