Retrieval-Augmented Generation | ElevenLabs Documentation

Overview

Retrieval-Augmented Generation (RAG) enables your agent to access and use large knowledge bases during conversations. Instead of loading entire documents into the context window, RAG retrieves only the most relevant information for each user query, allowing your agent to:

Access much larger knowledge bases than would fit in a prompt
Provide more accurate, knowledge-grounded responses
Reduce hallucinations by referencing source material
Scale knowledge without creating multiple specialized agents

RAG is ideal for agents that need to reference large documents, technical manuals, or extensive knowledge bases that would exceed the context window limits of traditional prompting. RAG adds on slight latency to the response time of your agent, around 500ms.

How RAG works

When RAG is enabled, your agent processes user queries through these steps:

Query processing: The user’s question is analyzed and reformulated for optimal retrieval.
Embedding generation: The processed query is converted into a vector embedding that represents the user’s question.
Retrieval: The system finds the most semantically similar content from your knowledge base.
Response generation: The agent generates a response using both the conversation context and the retrieved information.

This process ensures that relevant information to the user’s query is passed to the LLM to generate a factually correct answer.

When RAG is enabled, the size of knowledge base items that can be assigned to an agent is increased from 300KB to 10MB

Guide

Prerequisites

An ElevenLabs account
A configured ElevenLabs Conversational Agent
At least one document added to your agent’s knowledge base

Enable RAG for your agent

In your agent’s settings, navigate to the Knowledge Base section and toggle on the Use RAG option.

Toggle switch to enable RAG in the agent settings

Configure RAG settings (optional)

After enabling RAG, you’ll see additional configuration options:

Embedding model: Select the model that will convert text into vector embeddings
Maximum document chunks: Set the maximum amount of retrieved content per query
Maximum vector distance: Set the maximum distance between the query and the retrieved chunks

These parameters could impact latency. They also could impact LLM cost which in the future will be passed on to you. For example, retrieving more chunks increases cost. Increasing vector distance allows for more context to be passed, but potentially less relevant context. This may affect quality and you should experiment with different parameters to find the best results.

RAG configuration options including embedding model selection

Knowledge base indexing

Each document in your knowledge base needs to be indexed before it can be used with RAG. This process happens automatically when a document is added to an agent with RAG enabled.

Indexing may take a few minutes for large documents. You can check the indexing status in the knowledge base list.

Configure document usage modes (optional)

For each document in your knowledge base, you can choose how it’s used:

Auto (default): The document is only retrieved when relevant to the query
Prompt: The document is always included in the system prompt, regardless of relevance, but can also be retrieved by RAG

Document usage mode options in the knowledge base

Setting too many documents to “Prompt” mode may exceed context limits. Use this option sparingly for critical information.

Test your RAG-enabled agent

After saving your configuration, test your agent by asking questions related to your knowledge base. The agent should now be able to retrieve and reference specific information from your documents.

API implementation

You can also implement RAG through the API:

1 from elevenlabs import ElevenLabs, EmbeddingModelEnum
2 import time
3 
4 # Initialize the ElevenLabs client
5 client = ElevenLabs(api_key="your-api-key")
6 
7 # First, index a document for RAG
8 document_id = "your-document-id"
9 embedding_model = EmbeddingModelEnum.E5_MISTRAL_7B_INSTRUCT
10 
11 # Trigger RAG indexing
12 response = client.conversational_ai.rag_index_status(
13     documentation_id=document_id,
14     model=embedding_model
15 )
16 
17 # Check indexing status
18 while response.status not in ["SUCCEEDED", "FAILED"]:
19     time.sleep(5)  # Wait 5 seconds before checking status again
20     response = client.conversational_ai.rag_index_status(
21         documentation_id=document_id,
22         model=embedding_model
23     )
24 
25 # Then update agent configuration to use RAG
26 agent_id = "your-agent-id"
27 
28 # Get the current agent configuration
29 agent_config = client.conversational_ai.get_agent(agent_id=agent_id)
30 
31 # Enable RAG in the agent configuration
32 agent_config.agent.prompt.rag = {
33     "enabled": True,
34     "embedding_model": "e5_mistral_7b_instruct",
35     "max_documents_length": 10000
36 }
37 
38 # Update document usage mode if needed
39 for i, doc in enumerate(agent_config.agent.prompt.knowledge_base):
40     if doc.id == document_id:
41         agent_config.agent.prompt.knowledge_base[i].usage_mode = "auto"
42 
43 # Update the agent configuration
44 client.conversational_ai.update_agent(
45     agent_id=agent_id,
46     conversation_config=agent_config.agent
47 )