Retrieval-Augmented Generation
Retrieval-Augmented Generation
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) enables your agent to access and use large knowledge bases during conversations. Instead of loading entire documents into the context window, RAG retrieves only the most relevant information for each user query, allowing your agent to:
RAG is ideal for agents that need to reference large documents, technical manuals, or extensive knowledge bases that would exceed the context window limits of traditional prompting. RAG adds on slight latency to the response time of your agent, around 250ms.
When RAG is enabled, your agent processes user queries through these steps:
This process ensures that relevant information to the user’s query is passed to the LLM to generate a factually correct answer.
In your agent’s settings, navigate to the Knowledge Base section and toggle on the Use RAG option. Configure the embedding model, maximum document chunks, and maximum vector distance under the Advanced tab as needed.


Each document in your knowledge base needs to be indexed before it can be used with RAG. This process happens automatically when a document is added to an agent with RAG enabled.
Indexing may take a few minutes for large documents. You can check the indexing status in the knowledge base list.
For each document in your knowledge base, you can choose how it’s used:

Setting too many documents to “Prompt” mode may exceed context limits. Use this option sparingly for critical information.
After saving your configuration, test your agent by asking questions related to your knowledge base. The agent should now be able to retrieve and reference specific information from your documents.
To ensure fair resource allocation, ElevenLabs enforces limits on the total size of documents that can be indexed for RAG per workspace, based on subscription tier.
The limits are as follows:
Note:
You can also implement RAG through the API: