Conversational agents
Learn how to build real-time conversational AI agents using our multi-context WebSocket API for dynamic and responsive interactions.
Advanced
Orchestrating conversational agents using this multi-context WebSocket API is a complex task recommended for advanced developers. For a more managed solution, consider exploring our Conversational AI product, which simplifies many of these challenges.
Overview
Building responsive conversational AI agents requires the ability to manage audio streams dynamically, handle interruptions gracefully, and maintain natural-sounding speech across conversational turns. Our multi-context WebSocket API for Text to Speech (TTS) is specifically designed for these scenarios.
This API extends our standard TTS WebSocket functionality by introducing the concept of “contexts.” Each context operates as an independent audio generation stream within a single WebSocket connection. This allows you to:
- Manage multiple lines of speech concurrently (e.g., agent speaking while preparing a response to a user interruption).
- Seamlessly handle user barge-ins by closing an existing speech context and initiating a new one.
- Maintain prosodic consistency for utterances within the same logical context.
- Optimize resource usage by selectively closing contexts that are no longer needed.
The multi-context WebSocket API is optimized for conversational applications and is not intended for generating multiple unrelated audio streams simultaneously. Each connection is limited to 5 concurrent contexts to reflect this.
This guide will walk you through connecting to the multi-context WebSocket, managing contexts, and applying best practices for building engaging conversational agents.
Best practices
These best practices are essential for building responsive, efficient conversational agents with our multi-context WebSocket API.
Use a single WebSocket connection
Establish one WebSocket connection for each end-user session. This reduces overhead and latency compared to creating multiple connections. Within this single connection, you can manage multiple contexts for different parts of the conversation.
Stream responses in chunks, generate sentences
When generating long responses, stream the text in smaller chunks and use the flush: true
flag
at the end of complete sentences. This improves the quality of the generated audio and improves
responsiveness.
Handle interruptions gracefully
Stream text into one context until an interruption occurs, then create a new context and close the existing one. This approach ensures smooth transitions when the conversation flow changes.
Handling interuptions
When a user interrupts your agent, you should close the current context and create a new one:
Keeping a context alive
Contexts automatically timeout after a default of 20 seconds of inactivity. If you need to keep a context alive without generating text (for example, during a processing delay), you can send an empty text message to reset the timeout clock.
Closing the WebSocket connection
When your conversation ends, you can clean up all contexts by closing the socket:
Complete conversational agent example
Requirements
- An ElevenLabs account with an API key (learn how to find your API key).
- Python or Node.js (or another JavaScript runtime) installed on your machine.
- Familiarity with WebSocket communication. We recommend reading our guide on standard WebSocket streaming for foundational concepts.
Setup
Install the necessary dependencies for your chosen language:
Create a .env file in your project directory to store your API key: