Conversational agents

Learn how to build real-time conversational AI agents using our multi-context WebSocket API for dynamic and responsive interactions.

Advanced

Orchestrating conversational agents using this multi-context WebSocket API is a complex task recommended for advanced developers. For a more managed solution, consider exploring our Conversational AI product, which simplifies many of these challenges.

Overview

Building responsive conversational AI agents requires the ability to manage audio streams dynamically, handle interruptions gracefully, and maintain natural-sounding speech across conversational turns. Our multi-context WebSocket API for Text to Speech (TTS) is specifically designed for these scenarios.

This API extends our standard TTS WebSocket functionality by introducing the concept of “contexts.” Each context operates as an independent audio generation stream within a single WebSocket connection. This allows you to:

  • Manage multiple lines of speech concurrently (e.g., agent speaking while preparing a response to a user interruption).
  • Seamlessly handle user barge-ins by closing an existing speech context and initiating a new one.
  • Maintain prosodic consistency for utterances within the same logical context.
  • Optimize resource usage by selectively closing contexts that are no longer needed.

The multi-context WebSocket API is optimized for conversational applications and is not intended for generating multiple unrelated audio streams simultaneously. Each connection is limited to 5 concurrent contexts to reflect this.

This guide will walk you through connecting to the multi-context WebSocket, managing contexts, and applying best practices for building engaging conversational agents.

Best practices

These best practices are essential for building responsive, efficient conversational agents with our multi-context WebSocket API.

1

Use a single WebSocket connection

Establish one WebSocket connection for each end-user session. This reduces overhead and latency compared to creating multiple connections. Within this single connection, you can manage multiple contexts for different parts of the conversation.

2

Stream responses in chunks, generate sentences

When generating long responses, stream the text in smaller chunks and use the flush: true flag at the end of complete sentences. This improves the quality of the generated audio and improves responsiveness.

3

Handle interruptions gracefully

Stream text into one context until an interruption occurs, then create a new context and close the existing one. This approach ensures smooth transitions when the conversation flow changes.

4

Manage context lifecycle

Close unused contexts promptly. The server can maintain up to 5 concurrent contexts per connection, but you should close contexts when they are no longer needed.

5

Prevent context timeouts

Contexts by default timeout after 20 seconds and are closed automatically. The inactivity timeout is a websocket level parameter that applies to all contexts and can be up to 180 seconds if needed. Send an empty text message on a context to reset the timeout clock.

Handling interuptions

When a user interrupts your agent, you should close the current context and create a new one:

1async def handle_interruption(websocket, old_context_id, new_context_id, new_response):
2 # Close the existing context that was interrupted
3 await websocket.send(json.dumps({
4 "context_id": old_context_id,
5 "close_context": True
6 }))
7 print(f"Closed interrupted context '{old_context_id}'")
8
9 # Create a new context for the new response
10 await send_text_in_context(websocket, new_response, new_context_id)

Keeping a context alive

Contexts automatically timeout after a default of 20 seconds of inactivity. If you need to keep a context alive without generating text (for example, during a processing delay), you can send an empty text message to reset the timeout clock.

1async def keep_context_alive(websocket, context_id):
2 await websocket.send(json.dumps({
3 "context_id": context_id,
4 "text": ""
5 }))

Closing the WebSocket connection

When your conversation ends, you can clean up all contexts by closing the socket:

1async def end_conversation(websocket):
2 # This will close all contexts and close the connection
3 await websocket.send(json.dumps({
4 "close_socket": True
5 }))
6 print("Ending conversation and closing WebSocket")`

Complete conversational agent example

Requirements

  • An ElevenLabs account with an API key (learn how to find your API key).
  • Python or Node.js (or another JavaScript runtime) installed on your machine.
  • Familiarity with WebSocket communication. We recommend reading our guide on standard WebSocket streaming for foundational concepts.

Setup

Install the necessary dependencies for your chosen language:

1pip install python-dotenv websockets

Create a .env file in your project directory to store your API key:

.env
1ELEVENLABS_API_KEY=your_elevenlabs_api_key_here

Example conversational agent

This code is provided as an example and is not intended for production usage
1import os
2import json
3import asyncio
4import websockets
5from dotenv import load_dotenv
6
7load_dotenv()
8ELEVENLABS_API_KEY = os.getenv("ELEVENLABS_API_KEY")
9VOICE_ID = "your_voice_id"
10MODEL_ID = "eleven_flash_v2_5"
11
12WEBSOCKET_URI = f"wss://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/multi-stream-input?model_id={MODEL_ID}"
13
14async def send_text_in_context(websocket, text, context_id, voice_settings=None):
15 """Send text to be synthesized in the specified context."""
16 message = {
17 "text": text,
18 "context_id": context_id,
19 }
20
21 # Only include voice_settings for the first message in a context
22 if voice_settings:
23 message["voice_settings"] = voice_settings
24
25 await websocket.send(json.dumps(message))
26
27async def continue_context(websocket, text, context_id):
28 """Add more text to an existing context."""
29 await websocket.send(json.dumps({
30 "text": text,
31 "context_id": context_id
32 }))
33
34async def flush_context(websocket, context_id):
35 """Force generation of any buffered audio in the context."""
36 await websocket.send(json.dumps({
37 "context_id": context_id,
38 "flush": True
39 }))
40
41async def handle_interruption(websocket, old_context_id, new_context_id, new_response):
42 """Handle user interruption by closing current context and starting a new one."""
43 # Close the existing context that was interrupted
44 await websocket.send(json.dumps({
45 "context_id": old_context_id,
46 "close_context": True
47 }))
48
49 # Create a new context for the new response
50 await send_text_in_context(websocket, new_response, new_context_id)
51
52async def end_conversation(websocket):
53 """End the conversation and close the WebSocket connection."""
54 await websocket.send(json.dumps({
55 "close_socket": True
56 }))
57
58async def receive_messages(websocket):
59 """Process incoming WebSocket messages."""
60 context_audio = {}
61 try:
62 async for message in websocket:
63 data = json.loads(message)
64 context_id = data.get("context_id", "default")
65
66 if data.get("audio"):
67 print(f"Received audio for context '{context_id}'")
68
69 if data.get("is_final"):
70 print(f"Context '{context_id}' completed")
71 except (websockets.exceptions.ConnectionClosed, asyncio.CancelledError):
72 print("Message receiving stopped")
73
74async def conversation_agent_demo():
75 """Run a complete conversational agent demo."""
76 # Connect with API key in headers
77 async with websockets.connect(
78 WEBSOCKET_URI,
79 max_size=16 * 1024 * 1024,
80 extra_headers={"xi-api-key": ELEVENLABS_API_KEY}
81 ) as websocket:
82 # Start receiving messages in background
83 receive_task = asyncio.create_task(receive_messages(websocket))
84
85 # Initial agent response
86 await send_text_in_context(
87 websocket,
88 "Hello! I'm your virtual assistant. I can help you with a wide range of topics. What would you like to know about today?",
89 "greeting"
90 )
91
92 # Wait a bit (simulating user listening)
93 await asyncio.sleep(2)
94
95 # Simulate user interruption
96 print("USER INTERRUPTS: 'Can you tell me about the weather?'")
97
98 # Handle the interruption by closing current context and starting new one
99 await handle_interruption(
100 websocket,
101 "greeting",
102 "weather_response",
103 "I'd be happy to tell you about the weather. Currently in your area, it's 72 degrees and sunny with a slight chance of rain later this afternoon."
104 )
105
106 # Add more to the weather context
107 await continue_context(
108 websocket,
109 " If you're planning to go outside, you might want to bring a light jacket just in case.",
110 "weather_response"
111 )
112
113 # Flush at the end of this turn to ensure all audio is generated
114 await flush_context(websocket, "weather_response")
115
116 # Wait a bit (simulating user listening)
117 await asyncio.sleep(3)
118
119 # Simulate user asking another question
120 print("USER: 'What about tomorrow?'")
121
122 # Create a new context for this response
123 await send_text_in_context(
124 websocket,
125 "Tomorrow's forecast shows temperatures around 75 degrees with partly cloudy skies. It should be a beautiful day overall!",
126 "tomorrow_weather"
127 )
128
129 # Flush and close this context
130 await flush_context(websocket, "tomorrow_weather")
131 await websocket.send(json.dumps({
132 "context_id": "tomorrow_weather",
133 "close_context": True
134 }))
135
136 # End the conversation
137 await asyncio.sleep(2)
138 await end_conversation(websocket)
139
140 # Cancel the receive task
141 receive_task.cancel()
142 try:
143 await receive_task
144 except asyncio.CancelledError:
145 pass
146
147if __name__ == "__main__":
148 asyncio.run(conversation_agent_demo())