WebSocket
Create real-time, interactive voice conversations with AI agents
This documentation is for developers integrating directly with the ElevenLabs WebSocket API. For convenience, consider using the official SDKs provided by ElevenLabs.
The ElevenLabs Conversational AI WebSocket API enables real-time, interactive voice conversations with AI agents. By establishing a WebSocket connection, you can send audio input and receive audio responses in real-time, creating life-like conversational experiences.
Endpoint: wss://api.elevenlabs.io/v1/convai/conversation?agent_id={agent_id}
Authentication
Using Agent ID
For public agents, you can directly use the agent_id
in the WebSocket URL without additional authentication:
Using a Signed URL
For private agents or conversations requiring authorization, obtain a signed URL from your server, which securely communicates with the ElevenLabs API using your API key.
Example using cURL
Request:
Response:
Communication
Client-to-Server Messages
User Audio Chunk
Send audio data from the user to the server.
Format:
Notes:
-
Audio Format Requirements:
- PCM 16-bit mono format
- Base64 encoded
- Sample rate of 16,000 Hz
-
Recommended Chunk Duration:
- Send audio chunks approximately every 250 milliseconds (0.25 seconds)
- This equates to chunks of about 4,000 samples at a 16,000 Hz sample rate
-
Optimizing Latency and Efficiency:
- Balance Latency and Efficiency: Sending audio chunks every 250 milliseconds offers a good trade-off between responsiveness and network overhead.
- Adjust Based on Needs:
- Lower Latency Requirements: Decrease the chunk duration to send smaller chunks more frequently.
- Higher Efficiency Requirements: Increase the chunk duration to send larger chunks less frequently.
- Network Conditions: Adapt the chunk size if you experience network constraints or variability.
Pong Message
Respond to server ping
messages by sending a pong
message, ensuring the event_id
matches the one received in the ping
message.
Format:
Server-to-Client Messages
conversation_initiation_metadata
Provides initial metadata about the conversation.
Format:
Other Server-to-Client Messages
Type | Purpose |
---|---|
user_transcript | Transcriptions of the user’s speech |
agent_response | Agent’s textual response |
audio | Chunks of the agent’s audio response |
interruption | Indicates that the agent’s response was interrupted |
ping | Server pings to measure latency |
Message Formats
user_transcript:
agent_response:
audio:
interruption:
internal_tentative_agent_response:
ping:
Latency Management
To ensure smooth conversations, implement these strategies:
- Adaptive Buffering: Adjust audio buffering based on network conditions.
- Jitter Buffer: Implement a jitter buffer to smooth out variations in packet arrival times.
- Ping-Pong Monitoring: Use ping and pong events to measure round-trip time and adjust accordingly.
Security Best Practices
- Rotate API keys regularly and use environment variables to store them.
- Implement rate limiting to prevent abuse.
- Clearly explain the intention when prompting users for microphone access.
- Optimized Chunking: Tweak the audio chunk duration to balance latency and efficiency.
Additional Resources
Was this page helpful?