The Speech Engine upstream WebSocket protocol defines the interface your server must implement so that ElevenLabs can connect to it during a Speech Engine conversation. Unlike other ElevenLabs WebSocket channels where your client connects to ElevenLabs, the Speech Engine reverses this relationship: ElevenLabs is the WebSocket client and your server is the WebSocket server.
This page shows the WebSocket API shape, however we recommend using the provided server side SDKs instead of implementing this yourself. The SDKs include several helper methods and automatically handle auth for you. You can find SDK installation instructions and guides in the Speech Engine quickstart.
Configure your server’s publicly reachable WebSocket URL in the wsUrl field when
creating or updating a Speech Engine via the REST API. When a user starts a conversation
with that agent, ElevenLabs will open a WebSocket connection to your server and begin
the message exchange described below.
wsUrl.init message containing the conversation ID.user_transcript messages with the full conversation history.agent_response messages.ping messages keep the connection alive; reply with pong.close message.Every connection from ElevenLabs includes an X-Elevenlabs-Speech-Engine-Authorization
header containing a short-lived JWT. Verify this token before accepting the WebSocket
upgrade to ensure the connection originates from ElevenLabs.
The JWT is signed with HS256 using the SHA-256 hash of your ElevenLabs API key as the HMAC secret, and has:
iss): https://api.elevenlabs.io/convai/speech-enginesub): convai_speech_engine_upstreamexp): short-lived; a 60-second clock-skew leeway is appliedEach user_transcript message carries an event_id. If the user speaks again before
your server finishes responding, a new user_transcript arrives with a higher event_id.
Cancel your in-flight LLM call and begin responding to the new transcript. Any
agent_response messages sent with an outdated event_id are silently discarded by
ElevenLabs.
Send LLM output as a sequence of agent_response messages with is_final: false for
each text chunk, followed by a final agent_response with is_final: true and an empty
content string. ElevenLabs begins synthesizing audio as chunks arrive, minimising
latency.
Short-lived JWT proving the connection originates from ElevenLabs.
Signed with HS256 using the SHA-256 hash of your ElevenLabs API key.
Validate the issuer (https://api.elevenlabs.io/convai/speech-engine)
and subject (convai_speech_engine_upstream) before accepting the upgrade.
Reply to a ping message. Must be sent in response to every ping to keep the session alive.
Sent by ElevenLabs each time the user finishes a speech turn. Contains the full
conversation history and an event_id for correlating responses and tracking
interruptions.
Keep-alive ping sent periodically by ElevenLabs. Your server must reply with a pong message.
Sent by ElevenLabs when a protocol-level error occurs. The connection will be closed after this message.
LLM-generated text sent from your server to ElevenLabs for speech synthesis. Send
partial chunks with is_final: false for low-latency streaming, then a final message
with is_final: true and empty content to signal completion. Include the event_id
from the triggering user_transcript so ElevenLabs can handle interruptions correctly.