WebSocket

WSS

/speech-engine/upstream

This page shows the WebSocket API shape, however we recommend using the provided server side SDKs instead of implementing this yourself. The SDKs include several helper methods and automatically handle auth for you. You can find SDK installation instructions and guides in the Speech Engine quickstart.

Configure your server’s publicly reachable WebSocket URL in the wsUrl field when creating or updating a Speech Engine via the REST API. When a user starts a conversation with that agent, ElevenLabs will open a WebSocket connection to your server and begin the message exchange described below.

Connection flow

A user starts a conversation with a Speech Engine agent (via the ElevenLabs client SDK or API).
ElevenLabs opens a WebSocket connection to your wsUrl.
ElevenLabs sends an init message containing the conversation ID.
As the user speaks, ElevenLabs transcribes the audio and sends user_transcript messages with the full conversation history.
Your server calls an LLM and streams the response back as one or more agent_response messages.
ElevenLabs synthesizes the text to speech and streams the audio back to the user.
Periodic ping messages keep the connection alive; reply with pong.
When the conversation ends, ElevenLabs sends a close message.

Authentication

Every connection from ElevenLabs includes an X-Elevenlabs-Speech-Engine-Authorization header containing a short-lived JWT. Verify this token before accepting the WebSocket upgrade to ensure the connection originates from ElevenLabs.

The JWT is signed with HS256 using the SHA-256 hash of your ElevenLabs API key as the HMAC secret, and has:

Issuer (iss): https://api.elevenlabs.io/convai/speech-engine
Subject (sub): convai_speech_engine_upstream
Expiry (exp): short-lived; a 60-second clock-skew leeway is applied

Interruption handling

Each user_transcript message carries an event_id. If the user speaks again before your server finishes responding, a new user_transcript arrives with a higher event_id. Cancel your in-flight LLM call and begin responding to the new transcript. Any agent_response messages sent with an outdated event_id are silently discarded by ElevenLabs.

Streaming responses

Send LLM output as a sequence of agent_response messages with is_final: false for each text chunk, followed by a final agent_response with is_final: true and an empty content string. ElevenLabs begins synthesizing audio as chunks arrive, minimising latency.

The Speech Engine upstream WebSocket protocol defines the interface your server must implement so that ElevenLabs can connect to it during a Speech Engine conversation. Unlike other ElevenLabs WebSocket channels where your client connects to ElevenLabs, the Speech Engine reverses this relationship: **ElevenLabs is the WebSocket client and your server is the WebSocket server**. This page shows the WebSocket API shape, however we recommend using the provided server side SDKs instead of implementing this yourself. The SDKs include several helper methods and automatically handle auth for you. You can find SDK installation instructions and guides in the [Speech Engine quickstart](/docs/eleven-api/guides/cookbooks/speech-engine). Configure your server's publicly reachable WebSocket URL in the `wsUrl` field when creating or updating a Speech Engine via the REST API. When a user starts a conversation with that agent, ElevenLabs will open a WebSocket connection to your server and begin the message exchange described below. ## Connection flow 1. A user starts a conversation with a Speech Engine agent (via the ElevenLabs client SDK or API). 2. ElevenLabs opens a WebSocket connection to your `wsUrl`. 3. ElevenLabs sends an `init` message containing the conversation ID. 4. As the user speaks, ElevenLabs transcribes the audio and sends `user_transcript` messages with the full conversation history. 5. Your server calls an LLM and streams the response back as one or more `agent_response` messages. 6. ElevenLabs synthesizes the text to speech and streams the audio back to the user. 7. Periodic `ping` messages keep the connection alive; reply with `pong`. 8. When the conversation ends, ElevenLabs sends a `close` message. ## Authentication Every connection from ElevenLabs includes an `X-Elevenlabs-Speech-Engine-Authorization` header containing a short-lived JWT. Verify this token before accepting the WebSocket upgrade to ensure the connection originates from ElevenLabs. The JWT is signed with **HS256** using the SHA-256 hash of your ElevenLabs API key as the HMAC secret, and has: - **Issuer** (`iss`): `https://api.elevenlabs.io/convai/speech-engine` - **Subject** (`sub`): `convai_speech_engine_upstream` - **Expiry** (`exp`): short-lived; a 60-second clock-skew leeway is applied ## Interruption handling Each `user_transcript` message carries an `event_id`. If the user speaks again before your server finishes responding, a new `user_transcript` arrives with a higher `event_id`. Cancel your in-flight LLM call and begin responding to the new transcript. Any `agent_response` messages sent with an outdated `event_id` are silently discarded by ElevenLabs. ## Streaming responses Send LLM output as a sequence of `agent_response` messages with `is_final: false` for each text chunk, followed by a final `agent_response` with `is_final: true` and an empty `content` string. ElevenLabs begins synthesizing audio as chunks arrive, minimising latency.

Handshake

WSS

/speech-engine/upstream

Headers

xi-api-keystringOptional

X-ElevenLabs-Speech-Engine-AuthorizationstringRequired

Short-lived JWT proving the connection originates from ElevenLabs. Signed with HS256 using the SHA-256 hash of your ElevenLabs API key. Validate the issuer (https://api.elevenlabs.io/convai/speech-engine) and subject (convai_speech_engine_upstream) before accepting the upgrade.

Send

agentResponseobjectRequired

LLM-generated text sent from your server to ElevenLabs for speech synthesis. Send partial chunks with `is_final: false` for low-latency streaming, then a final message with `is_final: true` and empty `content` to signal completion. Include the `event_id` from the triggering `user_transcript` so ElevenLabs can handle interruptions correctly.

pongobjectRequired

Reply to a ping message. Must be sent in response to every ping to keep the session alive.

Receive

initobjectRequired

Sent by ElevenLabs when a new conversation starts. Contains the unique conversation ID for this session.

userTranscriptobjectRequired

Sent by ElevenLabs each time the user finishes a speech turn. Contains the full conversation history and an event_id for correlating responses and tracking interruptions.

pingobjectRequired

Keep-alive ping sent periodically by ElevenLabs. Your server must reply with a pong message.

closeobjectRequired

Sent by ElevenLabs when the conversation ends cleanly. After this, the WebSocket connection will be closed.

errorobjectRequired

Sent by ElevenLabs when a protocol-level error occurs. The connection will be closed after this message.

URL	wss://api.elevenlabs.io/speech-engine/upstream
Method	GET
Status	101 Switching Protocols

Connection flow

Authentication

Interruption handling

Streaming responses

HandshakeTry it

Headers

Send

Receive

Connection flow

Authentication

Interruption handling

Streaming responses

HandshakeTry it

Headers

Send

Receive

Handshake

Handshake