Speech Engine quickstart | ElevenLabs Documentation

This guide walks you through building a voice-powered agent with Speech Engine. You set up a server that connects your LLM to ElevenLabs, then wire up a browser client so users can have voice conversations with your agent.

Use the ElevenLabs Speech Engine skill to add voice to your chat agent:

$ npx skills add elevenlabs/skills --skill speech-engine

How Speech Engine works

Speech Engine connects your LLM to ElevenLabs so that users can speak to your agent and hear it respond. ElevenLabs handles speech-to-text and text-to-speech; your server provides the LLM logic.

Each WebSocket connection represents one conversation. When the user speaks, ElevenLabs transcribes the audio and sends the transcript to your server. Your server passes it to your LLM, then streams the response back. ElevenLabs converts the text to speech and plays it in the browser. The SDK handles connection management, turn-taking, and interruption detection.

Prerequisites

This tutorial uses OpenAI’s API for the LLM. You need an OpenAI API key set in the OPENAI_API_KEY environment variable.

Server setup

Create an API key

Create an API key in the dashboard here, which you’ll use to securely access the API.

Store the key as a managed secret and pass it to the SDKs either as a environment variable via an .env file, or directly in your app’s configuration depending on your preference.

.env

1 ELEVENLABS_API_KEY=<your_api_key_here>

Install dependencies

1 pip install elevenlabs openai python-dotenv

Expose the server

Speech Engine needs a publicly reachable URL. Use ngrok to expose your local server. The server is not built yet, but ngrok needs to be running first so you have the URL for the next step.

$ ngrok http 3001

Copy the forwarding URL (e.g. https://abc123.ngrok.io).

Create a Speech Engine instance

Use the SDK to create a Speech Engine instance, passing your ngrok URL with the /ws path appended as the WebSocket URL.

1 import asyncio
2 from dotenv import load_dotenv
3 from elevenlabs import AsyncElevenLabs
4 
5 load_dotenv()
6 
7 elevenlabs = AsyncElevenLabs(
8     api_key=os.getenv("ELEVENLABS_API_KEY"),
9 )
10 
11 
12 async def main():
13     engine = await elevenlabs.speech_engine.create(
14         name="My Speech Engine",
15         speech_engine={
16             # Note we use the wss protocol instead of https
17             "ws_url": "wss://abc123.ngrok.io/ws",
18         },
19     )
20 
21     print(f"Speech Engine ID: {engine.engine_id}")
22 
23 
24 if __name__ == "__main__":
25     asyncio.run(main())

Run this script and copy the Speech Engine ID (e.g. seng_8k3m9xr4hjnfg983brhmhkd98n6) for the next step.

Create the server

Create a file called server.py or server.mts with the following contents. This sets up a server, attaches Speech Engine on the /ws path, and uses OpenAI to generate responses.

1 import asyncio
2 import os
3 
4 from dotenv import load_dotenv
5 from openai import AsyncOpenAI
6 from elevenlabs import AsyncElevenLabs
7 
8 load_dotenv()
9 
10 # Replace with your Speech Engine ID from step 4
11 SPEECH_ENGINE_ID = "seng_8k3m9xr4hjnfg983brhmhkd98n6"
12 
13 openai = AsyncOpenAI(
14   api_key=os.getenv("OPENAI_API_KEY"),
15 )
16 elevenlabs = AsyncElevenLabs(
17   api_key=os.getenv("ELEVENLABS_API_KEY"),
18 )
19 
20 
21 def on_init(conversation_id, session):
22     print(f"Session started: {conversation_id}")
23 
24 
25 async def on_transcript(transcript, session):
26     stream = await openai.responses.create(
27         model="gpt-4o",
28         instructions="You are a helpful voice assistant. Keep responses concise and conversational.",
29         input=[
30             {"role": "assistant" if m.role == "agent" else m.role, "content": m.content}
31             for m in transcript
32         ],
33         stream=True,
34     )
35 
36     await session.send_response(stream)
37 
38 
39 def on_close(session):
40     print(f"Session ended: {session.conversation_id}")
41 
42 
43 def on_error(err, session):
44     print(f"Error: {err}")
45 
46 
47 async def main():
48     engine = await elevenlabs.speech_engine.get(SPEECH_ENGINE_ID)
49 
50     await engine.serve(
51         port=3001,
52         path="/ws",
53         debug=True,
54         on_init=on_init,
55         on_transcript=on_transcript,
56         on_close=on_close,
57         on_error=on_error,
58     )
59 
60 
61 if __name__ == "__main__":
62     asyncio.run(main())

The onTranscript / on_transcript callback receives the full conversation history and the current session. The TypeScript SDK also provides an AbortSignal that fires if the user interrupts mid-response. Passing signal to the OpenAI call cancels the LLM request automatically on interruption.

sendResponse() / send_response() accepts a string, an async iterable, or a stream from OpenAI, Anthropic, or Google Gemini. The SDK extracts the text content automatically.

In the above example, the full transcript from the user is passed to the LLM. In a production environment you should add guardrails to prevent any prompt injection or manipulation attempts.

Start the server

1 python server.py

Client setup

Install the client SDK

React

JavaScript

$ npm install @elevenlabs/react

Create a token endpoint

Add a server-side endpoint that generates a conversation token. This keeps your API key out of the browser and uses WebRTC for the best audio quality.

1 import os
2 
3 from dotenv import load_dotenv
4 from flask import Flask, jsonify
5 from elevenlabs import ElevenLabs
6 
7 load_dotenv()
8 
9 app = Flask(__name__)
10 elevenlabs = ElevenLabs(
11     api_key=os.getenv("ELEVENLABS_API_KEY"),
12 )
13 
14 
15 @app.route("/api/token")
16 def get_token():
17     # Replace with your Speech Engine ID from step 4 of the server setup
18     speech_engine_id = "seng_8k3m9xr4hjnfg983brhmhkd98n6"
19 
20     response = elevenlabs.conversational_ai.conversations.get_webrtc_token(
21         agent_id=speech_engine_id,
22     )
23 
24     return jsonify(token=response.token)
25 
26 
27 if __name__ == "__main__":
28     app.run(port=3002)

Build the conversation UI

Fetch the conversation token from your server and use it to start a session.

React

JavaScript

App.tsx

1 import { useConversation } from "@elevenlabs/react";
2 import { useCallback } from "react";
3 
4 async function getToken(): Promise<string> {
5   const response = await fetch("/api/token");
6   if (!response.ok) {
7     throw Error("Failed to get conversation token");
8   }
9   const data = await response.json();
10   return data.token;
11 }
12 
13 export default function App() {
14   const conversation = useConversation({
15     onConnect: () => console.log("Connected"),
16     onDisconnect: () => console.log("Disconnected"),
17     onError: (error: Error) => console.error("Error:", error),
18   });
19 
20   const startConversation = useCallback(async () => {
21     await navigator.mediaDevices.getUserMedia({ audio: true });
22     const token = await getToken();
23     await conversation.startSession({ conversationToken: token });
24   }, [conversation]);
25 
26   const stopConversation = useCallback(async () => {
27     await conversation.endSession();
28   }, [conversation]);
29 
30   return (
31     <div>
32       <p>Status: {conversation.status}</p>
33       <button onClick={startConversation} disabled={conversation.status === "connected"}>
34         Start conversation
35       </button>
36       <button onClick={stopConversation} disabled={conversation.status !== "connected"}>
37         End conversation
38       </button>
39     </div>
40   );
41 }

Try it out

Make sure three processes are running:

ngrok - forwarding to port 3001
Your Speech Engine server - python server.py or npx tsx server.mts
The token server - npx tsx token-server.mts or python token_server.py

Open your client application in the browser and click Start conversation. Grant microphone access when prompted, then speak. You should hear the agent respond through your speakers.

If you have debug: true enabled on the server, you will see incoming transcripts and outgoing responses logged to the console.

Session events

Event	TypeScript callback	Python callback	Description
`user_transcript`	`onTranscript`	`on_transcript`	User speech transcribed. Includes full conversation history and an abort signal.
`init`	`onInit`	`on_init`	Session initialized with a conversation ID.
`close`	`onClose`	`on_close`	Clean disconnect from ElevenLabs.
`disconnected`	`onDisconnect`	`on_disconnect`	WebSocket dropped unexpectedly.
`error`	`onError`	`on_error`	Protocol or WebSocket error.

Configuring the first agent message

By default, the agent waits for the user to speak first. To have the agent greet the user when the conversation starts, set a first message in the overrides option on the client when starting the session.

To allow the agent to speak first, we need to update the Speech Engine resource to allow setting this from the client.

1 engine = await elevenlabs.speech_engine.update(
2     speech_engine_id="seng_8k3m9xr4hjnfg983brhmhkd98n6",
3     overrides={
4       "first_message": True,
5     },
6 )

Then we configure the first message in the client SDK.

React

JavaScript

1 conversation.startSession({
2   conversationToken: token,
3   overrides: {
4     agent: {
5       firstMessage: "Hello! How can I help you today?",
6     },
7   },
8 });

The first message is spoken by the agent as soon as the connection is established. It does not trigger the onTranscript callback on your server - it is handled entirely on the ElevenLabs side.

Next steps

JavaScript SDK reference

Classes, methods, and events for the JavaScript SDK.

Python SDK reference

Classes, methods, and events for the Python SDK.

API reference

Explore all Speech Engine parameters and response formats.