Speech Engine quickstart

Add voice to your chat agent using the ElevenLabs SDK.

This guide walks you through building a voice-powered agent with Speech Engine. You set up a server that connects your LLM to ElevenLabs, then wire up a browser client so users can have voice conversations with your agent.

Use the ElevenLabs Speech Engine skill to add voice to your chat agent:

$npx skills add elevenlabs/skills --skill speech-engine

How Speech Engine works

Speech Engine connects your LLM to ElevenLabs so that users can speak to your agent and hear it respond. ElevenLabs handles speech-to-text and text-to-speech; your server provides the LLM logic.

Each WebSocket connection represents one conversation. When the user speaks, ElevenLabs transcribes the audio and sends the transcript to your server. Your server passes it to your LLM, then streams the response back. ElevenLabs converts the text to speech and plays it in the browser. The SDK handles connection management, turn-taking, and interruption detection.

Prerequisites

This tutorial uses OpenAI’s API for the LLM. You need an OpenAI API key set in the OPENAI_API_KEY environment variable.

Server setup

1

Create an API key

Create an API key in the dashboard here, which you’ll use to securely access the API.

Store the key as a managed secret and pass it to the SDKs either as a environment variable via an .env file, or directly in your app’s configuration depending on your preference.

.env
1ELEVENLABS_API_KEY=<your_api_key_here>
2

Install dependencies

1pip install elevenlabs openai python-dotenv
3

Expose the server

Speech Engine needs a publicly reachable URL. Use ngrok to expose your local server. The server is not built yet, but ngrok needs to be running first so you have the URL for the next step.

$ngrok http 3001

Copy the forwarding URL (e.g. https://abc123.ngrok.io).

4

Create a Speech Engine instance

Use the SDK to create a Speech Engine instance, passing your ngrok URL with the /ws path appended as the WebSocket URL.

1import asyncio
2from dotenv import load_dotenv
3from elevenlabs import AsyncElevenLabs
4
5load_dotenv()
6
7elevenlabs = AsyncElevenLabs(
8 api_key=os.getenv("ELEVENLABS_API_KEY"),
9)
10
11
12async def main():
13 engine = await elevenlabs.speech_engine.create(
14 name="My Speech Engine",
15 speech_engine={
16 # Note we use the wss protocol instead of https
17 "ws_url": "wss://abc123.ngrok.io/ws",
18 },
19 )
20
21 print(f"Speech Engine ID: {engine.engine_id}")
22
23
24if __name__ == "__main__":
25 asyncio.run(main())

Run this script and copy the Speech Engine ID (e.g. seng_8k3m9xr4hjnfg983brhmhkd98n6) for the next step.

5

Create the server

Create a file called server.py or server.mts with the following contents. This sets up a server, attaches Speech Engine on the /ws path, and uses OpenAI to generate responses.

1import asyncio
2import os
3
4from dotenv import load_dotenv
5from openai import AsyncOpenAI
6from elevenlabs import AsyncElevenLabs
7
8load_dotenv()
9
10# Replace with your Speech Engine ID from step 4
11SPEECH_ENGINE_ID = "seng_8k3m9xr4hjnfg983brhmhkd98n6"
12
13openai = AsyncOpenAI(
14 api_key=os.getenv("OPENAI_API_KEY"),
15)
16elevenlabs = AsyncElevenLabs(
17 api_key=os.getenv("ELEVENLABS_API_KEY"),
18)
19
20
21def on_init(conversation_id, session):
22 print(f"Session started: {conversation_id}")
23
24
25async def on_transcript(transcript, session):
26 stream = await openai.responses.create(
27 model="gpt-4o",
28 instructions="You are a helpful voice assistant. Keep responses concise and conversational.",
29 input=[
30 {"role": "assistant" if m.role == "agent" else m.role, "content": m.content}
31 for m in transcript
32 ],
33 stream=True,
34 )
35
36 await session.send_response(stream)
37
38
39def on_close(session):
40 print(f"Session ended: {session.conversation_id}")
41
42
43def on_error(err, session):
44 print(f"Error: {err}")
45
46
47async def main():
48 engine = await elevenlabs.speech_engine.get(SPEECH_ENGINE_ID)
49
50 await engine.serve(
51 port=3001,
52 path="/ws",
53 debug=True,
54 on_init=on_init,
55 on_transcript=on_transcript,
56 on_close=on_close,
57 on_error=on_error,
58 )
59
60
61if __name__ == "__main__":
62 asyncio.run(main())

The onTranscript / on_transcript callback receives the full conversation history and the current session. The TypeScript SDK also provides an AbortSignal that fires if the user interrupts mid-response. Passing signal to the OpenAI call cancels the LLM request automatically on interruption.

sendResponse() / send_response() accepts a string, an async iterable, or a stream from OpenAI, Anthropic, or Google Gemini. The SDK extracts the text content automatically.

In the above example, the full transcript from the user is passed to the LLM. In a production environment you should add guardrails to prevent any prompt injection or manipulation attempts.

6

Start the server

1python server.py

Client setup

1

Install the client SDK

$npm install @elevenlabs/react
2

Create a token endpoint

Add a server-side endpoint that generates a conversation token. This keeps your API key out of the browser and uses WebRTC for the best audio quality.

1import os
2
3from dotenv import load_dotenv
4from flask import Flask, jsonify
5from elevenlabs import ElevenLabs
6
7load_dotenv()
8
9app = Flask(__name__)
10elevenlabs = ElevenLabs(
11 api_key=os.getenv("ELEVENLABS_API_KEY"),
12)
13
14
15@app.route("/api/token")
16def get_token():
17 # Replace with your Speech Engine ID from step 4 of the server setup
18 speech_engine_id = "seng_8k3m9xr4hjnfg983brhmhkd98n6"
19
20 response = elevenlabs.conversational_ai.conversations.get_webrtc_token(
21 agent_id=speech_engine_id,
22 )
23
24 return jsonify(token=response.token)
25
26
27if __name__ == "__main__":
28 app.run(port=3002)
3

Build the conversation UI

Fetch the conversation token from your server and use it to start a session.

App.tsx
1import { useConversation } from "@elevenlabs/react";
2import { useCallback } from "react";
3
4async function getToken(): Promise<string> {
5 const response = await fetch("/api/token");
6 if (!response.ok) {
7 throw Error("Failed to get conversation token");
8 }
9 const data = await response.json();
10 return data.token;
11}
12
13export default function App() {
14 const conversation = useConversation({
15 onConnect: () => console.log("Connected"),
16 onDisconnect: () => console.log("Disconnected"),
17 onError: (error: Error) => console.error("Error:", error),
18 });
19
20 const startConversation = useCallback(async () => {
21 await navigator.mediaDevices.getUserMedia({ audio: true });
22 const token = await getToken();
23 await conversation.startSession({ conversationToken: token });
24 }, [conversation]);
25
26 const stopConversation = useCallback(async () => {
27 await conversation.endSession();
28 }, [conversation]);
29
30 return (
31 <div>
32 <p>Status: {conversation.status}</p>
33 <button onClick={startConversation} disabled={conversation.status === "connected"}>
34 Start conversation
35 </button>
36 <button onClick={stopConversation} disabled={conversation.status !== "connected"}>
37 End conversation
38 </button>
39 </div>
40 );
41}
4

Try it out

Make sure three processes are running:

  1. ngrok - forwarding to port 3001
  2. Your Speech Engine server - python server.py or npx tsx server.mts
  3. The token server - npx tsx token-server.mts or python token_server.py

Open your client application in the browser and click Start conversation. Grant microphone access when prompted, then speak. You should hear the agent respond through your speakers.

If you have debug: true enabled on the server, you will see incoming transcripts and outgoing responses logged to the console.

Session events

EventTypeScript callbackPython callbackDescription
user_transcriptonTranscripton_transcriptUser speech transcribed. Includes full conversation history and an abort signal.
initonIniton_initSession initialized with a conversation ID.
closeonCloseon_closeClean disconnect from ElevenLabs.
disconnectedonDisconnecton_disconnectWebSocket dropped unexpectedly.
erroronErroron_errorProtocol or WebSocket error.

Configuring the first agent message

By default, the agent waits for the user to speak first. To have the agent greet the user when the conversation starts, set a first message in the overrides option on the client when starting the session.

1

To allow the agent to speak first, we need to update the Speech Engine resource to allow setting this from the client.

1engine = await elevenlabs.speech_engine.update(
2 speech_engine_id="seng_8k3m9xr4hjnfg983brhmhkd98n6",
3 overrides={
4 "first_message": True,
5 },
6)
2

Then we configure the first message in the client SDK.

1conversation.startSession({
2 conversationToken: token,
3 overrides: {
4 agent: {
5 firstMessage: "Hello! How can I help you today?",
6 },
7 },
8});

The first message is spoken by the agent as soon as the connection is established. It does not trigger the onTranscript callback on your server - it is handled entirely on the ElevenLabs side.

Next steps