Speech Engine quickstart
This guide walks you through building a voice-powered agent with Speech Engine. You set up a server that connects your LLM to ElevenLabs, then wire up a browser client so users can have voice conversations with your agent.
Use the ElevenLabs Speech Engine skill to add voice to your chat agent:
How Speech Engine works
Speech Engine connects your LLM to ElevenLabs so that users can speak to your agent and hear it respond. ElevenLabs handles speech-to-text and text-to-speech; your server provides the LLM logic.
Each WebSocket connection represents one conversation. When the user speaks, ElevenLabs transcribes the audio and sends the transcript to your server. Your server passes it to your LLM, then streams the response back. ElevenLabs converts the text to speech and plays it in the browser. The SDK handles connection management, turn-taking, and interruption detection.
Prerequisites
This tutorial uses OpenAI’s API for the LLM. You need an OpenAI API key set in the OPENAI_API_KEY environment variable.
Server setup
Create an API key
Create an API key in the dashboard here, which you’ll use to securely access the API.
Store the key as a managed secret and pass it to the SDKs either as a environment variable via an .env file, or directly in your app’s configuration depending on your preference.
Expose the server
Speech Engine needs a publicly reachable URL. Use ngrok to expose your local server. The server is not built yet, but ngrok needs to be running first so you have the URL for the next step.
Copy the forwarding URL (e.g. https://abc123.ngrok.io).
Create a Speech Engine instance
Use the SDK to create a Speech Engine instance, passing your ngrok URL with the /ws path appended as the WebSocket URL.
Run this script and copy the Speech Engine ID (e.g. seng_8k3m9xr4hjnfg983brhmhkd98n6) for the next step.
Create the server
Create a file called server.py or server.mts with the following contents. This sets up a server, attaches Speech Engine on the /ws path, and uses OpenAI to generate responses.
The onTranscript / on_transcript callback receives the full conversation history and the current session. The TypeScript SDK also provides an AbortSignal that fires if the user interrupts mid-response. Passing signal to the OpenAI call cancels the LLM request automatically on interruption.
sendResponse() / send_response() accepts a string, an async iterable, or a stream from OpenAI, Anthropic, or Google Gemini. The SDK extracts the text content automatically.
In the above example, the full transcript from the user is passed to the LLM. In a production environment you should add guardrails to prevent any prompt injection or manipulation attempts.
Client setup
Create a token endpoint
Add a server-side endpoint that generates a conversation token. This keeps your API key out of the browser and uses WebRTC for the best audio quality.
Build the conversation UI
Fetch the conversation token from your server and use it to start a session.
React
JavaScript
Try it out
Make sure three processes are running:
- ngrok - forwarding to port 3001
- Your Speech Engine server -
python server.pyornpx tsx server.mts - The token server -
npx tsx token-server.mtsorpython token_server.py
Open your client application in the browser and click Start conversation. Grant microphone access when prompted, then speak. You should hear the agent respond through your speakers.
If you have debug: true enabled on the server, you will see incoming transcripts and outgoing responses logged to the console.
Session events
Configuring the first agent message
By default, the agent waits for the user to speak first. To have the agent greet the user when the conversation starts, set a first message in the overrides option on the client when starting the session.
The first message is spoken by the agent as soon as the connection is established. It does not trigger the onTranscript callback on your server - it is handled entirely on the ElevenLabs side.