Speech Engine | ElevenLabs Documentation

Overview

ElevenLabs Speech Engine adds voice capabilities to any chat agent. ElevenLabs handles speech-to-text and text-to-speech while your server provides the LLM logic. The SDK manages connection lifecycle, turn-taking, and interruption detection so you can focus on your agent’s behavior.

Quickstart

Build a voice agent with the ElevenLabs SDK.

JavaScript SDK reference

Classes, methods, and events for the JavaScript SDK.

Python SDK reference

Classes, methods, and events for the Python SDK.

How it works

Speech Engine connects your server to ElevenLabs over WebSocket. Each connection represents one conversation.

A user speaks in the browser. ElevenLabs captures the audio and transcribes it.
The transcript is sent to your server along with the full conversation history.
Your server passes the transcript to your LLM and streams the response back.
ElevenLabs converts the text to speech and plays it in the browser.

When to use Speech Engine

Speech Engine is designed for developers who want to bring their own LLM and control the conversation logic on their own server. Use it when you need to:

Add voice to an existing text-based chat agent
Use a specific LLM, fine-tuned model, or custom inference pipeline
Keep full control over conversation routing, context management, and tool calling
Integrate voice into an existing server application (Express, FastAPI, etc.)

If you want a fully hosted solution where ElevenLabs provides the LLM, knowledge base, and tools, use ElevenAgents instead.

Key features

Any LLM - use OpenAI, Anthropic, Google Gemini, or any model that produces text. The SDK auto-extracts text from OpenAI, Anthropic, and Gemini stream formats.
Interruption handling - when the user speaks mid-response, the SDK cancels the in-flight LLM request automatically via an AbortSignal (TypeScript) or task cancellation (Python).
Streaming - responses are streamed to the browser as they are generated. Pass a string, an async iterable, or a native LLM stream object.
Turn-taking - the SDK manages conversation turns, so your server only needs to respond to transcripts.

FAQ

What LLMs are supported?

Any LLM that produces text. The SDK has built-in stream extraction for OpenAI (Responses API and Chat Completions API), Anthropic Messages API, and Google Gemini API. For other providers, pass a plain string or an async iterable of string chunks.

What is the difference between Speech Engine and ElevenAgents?

ElevenAgents is a fully hosted platform where ElevenLabs provides the LLM, knowledge base, and tools. Speech Engine is for developers who want to bring their own LLM and control the conversation logic on their own server.

What server frameworks are supported?

In TypeScript, you can attach Speech Engine to any Node.js HTTP server (Express, Fastify, or plain http.createServer()), or run a standalone WebSocket server. In Python, the SDK provides a standalone server via engine.serve(), or you can integrate with FastAPI, Starlette, or any ASGI framework using engine.create_session().