Realtime Speech to Text

Learn how to transcribe audio with ElevenLabs in realtime with WebSockets

Overview

The ElevenLabs Realtime Speech to Text API enables you to transcribe audio streams in real-time with ultra-low latency using the ScribeRealtime v2 model. Whether you’re building voice assistants, transcription services, or any application requiring live speech recognition, this WebSocket-based API delivers partial transcripts as you speak and committed transcripts when speech segments are complete.

Key features

  • Ultra-low latency: Get partial transcriptions in milliseconds
  • Streaming support: Send audio in chunks while receiving transcripts in real-time
  • Multiple audio formats: Support for PCM (8kHz to 48kHz) and μ-law encoding
  • Voice Activity Detection (VAD): Automatic speech segmentation based on silence detection
  • Manual commit control: Full control over when to commit transcript segments
  • Previous text context: Send previous text context for improved transcription

Quickstart

ElevenLabs Scribe v2 Realtime can be implemented on either the client or the server side. Choose client if you want to transcribe audio in realtime on the client side, for instance via the microphone. If you want to transcribe audio from a URL, choose the server side implementation.

1

Create an API key

Create an API key in the dashboard here, which you’ll use to securely access the API.

Store the key as a managed secret and pass it to the SDKs either as a environment variable via an .env file, or directly in your app’s configuration depending on your preference.

.env
1ELEVENLABS_API_KEY=<your_api_key_here>
2

Install the SDK

$npm install @elevenlabs/react
3

Create a token

To use the client side SDK, you need to create a single use token. This can be done via the ElevenLabs API on the server side.

Never expose your API key to the client.

1// Node.js server
2app.get("/scribe-token", yourAuthMiddleware, async (req, res) => {
3 const response = await fetch(
4 "https://api.elevenlabs.io/v1/single-use-token/realtime_scribe",
5 {
6 method: "POST",
7 headers: {
8 "xi-api-key": process.env.ELEVENLABS_API_KEY,
9 },
10 }
11 );
12
13 const data = await response.json();
14 res.json({ token: data.token });
15});

Once generated, the token automatically expires after 15 minutes.

4

Configure the SDK

The client SDK provides two ways to transcribe audio in realtime, streaming from the microphone or manually chunking the audio.

1import { useScribe } from "@elevenlabs/react";
2
3function MyComponent() {
4 const scribe = useScribe({
5 modelId: "scribe_v2_realtime",
6 onPartialTranscript: (data) => {
7 console.log("Partial:", data.text);
8 },
9 onCommittedTranscript: (data) => {
10 console.log("Committed:", data.text);
11 },
12 onCommittedTranscriptWithTimestamps: (data) => {
13 console.log("Committed with timestamps:", data.text);
14 console.log("Timestamps:", data.words);
15 },
16 });
17
18 const handleStart = async () => {
19 // Fetch a single use token from the server
20 const token = await fetchTokenFromServer();
21
22 await scribe.connect({
23 token,
24 microphone: {
25 echoCancellation: true,
26 noiseSuppression: true,
27 },
28 });
29 };
30
31 return (
32 <div>
33 <button onClick={handleStart} disabled={scribe.isConnected}>
34 Start Recording
35 </button>
36 <button onClick={scribe.disconnect} disabled={!scribe.isConnected}>
37 Stop
38 </button>
39
40 {scribe.partialTranscript && <p>Live: {scribe.partialTranscript}</p>}
41
42 <div>
43 {scribe.committedTranscripts.map((t) => (
44 <p key={t.id}>{t.text}</p>
45 ))}
46 </div>
47 </div>
48 );
49}

Query parameters

When using the Realtime Speech to Text WebSocket endpoint, you can configure the transcription session with optional query parameters. These parameters are specified in the connect method.

ParameterTypeDefaultDescription
model_idstringn/aRequired model ID
language_codestringn/aAn ISO-639-1 or ISO-639-3 language code corresponding to the language of the audio file. Can sometimes improve transcription performance if known beforehand. Leave empty to have the model auto-detect the language.
audio_formatstring"pcm_16000"Audio encoding format. See “Supported audio formats” section
commit_strategystring"manual"How to segment speech: manual or vad
include_timestampsbooleanfalseWhether to receive the committed_transcript_with_timestamps event, which includes word-level timestamps.
vad_silence_threshold_secsfloat1.5Seconds of silence before VAD commits (0.3-3.0). Not applicable if commit_strategy is manual
vad_thresholdfloat0.4VAD sensitivity (0.1-0.9, lower indicates more sensitive). Not applicable if commit_strategy is manual
min_speech_duration_msint100Minimum speech duration for VAD (50-2000ms). Not applicable if commit_strategy is manual
min_silence_duration_msint100Minimum silence duration for VAD (50-2000ms). Not applicable if commit_strategy is manual
1import { Scribe, AudioFormat, CommitStrategy } from "@elevenlabs/client";
2
3const connection = Scribe.connect({
4 token: "your-token",
5 modelId: "scribe_v2_realtime",
6 languageCode: "en",
7 audioFormat: AudioFormat.PCM_16000,
8 commitStrategy: CommitStrategy.VAD,
9 vadSilenceThresholdSecs: 1.5,
10 vadThreshold: 0.4,
11 minSpeechDurationMs: 100,
12 minSilenceDurationMs: 100,
13 includeTimestamps: false,
14});

Supported audio formats

FormatSample RateDescription
pcm_80008 kHz16-bit PCM, little-endian
pcm_1600016 kHz16-bit PCM, little-endian (recommended)
pcm_2205022.05 kHz16-bit PCM, little-endian
pcm_2400024 kHz16-bit PCM, little-endian
pcm_4410044.1 kHz16-bit PCM, little-endian
pcm_4800048 kHz16-bit PCM, little-endian
ulaw_80008 kHz8-bit μ-law encoding

Commit strategies

When sending audio chunks via the WebSocket, transcript segments can be committed in two ways: Manual Commit or Voice Activity Detection (VAD).

Manual commit

With the manual commit strategy, you control when to commit transcript segments. This is the strategy that is used by default. Committing a segment will clear the processed accumulated transcript and start a new segment without losing context. Committing every 20-30 seconds is good practice to improve latency. By default the stream will be automatically committed every 90 seconds.

For best results, commit during silence periods or another logical point like a turn model.

Transcript processing starts after the first 2 seconds of audio are sent.
1await connection.send({
2 "audio_base_64": audio_base_64,
3 "sample_rate": 16000,
4})
5
6# When ready to finalize the segment
7await connection.commit()

Committing manually several times in a short sequence can degrade model performance.

Sending previous text context

When sending audio for transcription, you can send previous text context alongside the first audio chunk to help the model understand the context of the speech. This is useful in a few scenarios:

  • Agent text for conversational AI use cases - Allows the model to more easily understand the context of the conversation and produce better transcriptions.
  • Reconnection after a network error - This allows the model to continue transcribing, using the previous text as guidance.
  • General contextual information - A short description of what the transcription will be about helps the model understand the context.

Sending previous_text context is only possible when sending the first audio chunk via connection.send(). Sending it in subsequent chunks will result in an error. Previous text works best when it’s under 50 characters long.

1await connection.send({
2 "audio_base_64": audio_base_64,
3 "previous_text": "The previous text context",
4})
Previous text works best when it is under 50 characters long.

Voice Activity Detection (VAD)

With the VAD strategy, the transcription engine automatically detects speech and silence segments. When a silence threshold is reached, the transcription engine will commit the transcript segment automatically.

See the Query parameters section for more information on the VAD parameters.

Error handling

If an error occurs, an error message will be returned before the WebSocket connection is closed.

Error TypeDescription
auth_errorAn error occurred while authenticating the request. Double check your API key.
quota_exceededYou have exceeded your usage quota.
transcriber_errorAn error occurred while transcribing the audio.
input_errorAn error occurred while processing the audio chunk. Likely due to invalid input format or parameters.
errorA generic server error.
commit_throttledThe commit was throttled due to too many commit requests made in a short period of time.
transcriber_errorAn error occurred while transcribing the audio.
unaccepted_terms_errorYou haven’t accepted the terms of service to use Scribe. Please review and accept the terms & conditions in the ElevenLabs dashboard.
rate_limitedYou are rate limited. Please reduce the amount of requests made in a short period of time.
queue_overflowThe processing queue is full. Please send fewer requests made in a short period of time.
resource_exhaustedServer resources are at capacity. Please try again later.
session_time_limit_exceededMaximum session time has been reached. Please start a new session or upgrade your subscription.
chunk_size_exceededThe audio chunk size is too large. Please reduce the chunk size.
insufficient_audio_activityYou haven’t sent enough audio activity to maintain the connection.

Best practices

Audio quality

  • For best results, use a 16kHz sample rate for an optimum balance of quality and bandwidth.
  • Ensure clean audio input with minimal background noise.
  • Use an appropriate microphone gain to avoid clipping.
  • Only mono audio is supported at this time.

Chunk size

  • Send audio chunks of 0.1 - 1 second in length for smooth streaming.
  • Smaller chunks result in lower latency but more overhead.
  • Larger chunks are more efficient but can introduce latency.

Reconnection logic

Implement reconnection logic to handle connection failures gracefully using the SDK’s event-driven approach.

1import asyncio
2from elevenlabs import RealtimeEvents
3
4# Track connection state for reconnection
5should_reconnect = {"value": False}
6reconnect_event = asyncio.Event()
7
8def on_error(error):
9 print(f"Connection error: {error}")
10 should_reconnect["value"] = True
11 reconnect_event.set()
12
13def on_close():
14 print("Connection closed")
15 reconnect_event.set()
16
17# Register error handlers
18connection.on(RealtimeEvents.ERROR, on_error)
19connection.on(RealtimeEvents.CLOSE, on_close)
20
21# Wait for connection to close or error
22await reconnect_event.wait()
23
24# Check if we should attempt reconnection
25if should_reconnect["value"]:
26 print("Reconnecting with exponential backoff...")
27 for attempt in range(3):
28 try:
29 await asyncio.sleep(2 ** attempt) # Exponential backoff
30 connection = await elevenlabs.speech_to_text.realtime.connect(config)
31 break
32 except Exception as e:
33 print(f"Reconnection attempt {attempt + 1} failed: {e}")

Event reference

EventDescriptionWhen to use
input_audio_chunkSend audio data for transcriptionContinuously while streaming audio
EventDescriptionWhen received
session_startedConfirms connection and returns session configurationImmediately after WebSocket connection is established
partial_transcriptLive transcript updateDuring audio processing, before a commit is made
committed_transcriptTranscript of the audio segmentAfter a commit (either manual or VAD triggered)
committed_transcript_with_timestampsSent after the committed transcript of the audio segment. Contains word-level timestampsSent after the committed transcript of the audio segment. Only received when include_timestamps=true is included in the query parameters
auth_errorAuthentication errorInvalid or missing API key
quota_exceededUsage limit reachedAccount quota exhausted
transcriber_errorTranscription engine errorInternal transcription failure
input_errorInvalid input formatMalformed messages or invalid audio
errorGeneric server errorUnexpected server failure

Troubleshooting

  • Check audio format matches the configured format
  • Ensure audio data is properly base 64 encoded
  • Verify chunks include the sample_rate field
  • Check for authentication errors
  • Verify usage limits
  • Ensure you are sending commit messages
  • With VAD, ensure sufficient silence between segments to trigger committed commit
  • Reduce audio chunk size
  • Check network connection
  • Consider using a lower sample rate