Realtime Speech to Text | ElevenLabs Documentation

Overview

The ElevenLabs Realtime Speech to Text API enables you to transcribe audio streams in real-time with ultra-low latency using the ScribeRealtime v2 model. Whether you’re building voice assistants, transcription services, or any application requiring live speech recognition, this WebSocket-based API delivers partial transcripts as you speak and committed transcripts when speech segments are complete.

Key features

Ultra-low latency: Get partial transcriptions in milliseconds
Streaming support: Send audio in chunks while receiving transcripts in real-time
Multiple audio formats: Support for PCM (8kHz to 48kHz) and μ-law encoding
Voice Activity Detection (VAD): Automatic speech segmentation based on silence detection
Manual commit control: Full control over when to commit transcript segments
Previous text context: Send previous text context for improved transcription

Quickstart

ElevenLabs Scribe v2 Realtime can be implemented on either the client or the server side. Choose client if you want to transcribe audio in realtime on the client side, for instance via the microphone. If you want to transcribe audio from a URL, choose the server side implementation.

Client

Server

Create an API key

Create an API key in the dashboard here, which you’ll use to securely access the API.

Store the key as a managed secret and pass it to the SDKs either as a environment variable via an .env file, or directly in your app’s configuration depending on your preference.

.env

1 ELEVENLABS_API_KEY=<your_api_key_here>

Install the SDK

$ npm install @elevenlabs/react

Create a token

To use the client side SDK, you need to create a single use token. This can be done via the ElevenLabs API on the server side.

Never expose your API key to the client.

1 // Node.js server
2 app.get("/scribe-token", yourAuthMiddleware, async (req, res) => {
3   const response = await fetch(
4     "https://api.elevenlabs.io/v1/single-use-token/realtime_scribe",
5     {
6       method: "POST",
7       headers: {
8         "xi-api-key": process.env.ELEVENLABS_API_KEY,
9       },
10     }
11   );
12 
13   const data = await response.json();
14   res.json({ token: data.token });
15 });

Once generated, the token automatically expires after 15 minutes.

Configure the SDK

The client SDK provides two ways to transcribe audio in realtime, streaming from the microphone or manually chunking the audio.

Streaming from the microphone

Manual audio chunking

1 import { useScribe } from "@elevenlabs/react";
2 
3 function MyComponent() {
4   const scribe = useScribe({
5     modelId: "scribe_v2_realtime",
6     onPartialTranscript: (data) => {
7       console.log("Partial:", data.text);
8     },
9     onCommittedTranscript: (data) => {
10       console.log("Committed:", data.text);
11     },
12     onCommittedTranscriptWithTimestamps: (data) => {
13       console.log("Committed with timestamps:", data.text);
14       console.log("Timestamps:", data.words);
15     },
16   });
17 
18   const handleStart = async () => {
19     // Fetch a single use token from the server
20     const token = await fetchTokenFromServer();
21 
22     await scribe.connect({
23       token,
24       microphone: {
25         echoCancellation: true,
26         noiseSuppression: true,
27       },
28     });
29   };
30 
31   return (
32     <div>
33       <button onClick={handleStart} disabled={scribe.isConnected}>
34         Start Recording
35       </button>
36       <button onClick={scribe.disconnect} disabled={!scribe.isConnected}>
37         Stop
38       </button>
39 
40       {scribe.partialTranscript && <p>Live: {scribe.partialTranscript}</p>}
41 
42       <div>
43         {scribe.committedTranscripts.map((t) => (
44           <p key={t.id}>{t.text}</p>
45         ))}
46       </div>
47     </div>
48   );
49 }

Query parameters

When using the Realtime Speech to Text WebSocket endpoint, you can configure the transcription session with optional query parameters. These parameters are specified in the connect method.

Parameter	Type	Default	Description
`model_id`	string	n/a	Required model ID
`language_code`	string	n/a	An ISO-639-1 or ISO-639-3 language code corresponding to the language of the audio file. Can sometimes improve transcription performance if known beforehand. Leave empty to have the model auto-detect the language.
`audio_format`	string	`"pcm_16000"`	Audio encoding format. See “Supported audio formats” section
`commit_strategy`	string	`"manual"`	How to segment speech: `manual` or `vad`
`include_timestamps`	boolean	`false`	Whether to receive the `committed_transcript_with_timestamps` event, which includes word-level timestamps.
`vad_silence_threshold_secs`	float	1.5	Seconds of silence before VAD commits (0.3-3.0). Not applicable if `commit_strategy` is `manual`
`vad_threshold`	float	0.4	VAD sensitivity (0.1-0.9, lower indicates more sensitive). Not applicable if `commit_strategy` is `manual`
`min_speech_duration_ms`	int	100	Minimum speech duration for VAD (50-2000ms). Not applicable if `commit_strategy` is `manual`
`min_silence_duration_ms`	int	100	Minimum silence duration for VAD (50-2000ms). Not applicable if `commit_strategy` is `manual`

1 import { Scribe, AudioFormat, CommitStrategy } from "@elevenlabs/client";
2 
3 const connection = Scribe.connect({
4   token: "your-token",
5   modelId: "scribe_v2_realtime",
6   languageCode: "en",
7   audioFormat: AudioFormat.PCM_16000,
8   commitStrategy: CommitStrategy.VAD,
9   vadSilenceThresholdSecs: 1.5,
10   vadThreshold: 0.4,
11   minSpeechDurationMs: 100,
12   minSilenceDurationMs: 100,
13   includeTimestamps: false,
14 });

Supported audio formats

Format	Sample Rate	Description
pcm_8000	8 kHz	16-bit PCM, little-endian
pcm_16000	16 kHz	16-bit PCM, little-endian (recommended)
pcm_22050	22.05 kHz	16-bit PCM, little-endian
pcm_24000	24 kHz	16-bit PCM, little-endian
pcm_44100	44.1 kHz	16-bit PCM, little-endian
pcm_48000	48 kHz	16-bit PCM, little-endian
ulaw_8000	8 kHz	8-bit μ-law encoding

Commit strategies

When sending audio chunks via the WebSocket, transcript segments can be committed in two ways: Manual Commit or Voice Activity Detection (VAD).

Manual commit

With the manual commit strategy, you control when to commit transcript segments. This is the strategy that is used by default. Committing a segment will clear the processed accumulated transcript and start a new segment without losing context. Committing every 20-30 seconds is good practice to improve latency. By default the stream will be automatically committed every 90 seconds.

For best results, commit during silence periods or another logical point like a turn model.

Transcript processing starts after the first 2 seconds of audio are sent.

1 await connection.send({
2   "audio_base_64": audio_base_64,
3   "sample_rate": 16000,
4 })
5 
6 # When ready to finalize the segment
7 await connection.commit()

Committing manually several times in a short sequence can degrade model performance.

Sending previous text context

When sending audio for transcription, you can send previous text context alongside the first audio chunk to help the model understand the context of the speech. This is useful in a few scenarios:

Agent text for conversational AI use cases - Allows the model to more easily understand the context of the conversation and produce better transcriptions.
Reconnection after a network error - This allows the model to continue transcribing, using the previous text as guidance.
General contextual information - A short description of what the transcription will be about helps the model understand the context.

Sending previous_text context is only possible when sending the first audio chunk via connection.send(). Sending it in subsequent chunks will result in an error. Previous text works best when it’s under 50 characters long.

1 await connection.send({
2   "audio_base_64": audio_base_64,
3   "previous_text": "The previous text context",
4 })

Previous text works best when it is under 50 characters long.

Voice Activity Detection (VAD)

With the VAD strategy, the transcription engine automatically detects speech and silence segments. When a silence threshold is reached, the transcription engine will commit the transcript segment automatically.

See the Query parameters section for more information on the VAD parameters.

Error handling

If an error occurs, an error message will be returned before the WebSocket connection is closed.

Error Type	Description
`auth_error`	An error occurred while authenticating the request. Double check your API key.
`quota_exceeded`	You have exceeded your usage quota.
`transcriber_error`	An error occurred while transcribing the audio.
`input_error`	An error occurred while processing the audio chunk. Likely due to invalid input format or parameters.
`error`	A generic server error.
`commit_throttled`	The commit was throttled due to too many commit requests made in a short period of time.
`transcriber_error`	An error occurred while transcribing the audio.
`unaccepted_terms`	You haven’t accepted the terms of service to use Scribe. Please review and accept the terms & conditions in the ElevenLabs dashboard.
`rate_limited`	You are rate limited. Please reduce the amount of requests made in a short period of time.
`queue_overflow`	The processing queue is full. Please send fewer requests made in a short period of time.
`resource_exhausted`	Server resources are at capacity. Please try again later.
`session_time_limit_exceeded`	Maximum session time has been reached. Please start a new session or upgrade your subscription.
`chunk_size_exceeded`	The audio chunk size is too large. Please reduce the chunk size.
`insufficient_audio_activity`	You haven’t sent enough audio activity to maintain the connection.

Best practices

Audio quality

For best results, use a 16kHz sample rate for an optimum balance of quality and bandwidth.
Ensure clean audio input with minimal background noise.
Use an appropriate microphone gain to avoid clipping.
Only mono audio is supported at this time.

Chunk size

Send audio chunks of 0.1 - 1 second in length for smooth streaming.
Smaller chunks result in lower latency but more overhead.
Larger chunks are more efficient but can introduce latency.

Reconnection logic

Implement reconnection logic to handle connection failures gracefully using the SDK’s event-driven approach.

1 import asyncio
2 from elevenlabs import RealtimeEvents
3 
4 # Track connection state for reconnection
5 should_reconnect = {"value": False}
6 reconnect_event = asyncio.Event()
7 
8 def on_error(error):
9     print(f"Connection error: {error}")
10     should_reconnect["value"] = True
11     reconnect_event.set()
12 
13 def on_close():
14     print("Connection closed")
15     reconnect_event.set()
16 
17 # Register error handlers
18 connection.on(RealtimeEvents.ERROR, on_error)
19 connection.on(RealtimeEvents.CLOSE, on_close)
20 
21 # Wait for connection to close or error
22 await reconnect_event.wait()
23 
24 # Check if we should attempt reconnection
25 if should_reconnect["value"]:
26     print("Reconnecting with exponential backoff...")
27     for attempt in range(3):
28         try:
29             await asyncio.sleep(2 ** attempt)  # Exponential backoff
30             connection = await elevenlabs.speech_to_text.realtime.connect(config)
31             break
32         except Exception as e:
33             print(f"Reconnection attempt {attempt + 1} failed: {e}")
34 
35 # When sending the first audio chunk (in manual mode) after reconnecting, include the previous text to keep the context of the conversation
36 await connection.send({
37   "audio_base_64": audio_base_64,
38   "previous_text": "The previous text context",
39 })

Event reference

Sent events

Event	Description	When to use
`input_audio_chunk`	Send audio data for transcription	Continuously while streaming audio

Received events

Event	Description	When received
`session_started`	Confirms connection and returns session configuration	Immediately after WebSocket connection is established
`partial_transcript`	Live transcript update	During audio processing, before a commit is made
`committed_transcript`	Transcript of the audio segment	After a commit (either manual or VAD triggered)
`committed_transcript_with_timestamps`	Sent after the committed transcript of the audio segment. Contains word-level timestamps	Sent after the committed transcript of the audio segment. Only received when `include_timestamps=true` is included in the query parameters
`auth_error`	Authentication error	Invalid or missing API key
`quota_exceeded`	Usage limit reached	Account quota exhausted
`transcriber_error`	Transcription engine error	Internal transcription failure
`input_error`	Invalid input format	Malformed messages or invalid audio
`error`	Generic server error	Unexpected server failure

Troubleshooting

No transcripts received

Check audio format matches the configured format
Ensure audio data is properly base 64 encoded
Verify chunks include the sample_rate field
Check for authentication errors
Verify usage limits

Partial transcripts but no committed transcript

Ensure you are sending commit messages
With VAD, ensure sufficient silence between segments to trigger committed commit

High latency

Reduce audio chunk size
Check network connection
Consider using a lower sample rate