Realtime Speech to Text
Overview
The ElevenLabs Realtime Speech to Text API enables you to transcribe audio streams in real-time with ultra-low latency using the ScribeRealtime v2 model. Whether you’re building voice assistants, transcription services, or any application requiring live speech recognition, this WebSocket-based API delivers partial transcripts as you speak and committed transcripts when speech segments are complete.
Key features
- Ultra-low latency: Get partial transcriptions in milliseconds
- Streaming support: Send audio in chunks while receiving transcripts in real-time
- Multiple audio formats: Support for PCM (8kHz to 48kHz) and μ-law encoding
- Voice Activity Detection (VAD): Automatic speech segmentation based on silence detection
- Manual commit control: Full control over when to commit transcript segments
- Previous text context: Send previous text context for improved transcription
Quickstart
ElevenLabs Scribe v2 Realtime can be implemented on either the client or the server side. Choose client if you want to transcribe audio in realtime on the client side, for instance via the microphone. If you want to transcribe audio from a URL, choose the server side implementation.
Client
Server
Create an API key
Create an API key in the dashboard here, which you’ll use to securely access the API.
Store the key as a managed secret and pass it to the SDKs either as a environment variable via an .env file, or directly in your app’s configuration depending on your preference.
Query parameters
When using the Realtime Speech to Text WebSocket endpoint, you can configure the transcription session with optional query parameters. These parameters are specified in the connect method.
Supported audio formats
Commit strategies
When sending audio chunks via the WebSocket, transcript segments can be committed in two ways: Manual Commit or Voice Activity Detection (VAD).
Manual commit
With the manual commit strategy, you control when to commit transcript segments. This is the strategy that is used by default. Committing a segment will clear the processed accumulated transcript and start a new segment without losing context. Committing every 20-30 seconds is good practice to improve latency. By default the stream will be automatically committed every 90 seconds.
For best results, commit during silence periods or another logical point like a turn model.
Committing manually several times in a short sequence can degrade model performance.
Sending previous text context
When sending audio for transcription, you can send previous text context alongside the first audio chunk to help the model understand the context of the speech. This is useful in a few scenarios:
- Agent text for conversational AI use cases - Allows the model to more easily understand the context of the conversation and produce better transcriptions.
- Reconnection after a network error - This allows the model to continue transcribing, using the previous text as guidance.
- General contextual information - A short description of what the transcription will be about helps the model understand the context.
Sending previous_text context is only possible when sending the first audio chunk via
connection.send(). Sending it in subsequent chunks will result in an error. Previous text works
best when it’s under 50 characters long.
Voice Activity Detection (VAD)
With the VAD strategy, the transcription engine automatically detects speech and silence segments. When a silence threshold is reached, the transcription engine will commit the transcript segment automatically.
See the Query parameters section for more information on the VAD parameters.
Error handling
If an error occurs, an error message will be returned before the WebSocket connection is closed.
Best practices
Audio quality
- For best results, use a 16kHz sample rate for an optimum balance of quality and bandwidth.
- Ensure clean audio input with minimal background noise.
- Use an appropriate microphone gain to avoid clipping.
- Only mono audio is supported at this time.
Chunk size
- Send audio chunks of 0.1 - 1 second in length for smooth streaming.
- Smaller chunks result in lower latency but more overhead.
- Larger chunks are more efficient but can introduce latency.
Reconnection logic
Implement reconnection logic to handle connection failures gracefully using the SDK’s event-driven approach.
Event reference
Sent events
Received events
Troubleshooting
No transcripts received
- Check audio format matches the configured format
- Ensure audio data is properly base 64 encoded
- Verify chunks include the
sample_ratefield - Check for authentication errors
- Verify usage limits
Partial transcripts but no committed transcript
- Ensure you are sending commit messages
- With VAD, ensure sufficient silence between segments to trigger committed commit
High latency
- Reduce audio chunk size
- Check network connection
- Consider using a lower sample rate