Transcripts and commit strategies
Overview
When transcribing audio, you will receive partial and committed transcripts.
- Partial transcripts - the interim results of the transcription
- Committed transcripts - the final results of the transcription segment that are sent when a “commit” message is received. A session can have multiple committed transcripts.
The commit transcript can optionally contain word-level timestamps. This is only received when the “include timestamps” option is set to true.
Commit strategies
When sending audio chunks via the WebSocket, transcript segments can be committed in two ways: Manual Commit or Voice Activity Detection (VAD).
Manual commit
With the manual commit strategy, you control when to commit transcript segments. This is the strategy that is used by default. Committing a segment will clear the processed accumulated transcript and start a new segment without losing context. Committing every 20-30 seconds is good practice to improve latency. By default the stream will be automatically committed every 90 seconds.
For best results, commit during silence periods or another logical point like a turn model.
Committing manually several times in a short sequence can degrade model performance.
Sending previous text context
When sending audio for transcription, you can send previous text context alongside the first audio chunk to help the model understand the context of the speech. This is useful in a few scenarios:
- Agent text for conversational AI use cases - Allows the model to more easily understand the context of the conversation and produce better transcriptions.
- Reconnection after a network error - This allows the model to continue transcribing, using the previous text as guidance.
- General contextual information - A short description of what the transcription will be about helps the model understand the context.
Sending previous_text context is only possible when sending the first audio chunk via
connection.send(). Sending it in subsequent chunks will result in an error. Previous text works
best when it’s under 50 characters long.
Voice Activity Detection (VAD)
With the VAD strategy, the transcription engine automatically detects speech and silence segments. When a silence threshold is reached, the transcription engine will commit the transcript segment automatically.
When transcribing audio from the microphone in the client-side integration, it is recommended to use the VAD strategy.
Supported audio formats
Best practices
Audio quality
- For best results, use a 16kHz sample rate for an optimum balance of quality and bandwidth.
- Ensure clean audio input with minimal background noise.
- Use an appropriate microphone gain to avoid clipping.
- Only mono audio is supported at this time.
Chunk size
- Send audio chunks of 0.1 - 1 second in length for smooth streaming.
- Smaller chunks result in lower latency but more overhead.
- Larger chunks are more efficient but can introduce latency.