How-to guide · Assumes you have completed the client-side or server-side streaming guide.
When transcribing audio, you will receive partial and committed transcripts.
The commit transcript can optionally contain word-level timestamps. This is only received when the “include timestamps” option is set to true.
When sending audio chunks via the WebSocket, transcript segments can be committed in two ways: Manual Commit or Voice Activity Detection (VAD).
With the manual commit strategy, you control when to commit transcript segments. This is the strategy that is used by default. Committing a segment will clear the processed accumulated transcript and start a new segment without losing context. Committing every 20-30 seconds is good practice to improve latency. By default the stream will be automatically committed every 90 seconds.
For best results, commit during silence periods or another logical point like a turn model.
Committing manually several times in a short sequence can degrade model performance.
When sending audio for transcription, you can send previous text context alongside the first audio chunk to help the model understand the context of the speech. This is useful in a few scenarios:
Sending previous_text context is only possible when sending the first audio chunk via
connection.send(). Sending it in subsequent chunks will result in an error. Previous text works
best when it’s under 50 characters long.
With the VAD strategy, the transcription engine automatically detects speech and silence segments. When a silence threshold is reached, the transcription engine will commit the transcript segment automatically.
When transcribing audio from the microphone in the client-side integration, it is recommended to use the VAD strategy.