Realtime

Realtime speech-to-text transcription service. This WebSocket API enables streaming audio input and receiving transcription results. ## Event Flow - Audio chunks are sent as `input_audio_chunk` messages - Transcription results are streamed back in various formats (partial, committed, with timestamps) - Supports manual commit or VAD-based automatic commit strategies Authentication is done either by providing a valid API key in the `xi-api-key` header or by providing a valid token in the `token` query parameter. Tokens can be generated from the [single use token endpoint](/docs/api-reference/tokens/create). Use tokens if you want to transcribe audio from the client side.

Handshake

WSS
/v1/speech-to-text/realtime

Headers

xi-api-keystringOptional

Query parameters

model_idstringRequired
ID of the model to use for transcription.
tokenstringOptional

Single use token for authentication. Only used when initiating a session from the client. If provided, xi-api-key is no longer required for authentication.

include_timestampsbooleanOptionalDefaults to false

Whether to receive the committed_transcript_with_timestamps event, which includes word-level timestamps.

include_language_detectionbooleanOptionalDefaults to false

Whether to include the detected language code in the committed_transcript_with_timestamps event.

audio_formatenumOptionalDefaults to pcm_16000

Audio encoding format for speech-to-text.

language_codestringOptional

Language code in ISO 639-1 or ISO 639-3 format.

commit_strategyenumOptionalDefaults to manual
Strategy for committing transcriptions.
Allowed values:
keytermslist of stringsOptional

List of keyterms to bias the model towards. Maximum 50 keyterms, each up to 20 characters. Adds a 20% premium to the base transcription cost.

no_verbatimbooleanOptionalDefaults to false
If true, removes filler words, false starts and disfluencies from the transcript.
vad_silence_threshold_secsdoubleOptional0.3-3Defaults to 1.5
Silence threshold in seconds for VAD.
vad_thresholddoubleOptional0.1-0.9Defaults to 0.4
Threshold for voice activity detection.
min_speech_duration_msintegerOptional50-2000Defaults to 100
Minimum speech duration in milliseconds.
min_silence_duration_msintegerOptional50-2000Defaults to 100
Minimum silence duration in milliseconds.
enable_loggingbooleanOptionalDefaults to true

When enable_logging is set to false zero retention mode will be used for the request. This will mean history features are unavailable for this request. Zero retention mode may only be used by enterprise customers.

Send

inputAudioChunkobjectRequired
Audio data chunk sent from client to server for transcription.

Receive

sessionStartedobjectRequired
Sent when the transcription session is successfully started.
OR
partialTranscriptobjectRequired
Interim transcription result that may change.
OR
committedTranscriptobjectRequired
Committed transcription result that will not change.
OR
committedTranscriptWithTimestampsobjectRequired

Committed transcription result with word-level timestamps.

OR
scribeErrorobjectRequired
Error event during transcription.
OR
scribeAuthErrorobjectRequired
Authentication error during transcription session.
OR
scribeQuotaExceededErrorobjectRequired
Quota exceeded error during transcription session.
OR
scribeThrottledErrorobjectRequired
Throttled error during transcription session.
OR
scribeUnacceptedTermsErrorobjectRequired
Unaccepted terms error during transcription session.
OR
scribeRateLimitedErrorobjectRequired
Rate limited error during transcription session.
OR
scribeQueueOverflowErrorobjectRequired
Queue overflow error during transcription session.
OR
scribeResourceExhaustedErrorobjectRequired
Resource exhausted error during transcription session.
OR
scribeSessionTimeLimitExceededErrorobjectRequired
Session time limit exceeded error during transcription session.
OR
scribeInputErrorobjectRequired
Input error during transcription session.
OR
scribeChunkSizeExceededErrorobjectRequired
Chunk size exceeded error during transcription session.
OR
scribeInsufficientAudioActivityErrorobjectRequired
Insufficient audio activity error during transcription session.
OR
scribeTranscriberErrorobjectRequired
Transcriber error during transcription session.