Multichannel speech-to-text
Multichannel speech-to-text
Multichannel speech-to-text
How-to guide · Assumes you have completed the Speech to Text quickstart.
The multichannel Speech to Text feature enables you to transcribe audio files where each channel contains a distinct speaker. This is particularly useful for recordings where speakers are isolated on separate audio channels, providing cleaner transcriptions without the need for speaker diarization.
Each channel is processed independently and automatically assigned a speaker ID based on its channel number (channel 0 → speaker_0, channel 1 → speaker_1, etc.). The system extracts individual channels from your input audio file and transcribes them in parallel, combining the results sorted by timestamp.
Ensure your audio file has speakers isolated on separate channels. The multichannel feature supports up to 5 channels, with each channel mapped to a specific speaker:
speaker_0speaker_1speaker_2speaker_3speaker_4When making a speech-to-text request, you must set:
use_multi_channel: truediarize: false (multichannel mode handles speaker separation via channels)The num_speakers parameter cannot be used with multichannel mode as the speaker count is automatically determined by the number of channels. Multichannel mode assumes there will exactly one speaker per channel. If there are more, it will assign the same speaker id to all speakers in the channel.
The API returns a different response format for multichannel audio:
If you set use_multi_channel: true but provide a single-channel (mono) audio file, you’ll
receive a standard single-channel response, not the multichannel format. The multichannel response
format is only returned when the audio file actually contains multiple channels.
Here’s a complete example of transcribing a stereo audio file with two speakers:
Generate a time-ordered conversation transcript from multichannel audio:
Multichannel transcription fully supports webhook delivery for asynchronous processing:
Error: Multichannel mode does not support diarization and assigns speakers based on the channel they speak on.
Solution: Always set diarize=false when using multichannel mode.
Error: Cannot specify num_speakers when use_multi_channel is enabled. The number of speakers
is automatically determined by the number of channels. Solution: Remove the num_speakers
parameter from your request.
Error: Multichannel mode supports up to 5 channels, but the audio file contains X channels.
Solution: Process only the first 5 channels or pre-process your audio to reduce channel count.
For optimal results: - Use 16kHz sample rate for better performance - Remove silent or unused channels before processing - Ensure each channel contains only one speaker - Use lossless formats (WAV) when possible for best quality
The concurrency cost increases linearly with the number of channels. A 60-second 3-channel file has 3x the concurrency cost of a single-channel file.
You can estimate the processing time for multichannel audio using the following formula:
Where:
Example: For a 60-second stereo file (2 channels):
For large multichannel files, consider streaming or chunking:
The API will return an error. You’ll need to either select which 5 channels to send to the API or mix down some channels before sending them to the API.
Yes, but it’s unnecessary. If you send mono audio with use_multi_channel=true, you’ll receive
a standard single-channel response, not the multichannel format.
Speaker IDs are deterministic based on channel number: channel 0 becomes speaker_0, channel 1 becomes speaker_1, and so on.
Yes, each channel is processed independently and can detect different languages. The language detection happens per channel.