Multichannel speech-to-text | ElevenLabs Documentation

How-to guide · Assumes you have completed the Speech to Text quickstart.

Overview

The multichannel Speech to Text feature enables you to transcribe audio files where each channel contains a distinct speaker. This is particularly useful for recordings where speakers are isolated on separate audio channels, providing cleaner transcriptions without the need for speaker diarization.

Each channel is processed independently and automatically assigned a speaker ID based on its channel number (channel 0 → speaker_0, channel 1 → speaker_1, etc.). The system extracts individual channels from your input audio file and transcribes them in parallel. By default the API returns one transcript per channel; set multichannel_output_style=combined to instead receive a single transcript with all channels merged into one list sorted by start time, with each word tagged by its channel_index.

Common use cases

Stereo interview recordings - Interviewer on left channel, interviewee on right channel
Multi-track podcast recordings - Each participant recorded on a separate track
Call center recordings - Agent and customer separated on different channels
Conference recordings - Individual participants isolated on separate channels
Court proceedings - Multiple parties recorded on distinct channels

Requirements

An ElevenLabs account with an API key
Multichannel audio file (WAV, MP3, or other supported formats)
Maximum 5 channels per audio file
Each channel should contain only one speaker

How it works

Prepare your multichannel audio

Ensure your audio file has speakers isolated on separate channels. The multichannel feature supports up to 5 channels, with each channel mapped to a specific speaker:

Channel 0 → speaker_0
Channel 1 → speaker_1
Channel 2 → speaker_2
Channel 3 → speaker_3
Channel 4 → speaker_4

Configure API parameters

When making a speech-to-text request, you must set:

use_multi_channel: true
diarize: false (multichannel mode handles speaker separation via channels)

Optionally, control the response shape with:

multichannel_output_style: separate (default) returns one transcript per channel. combined merges all channels into a single transcript whose words are sorted by start time, each carrying a channel_index — matching the standard single-channel response shape. combined requires timestamps (timestamps_granularity must not be none) and is not supported with webhook delivery or entity detection/redaction.

The num_speakers parameter cannot be used with multichannel mode as the speaker count is automatically determined by the number of channels. Multichannel mode assumes there will exactly one speaker per channel. If there are more, it will assign the same speaker id to all speakers in the channel.

Process the response

By default (multichannel_output_style=separate), multichannel audio returns a different response format than single-channel:

If you set use_multi_channel: true but provide a single-channel (mono) audio file, you’ll receive a standard single-channel response, not the multichannel format. The multichannel response format is only returned when the audio file actually contains multiple channels.

1 {
2   "language_code": "en",
3   "language_probability": 0.98,
4   "text": "Hello world",
5   "words": [...]
6 }

With multichannel_output_style=combined, the response uses the same flat shape as a single-channel transcription (top-level text and words, no transcripts array), with all channels merged into one list sorted by start time. Every word includes a channel_index (and speaker_id) identifying its channel.

Implementation

Basic multichannel transcription

Here’s a complete example of transcribing a stereo audio file with two speakers:

1 from elevenlabs import ElevenLabs
2 
3 elevenlabs = ElevenLabs(api_key="YOUR_API_KEY")
4 
5 def transcribe_multichannel(audio_file_path):
6     with open(audio_file_path, 'rb') as audio_file:
7         result = elevenlabs.speech_to_text.convert(
8             file=audio_file,
9             model_id='scribe_v2',
10             use_multi_channel=True,
11             diarize=False,
12             timestamps_granularity='word'
13         )
14     return result
15 
16 # Process the response
17 
18 result = transcribe_multichannel('stereo_interview.wav')
19 
20 if hasattr(result, 'transcripts'): # Multichannel response
21     for transcript in result.transcripts:
22         channel = transcript.channel_index
23         text = transcript.text
24         print(f"Channel {channel} (speaker_{channel}): {text}")
25     else: # Single channel response (fallback)
26         print(f"Text: {result.text}")

Creating conversation transcripts

The easiest way to get a time-ordered, conversation-style transcript is to request multichannel_output_style=combined — the API returns a single words list, already sorted by start time, with a channel_index and speaker_id on each word:

Combined output (recommended)

1 with open("stereo_interview.wav", "rb") as audio_file:
2     result = elevenlabs.speech_to_text.convert(
3         file=audio_file,
4         model_id="scribe_v2",
5         use_multi_channel=True,
6         multichannel_output_style="combined",
7         diarize=False,
8         timestamps_granularity="word",
9     )
10 
11 for word in result.words:
12     if word.type == "word":
13         print(f"speaker_{word.channel_index}: {word.text}")

If you’re using the default separate output, you can merge the per-channel transcripts client-side instead:

1 def create_conversation_transcript(multichannel_result):
2     """Create a conversation-style transcript with speaker labels"""
3     all_words = []
4 
5     if hasattr(multichannel_result, 'transcripts'):
6         # Collect all words from all channels
7         for transcript in multichannel_result.transcripts:
8             for word in transcript.words or []:
9                 if word.type == 'word':
10                     all_words.append({
11                         'text': word.text,
12                         'start': word.start,
13                         'speaker_id': word.speaker_id,
14                         'channel': transcript.channel_index
15                     })
16 
17     # Sort by timestamp
18     all_words.sort(key=lambda w: w['start'])
19 
20     # Group consecutive words by speaker
21     conversation = []
22     current_speaker = None
23     current_text = []
24 
25     for word in all_words:
26         if word['speaker_id'] != current_speaker:
27             if current_text:
28                 conversation.append({
29                     'speaker': current_speaker,
30                     'text': ' '.join(current_text)
31                 })
32             current_speaker = word['speaker_id']
33             current_text = [word['text']]
34         else:
35             current_text.append(word['text'])
36 
37     # Add the last segment
38     if current_text:
39         conversation.append({
40             'speaker': current_speaker,
41             'text': ' '.join(current_text)
42         })
43 
44     return conversation
45 
46 # Format the output
47 conversation = create_conversation_transcript(result)
48 for turn in conversation:
49     print(f"{turn['speaker']}: {turn['text']}")

Using webhooks with multichannel

Multichannel transcription supports webhook delivery for asynchronous processing:

Webhooks return the separate (per-channel) format. multichannel_output_style=combined is not currently supported with webhook delivery — use a synchronous request, or merge the per-channel webhook payload client-side.

1 from elevenlabs import ElevenLabs
2 
3 elevenlabs = ElevenLabs(api_key="YOUR_API_KEY")
4 
5 async def transcribe_multichannel_with_webhook(audio_file_path):
6     with open(audio_file_path, 'rb') as audio_file:
7         result = await elevenlabs.speech_to_text.convert_async(
8             file=audio_file,
9             model_id='scribe_v2',
10             use_multi_channel=True,
11             diarize=False,
12             webhook=True  # Enable webhook delivery
13         )
14 
15     print(f"Transcription started with task ID: {result.task_id}")
16     return result.task_id

Error handling

Common validation errors

Setting diarize=true with multichannel mode

Error: Multichannel mode does not support diarization and assigns speakers based on the channel they speak on.

Solution: Always set diarize=false when using multichannel mode.

Providing num_speakers parameter

Error: Cannot specify num_speakers when use_multi_channel is enabled. The number of speakers is automatically determined by the number of channels. Solution: Remove the num_speakers parameter from your request.

Audio file with more than 5 channels

Error: Multichannel mode supports up to 5 channels, but the audio file contains X channels.

Solution: Process only the first 5 channels or pre-process your audio to reduce channel count.

Using combined output without timestamps

Error: multichannel_output_style=‘combined’ requires timestamps; set timestamps_granularity to ‘word’ or ‘character’.

Solution: Combined output sorts words by time, so set timestamps_granularity to word (the default) or character.

Using combined output with webhooks

Error: multichannel_output_style=‘combined’ is not yet supported with webhook delivery.

Solution: Use a synchronous request with combined, or keep the default separate output when using webhooks and merge client-side.

Best practices

Audio preparation

For optimal results: - Use 16kHz sample rate for better performance - Remove silent or unused channels before processing - Ensure each channel contains only one speaker - Use lossless formats (WAV) when possible for best quality

Performance optimization

The concurrency cost increases linearly with the number of channels. A 60-second 3-channel file has 3x the concurrency cost of a single-channel file.

You can estimate the processing time for multichannel audio using the following formula:

Processing\ Time = (D \cdot 0.3) + 2 + (N \cdot 0.5)

Where:

$D$ = file duration in seconds
$N$ = number of channels
$0.3$ = processing speed factor (approximately 30% of real-time)
$2$ = fixed overhead in seconds
$0.5$ = per-channel overhead in seconds

Example: For a 60-second stereo file (2 channels):

Processing\ Time = (60 \cdot 0.3) + 2 + (2 \cdot 0.5) = 18 + 2 + 1 = 21\ seconds

Memory considerations

For large multichannel files, consider streaming or chunking:

1 def process_large_multichannel_file(file_path, chunk_duration=300):
2     """Process large files in chunks (5-minute segments)"""
3 
4     from pydub import AudioSegment
5     from elevenlabs import ElevenLabs
6     import os
7 
8     elevenlabs = ElevenLabs(api_key="YOUR_API_KEY")
9     audio = AudioSegment.from_file(file_path)
10     duration_ms = len(audio)
11     chunk_size_ms = chunk_duration * 1000
12 
13     all_transcripts = []
14 
15     for start_ms in range(0, duration_ms, chunk_size_ms):
16         end_ms = min(start_ms + chunk_size_ms, duration_ms)
17 
18         # Extract chunk
19         chunk = audio[start_ms:end_ms]
20         chunk_file = f"temp_chunk_{start_ms}.wav"
21         chunk.export(chunk_file, format="wav")
22 
23         # Transcribe chunk using SDK
24         with open(chunk_file, 'rb') as audio_file:
25             result = elevenlabs.speech_to_text.convert(
26                 file=audio_file,
27                 model_id='scribe_v2',
28                 use_multi_channel=True,
29                 diarize=False,
30                 timestamps_granularity='word'
31             )
32 
33         # Adjust timestamps
34         if hasattr(result, 'transcripts'):
35             for transcript in result.transcripts:
36                 for word in transcript.words or []:
37                     word.start += start_ms / 1000
38                     word.end += start_ms / 1000
39             all_transcripts.extend(result.transcripts)
40 
41         # Clean up
42         os.remove(chunk_file)
43 
44     return all_transcripts

FAQ

What happens if my audio has more than 5 channels?

The API will return an error. You’ll need to either select which 5 channels to send to the API or mix down some channels before sending them to the API.

Can I process mono audio with multichannel mode?

Yes, but it’s unnecessary. If you send mono audio with use_multi_channel=true, you’ll receive a standard single-channel response, not the multichannel format.

Can I get one combined transcript instead of separate per-channel transcripts?

Yes. Set multichannel_output_style=combined to receive a single transcript with all channels merged and sorted by start time, each word tagged with its channel_index. This matches the standard single-channel response shape. It requires timestamps and isn’t available with webhook delivery.

How are speaker IDs assigned?

Speaker IDs are deterministic based on channel number: channel 0 becomes speaker_0, channel 1 becomes speaker_1, and so on.

Can channels have different languages?

Yes, each channel is processed independently and can detect different languages. The language detection happens per channel. With multichannel_output_style=combined, the top-level language_code reflects the most confident channel, while each word still carries its channel_index.

Next steps

API reference

Full Speech to Text API reference and parameters.

Webhooks

Receive transcription results asynchronously via webhook.