Speech to Text | ElevenLabs Documentation

Overview

The ElevenLabs Speech to Text (STT) API turns spoken audio into text with state of the art accuracy. Our Scribe v1 model adapts to textual cues across 99 languages and multiple voice styles and can be used to:

Transcribe podcasts, interviews, and other audio or video content
Generate transcripts for meetings and other audio or video recordings

Developer tutorial

Learn how to integrate speech to text into your application.

Product guide

Step-by-step guide for using speech to text in ElevenLabs.

Companies requiring HIPAA compliance must contact ElevenLabs Sales to sign a Business Associate Agreement (BAA) agreement. Please ensure this step is completed before proceeding with any HIPAA-related integrations or deployments.

State of the art accuracy

The Scribe v1 model is capable of transcribing audio from up to 32 speakers with high accuracy. Optionally it can also transcribe audio events like laughter, applause, and other non-speech sounds.

The transcribed output supports exact timestamps for each word and audio event, plus diarization to identify the speaker for each word.

The Scribe v1 model is best used for when high-accuracy transcription is required rather than real-time transcription. A low-latency, real-time version will be released soon.

Pricing

Developer API

Product interface pricing

Tier	Price/month	Hours included	Price per included hour	Price per additional hour
Free	$0	Unavailable	Unavailable	Unavailable
Starter	$5	12 hours 30 minutes	$0.40	Unavailable
Creator	$22	62 hours 51 minutes	$0.35	$0.48
Pro	$99	300 hours	$0.33	$0.40
Scale	$330	1,100 hours	$0.30	$0.33
Business	$1,320	6,000 hours	$0.22	$0.22

For reduced pricing at higher scale than 6,000 hours/month in addition to custom MSAs and DPAs, please contact sales.

Note: The free tier requires attribution and does not have commercial licensing.

Scribe has higher concurrency limits than other services from ElevenLabs. Please see other concurrency limits here

Plan	STT Concurrency Limit
Free	10
Starter	15
Creator	25
Pro	50
Scale	75
Business	75
Enterprise	Elevated

Examples

The following example shows the output of the Scribe v1 model for a sample audio file.

1 {
2   "language_code": "en",
3   "language_probability": 1,
4   "text": "With a soft and whispery American accent, I'm the ideal choice for creating ASMR content, meditative guides, or adding an intimate feel to your narrative projects.",
5   "words": [
6     {
7       "text": "With",
8       "start": 0.119,
9       "end": 0.259,
10       "type": "word",
11       "speaker_id": "speaker_0"
12     },
13     {
14       "text": " ",
15       "start": 0.239,
16       "end": 0.299,
17       "type": "spacing",
18       "speaker_id": "speaker_0"
19     },
20     {
21       "text": "a",
22       "start": 0.279,
23       "end": 0.359,
24       "type": "word",
25       "speaker_id": "speaker_0"
26     },
27     {
28       "text": " ",
29       "start": 0.339,
30       "end": 0.499,
31       "type": "spacing",
32       "speaker_id": "speaker_0"
33     },
34     {
35       "text": "soft",
36       "start": 0.479,
37       "end": 1.039,
38       "type": "word",
39       "speaker_id": "speaker_0"
40     },
41     {
42       "text": " ",
43       "start": 1.019,
44       "end": 1.2,
45       "type": "spacing",
46       "speaker_id": "speaker_0"
47     },
48     {
49       "text": "and",
50       "start": 1.18,
51       "end": 1.359,
52       "type": "word",
53       "speaker_id": "speaker_0"
54     },
55     {
56       "text": " ",
57       "start": 1.339,
58       "end": 1.44,
59       "type": "spacing",
60       "speaker_id": "speaker_0"
61     },
62     {
63       "text": "whispery",
64       "start": 1.419,
65       "end": 1.979,
66       "type": "word",
67       "speaker_id": "speaker_0"
68     },
69     {
70       "text": " ",
71       "start": 1.959,
72       "end": 2.179,
73       "type": "spacing",
74       "speaker_id": "speaker_0"
75     },
76     {
77       "text": "American",
78       "start": 2.159,
79       "end": 2.719,
80       "type": "word",
81       "speaker_id": "speaker_0"
82     },
83     {
84       "text": " ",
85       "start": 2.699,
86       "end": 2.779,
87       "type": "spacing",
88       "speaker_id": "speaker_0"
89     },
90     {
91       "text": "accent,",
92       "start": 2.759,
93       "end": 3.389,
94       "type": "word",
95       "speaker_id": "speaker_0"
96     },
97     {
98       "text": " ",
99       "start": 4.119,
100       "end": 4.179,
101       "type": "spacing",
102       "speaker_id": "speaker_0"
103     },
104     {
105       "text": "I'm",
106       "start": 4.159,
107       "end": 4.459,
108       "type": "word",
109       "speaker_id": "speaker_0"
110     },
111     {
112       "text": " ",
113       "start": 4.44,
114       "end": 4.52,
115       "type": "spacing",
116       "speaker_id": "speaker_0"
117     },
118     {
119       "text": "the",
120       "start": 4.5,
121       "end": 4.599,
122       "type": "word",
123       "speaker_id": "speaker_0"
124     },
125     {
126       "text": " ",
127       "start": 4.579,
128       "end": 4.699,
129       "type": "spacing",
130       "speaker_id": "speaker_0"
131     },
132     {
133       "text": "ideal",
134       "start": 4.679,
135       "end": 5.099,
136       "type": "word",
137       "speaker_id": "speaker_0"
138     },
139     {
140       "text": " ",
141       "start": 5.079,
142       "end": 5.219,
143       "type": "spacing",
144       "speaker_id": "speaker_0"
145     },
146     {
147       "text": "choice",
148       "start": 5.199,
149       "end": 5.719,
150       "type": "word",
151       "speaker_id": "speaker_0"
152     },
153     {
154       "text": " ",
155       "start": 5.699,
156       "end": 6.099,
157       "type": "spacing",
158       "speaker_id": "speaker_0"
159     },
160     {
161       "text": "for",
162       "start": 6.099,
163       "end": 6.199,
164       "type": "word",
165       "speaker_id": "speaker_0"
166     },
167     {
168       "text": " ",
169       "start": 6.179,
170       "end": 6.279,
171       "type": "spacing",
172       "speaker_id": "speaker_0"
173     },
174     {
175       "text": "creating",
176       "start": 6.259,
177       "end": 6.799,
178       "type": "word",
179       "speaker_id": "speaker_0"
180     },
181     {
182       "text": " ",
183       "start": 6.779,
184       "end": 6.979,
185       "type": "spacing",
186       "speaker_id": "speaker_0"
187     },
188     {
189       "text": "ASMR",
190       "start": 6.959,
191       "end": 7.739,
192       "type": "word",
193       "speaker_id": "speaker_0"
194     },
195     {
196       "text": " ",
197       "start": 7.719,
198       "end": 7.859,
199       "type": "spacing",
200       "speaker_id": "speaker_0"
201     },
202     {
203       "text": "content,",
204       "start": 7.839,
205       "end": 8.45,
206       "type": "word",
207       "speaker_id": "speaker_0"
208     },
209     {
210       "text": " ",
211       "start": 9,
212       "end": 9.06,
213       "type": "spacing",
214       "speaker_id": "speaker_0"
215     },
216     {
217       "text": "meditative",
218       "start": 9.04,
219       "end": 9.64,
220       "type": "word",
221       "speaker_id": "speaker_0"
222     },
223     {
224       "text": " ",
225       "start": 9.619,
226       "end": 9.699,
227       "type": "spacing",
228       "speaker_id": "speaker_0"
229     },
230     {
231       "text": "guides,",
232       "start": 9.679,
233       "end": 10.359,
234       "type": "word",
235       "speaker_id": "speaker_0"
236     },
237     {
238       "text": " ",
239       "start": 10.359,
240       "end": 10.409,
241       "type": "spacing",
242       "speaker_id": "speaker_0"
243     },
244     {
245       "text": "or",
246       "start": 11.319,
247       "end": 11.439,
248       "type": "word",
249       "speaker_id": "speaker_0"
250     },
251     {
252       "text": " ",
253       "start": 11.42,
254       "end": 11.52,
255       "type": "spacing",
256       "speaker_id": "speaker_0"
257     },
258     {
259       "text": "adding",
260       "start": 11.5,
261       "end": 11.879,
262       "type": "word",
263       "speaker_id": "speaker_0"
264     },
265     {
266       "text": " ",
267       "start": 11.859,
268       "end": 12,
269       "type": "spacing",
270       "speaker_id": "speaker_0"
271     },
272     {
273       "text": "an",
274       "start": 11.979,
275       "end": 12.079,
276       "type": "word",
277       "speaker_id": "speaker_0"
278     },
279     {
280       "text": " ",
281       "start": 12.059,
282       "end": 12.179,
283       "type": "spacing",
284       "speaker_id": "speaker_0"
285     },
286     {
287       "text": "intimate",
288       "start": 12.179,
289       "end": 12.579,
290       "type": "word",
291       "speaker_id": "speaker_0"
292     },
293     {
294       "text": " ",
295       "start": 12.559,
296       "end": 12.699,
297       "type": "spacing",
298       "speaker_id": "speaker_0"
299     },
300     {
301       "text": "feel",
302       "start": 12.679,
303       "end": 13.159,
304       "type": "word",
305       "speaker_id": "speaker_0"
306     },
307     {
308       "text": " ",
309       "start": 13.139,
310       "end": 13.179,
311       "type": "spacing",
312       "speaker_id": "speaker_0"
313     },
314     {
315       "text": "to",
316       "start": 13.159,
317       "end": 13.26,
318       "type": "word",
319       "speaker_id": "speaker_0"
320     },
321     {
322       "text": " ",
323       "start": 13.239,
324       "end": 13.3,
325       "type": "spacing",
326       "speaker_id": "speaker_0"
327     },
328     {
329       "text": "your",
330       "start": 13.299,
331       "end": 13.399,
332       "type": "word",
333       "speaker_id": "speaker_0"
334     },
335     {
336       "text": " ",
337       "start": 13.379,
338       "end": 13.479,
339       "type": "spacing",
340       "speaker_id": "speaker_0"
341     },
342     {
343       "text": "narrative",
344       "start": 13.479,
345       "end": 13.889,
346       "type": "word",
347       "speaker_id": "speaker_0"
348     },
349     {
350       "text": " ",
351       "start": 13.919,
352       "end": 13.939,
353       "type": "spacing",
354       "speaker_id": "speaker_0"
355     },
356     {
357       "text": "projects.",
358       "start": 13.919,
359       "end": 14.779,
360       "type": "word",
361       "speaker_id": "speaker_0"
362     }
363   ]
364 }

The output is classified in three category types:

word - A word in the language of the audio
spacing - The space between words, not applicable for languages that don’t use spaces like Japanese, Mandarin, Thai, Lao, Burmese and Cantonese
audio_event - Non-speech sounds like laughter or applause

Models

Scribe v1

State-of-the-art speech recognition model

Accurate transcription in 99 languages

Precise word-level timestamps

Speaker diarization

Dynamic audio tagging

Explore all

Concurrency and priority

Concurrency is the concept of how many requests can be processed at the same time.

For Speech to Text, files that are over 8 minutes long are transcribed in parallel internally in order to speed up processing. The audio is chunked into four segments to be transcribed concurrently.

You can calculate the concurrency limit with the following calculation:

Concurrency = \min(4, \text{round\_up}(\frac{\text{audio\_duration\_secs}}{480}))

For example, a 15 minute audio file will be transcribed with a concurrency of 2, while a 120 minute audio file will be transcribed with a concurrency of 4.

Supported languages

The Scribe v1 model supports 99 languages, including:

Afrikaans (afr), Amharic (amh), Arabic (ara), Armenian (hye), Assamese (asm), Asturian (ast), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Burmese (mya), Cantonese (yue), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Fulah (ful), Galician (glg), Ganda (lug), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Igbo (ibo), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kabuverdianu (kea), Kannada (kan), Kazakh (kaz), Khmer (khm), Korean (kor), Kurdish (kur), Kyrgyz (kir), Lao (lao), Latvian (lav), Lingala (lin), Lithuanian (lit), Luo (luo), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Maltese (mlt), Mandarin Chinese (zho), Māori (mri), Marathi (mar), Mongolian (mon), Nepali (nep), Northern Sotho (nso), Norwegian (nor), Occitan (oci), Odia (ori), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Shona (sna), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Tajik (tgk), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Umbundu (umb), Urdu (urd), Uzbek (uzb), Vietnamese (vie), Welsh (cym), Wolof (wol), Xhosa (xho) and Zulu (zul).

Breakdown of language support

Word Error Rate (WER) is a key metric used to evaluate the accuracy of transcription systems. It measures how many errors are present in a transcript compared to a reference transcript. Below is a breakdown of the WER for each language that Scribe v1 supports.

Excellent (≤ 5% WER)

Bulgarian (bul), Catalan (cat), Czech (ces), Danish (dan), Dutch (nld), English (eng), Finnish (fin), French (fra), Galician (glg), German (deu), Greek (ell), Hindi (hin), Indonesian (ind), Italian (ita), Japanese (jpn), Kannada (kan), Malay (msa), Malayalam (mal), Macedonian (mkd), Norwegian (nor), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Serbian (srp), Slovak (slk), Spanish (spa), Swedish (swe), Turkish (tur), Ukrainian (ukr) and Vietnamese (vie).

High Accuracy (>5% to ≤10% WER)

Bengali (ben), Belarusian (bel), Bosnian (bos), Cantonese (yue), Estonian (est), Filipino (fil), Gujarati (guj), Hungarian (hun), Kazakh (kaz), Latvian (lav), Lithuanian (lit), Mandarin (cmn), Marathi (mar), Nepali (nep), Odia (ori), Persian (fas), Slovenian (slv), Tamil (tam) and Telugu (tel)

Good (>10% to ≤25% WER)

Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Asturian (ast), Azerbaijani (aze), Burmese (mya), Cebuano (ceb), Croatian (hrv), Georgian (kat), Hausa (hau), Hebrew (heb), Icelandic (isl), Javanese (jav), Kabuverdianu (kea), Korean (kor), Kyrgyz (kir), Lingala (lin), Maltese (mlt), Mongolian (mon), Māori (mri), Occitan (oci), Punjabi (pan), Sindhi (snd), Swahili (swa), Tajik (tgk), Thai (tha), Urdu (urd), Uzbek (uzb) and Welsh (cym).

Moderate (>25% to ≤50% WER)

Amharic (amh), Chichewa (nya), Fulah (ful), Ganda (lug), Igbo (ibo), Irish (gle), Khmer (khm), Kurdish (kur), Lao (lao), Luxembourgish (ltz), Luo (luo), Northern Sotho (nso), Pashto (pus), Shona (sna), Somali (som), Umbundu (umb), Wolof (wol), Xhosa (xho) and Zulu (zul).

FAQ

Can I use speech to text with video files?

Yes, the API supports uploading both audio and video files for transcription.

What are the file size and duration limits?

Files up to 3 GB in size and up to 10 hours in duration are supported.

Which audio and video formats are supported?

The audio supported audio formats include:

audio/aac
audio/x-aac
audio/x-aiff
audio/ogg
audio/mpeg
audio/mp3
audio/mpeg3
audio/x-mpeg-3
audio/opus
audio/wav
audio/x-wav
audio/webm
audio/flac
audio/x-flac
audio/mp4
audio/aiff
audio/x-m4a

Supported video formats include:

video/mp4
video/x-msvideo
video/x-matroska
video/quicktime
video/x-ms-wmv
video/x-flv
video/webm
video/mpeg
video/3gpp

When will you support more languages?

ElevenLabs is constantly expanding the number of languages supported by our models. Please check back frequently for updates.

Does speech to text API support webhooks?

Yes, asynchronous transcription results can be sent to webhooks configured in webhook settings in the UI. Learn more in the webhooks cookbook.

Is a multichannel transcription mode supported?

Yes, the multichannel STT feature allows you to transcribe audio where each channel is processed independently and assigned a speaker ID based on its channel number. This feature supports up to 5 channels. Learn more in the multichannel transcription cookbook.