Transcription

Learn how to turn spoken audio into text with ElevenLabs.

Overview

The ElevenLabs Speech to Text (STT) API turns spoken audio into text with state of the art accuracy. Our Scribe v1 model adapts to textual cues across 99 languages and multiple voice styles. To try a live demo please visit our Speech to Text showcase page.

Companies requiring HIPAA compliance must contact ElevenLabs Sales to sign a Business Associate Agreement (BAA) agreement. Please ensure this step is completed before proceeding with any HIPAA-related integrations or deployments.

Models

Example API response

The following example shows the output of the Speech to Text API using the Scribe v1 model for a sample audio file.

1{
2 "language_code": "en",
3 "language_probability": 1,
4 "text": "With a soft and whispery American accent, I'm the ideal choice for creating ASMR content, meditative guides, or adding an intimate feel to your narrative projects.",
5 "words": [
6 {
7 "text": "With",
8 "start": 0.119,
9 "end": 0.259,
10 "type": "word",
11 "speaker_id": "speaker_0"
12 },
13 {
14 "text": " ",
15 "start": 0.239,
16 "end": 0.299,
17 "type": "spacing",
18 "speaker_id": "speaker_0"
19 },
20 {
21 "text": "a",
22 "start": 0.279,
23 "end": 0.359,
24 "type": "word",
25 "speaker_id": "speaker_0"
26 },
27 {
28 "text": " ",
29 "start": 0.339,
30 "end": 0.499,
31 "type": "spacing",
32 "speaker_id": "speaker_0"
33 },
34 {
35 "text": "soft",
36 "start": 0.479,
37 "end": 1.039,
38 "type": "word",
39 "speaker_id": "speaker_0"
40 },
41 {
42 "text": " ",
43 "start": 1.019,
44 "end": 1.2,
45 "type": "spacing",
46 "speaker_id": "speaker_0"
47 },
48 {
49 "text": "and",
50 "start": 1.18,
51 "end": 1.359,
52 "type": "word",
53 "speaker_id": "speaker_0"
54 },
55 {
56 "text": " ",
57 "start": 1.339,
58 "end": 1.44,
59 "type": "spacing",
60 "speaker_id": "speaker_0"
61 },
62 {
63 "text": "whispery",
64 "start": 1.419,
65 "end": 1.979,
66 "type": "word",
67 "speaker_id": "speaker_0"
68 },
69 {
70 "text": " ",
71 "start": 1.959,
72 "end": 2.179,
73 "type": "spacing",
74 "speaker_id": "speaker_0"
75 },
76 {
77 "text": "American",
78 "start": 2.159,
79 "end": 2.719,
80 "type": "word",
81 "speaker_id": "speaker_0"
82 },
83 {
84 "text": " ",
85 "start": 2.699,
86 "end": 2.779,
87 "type": "spacing",
88 "speaker_id": "speaker_0"
89 },
90 {
91 "text": "accent,",
92 "start": 2.759,
93 "end": 3.389,
94 "type": "word",
95 "speaker_id": "speaker_0"
96 },
97 {
98 "text": " ",
99 "start": 4.119,
100 "end": 4.179,
101 "type": "spacing",
102 "speaker_id": "speaker_0"
103 },
104 {
105 "text": "I'm",
106 "start": 4.159,
107 "end": 4.459,
108 "type": "word",
109 "speaker_id": "speaker_0"
110 },
111 {
112 "text": " ",
113 "start": 4.44,
114 "end": 4.52,
115 "type": "spacing",
116 "speaker_id": "speaker_0"
117 },
118 {
119 "text": "the",
120 "start": 4.5,
121 "end": 4.599,
122 "type": "word",
123 "speaker_id": "speaker_0"
124 },
125 {
126 "text": " ",
127 "start": 4.579,
128 "end": 4.699,
129 "type": "spacing",
130 "speaker_id": "speaker_0"
131 },
132 {
133 "text": "ideal",
134 "start": 4.679,
135 "end": 5.099,
136 "type": "word",
137 "speaker_id": "speaker_0"
138 },
139 {
140 "text": " ",
141 "start": 5.079,
142 "end": 5.219,
143 "type": "spacing",
144 "speaker_id": "speaker_0"
145 },
146 {
147 "text": "choice",
148 "start": 5.199,
149 "end": 5.719,
150 "type": "word",
151 "speaker_id": "speaker_0"
152 },
153 {
154 "text": " ",
155 "start": 5.699,
156 "end": 6.099,
157 "type": "spacing",
158 "speaker_id": "speaker_0"
159 },
160 {
161 "text": "for",
162 "start": 6.099,
163 "end": 6.199,
164 "type": "word",
165 "speaker_id": "speaker_0"
166 },
167 {
168 "text": " ",
169 "start": 6.179,
170 "end": 6.279,
171 "type": "spacing",
172 "speaker_id": "speaker_0"
173 },
174 {
175 "text": "creating",
176 "start": 6.259,
177 "end": 6.799,
178 "type": "word",
179 "speaker_id": "speaker_0"
180 },
181 {
182 "text": " ",
183 "start": 6.779,
184 "end": 6.979,
185 "type": "spacing",
186 "speaker_id": "speaker_0"
187 },
188 {
189 "text": "ASMR",
190 "start": 6.959,
191 "end": 7.739,
192 "type": "word",
193 "speaker_id": "speaker_0"
194 },
195 {
196 "text": " ",
197 "start": 7.719,
198 "end": 7.859,
199 "type": "spacing",
200 "speaker_id": "speaker_0"
201 },
202 {
203 "text": "content,",
204 "start": 7.839,
205 "end": 8.45,
206 "type": "word",
207 "speaker_id": "speaker_0"
208 },
209 {
210 "text": " ",
211 "start": 9,
212 "end": 9.06,
213 "type": "spacing",
214 "speaker_id": "speaker_0"
215 },
216 {
217 "text": "meditative",
218 "start": 9.04,
219 "end": 9.64,
220 "type": "word",
221 "speaker_id": "speaker_0"
222 },
223 {
224 "text": " ",
225 "start": 9.619,
226 "end": 9.699,
227 "type": "spacing",
228 "speaker_id": "speaker_0"
229 },
230 {
231 "text": "guides,",
232 "start": 9.679,
233 "end": 10.359,
234 "type": "word",
235 "speaker_id": "speaker_0"
236 },
237 {
238 "text": " ",
239 "start": 10.359,
240 "end": 10.409,
241 "type": "spacing",
242 "speaker_id": "speaker_0"
243 },
244 {
245 "text": "or",
246 "start": 11.319,
247 "end": 11.439,
248 "type": "word",
249 "speaker_id": "speaker_0"
250 },
251 {
252 "text": " ",
253 "start": 11.42,
254 "end": 11.52,
255 "type": "spacing",
256 "speaker_id": "speaker_0"
257 },
258 {
259 "text": "adding",
260 "start": 11.5,
261 "end": 11.879,
262 "type": "word",
263 "speaker_id": "speaker_0"
264 },
265 {
266 "text": " ",
267 "start": 11.859,
268 "end": 12,
269 "type": "spacing",
270 "speaker_id": "speaker_0"
271 },
272 {
273 "text": "an",
274 "start": 11.979,
275 "end": 12.079,
276 "type": "word",
277 "speaker_id": "speaker_0"
278 },
279 {
280 "text": " ",
281 "start": 12.059,
282 "end": 12.179,
283 "type": "spacing",
284 "speaker_id": "speaker_0"
285 },
286 {
287 "text": "intimate",
288 "start": 12.179,
289 "end": 12.579,
290 "type": "word",
291 "speaker_id": "speaker_0"
292 },
293 {
294 "text": " ",
295 "start": 12.559,
296 "end": 12.699,
297 "type": "spacing",
298 "speaker_id": "speaker_0"
299 },
300 {
301 "text": "feel",
302 "start": 12.679,
303 "end": 13.159,
304 "type": "word",
305 "speaker_id": "speaker_0"
306 },
307 {
308 "text": " ",
309 "start": 13.139,
310 "end": 13.179,
311 "type": "spacing",
312 "speaker_id": "speaker_0"
313 },
314 {
315 "text": "to",
316 "start": 13.159,
317 "end": 13.26,
318 "type": "word",
319 "speaker_id": "speaker_0"
320 },
321 {
322 "text": " ",
323 "start": 13.239,
324 "end": 13.3,
325 "type": "spacing",
326 "speaker_id": "speaker_0"
327 },
328 {
329 "text": "your",
330 "start": 13.299,
331 "end": 13.399,
332 "type": "word",
333 "speaker_id": "speaker_0"
334 },
335 {
336 "text": " ",
337 "start": 13.379,
338 "end": 13.479,
339 "type": "spacing",
340 "speaker_id": "speaker_0"
341 },
342 {
343 "text": "narrative",
344 "start": 13.479,
345 "end": 13.889,
346 "type": "word",
347 "speaker_id": "speaker_0"
348 },
349 {
350 "text": " ",
351 "start": 13.919,
352 "end": 13.939,
353 "type": "spacing",
354 "speaker_id": "speaker_0"
355 },
356 {
357 "text": "projects.",
358 "start": 13.919,
359 "end": 14.779,
360 "type": "word",
361 "speaker_id": "speaker_0"
362 }
363 ]
364}

The output is classified in three category types:

  • word - A word in the language of the audio
  • spacing - The space between words, not applicable for languages that don’t use spaces like Japanese, Mandarin, Thai, Lao, Burmese and Cantonese
  • audio_event - Non-speech sounds like laughter or applause

Concurrency and priority

Concurrency is the concept of how many requests can be processed at the same time.

For Speech to Text, files that are over 8 minutes long are transcribed in parallel internally in order to speed up processing. The audio is chunked into four segments to be transcribed concurrently.

You can calculate the concurrency limit with the following calculation:

Concurrency=min(4,round_up(audio_duration_secs480))Concurrency = \min(4, \text{round\_up}(\frac{\text{audio\_duration\_secs}}{480}))

For example, a 15 minute audio file will be transcribed with a concurrency of 2, while a 120 minute audio file will be transcribed with a concurrency of 4.

The above calculation is only applicable to Scribe v1. For Scribe v2 Realtime, see the concurrency limit chart.

Supported languages

The Scribe v1 model supports 99 languages, including:

Afrikaans (afr), Amharic (amh), Arabic (ara), Armenian (hye), Assamese (asm), Asturian (ast), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Burmese (mya), Cantonese (yue), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Fulah (ful), Galician (glg), Ganda (lug), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Igbo (ibo), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kabuverdianu (kea), Kannada (kan), Kazakh (kaz), Khmer (khm), Korean (kor), Kurdish (kur), Kyrgyz (kir), Lao (lao), Latvian (lav), Lingala (lin), Lithuanian (lit), Luo (luo), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Maltese (mlt), Mandarin Chinese (zho), Māori (mri), Marathi (mar), Mongolian (mon), Nepali (nep), Northern Sotho (nso), Norwegian (nor), Occitan (oci), Odia (ori), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Shona (sna), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Tajik (tgk), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Umbundu (umb), Urdu (urd), Uzbek (uzb), Vietnamese (vie), Welsh (cym), Wolof (wol), Xhosa (xho) and Zulu (zul).

Breakdown of language support

Word Error Rate (WER) is a key metric used to evaluate the accuracy of transcription systems. It measures how many errors are present in a transcript compared to a reference transcript. Below is a breakdown of the WER for each language that Scribe v1 supports.

Bulgarian (bul), Catalan (cat), Czech (ces), Danish (dan), Dutch (nld), English (eng), Finnish (fin), French (fra), Galician (glg), German (deu), Greek (ell), Hindi (hin), Indonesian (ind), Italian (ita), Japanese (jpn), Kannada (kan), Malay (msa), Malayalam (mal), Macedonian (mkd), Norwegian (nor), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Serbian (srp), Slovak (slk), Spanish (spa), Swedish (swe), Turkish (tur), Ukrainian (ukr) and Vietnamese (vie).

Bengali (ben), Belarusian (bel), Bosnian (bos), Cantonese (yue), Estonian (est), Filipino (fil), Gujarati (guj), Hungarian (hun), Kazakh (kaz), Latvian (lav), Lithuanian (lit), Mandarin (cmn), Marathi (mar), Nepali (nep), Odia (ori), Persian (fas), Slovenian (slv), Tamil (tam) and Telugu (tel)

Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Asturian (ast), Azerbaijani (aze), Burmese (mya), Cebuano (ceb), Croatian (hrv), Georgian (kat), Hausa (hau), Hebrew (heb), Icelandic (isl), Javanese (jav), Kabuverdianu (kea), Korean (kor), Kyrgyz (kir), Lingala (lin), Maltese (mlt), Mongolian (mon), Māori (mri), Occitan (oci), Punjabi (pan), Sindhi (snd), Swahili (swa), Tajik (tgk), Thai (tha), Urdu (urd), Uzbek (uzb) and Welsh (cym).

Amharic (amh), Chichewa (nya), Fulah (ful), Ganda (lug), Igbo (ibo), Irish (gle), Khmer (khm), Kurdish (kur), Lao (lao), Luxembourgish (ltz), Luo (luo), Northern Sotho (nso), Pashto (pus), Shona (sna), Somali (som), Umbundu (umb), Wolof (wol), Xhosa (xho) and Zulu (zul).

FAQ

Yes, the API supports uploading both audio and video files for transcription.

Files up to 3 GB in size and up to 10 hours in duration are supported.

The API supports the following audio and video formats:

  • audio/aac
  • audio/x-aac
  • audio/x-aiff
  • audio/ogg
  • audio/mpeg
  • audio/mp3
  • audio/mpeg3
  • audio/x-mpeg-3
  • audio/opus
  • audio/wav
  • audio/x-wav
  • audio/webm
  • audio/flac
  • audio/x-flac
  • audio/mp4
  • audio/aiff
  • audio/x-m4a

Supported video formats include:

  • video/mp4
  • video/x-msvideo
  • video/x-matroska
  • video/quicktime
  • video/x-ms-wmv
  • video/x-flv
  • video/webm
  • video/mpeg
  • video/3gpp

ElevenLabs is constantly expanding the number of languages supported by our models. Please check back frequently for updates.

Yes, asynchronous transcription results can be sent to webhooks configured in webhook settings in the UI. Learn more in the webhooks cookbook.

Yes, the multichannel STT feature allows you to transcribe audio where each channel is processed independently and assigned a speaker ID based on its channel number. This feature supports up to 5 channels. Learn more in the multichannel transcription cookbook.

ElevenLabs charges for speech to text based on the duration of the audio sent for transcription. Billing is calculated per hour of audio, with rates varying by tier and model. See the API pricing page for detailed pricing information.