Speech to Text

Learn how to turn spoken audio into text with ElevenLabs.

Overview

The ElevenLabs Speech to Text (STT) API turns spoken audio into text with state of the art accuracy. Our Scribe v1 model adapts to textual cues across 99 languages and multiple voice styles and can be used to:

  • Transcribe podcasts, interviews, and other audio or video content
  • Generate transcripts for meetings and other audio or video recordings

Companies requiring HIPAA compliance must contact ElevenLabs Sales to sign a Business Associate Agreement (BAA) agreement. Please ensure this step is completed before proceeding with any HIPAA-related integrations or deployments.

State of the art accuracy

The Scribe v1 model is capable of transcribing audio from up to 32 speakers with high accuracy. Optionally it can also transcribe audio events like laughter, applause, and other non-speech sounds.

The transcribed output supports exact timestamps for each word and audio event, plus diarization to identify the speaker for each word.

The Scribe v1 model is best used for when high-accuracy transcription is required rather than real-time transcription. A low-latency, real-time version will be released soon.

Pricing

Scribe is free in the web app until April 9th and has the following price via API:

TierPriceHours IncludedCost per Hour
Free$02.5Unavailable
Starter$512.5$0.4
Creator$2263$0.35
Pro$99320$0.31
Scale$3301220$0.27
Business$1,3206000$0.22

If you need more than 6,000 hours per month, please contact sales

Examples

The following example shows the output of the Scribe v1 model for a sample audio file.

1{
2 "language_code": "en",
3 "language_probability": 1,
4 "text": "With a soft and whispery American accent, I'm the ideal choice for creating ASMR content, meditative guides, or adding an intimate feel to your narrative projects.",
5 "words": [
6 {
7 "text": "With",
8 "start": 0.119,
9 "end": 0.259,
10 "type": "word",
11 "speaker_id": "speaker_0"
12 },
13 {
14 "text": " ",
15 "start": 0.239,
16 "end": 0.299,
17 "type": "spacing",
18 "speaker_id": "speaker_0"
19 },
20 {
21 "text": "a",
22 "start": 0.279,
23 "end": 0.359,
24 "type": "word",
25 "speaker_id": "speaker_0"
26 },
27 {
28 "text": " ",
29 "start": 0.339,
30 "end": 0.499,
31 "type": "spacing",
32 "speaker_id": "speaker_0"
33 },
34 {
35 "text": "soft",
36 "start": 0.479,
37 "end": 1.039,
38 "type": "word",
39 "speaker_id": "speaker_0"
40 },
41 {
42 "text": " ",
43 "start": 1.019,
44 "end": 1.2,
45 "type": "spacing",
46 "speaker_id": "speaker_0"
47 },
48 {
49 "text": "and",
50 "start": 1.18,
51 "end": 1.359,
52 "type": "word",
53 "speaker_id": "speaker_0"
54 },
55 {
56 "text": " ",
57 "start": 1.339,
58 "end": 1.44,
59 "type": "spacing",
60 "speaker_id": "speaker_0"
61 },
62 {
63 "text": "whispery",
64 "start": 1.419,
65 "end": 1.979,
66 "type": "word",
67 "speaker_id": "speaker_0"
68 },
69 {
70 "text": " ",
71 "start": 1.959,
72 "end": 2.179,
73 "type": "spacing",
74 "speaker_id": "speaker_0"
75 },
76 {
77 "text": "American",
78 "start": 2.159,
79 "end": 2.719,
80 "type": "word",
81 "speaker_id": "speaker_0"
82 },
83 {
84 "text": " ",
85 "start": 2.699,
86 "end": 2.779,
87 "type": "spacing",
88 "speaker_id": "speaker_0"
89 },
90 {
91 "text": "accent,",
92 "start": 2.759,
93 "end": 3.389,
94 "type": "word",
95 "speaker_id": "speaker_0"
96 },
97 {
98 "text": " ",
99 "start": 4.119,
100 "end": 4.179,
101 "type": "spacing",
102 "speaker_id": "speaker_0"
103 },
104 {
105 "text": "I'm",
106 "start": 4.159,
107 "end": 4.459,
108 "type": "word",
109 "speaker_id": "speaker_0"
110 },
111 {
112 "text": " ",
113 "start": 4.44,
114 "end": 4.52,
115 "type": "spacing",
116 "speaker_id": "speaker_0"
117 },
118 {
119 "text": "the",
120 "start": 4.5,
121 "end": 4.599,
122 "type": "word",
123 "speaker_id": "speaker_0"
124 },
125 {
126 "text": " ",
127 "start": 4.579,
128 "end": 4.699,
129 "type": "spacing",
130 "speaker_id": "speaker_0"
131 },
132 {
133 "text": "ideal",
134 "start": 4.679,
135 "end": 5.099,
136 "type": "word",
137 "speaker_id": "speaker_0"
138 },
139 {
140 "text": " ",
141 "start": 5.079,
142 "end": 5.219,
143 "type": "spacing",
144 "speaker_id": "speaker_0"
145 },
146 {
147 "text": "choice",
148 "start": 5.199,
149 "end": 5.719,
150 "type": "word",
151 "speaker_id": "speaker_0"
152 },
153 {
154 "text": " ",
155 "start": 5.699,
156 "end": 6.099,
157 "type": "spacing",
158 "speaker_id": "speaker_0"
159 },
160 {
161 "text": "for",
162 "start": 6.099,
163 "end": 6.199,
164 "type": "word",
165 "speaker_id": "speaker_0"
166 },
167 {
168 "text": " ",
169 "start": 6.179,
170 "end": 6.279,
171 "type": "spacing",
172 "speaker_id": "speaker_0"
173 },
174 {
175 "text": "creating",
176 "start": 6.259,
177 "end": 6.799,
178 "type": "word",
179 "speaker_id": "speaker_0"
180 },
181 {
182 "text": " ",
183 "start": 6.779,
184 "end": 6.979,
185 "type": "spacing",
186 "speaker_id": "speaker_0"
187 },
188 {
189 "text": "ASMR",
190 "start": 6.959,
191 "end": 7.739,
192 "type": "word",
193 "speaker_id": "speaker_0"
194 },
195 {
196 "text": " ",
197 "start": 7.719,
198 "end": 7.859,
199 "type": "spacing",
200 "speaker_id": "speaker_0"
201 },
202 {
203 "text": "content,",
204 "start": 7.839,
205 "end": 8.45,
206 "type": "word",
207 "speaker_id": "speaker_0"
208 },
209 {
210 "text": " ",
211 "start": 9,
212 "end": 9.06,
213 "type": "spacing",
214 "speaker_id": "speaker_0"
215 },
216 {
217 "text": "meditative",
218 "start": 9.04,
219 "end": 9.64,
220 "type": "word",
221 "speaker_id": "speaker_0"
222 },
223 {
224 "text": " ",
225 "start": 9.619,
226 "end": 9.699,
227 "type": "spacing",
228 "speaker_id": "speaker_0"
229 },
230 {
231 "text": "guides,",
232 "start": 9.679,
233 "end": 10.359,
234 "type": "word",
235 "speaker_id": "speaker_0"
236 },
237 {
238 "text": " ",
239 "start": 10.359,
240 "end": 10.409,
241 "type": "spacing",
242 "speaker_id": "speaker_0"
243 },
244 {
245 "text": "or",
246 "start": 11.319,
247 "end": 11.439,
248 "type": "word",
249 "speaker_id": "speaker_0"
250 },
251 {
252 "text": " ",
253 "start": 11.42,
254 "end": 11.52,
255 "type": "spacing",
256 "speaker_id": "speaker_0"
257 },
258 {
259 "text": "adding",
260 "start": 11.5,
261 "end": 11.879,
262 "type": "word",
263 "speaker_id": "speaker_0"
264 },
265 {
266 "text": " ",
267 "start": 11.859,
268 "end": 12,
269 "type": "spacing",
270 "speaker_id": "speaker_0"
271 },
272 {
273 "text": "an",
274 "start": 11.979,
275 "end": 12.079,
276 "type": "word",
277 "speaker_id": "speaker_0"
278 },
279 {
280 "text": " ",
281 "start": 12.059,
282 "end": 12.179,
283 "type": "spacing",
284 "speaker_id": "speaker_0"
285 },
286 {
287 "text": "intimate",
288 "start": 12.179,
289 "end": 12.579,
290 "type": "word",
291 "speaker_id": "speaker_0"
292 },
293 {
294 "text": " ",
295 "start": 12.559,
296 "end": 12.699,
297 "type": "spacing",
298 "speaker_id": "speaker_0"
299 },
300 {
301 "text": "feel",
302 "start": 12.679,
303 "end": 13.159,
304 "type": "word",
305 "speaker_id": "speaker_0"
306 },
307 {
308 "text": " ",
309 "start": 13.139,
310 "end": 13.179,
311 "type": "spacing",
312 "speaker_id": "speaker_0"
313 },
314 {
315 "text": "to",
316 "start": 13.159,
317 "end": 13.26,
318 "type": "word",
319 "speaker_id": "speaker_0"
320 },
321 {
322 "text": " ",
323 "start": 13.239,
324 "end": 13.3,
325 "type": "spacing",
326 "speaker_id": "speaker_0"
327 },
328 {
329 "text": "your",
330 "start": 13.299,
331 "end": 13.399,
332 "type": "word",
333 "speaker_id": "speaker_0"
334 },
335 {
336 "text": " ",
337 "start": 13.379,
338 "end": 13.479,
339 "type": "spacing",
340 "speaker_id": "speaker_0"
341 },
342 {
343 "text": "narrative",
344 "start": 13.479,
345 "end": 13.889,
346 "type": "word",
347 "speaker_id": "speaker_0"
348 },
349 {
350 "text": " ",
351 "start": 13.919,
352 "end": 13.939,
353 "type": "spacing",
354 "speaker_id": "speaker_0"
355 },
356 {
357 "text": "projects.",
358 "start": 13.919,
359 "end": 14.779,
360 "type": "word",
361 "speaker_id": "speaker_0"
362 }
363 ]
364}

The output is classified in three category types:

  • word - A word in the language of the audio
  • spacing - The space between words, not applicable for languages that don’t use spaces like Japanese, Mandarin, Thai, Lao, Burmese and Cantonese.
  • audio_event - Non-speech sounds like laughter or applause. Also includes verbatim noises like “um” or “ah”.

Models

Supported languages

The Scribe v1 model supports 99 languages, including:

Afrikaans (afr), Amharic (amh), Arabic (ara), Armenian (hye), Assamese (asm), Asturian (ast), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Burmese (mya), Cantonese (yue), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Fulah (ful), Galician (glg), Ganda (lug), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Igbo (ibo), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kabuverdianu (kea), Kannada (kan), Kazakh (kaz), Khmer (khm), Korean (kor), Kurdish (kur), Kyrgyz (kir), Lao (lao), Latvian (lav), Lingala (lin), Lithuanian (lit), Luo (luo), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Maltese (mlt), Mandarin Chinese (cmn), Māori (mri), Marathi (mar), Mongolian (mon), Nepali (nep), Northern Sotho (nso), Norwegian (nor), Occitan (oci), Odia (ori), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Shona (sna), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Tajik (tgk), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Umbundu (umb), Urdu (urd), Uzbek (uzb), Vietnamese (vie), Welsh (cym), Wolof (wol), Xhosa (xho) and Zulu (zul).

Breakdown of language support

Word Error Rate (WER) is a key metric used to evaluate the accuracy of transcription systems. It measures how many errors are present in a transcript compared to a reference transcript. Below is a breakdown of the WER for each language that Scribe v1 supports.

Bulgarian (bul), Catalan (cat), Czech (ces), Danish (dan), Dutch (nld), English (eng), Finnish (fin), French (fra), Galician (glg), German (deu), Greek (ell), Hindi (hin), Indonesian (ind), Italian (ita), Japanese (jpn), Kannada (kan), Malay (msa), Malayalam (mal), Macedonian (mkd), Norwegian (nor), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Serbian (srp), Slovak (slk), Spanish (spa), Swedish (swe), Turkish (tur), Ukrainian (ukr) and Vietnamese (vie).

Bengali (ben), Belarusian (bel), Bosnian (bos), Cantonese (yue), Estonian (est), Filipino (fil), Gujarati (guj), Hungarian (hun), Kazakh (kaz), Latvian (lav), Lithuanian (lit), Mandarin (cmn), Marathi (mar), Nepali (nep), Odia (ori), Persian (fas), Slovenian (slv), Tamil (tam) and Telugu (tel)

Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Asturian (ast), Azerbaijani (aze), Burmese (mya), Cebuano (ceb), Croatian (hrv), Georgian (kat), Hausa (hau), Hebrew (heb), Icelandic (isl), Javanese (jav), Kabuverdianu (kea), Korean (kor), Kyrgyz (kir), Lingala (lin), Maltese (mlt), Mongolian (mon), Māori (mri), Occitan (oci), Punjabi (pan), Sindhi (snd), Swahili (swa), Tajik (tgk), Thai (tha), Urdu (urd), Uzbek (uzb) and Welsh (cym).

Amharic (amh), Chichewa (nya), Fulah (ful), Ganda (lug), Igbo (ibo), Irish (gle), Khmer (khm), Kurdish (kur), Lao (lao), Luxembourgish (ltz), Luo (luo), Northern Sotho (nso), Pashto (pus), Shona (sna), Somali (som), Umbundu (umb), Wolof (wol), Xhosa (xho) and Zulu (zul).

FAQ

Yes, the API supports uploading both audio and video files for transcription.

Files up to 1 GB in size and up to 2 hours in duration are supported.

The audio supported audio formats include:

  • audio/aac
  • audio/x-aac
  • audio/x-aiff
  • audio/ogg
  • audio/mpeg
  • audio/mp3
  • audio/mpeg3
  • audio/x-mpeg-3
  • audio/opus
  • audio/wav
  • audio/x-wav
  • audio/webm
  • audio/flac
  • audio/x-flac
  • audio/mp4
  • audio/aiff
  • audio/x-m4a

Supported video formats include:

  • video/mp4
  • video/x-msvideo
  • video/x-matroska
  • video/quicktime
  • video/x-ms-wmv
  • video/x-flv
  • video/webm
  • video/mpeg
  • video/3gpp

ElevenLabs is constantly expanding the number of languages supported by our models. Please check back frequently for updates.

Built with