
Enterprise voice AI, deployed locally
- Category
- Company
- Date
Scribe v2 is the most accurate Speech to Text model. Scribe v2 Realtime sets the benchmark for live transcriptions - powering agents and real-time applications. Both available via API.
Scribe v2 Realtime uses ElevenLabs’ streaming-first architecture to turn live speech to text instantly, across 90+ languages.

Scribe v2 Realtime captures live speech in under 150 ms with exceptional accuracy – built for agents, meetings, and AI Agents that demand instant understanding.
Scribe v2 Realtime delivers industry-leading accuracy with sub-150 ms latency, setting a new benchmark for real-time speech recognition.
Automatically detect when speech starts and stops, segmenting speech with precision for smoother live processing.
Delivering exceptional accuracy across accents, dialects, and recording conditions.
Build Scribe Realtime v2 into your products with the API. With full-streaming support and commit control.
Create captions, subtitles, and editable transcripts for podcasts, videos, interviews, and other recorded content – all with industry-leading accuracy in Studio or via API.



Upload audio or video in any format — MP4, MOV, MP3, WAV, and more. Scribe v2 automatically converts speech into precise text, ready for captions, subtitles, or editing.
Scribe v2 achieves industry-leading transcription accuracy, delivering clean, editable text even in challenging audio conditions or across diverse accents.
Select up to 1000 specific words or sentences for Scribe to accurately transcribe based on context.
From laughter to footsteps, Scribe v2 tags every sound event, enriching your transcripts with the full context.
Scribe v2 intuitively distinguishes and labels every speaker, calculates entity timestamps, and redacts sensitive information from transcripts.

Integrate Scribe v2 and Scribe v2 Realtime into your product with the API or SDKs.

Enable real-time voice interactions with instant, low-latency transcription.
.webp&w=3840&q=100)
Convert recordings into editable text, captions, and repurposable content.

Excellent Accuracy (≤ 5% Word Error Rate - WER) Belarusian (bel), Bosnian (bos), Bulgarian (bul), Catalan (cat), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Finnish (fin), French (fra), Galician (glg), German (deu), Greek (ell), Hungarian (hun), Icelandic (isl), Indonesian (ind), Italian (ita), Japanese (jpn), Kannada (kan), Latvian (lav), Macedonian (mkd), Malay (msa), Malayalam (mal), Norwegian (nor), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Slovak (slk), Spanish (spa), Swedish (swe), Turkish (tur), Ukrainian (ukr) and Vietnamese (vie). High Accuracy (>5% to ≤10% WER) Armenian (hye), Azerbaijani (aze), Bengali (ben), Cantonese (yue), Filipino (fil), Georgian (kat), Gujarati (guj), Hindi (hin), Kazakh (kaz), Lithuanian (lit), Maltese (mlt), Mandarin (cmn), Marathi (mar), Nepali (nep), Odia (ori), Persian (fas), Serbian (srp), Slovenian (slv), Swahili (swa), Tamil (tam) and Telugu (tel). Good (>10% to ≤20% WER) Afrikaans (afr), Arabic (ara), Assamese (asm), Asturian (ast), Burmese (mya), Hausa (hau), Hebrew (heb), Javanese (jav), Korean (kor), Kyrgyz (kir), Luxembourgish (ltz), Māori (mri), Occitan (oci), Punjabi (pan), Tajik (tgk), Thai (tha), Uzbek (uzb) and Welsh (cym). Moderate (>20% to ≤50% WER) Amharic (amh), Ganda (lug), Igbo (ibo), Irish (gle), Khmer (khm), Kurdish (kur), Lao (lao), Mongolian (mon), Northern Sotho (nso), Pashto (pus), Shona (sna), Sindhi (snd), Somali (som), Urdu (urd), Wolof (wol), Xhosa (xho), Yoruba (yor) and Zulu (zul).
Speech-to-text (STT) is a technology that converts spoken language into written text using automatic speech recognition (ASR). It processes audio signals, identifies speech patterns, and transcribes them into text with high accuracy. ElevenLabs' AI-powered speech-to-text software is designed to transcribe audio and video content with human-like precision, making it ideal for speech-to-text conversion, audio transcription, and real-time speech recognition. Speech-to-text technology is used in: ✔ Speech-to-text transcription for podcasts, meetings, and interviews. ✔ Captions and subtitles in video content. ✔ Speech-to-text software for hands-free typing and accessibility tools. ElevenLabs ASR offers fast, reliable, and highly accurate speech-to-text conversion for multiple languages and accents.
ElevenLabs provides video transcription to convert spoken dialogue into text format, making it easy to create subtitles, captions, and searchable transcripts. Steps to transcribe video to text: 1. Upload your video file to ElevenLabs ASR 2. Speech recognition technology processes the audio 3. A transcript is generated automatically, with timestamps 4. Download the text file or export subtitles for editing. This AI-powered video transcription model helps content creators, businesses, and educators quickly convert video speech into accurate text for accessibility and content repurposing.
Starting from $0.40 per hour of transcribed audio, falling well below this at scale with Enterprise plans.
Yes. Scribe can auto-generate captions and subtitles for YouTube, TikTok, Instagram, and more—supporting multiple languages for accessibility and reach.
The most accurate Speech to Text models use deep neural networks trained on large, multilingual datasets. Scribe achieves industry-leading accuracy across 90+ languages, outperforming models like Whisper, Deepgram, and Gemini in benchmark tests.
Yes. Real-time Speech to Text converts spoken words into text as they’re being spoken. With Scribe v2 Realtime, transcription occurs in under 150 milliseconds, making it ideal for live conversations, meetings, and AI agents.
Speech to Text can be used for meeting notes, podcasts, accessibility captions, customer service calls, and any task that requires converting spoken content into readable text. It also powers real-time AI assistants and automated workflows.
All Speech to Text data is processed with enterprise-grade security. Transcriptions can be handled through encrypted APIs, and sensitive information can be processed locally or with restricted access to meet compliance standards.
Speech to Text technology can work offline if models are deployed locally. Scribe supports cloud and on-premise configurations, allowing enterprises to control data handling while maintaining low latency and high accuracy.
Yes. Advanced Speech to Text systems use speaker diarization to distinguish and label multiple speakers automatically, even in overlapping conversations.
Speech to Text refers to the automatic process of converting spoken language into text using AI, while transcription software may include editing tools, formatting, and collaboration features built around that core technology.
Our AI speech to text transcription supports 90+ languages, just select the language and upload your audio file.


