Meet Scribe

Transcribe Speech to Text with the world's most accurate ASR model

Introducing IIscribe V1, the world's most accurate speech-to-text model.

Scribe, our first Speech to Text model, is the world’s most accurate transcription model. Built to handle the unpredictability of real-world audio, Scribe transcribes speech in 99 languages, featuring word-level timestamps, speaker diarization, and audio-event tagging—all delivered in a structured response for seamless integration.

Scribe is engineered for precision. In FLEURS & Common Voice benchmark tests across 99 languages, it consistently outperforms leading models like Gemini 2.0 Flash, Whisper Large V3 and Deepgram Nova-3. Whether it’s meeting summaries, movie subtitles, or even song lyrics, Scribe delivers the lowest automated transcription word error rate in Italian (98.7%), English (96.7%) and 97 other languages.

Scribe makes ASR universally accessible—dramatically reducing errors in traditionally underserved languages such as Serbian, Cantonese, and Malayalam, where competing models often exceed 40% word error rates.

The world's most accurate ASR model by IIElevenLabs.

Developers can integrate Scribe today via our Speech to Text API to get structured JSON transcripts with speaker diarization and word-level timestamps & non-speech event markers (e.g. laughter). A low-latency version for real-time applications will be released soon.

Creators and businesses can use Scribe directly via the ElevenLabs dashboard to upload audio or video files and generate formatted transcripts.

Start building with Scribe:

API Documentation | Try in the ElevenLabs Dashboard

Benchmarks

FLEURS - Word Error Rate % - 102 Languages

Bar chart comparing word error rates for different languages and speech recognition models.

Common Voice - Word Error Rate % - 102 Languages

Bar chart comparing word error rates for different voice recognition models across various countries.

Contributions

Research lead, training, architecture

Flavio Schneider

Project lead, pre-training data, fine-tuning data

Tim von Känel

Inference, Optimizations

Maximiliano Levi

Research Contributors

Johan Nordberg, Piotr Dabkowski

Frontend

Austin Malerba

Backend

Hristo Stoychev

Data Acquisition

Alex George

Explore more

Research
Text on a gray gradient background introducing IIFlash v2.5, highlighting 75ms model latency and support for 32 languages.

Meet Flash

You’ve never experienced human-like TTS this fast

ElevenLabs

Create with the highest quality AI Audio

Get started free

Already have an account? Log in