Skip to content

Speech to Text API

Transcribe speech with ElevenLabs Scribe v2

Highest accuracy speech to text for bulk applications. Detect emphasis & sound effects, and guide transcription with keyterm prompting.

Uh, hi! So, um, I was wondering if you wanted to meet up for coffee? Maybe tomorrow morning? [nervous laugh] Totally fine if not!

  • Lovable
  • Veed model
  • Synthesia
  • Stripe
  • Perplexity
  • Twilio

Most accurate Speech to Text API for batch workloads

Create captions, subtitles, and editable transcripts for podcasts, videos, interviews, and other recorded content – all with industry-leading accuracy via API.

Scribe v2 achieves industry-leading transcription accuracy, delivering clean, editable text even in challenging audio conditions or across diverse accents.

Unprecedented transcription accuracy

Scribe v2 achieves industry-leading transcription accuracy, delivering clean, editable text even in challenging audio conditions or across diverse accents.

Uh, hi! So, um, I was wondering if you wanted to meet up for coffee? Maybe tomorrow morning? [nervous laugh] Totally fine if not!

Designed for every scenario

Transcription that works in noisy environments, with background music, strong accents, and low-quality audio.

Fine-grained control over timing, speakers, and non-speech events.

The ElevenLabs Transcription API can detect laughter, emotion, and sound effects. Use keyterm prompting to guide transcription with domain-specific terms.

Transcribe audio and video

Upload MP3, MP4, WAV, MOV, and other common formats. Scribe handles files up to 10 hours with async processing and webhook notifications for large batches.
Transcription Formats

Clean, editable transcripts

Get properly punctuated, paragraph-structured text ready for editing, publishing, or downstream processing. No cleanup required.
Editable transcripts

Keyterm prompting

Boost recognition accuracy for up to 100 domain-specific terms. Product names, technical jargon, and specialized vocabulary transcribed correctly the first time.
Keyterm Prompting

Dynamic audio tagging

Capture non-speech events like laughter, applause, music, and background noise. Transcripts include the full context of your audio, not just the words.

Smart speaker diarization

Automatically identify and label up to 48 speakers. Clear attribution of who said what, organized into readable transcripts.

Entity detection

Automatically identify and tag 56 entity types including names, dates, locations, and organizations within your transcripts.

Black Mountain

Scribe v2

Highest accuracy, designed for batch workloads.

  • >95% Accuracy
  • 90+ Languages
  • Non-Speech Event Detection
  • Entity Detection
  • Keyterm Prompting
Mountains

Scribe v2 Realtime

Lowest latency, for realtime workloads.

  • Under 150ms Latency
  • 90+ Languages
  • Transcription Streaming
  • Voice Activity Detection
  • Automatic Language Recognition

Transcribe speech in 90+ languages and a wide range of accents

Delivering exceptional accuracy across accents, dialects, and recording conditions.

Change the languageCode to preview languages

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

const elevenlabs = new ElevenLabsClient({
	apiKey: "<your_api_key>"
});
const response = await fetch(
  "https://storage.googleapis.com/eleven-public-cdn/audio/marketing/nicole.mp3"
);
const audioBlob = new Blob([await response.arrayBuffer()], { type: "audio/mp3" });

const transcription = await	elevenlabs
	.speechToText.convert({
	  file: audioBlob,
	  modelId: "scribe_v2",
	  tagAudioEvents: true,
	  languageCode: 
, // Set language diarize: true }); console.log(transcription);
Flag for en
English
Flag for zh
Chinese
Flag for es
Spanish
Flag for fr
French
Flag for pt
Portuguese
Flag for de
German
Flag for ja
Japanese
Flag for it
Italian
Flag for hi
Hindi
Flag for en
EnglishClick to preview

Powering the world’s leading companies and brands

  • From dubbing Reels in local languages, to generating music and character voices in Horizon, ElevenLabs platform enables global creators, businesses, and enterprises to build with voice, music, and sound at scale.
    Meta Color Logo
  • Scribe’s unmatched accuracy across so many languages lets Fieldy understand every daily conversation and easily scale across continents. Fieldy has increased user retention by 50% after moving to ElevenLabs Scribe.
    Fieldy logo
  • ElevenLabs made it easy for us to quickly bring powerful text-to-speech capabilities to our SDK, allowing Agents to respond in real time with expressive voices to user questions or as feedback to what it’s seeing.
    Stream Color Logo
  • Twilio has integrated ElevenLabs’ generative AI voice technology into its CPaaS, enhancing ConversationRelay. This integration allows businesses and developers to create conversational AI voice interactions that sound human, feel expressive, and respond in real time directly from the Twilio CPaaS platform. We at ElevenLabs are excited that Twilio has chosen ElevenLabs to enhance ConversationRelay with the most expressive, human sounding voices available.
    Twilio logo

APIs built for production

Foreground

Frequently asked questions

Latest updates

The most realistic audio AI platform