What is the ElevenLabs Bulk Transcription API?

The Bulk Transcription API is part of Scribe, our Speech to Text system designed for large-scale audio and video transcription. It enables developers and enterprises to process hours of recorded content with industry-leading accuracy across 99 languages.

What types of audio and video files can I upload?

Scribe supports all common formats, including MP4, MOV, MP3, WAV, and more.

How accurate is Scribe for bulk transcription?

Scribe v2 achieves best-in-class accuracy across 99 languages and is robust to challenging audio conditions, accents, and recording quality. It outperforms previous generation models and other leading APIs on public benchmarks.

How long does transcription take for large files?

Processing time depends on file length and concurrency. Scribe is optimized for throughput and can handle large-scale pipelines with high parallelization, delivering transcripts in seconds to minutes.

Does Scribe support speaker separation and timestamps?

Yes. The API provides smart speaker diarization, word- and character-level timestamps, and dynamic audio tagging for non-speech events like laughter or music.

Can I customize the model for domain-specific terms?

Yes. You can define custom vocabularies to ensure correct transcription of product names, technical terminology, or unique brand phrases using keyterm prompting.

Is the Bulk Transcription API secure and compliant?

Scribe supports SOC 2, GDPR, and optional HIPAA compliance. Data is encrypted in transit and at rest, and teams can enable EU data residency or Zero Retention modes for stricter control.

How is the Bulk Transcription API priced?

Pricing is usage-based, calculated per minute of input audio. Volume discounts and enterprise plans are available for high-throughput workloads. Contact our sales team to discuss your requirements.

How can I get started?

You can start transcribing immediately by generating an API key and exploring the API docs.

Speech to Text API

Transcribe speech with ElevenLabs Scribe v2

Get API key Explore docs

Highest accuracy STT for bulk applications. Detect emphasis & sound effects, and guide transcription with keyterm prompting.

Demo

Code

Uh, hi! So, um, I was wondering if you wanted to meet up for coffee? Maybe tomorrow morning? [nervous laugh] Totally fine if not!

Natural Speech

Low-quality Audio

Accents

Domain Terms

Lovable
Veed model
Synthesia
Stripe
Perplexity
Twilio

Most accurate Speech to Text API for batch workloads

Get API key Explore Docs

Create captions, subtitles, and editable transcripts for podcasts, videos, interviews, and other recorded content – all with industry-leading accuracy via API.

Unprecedented transcription accuracy

Scribe v2 achieves industry-leading transcription accuracy, delivering clean, editable text even in challenging audio conditions or across diverse accents.

Uh, hi! So, um, I was wondering if you wanted to meet up for coffee? Maybe tomorrow morning? [nervous laugh] Totally fine if not!

Natural Speech

Low-quality Audio

Accents

Domain Terms

Designed for every scenario

Transcription that works in noisy environments, with background music, strong accents, and low-quality audio.

Fine-grained control over timing, speakers, and non-speech events.

The ElevenLabs Transcription API can detect laughter, emotion, and sound effects. Use keyterm prompting to guide transcription with domain-specific terms.

Transcribe audio and video

Upload MP3, MP4, WAV, MOV, and other common formats. Scribe handles files up to 10 hours with async processing and webhook notifications for large batches.

Clean, editable transcripts

Get properly punctuated, paragraph-structured text ready for editing, publishing, or downstream processing. No cleanup required.

Keyterm prompting

Boost recognition accuracy for up to 100 domain-specific terms. Product names, technical jargon, and specialized vocabulary transcribed correctly the first time.

Dynamic audio tagging

Capture non-speech events like laughter, applause, music, and background noise. Transcripts include the full context of your audio, not just the words.

Smart speaker diarization

Automatically identify and label up to 48 speakers. Clear attribution of who said what, organized into readable transcripts.

Entity detection

Automatically identify and tag 56 entity types including names, dates, locations, and organizations within your transcripts.

Scribe v2

Highest accuracy, designed for batch workloads.

>95% Accuracy
90+ Languages
Non-Speech Event Detection
Entity Detection
Keyterm Prompting

Scribe v2 Realtime

Lowest latency, for realtime workloads.

Under 150ms Latency
90+ Languages
Transcription Streaming
Voice Activity Detection
Automatic Language Recognition

Transcribe speech in 90+ languages and a wide range of accents

Delivering exceptional accuracy across accents, dialects, and recording conditions.

Change the languageCode to preview languages

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

const elevenlabs = new ElevenLabsClient({
	apiKey: "<your_api_key>"
});
const response = await fetch(
  "https://storage.googleapis.com/eleven-public-cdn/audio/marketing/nicole.mp3"
);
const audioBlob = new Blob([await response.arrayBuffer()], { type: "audio/mp3" });

const transcription = await	elevenlabs
	.speechToText.convert({
	  file: audioBlob,
	  modelId: "scribe_v2",
	  tagAudioEvents: true,
	  languageCode: 
, // Set language
	  diarize: true
	});

console.log(transcription);

English

Chinese

Spanish

French

Portuguese

German

Japanese

Italian

Hindi

EnglishClick to preview

Powering the world’s leading companies and brands

View customer stories

“From dubbing Reels in local languages, to generating music and character voices in Horizon, ElevenLabs platform enables global creators, businesses, and enterprises to build with voice, music, and sound at scale.”
“Scribe’s unmatched accuracy across so many languages lets Fieldy understand every daily conversation and easily scale across continents. Fieldy has increased user retention by 50% after moving to ElevenLabs Scribe.”
“ElevenLabs made it easy for us to quickly bring powerful text-to-speech capabilities to our SDK, allowing Agents to respond in real time with expressive voices to user questions or as feedback to what it’s seeing.”
“Twilio has integrated ElevenLabs’ generative AI voice technology into its CPaaS, enhancing ConversationRelay. This integration allows businesses and developers to create conversational AI voice interactions that sound human, feel expressive, and respond in real time directly from the Twilio CPaaS platform. We at ElevenLabs are excited that Twilio has chosen ElevenLabs to enhance ConversationRelay with the most expressive, human sounding voices available. ”