
Text to Speech API
Ultra-realistic and low latency speech generation
Build with high-quality, controllable speech generation for real-time and bulk applications. Models optimized for latency, fidelity, and long-form consistency.
Demo
Code
In the ancient land of Eldoria, where skies shimmered and forests, whispered secrets to the wind, lived a dragon named Zephyros. [sarcastically] Not the “burn it all down” kind... [giggles] but he was gentle, wise, with eyes like old stars. [whispers] Even the birds fell silent when he passed.
- Lovable
- Synthesia
- Stripe
- Perplexity
- Twilio
Built on the most powerful Voice AI models
Choose the right model for your use case: from ultra-low latency agents to expressive, long-form narration.

Flash v2.5
Our lowest latency speech synthesis model
- Ultra-low latency (~75ms)
- 32 languages supported
- 40,000 character limit
- ~$0.06 per minute

Turbo v2.5
Balanced quality and latency
- Low latency (~250-300ms)
- High quality voice generation
- 32 languages supported
- 40,000 character limit
- ~$0.06 per minute

Multilingual v2
Lifelike, consistent quality speech synthesis model
- Natural-sounding output
- 29 languages supported
- 10,000 character limit
- Designed for long-form generations
- ~$0.12 per minute

Eleven v3
Our most emotionally rich, expressive model
- Dramatic delivery and performance
- 70+ languages supported
- 3,000 character limit
- Multi-speaker dialogue
- ~$0.12 per minute
Everything you need to build production-ready speech
Generate expressive, controllable speech with models built for real-time, long-form, and production use.
Control emotion and delivery
Create controllable, expressive speech, layered with emotion, audio events, and immersive soundscapes.

Access 10,000+ voices
Explore an ever-growing collection of expressive, lifelike voices for any use case.

Voice design & cloning
Create in over 30 languages with natural voices, expressive accents, and localized audio tailored to your audience.

Multi-speaker dialogue
Create natural multi-speaker conversations across 30+ languages with expressive, controllable voices.

Audio events and direction
Control delivery with audio tags, timing cues, and narrative direction built into the speech.

Pronunciation dictionaries
Define custom pronunciations to ensure consistent, accurate speech for names and terminology.

Powering world’s leading companies and brands
“From dubbing Reels in local languages, to generating music and character voices in Horizon, ElevenLabs platform enables global creators, businesses, and enterprises to build with voice, music, and sound at scale.”
“Millions of people learn chess from creators like Hikaru, Levy, and Magnus every day on YouTube and Twitch. Now you can learn from them inside Chess.com in a way that feels immersive, personal, and full of character. Our mission is to build a chess coach that teaches at the right level, welcomes players of every skill level, and demystifies chess while keeping it fun and full of personality. With ElevenLabs and these amazing new voices, we’ve taken a big step toward making that vision a reality.”
“ElevenLabs made it easy for us to quickly bring powerful text-to-speech capabilities to our SDK, allowing Agents to respond in real time with expressive voices to user questions or as feedback to what it’s seeing.”

“Twilio has integrated ElevenLabs’ generative AI voice technology into its CPaaS, enhancing ConversationRelay. This integration allows businesses and developers to create conversational AI voice interactions that sound human, feel expressive, and respond in real time directly from the Twilio CPaaS platform. We at ElevenLabs are excited that Twilio has chosen ElevenLabs to enhance ConversationRelay with the most expressive, human sounding voices available. ”
APIs built for production

Frequently asked questions
- Flash v2.5 - Ultra-low latency (~75ms) for real-time applications like voice agents
- Turbo v2.5 - Balanced quality and speed (~250-300ms) for interactive use cases
- Multilingual v2 - Consistent quality for long-form content up to 10,000 characters
- Eleven v3 - Maximum expressiveness and emotional range for creative applications
Flash v2.5 delivers ~75ms latency.
Turbo v2.5 typically responds in 250-300ms.
Both support streaming output, allowing playback to begin before generation completes.
Eleven v3 supports 70+ languages.
Flash v2.5 and Turbo v2.5 support 32 languages.
Multilingual v2 supports 70+ languages.
Flash v2.5 and Turbo v2.5: 40,000 characters
Multilingual v2: 10,000 characters
Eleven v3: 3,000 characters
Use audio tags ([laughs], [whispers], [sighs], [door slam]) to control delivery, emotion, emphasis, pauses, and sound effects. Eleven v3 provides the most expressive control.
The voice library includes 10,000+ voices. You can also clone voices or design custom voices using text prompts.
Yes. Streaming allows you to start playback before the full audio is generated, reducing perceived latency in real-time applications.
Yes. Reference any voice in your library by voice ID, including professional voice clones, instant voice clones, and voices you've designed.
The API outputs MP3 by default. Additional formats include PCM and μ-law.
Use Flash v2.5 with streaming enabled. Keep requests under 1,000 characters. Enable WebSocket connections for persistent real-time applications.
Yes. Use phonetic spelling or pronunciation dictionaries to control how specific words are spoken.
Official SDKs for Python, JavaScript/TypeScript are available. You can also use the HTTP API.
Complete API reference, code examples, and integration guides are available at elevenlabs.io/docs/api-reference
Yes. Enterprise plans include SOC 2 compliance, HIPAA support, GDPR compliance, EU data residency, zero retention mode, dedicated support, and custom SLAs.
.webp&w=3840&q=80)




.webp&w=3840&q=80)

.webp&w=3840&q=80)