Text to speech
Learn how to turn text into lifelike spoken audio with ElevenLabs.
Overview
ElevenLabs text-to-speech (TTS) API turns text into lifelike audio with nuanced intonation, pacing and emotional awareness. Our models adapt to textual cues across 32 languages and multiple voice styles and can be used to:
- Narrate global media campaigns & ads
- Produce audiobooks in multiple languages with complex emotional delivery
- Stream real-time audio from text
Listen to a sample:
Explore our voice library to find the perfect voice for your project.
Learn how to integrate text-to-speech into your application.
Step-by-step guide for using text-to-speech in ElevenLabs.
Voice quality
For real-time applications, Flash v2.5 provides ultra-low 75ms latency, while Multilingual v2 delivers the highest quality audio with more nuanced expression.
Our most lifelike, emotionally rich speech synthesis model
Our fast, affordable speech synthesis model
Voice options
ElevenLabs offers thousands of voices across 32 languages through multiple creation methods:
- Voice library with 3,000+ community-shared voices
- Professional voice cloning for highest-fidelity replicas
- Instant voice cloning for quick voice replication
- Voice design to generate custom voices from text descriptions
Learn more about our voice options.
Supported formats
The default response format is “mp3”, but other formats like “PCM”, & “μ-law” are available.
- MP3
- Sample rates: 22.05kHz - 44.1kHz
- Bitrates: 32kbps - 192kbps
- PCM (S16LE)
- Sample rates: 16kHz - 44.1kHz
- μ-law
- 8kHz sample rate
- Optimized for telephony applications
Higher quality audio options are only available on paid tiers - see our pricing page for details.
Supported languages
Our v2 models support 29 languages:
English (USA, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian & Russian.
Flash v2.5 supports 32 languages - all languages from v2 models plus:
Hungarian, Norwegian & Vietnamese
Simply input text in any of our supported languages and select a matching voice from our voice library. For the most natural results, choose a voice with an accent that matches your target language and region.
Prompting
The models interpret emotional context directly from the text input. For example, adding descriptive text like “she said excitedly” or using exclamation marks will influence the speech emotion. Voice settings like Stability and Similarity help control the consistency, while the underlying emotion comes from textual cues.
Read the prompting guide for more details.
Descriptive text will be spoken out by the model and must be manually trimmed or removed from the audio if desired.
FAQ
Can I clone my own voice?
Yes, you can create instant voice clones of your own voice from short audio clips. For high-fidelity clones, check out our professional voice cloning feature.
Do I own the audio output?
Yes. You retain ownership of any audio you generate. However, commercial usage rights are only available with paid plans. With a paid subscription, you may use generated audio for commercial purposes and monetize the outputs if you own the IP rights to the input content.
How do I reduce latency for real-time cases?
Use the low-latency Flash models (Flash v2 or v2.5) optimized for near real-time conversational or interactive scenarios. See our latency optimization guide for more details.
Why is my output sometimes inconsistent?
The models are nondeterministic. For consistency, use the optional seed parameter, though subtle differences may still occur.
What's the best practice for large text conversions?
Split long text into segments and use streaming for real-time playback and efficient processing. To maintain natural prosody flow between chunks, include previous/next text or previous/next request id parameters.