Models
Flagship models
Our most lifelike, emotionally rich speech synthesis model
Our fast, affordable speech synthesis model
Models overview
The ElevenLabs API offers a range of audio models optimized for different use cases, quality levels, and performance requirements.
Older Models
These models are maintained for backward compatibility but are not recommended for new projects.
Multilingual v2
Eleven Multilingual v2 is our most advanced, emotionally-aware speech synthesis model. It produces natural, lifelike speech with high emotional range and contextual understanding across multiple languages.
The model delivers consistent voice quality and personality across all supported languages while maintaining the speaker’s unique characteristics and accent.
This model excels in scenarios requiring high-quality, emotionally nuanced speech:
- Audiobook Production: Perfect for long-form narration with complex emotional delivery
- Character Voiceovers: Ideal for gaming and animation due to its emotional range
- Professional Content: Well-suited for corporate videos and e-learning materials
- Multilingual Projects: Maintains consistent voice quality across language switches
While it has a higher latency & cost per character than Flash models, it delivers superior quality for projects where lifelike speech is important.
Our v2 models support 29 languages:
English (USA, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian & Russian.
Flash v2.5
Eleven Flash v2.5 is our fastest speech synthesis model, designed for real-time applications and conversational AI. It delivers high-quality speech with ultra-low latency (~75ms†) across 32 languages.
The model balances speed and quality, making it ideal for interactive applications while maintaining natural-sounding output and consistent voice characteristics across languages.
This model is particularly well-suited for:
- Conversational AI: Perfect for real-time voice agents and chatbots
- Interactive Applications: Ideal for games and applications requiring immediate response
- Large-Scale Processing: Efficient for bulk text-to-speech conversion
With its lower price point and 75ms latency, Flash v2.5 is the cost-effective option for anyone needing fast, reliable speech synthesis across multiple languages.
Flash v2.5 supports 32 languages - all languages from v2 models plus:
Hungarian, Norwegian & Vietnamese
Considerations
Text normalization with numbers
When using Flash v2.5, numbers aren’t normalized in a way you might expect. For example, phone numbers might be read out in way that isn’t clear for the user. Dates and currencies are affected in a similar manner.
The Multilingual v2 model does a better job of normalizing numbers, so we recommend using it for phone numbers and other cases where number normalization is important.
Model selection guide
Requirements
Use eleven_multilingual_v2
Best for high-fidelity audio output with rich emotional expression
Use Flash models
Optimized for real-time applications (~75ms latency)
Use either either eleven_multilingual_v2
or eleven_flash_v2_5
Both support up to 32 languages
Use case
Use eleven_multilingual_v2
Ideal for professional content, audiobooks & video narration.
Use eleven_flash_v2_5
, eleven_flash_v2
or eleven_multilingual_v2
Perfect for real-time conversational applications
Use eleven_multilingual_sts_v2
Specialized for Speech-to-Speech conversion
Character limits
The maximum number of characters supported in a single text-to-speech request varies by model.
Scribe v1
Scribe v1 is our state-of-the-art speech recognition model designed for accurate transcription across 99 languages. It provides precise word-level timestamps and advanced features like speaker diarization and dynamic audio tagging.
This model excels in scenarios requiring accurate speech-to-text conversion:
- Transcription Services: Perfect for converting audio/video content to text
- Meeting Documentation: Ideal for capturing and documenting conversations
- Content Analysis: Well-suited for audio content processing and analysis
- Multilingual Recognition: Supports accurate transcription across 99 languages
Key features:
- Accurate transcription with word-level timestamps
- Speaker diarization for multi-speaker audio
- Dynamic audio tagging for enhanced context
- Support for 99 languages
Read more about Scribe v1 here.
Concurrency and priority
Your subscription plan determines how many requests can be processed simultaneously and the priority level of your requests in the queue.
To increase your concurrency limit & queue priority, upgrade your subscription plan.
Enterprise customers can request a higher concurrency limit by contacting their account manager.
The response headers include current-concurrent-requests
and maximum-concurrent-requests
which you can use to monitor your concurrency.