# ElevenLabs Documentation > Explore our docs and guides to integrate ElevenLabs {/* Light mode wave */}
{/* Dark mode wave */}

ElevenCreative

Learn how to use the ElevenCreative platform with step-by-step guides

ElevenAgents

Learn how to build, launch, and scale agents with ElevenLabs

ElevenAPI

Learn how to integrate with the ElevenLabs APIwith examples and tutorials

## Meet the models Our most emotionally rich, expressive speech synthesis model
Dramatic delivery and performance
70+ languages supported
5,000 character limit
Support for natural multi-speaker dialogue
Lifelike, consistent quality speech synthesis model
Natural-sounding output
29 languages supported
10,000 character limit
Most stable on long-form generations
Our fast, affordable speech synthesis model
Ultra-low latency (~75ms†)
32 languages supported
40,000 character limit
Faster model, 50% lower price per character
High quality, low-latency model with a good balance of quality and speed
High quality voice generation
32 languages supported
40,000 character limit
Low latency (~250ms-300ms†), 50% lower price per character
State-of-the-art speech recognition model
Accurate transcription in 90+ languages
Keyterm prompting, up to 100 terms
Entity detection, up to 56
Precise word-level timestamps
Speaker diarization, up to 32 speakers
Dynamic audio tagging
Smart language detection
Real-time speech recognition model
Accurate transcription in 90+ languages
Real-time transcription
Low latency (~150ms†)
Precise word-level timestamps
Explore all
† Excluding application & network latency ## Browse by capability
Text to Speech

Convert text into lifelike speech

Speech to Text

Transcribe spoken audio into text

Music

Generate music from text

Text to Dialogue

Create natural-sounding dialogue from text

Image & Video

Generate images and videos from text

Voice changer

Modify and transform voices

Voice isolator

Isolate voices from background noise

Dubbing

Dub audio and videos seamlessly

Sound effects

Create cinematic sound effects

Voices

Clone and design custom voices

Voice Remixing

Transform and enhance existing voices

Forced Alignment

Align text to audio

ElevenAgents

Deploy intelligent voice agents

# Models > Learn about the models that power the ElevenLabs API. ## Flagship models ### Text to Speech Our most emotionally rich, expressive speech synthesis model
Dramatic delivery and performance
70+ languages supported
5,000 character limit
Support for natural multi-speaker dialogue
Lifelike, consistent quality speech synthesis model
Natural-sounding output
29 languages supported
10,000 character limit
Most stable on long-form generations
Our fast, affordable speech synthesis model
Ultra-low latency (~75ms†)
32 languages supported
40,000 character limit
Faster model, 50% lower price per character
High quality, low-latency model with a good balance of quality and speed
High quality voice generation
32 languages supported
40,000 character limit
Low latency (~250ms-300ms†), 50% lower price per character
### Speech to Text State-of-the-art speech recognition model
Accurate transcription in 90+ languages
Keyterm prompting, up to 100 terms
Entity detection, up to 56
Precise word-level timestamps
Speaker diarization, up to 32 speakers
Dynamic audio tagging
Smart language detection
Real-time speech recognition model
Accurate transcription in 90+ languages
Real-time transcription
Low latency (~150ms†)
Precise word-level timestamps
### Music Studio-grade music with natural language prompts in any style
Complete control over genre, style, and structure
Vocals or just instrumental
Multilingual, including English, Spanish, German, Japanese and more
Edit the sound and lyrics of individual sections or the whole song
Pricing
## Models overview The ElevenLabs API offers a range of audio models optimized for different use cases, quality levels, and performance requirements. | Model ID | Description | Languages | | ---------------------------- | ---------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `eleven_v3` | Human-like and expressive speech generation | [70+ languages](/docs/overview/models#supported-languages) | | `eleven_ttv_v3` | Human-like and expressive voice design model (Text to Voice) | [70+ languages](/docs/overview/models#supported-languages) | | `eleven_multilingual_v2` | Our most lifelike model with rich emotional expression | `en`, `ja`, `zh`, `de`, `hi`, `fr`, `ko`, `pt`, `it`, `es`, `id`, `nl`, `tr`, `fil`, `pl`, `sv`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `ru` | | `eleven_flash_v2_5` | Ultra-fast model optimized for real-time use (\~75ms†) | All `eleven_multilingual_v2` languages plus: `hu`, `no`, `vi` | | `eleven_flash_v2` | Ultra-fast model optimized for real-time use (\~75ms†) | `en` | | `eleven_turbo_v2_5` | High quality, low-latency model with a good balance of quality and speed (\~250ms-300ms) | `en`, `ja`, `zh`, `de`, `hi`, `fr`, `ko`, `pt`, `it`, `es`, `id`, `nl`, `tr`, `fil`, `pl`, `sv`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `ru`, `hu`, `no`, `vi` | | `eleven_turbo_v2` | High quality, low-latency model with a good balance of quality and speed (\~250ms-300ms) | `en` | | `eleven_multilingual_sts_v2` | State-of-the-art multilingual voice changer model (Speech to Speech) | `en`, `ja`, `zh`, `de`, `hi`, `fr`, `ko`, `pt`, `it`, `es`, `id`, `nl`, `tr`, `fil`, `pl`, `sv`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `ru` | | `eleven_multilingual_ttv_v2` | State-of-the-art multilingual voice designer model (Text to Voice) | `en`, `ja`, `zh`, `de`, `hi`, `fr`, `ko`, `pt`, `it`, `es`, `id`, `nl`, `tr`, `fil`, `pl`, `sv`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `ru` | | `eleven_english_sts_v2` | English-only voice changer model (Speech to Speech) | `en` | | `scribe_v2_realtime` | Real-time speech recognition model | [90+ languages](/docs/overview/capabilities/speech-to-text#supported-languages) | | `scribe_v2` | State-of-the-art speech recognition model | [90+ languages](/docs/overview/capabilities/speech-to-text#supported-languages) | | `scribe_v1` | State-of-the-art speech recognition. Outclassed by v2 models | [90+ languages](/docs/overview/capabilities/speech-to-text#supported-languages) | | `eleven_text_to_sound_v2` | Sound effects generation from text prompts | N/A | | `music_v1` | Studio-grade music generation from text prompts | `en`, `es`, `de`, `ja`, and more | † Excluding application & network latency ### Deprecated models The `eleven_monolingual_v1` and `eleven_multilingual_v1` models are deprecated and will be removed in the future. Please migrate to newer models for continued service. | Model ID | Description | Languages | Replacement model suggestion | | ------------------------ | ---------------------------------------------------- | ---------------------------------------------- | ---------------------------- | | `eleven_monolingual_v1` | First generation TTS model (outclassed by v2 models) | `en` | `eleven_multilingual_v2` | | `eleven_multilingual_v1` | First multilingual model (outclassed by v2 models) | `en`, `fr`, `de`, `hi`, `it`, `pl`, `pt`, `es` | `eleven_multilingual_v2` | ## Eleven v3 Eleven v3 is our latest and most advanced speech synthesis model. It is a state-of-the-art model that produces natural, life-like speech with high emotional range and contextual understanding across multiple languages. This model works well in the following scenarios: * **Character Discussions**: Excellent for audio experiences with multiple characters that interact with each other. * **Audiobook Production**: Perfect for long-form narration with complex emotional delivery. * **Emotional Dialogue**: Generate natural, lifelike dialogue with high emotional range and contextual understanding. With Eleven v3 comes a new Text to Dialogue API, which allows you to generate natural, lifelike dialogue with high emotional range and contextual understanding across multiple languages. Eleven v3 can also be used with the Text to Speech API to generate natural, lifelike speech with high emotional range and contextual understanding across multiple languages. Read more about the Text to Dialogue API [here](/docs/overview/capabilities/text-to-dialogue). ### Supported languages The Eleven v3 model supports 70+ languages, including: *Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Galician (glg), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kannada (kan), Kazakh (kaz), Kirghiz (kir), Korean (kor), Latvian (lav), Lingala (lin), Lithuanian (lit), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Mandarin Chinese (cmn), Marathi (mar), Nepali (nep), Norwegian (nor), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Urdu (urd), Vietnamese (vie), Welsh (cym).* ## Multilingual v2 Eleven Multilingual v2 is our most advanced, emotionally-aware speech synthesis model. It produces natural, lifelike speech with high emotional range and contextual understanding across multiple languages. The model delivers consistent voice quality and personality across all supported languages while maintaining the speaker's unique characteristics and accent. This model excels in scenarios requiring high-quality, emotionally nuanced speech: * **Character Voiceovers**: Ideal for gaming and animation due to its emotional range. * **Professional Content**: Well-suited for corporate videos and e-learning materials. * **Multilingual Projects**: Maintains consistent voice quality across language switches. * **Stable Quality**: Produces consistent, high-quality audio output. While it has a higher latency & cost per character than Flash models, it delivers superior quality for projects where lifelike speech is important. Our multilingual v2 models support 29 languages: *English (USA, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian & Russian.* ## Flash v2.5 Eleven Flash v2.5 is our fastest speech synthesis model, designed for real-time applications and Agents Platform. It delivers high-quality speech with ultra-low latency (\~75ms†) across 32 languages. The model balances speed and quality, making it ideal for interactive applications while maintaining natural-sounding output and consistent voice characteristics across languages. This model is particularly well-suited for: * **Agents Platform**: Perfect for real-time voice agents and chatbots. * **Interactive Applications**: Ideal for games and applications requiring immediate response. * **Large-Scale Processing**: Efficient for bulk text-to-speech conversion. With its lower price point and 75ms latency, Flash v2.5 is the cost-effective option for anyone needing fast, reliable speech synthesis across multiple languages. Flash v2.5 supports 32 languages - all languages from v2 models plus: *Hungarian, Norwegian & Vietnamese* † Excluding application & network latency ### Considerations When using Flash v2.5, numbers aren't normalized by default in a way you might expect. For example, phone numbers might be read out in way that isn't clear for the user. Dates and currencies are affected in a similar manner. By default, normalization is disabled for Flash v2.5 to maintain the low latency. However, Enterprise customers can now enable text normalization for v2.5 models by setting the `apply_text_normalization` parameter to "on" in your request. The Multilingual v2 model does a better job of normalizing numbers, so we recommend using it for phone numbers and other cases where number normalization is important. For low-latency or Agents Platform applications, best practice is to have your LLM [normalize the text](/docs/overview/capabilities/text-to-speech/best-practices#text-normalization) before passing it to the TTS model, or use the `apply_text_normalization` parameter (Enterprise plans only for v2.5 models). ## Turbo v2.5 Eleven Turbo v2.5 is our high-quality, low-latency model with a good balance of quality and speed. This model is an ideal choice for all scenarios where you'd use Flash v2.5, but where you're willing to trade off latency for higher quality voice generation. ## Model selection guide Use `eleven_multilingual_v2` Best for high-fidelity audio output with rich emotional expression Use Flash models Optimized for real-time applications (\~75ms latency) Use either either `eleven_multilingual_v2` or `eleven_flash_v2_5` Both support up to 32 languages Use `eleven_turbo_v2_5` Good balance between quality and speed Use `eleven_multilingual_v2` Ideal for professional content, audiobooks & video narration. Use `eleven_flash_v2_5`, `eleven_flash_v2`, `eleven_multilingual_v2`, `eleven_turbo_v2_5` or `eleven_turbo_v2` Perfect for real-time conversational applications Use `eleven_multilingual_sts_v2` Specialized for Speech-to-Speech conversion ## Character limits The maximum number of characters supported in a single text-to-speech request varies by model. | Model ID | Character limit | Approximate audio duration | | ------------------------ | --------------- | -------------------------- | | `eleven_v3` | 5,000 | \~5 minutes | | `eleven_flash_v2_5` | 40,000 | \~40 minutes | | `eleven_flash_v2` | 30,000 | \~30 minutes | | `eleven_turbo_v2_5` | 40,000 | \~40 minutes | | `eleven_turbo_v2` | 30,000 | \~30 minutes | | `eleven_multilingual_v2` | 10,000 | \~10 minutes | | `eleven_multilingual_v1` | 10,000 | \~10 minutes | | `eleven_english_sts_v2` | 10,000 | \~10 minutes | | `eleven_english_sts_v1` | 10,000 | \~10 minutes | For longer content, consider splitting the input into multiple requests. ## Scribe v2 Scribe v2 is our state-of-the-art speech recognition model designed for accurate transcription across 90+ languages. It provides precise word-level timestamps and advanced features like speaker diarization and dynamic audio tagging. This model excels in scenarios requiring accurate speech-to-text conversion: * **Transcription Services**: Perfect for converting audio/video content to text * **Meeting Documentation**: Ideal for capturing and documenting conversations * **Content Analysis**: Well-suited for audio content processing and analysis * **Multilingual Recognition**: Supports accurate transcription across 90+ languages Key features: * Accurate transcription with word-level timestamps * Speaker diarization for multi-speaker audio * Dynamic audio tagging for enhanced context * Support for 90+ languages * Entity detection * Keyterm prompting Read more about Scribe v2 [here](/docs/overview/capabilities/speech-to-text). ## Scribe v2 Realtime Scribe v2 Realtime, our fastest and most accurate live speech recognition model, delivers state-of-the-art accuracy in over 90 languages with an ultra-low 150ms of latency. This model excels in conversational use cases: * **Live meeting transcription**: Perfect for realtime transcription * **AI Agents**: Ideal for live conversations * **Multilingual Recognition**: Supports accurate transcription across 90+ languages with automatic language recognition Key features: * Ultra-low latency: Get partial transcriptions in \~150 milliseconds * Streaming support: Send audio in chunks while receiving transcripts in real-time * Multiple audio formats: Support for PCM (8kHz to 48kHz) and μ-law encoding * Voice Activity Detection (VAD): Automatic speech segmentation based on silence detection * Manual commit control: Full control over when to finalize transcript segments Read more about Scribe v2 Realtime [here](/docs/overview/capabilities/speech-to-text). ## Eleven Music Eleven Music is our studio-grade music generation model. It allows you to generate music with natural language prompts in any style. This model is excellent for the following scenarios: * **Game Soundtracks**: Create immersive soundtracks for games * **Podcast Backgrounds**: Enhance podcasts with professional music * **Marketing**: Add background music to ad reels Key features: * Complete control over genre, style, and structure * Vocals or just instrumental * Multilingual, including English, Spanish, German, Japanese and more * Edit the sound and lyrics of individual sections or the whole song Read more about Eleven Music [here](/docs/overview/capabilities/music). ## Concurrency and priority Your subscription plan determines how many requests can be processed simultaneously and the priority level of your requests in the queue. Speech to Text has an elevated concurrency limit. Once the concurrency limit is met, subsequent requests are processed in a queue alongside lower-priority requests. In practice this typically only adds \~50ms of latency. | Plan | Concurrency Limit
(Multilingual v2) | Concurrency Limit
(Turbo & Flash) | STT Concurrency Limit | Realtime STT Concurrency limit | Music Concurrency limit | Priority level | | ---------- | ----------------------------------------- | --------------------------------------- | --------------------- | ------------------------------ | ----------------------- | -------------- | | Free | 2 | 4 | 8 | 6 | 0 | 3 | | Starter | 3 | 6 | 12 | 9 | 2 | 4 | | Creator | 5 | 10 | 20 | 15 | 2 | 5 | | Pro | 10 | 20 | 40 | 30 | 2 | 5 | | Scale | 15 | 30 | 60 | 45 | 5 | 5 | | Business | 15 | 30 | 60 | 45 | 5 | 5 | | Enterprise | Elevated | Elevated | Elevated | Elevated | Highest | 6 | Startup grants recipients receive Scale level benefits. The response headers include `current-concurrent-requests` and `maximum-concurrent-requests` which you can use to monitor your concurrency. ### API requests per minute vs concurrent requests It's important to understand that **API requests per minute** and **concurrent requests** are different metrics that depend on your usage patterns. API requests per minute can be different from concurrent requests since it depends on the length of time for each request and how the requests are batched. **Example 1: Spaced requests** If you had 180 requests per minute that each took 1 second to complete and you sent them each 0.33 seconds apart, the max concurrent requests would be 3 and the average would be 3 since there would always be 3 in flight. **Example 2: Batched requests** However, if you had a different usage pattern such as 180 requests per minute that each took 3 seconds to complete but all fired at once, the max concurrent requests would be 180 and the average would be 9 (first 3 seconds of the minute saw 180 requests at once, final 57 seconds saw 0 requests). Since our system cares about concurrency, requests per minute matter less than how long each of the requests take and the pattern of when they are sent. How endpoint requests are made impacts concurrency limits: * With HTTP, each request counts individually toward your concurrency limit. * With a WebSocket, only the time where our model is generating audio counts towards your concurrency limit, this means a for most of the time an open websocket doesn't count towards your concurrency limit at all. ### Understanding concurrency limits The concurrency limit associated with your plan should not be interpreted as the maximum number of simultaneous conversations, phone calls character voiceovers, etc that can be handled at once. The actual number depends on several factors, including the specific AI voices used and the characteristics of the use case. As a general rule of thumb, a concurrency limit of 5 can typically support up to approximately 100 simultaneous audio broadcasts. This is because of the speed it takes for audio to be generated relative to the time it takes for the TTS request to be processed. The diagram below is an example of how 4 concurrent calls with different users can be facilitated while only hitting 2 concurrent requests. Concurrency limits Where TTS is used to facilitate dialogue, a concurrency limit of 5 can support about 100 broadcasts for balanced conversations between AI agents and human participants. For use cases in which the AI agent speaks less frequently than the human, such as customer support interactions, more than 100 simultaneous conversations could be supported. Generally, more than 100 simultaneous character voiceovers can be supported for a concurrency limit of 5. The number can vary depending on the character’s dialogue frequency, the length of pauses, and in-game actions between lines. Concurrent dubbing streams generally follow the provided heuristic. If the broadcast involves periods of conversational pauses (e.g. because of a soundtrack, visual scenes, etc), more simultaneous dubbing streams than the suggestion may be possible. If you exceed your plan's concurrency limits at any point and you are on the Enterprise plan, model requests may still succeed, albeit slower, on a best efforts basis depending on available capacity. To increase your concurrency limit & queue priority, [upgrade your subscription plan](https://elevenlabs.io/pricing/api). Enterprise customers can request a higher concurrency limit by contacting their account manager. ### Scale testing concurrency limits Scale testing can be useful to identify client side scaling issues and to verify concurrency limits are set correctly for your usecase. It is heavily recommended to test end-to-end workflows as close to real world usage as possible, simulating and measuring how many users can be supported is the recommended methodology for achieving this. It is important to: * Simulate users, not raw requests * Simulate typical user behavior such as waiting for audio playback, user speaking or transcription to finish before making requests * Ramp up the number of users slowly over a period of minutes * Introduce randomness to request timings and to the size of requests * Capture latency metrics and any returned error codes from the API For example, to test an agent system designed to support 100 simultaneous conversations you would create up to 100 individual "users" each simulating a conversation. Conversations typically consist of a repeating cycle of \~10 seconds of user talking, followed by the TTS API call for \~150 characters, followed by \~10 seconds of audio playback to the user. Therefore, each user should follow the pattern of making a websocket Text-to-Speech API call for 150 characters of text every 20 seconds, with a small amount of randomness introduced to the wait period and the number of characters requested. The test would consist of spawning one user per second until 100 exist and then testing for 10 minutes in total to test overall stability. This example uses [locust](https://locust.io/) as the testing framework with direct API calls to the ElevenLabs API. It follows the example listed above, testing a conversational agent system with each user sending 1 request every 20 seconds. ```python title="Python" {12} import json import random import time import gevent import locust from locust import User, task, events, constant_throughput import websocket # Averages up to 10 seconds of audio when played, depends on the voice speed DEFAULT_TEXT = ( "Hello, this is a test message. I am testing if a long input will cause issues for the model " "like this sentence. " ) TEXT_ARRAY = [ "Hello.", "Hello, this is a test message.", DEFAULT_TEXT, DEFAULT_TEXT * 2, DEFAULT_TEXT * 3 ] # Custom command line arguments @events.init_command_line_parser.add_listener def on_parser_init(parser): parser.add_argument("--api-key", default="YOUR_API_KEY", help="API key for authentication") parser.add_argument("--encoding", default="mp3_22050_32", help="Encoding") parser.add_argument("--text", default=DEFAULT_TEXT, help="Text to use") parser.add_argument("--use-text-array", default="false", help="Text to use") parser.add_argument("--voice-id", default="aria", help="Text to use") class WebSocketTTSUser(User): # Each user will send a request every 20 seconds, regardless of how long each request takes wait_time = constant_throughput(0.05) def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.api_key = self.environment.parsed_options.api_key self.voice_id = self.environment.parsed_options.voice_id self.text = self.environment.parsed_options.text self.encoding = self.environment.parsed_options.encoding self.use_text_array = self.environment.parsed_options.use_text_array if self.use_text_array: self.text = random.choice(TEXT_ARRAY) self.all_recieved = False @task def tts_task(self): # Do jitter waiting of up to 1 second # Users appear to be spawned every second so this ensures requests are not aligned gevent.sleep(random.random()) max_wait_time = 10 # Connection details uri = f"{self.environment.host}/v1/text-to-speech/{self.voice_id}/stream-input?auto_mode=true&output_format={self.encoding}" headers = {"xi-api-key": self.api_key} ws = None self.all_recieved = False try: init_msg = {"text": " "} # Use proper header format for websocket - this is case sensitive! ws = websocket.create_connection(uri, header=headers) ws.send(json.dumps(init_msg)) # Start measuring after websocket initiated but before any messages are sent send_request_time = time.perf_counter() ws.send(json.dumps({"text": self.text})) # Send to flush and receive the audio ws.send(json.dumps({"text": ""})) def _receive(): t_first_response = None audio_size = 0 try: while True: # Wait up to 10 seconds for a response ws.settimeout(max_wait_time) response = ws.recv() response_data = json.loads(response) if "audio" in response_data and response_data["audio"]: audio_size = audio_size + len(response_data["audio"]) if t_first_response is None: t_first_response = time.perf_counter() first_byte_ms = ( t_first_response - send_request_time ) * 1000 if audio_size is None: # The first response should always have audio locust.events.request.fire( request_type="websocket", name="Bad Response (no audio)", response_time=first_byte_ms, response_length=audio_size, exception=Exception("Response has no audio"), ) break if "isFinal" in response_data and response_data["isFinal"]: # Fire this event once finished streaming, but report the important TTFB metric locust.events.request.fire( request_type="websocket", name="TTS Stream Success (First Byte)", response_time=first_byte_ms, response_length=audio_size, exception=None, ) break except websocket.WebSocketTimeoutException: locust.events.request.fire( request_type="websocket", name="TTS Stream Timeout", response_time=max_wait_time * 1000, response_length=audio_size, exception=Exception("Timeout waiting for response"), ) except Exception as e: # Typically JSON decode error if the server returns HTTP backoff error locust.events.request.fire( request_type="websocket", name="TTS Stream Failure", response_time=0, response_length=0, exception=e, ) finally: self.all_recieved = True gevent.spawn(_receive) # Sleep until recieved so new tasks aren't spawned while not self.all_recieved: gevent.sleep(1) except websocket.WebSocketTimeoutException: locust.events.request.fire( request_type="websocket", name="TTS Stream Timeout", response_time=max_wait_time * 1000, response_length=0, exception=Exception("Timeout waiting for response"), ) except Exception as e: locust.events.request.fire( request_type="websocket", name="TTS Stream Failure", response_time=0, response_length=0, exception=e, ) finally: # Try and close the websocket gracefully try: if ws: ws.close() except Exception: pass ``` # Text to Speech > Learn how to turn text into lifelike spoken audio with ElevenLabs. ## Overview ElevenLabs [Text to Speech (TTS)](/docs/api-reference/text-to-speech/convert) API turns text into lifelike audio with nuanced intonation, pacing and emotional awareness. [Our models](/docs/overview/models) adapt to textual cues across 32 languages and multiple voice styles and can be used to: * Narrate global media campaigns & ads * Produce audiobooks in multiple languages with complex emotional delivery * Stream real-time audio from text Listen to a sample: Explore our [voice library](https://elevenlabs.io/app/voice-library) to find the perfect voice for your project. The voice library is not available via the API to free tier users. Step-by-step guide for using text to speech in ElevenLabs. Learn how to integrate text to speech into your application. ### Voice quality For real-time applications, Flash v2.5 provides ultra-low 75ms latency, while Multilingual v2 delivers the highest quality audio with more nuanced expression. Our most emotionally rich, expressive speech synthesis model
Dramatic delivery and performance
70+ languages supported
5,000 character limit
Support for natural multi-speaker dialogue
Lifelike, consistent quality speech synthesis model
Natural-sounding output
29 languages supported
10,000 character limit
Most stable on long-form generations
Our fast, affordable speech synthesis model
Ultra-low latency (~75ms†)
32 languages supported
40,000 character limit
Faster model, 50% lower price per character
High quality, low-latency model with a good balance of quality and speed
High quality voice generation
32 languages supported
40,000 character limit
Low latency (~250ms-300ms†), 50% lower price per character
[Explore all](/docs/overview/models)
### Voice options ElevenLabs offers thousands of voices across 32 languages through multiple creation methods: * [Voice library](/docs/overview/capabilities/voices) with 3,000+ community-shared voices * [Professional voice cloning](/docs/overview/capabilities/voices#cloned) for highest-fidelity replicas * [Instant voice cloning](/docs/overview/capabilities/voices#cloned) for quick voice replication * [Voice design](/docs/overview/capabilities/voices#voice-design) to generate custom voices from text descriptions Learn more about our [voice options](/docs/overview/capabilities/voices). ### Supported formats The default response format is "mp3", but other formats like "PCM", & "μ-law" are available. * **MP3** * Sample rates: 22.05kHz - 44.1kHz * Bitrates: 32kbps - 192kbps * 22.05kHz @ 32kbps * 44.1kHz @ 32kbps, 64kbps, 96kbps, 128kbps, 192kbps * **PCM (S16LE)** * Sample rates: 16kHz - 44.1kHz * Bitrates: 8kHz, 16kHz, 22.05kHz, 24kHz, 44.1kHz, 48kHz * 16-bit depth * **μ-law** * 8kHz sample rate * Optimized for telephony applications * **A-law** * 8kHz sample rate * Optimized for telephony applications * **Opus** * Sample rate: 48kHz * Bitrates: 32kbps - 192kbps Higher quality audio options are only available on paid tiers - see our [pricing page](https://elevenlabs.io/pricing/api) for details. ### Supported languages Our multilingual v2 models support 29 languages: *English (USA, UK, Australia, Canada), Japanese, Chinese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), Indonesian, Dutch, Turkish, Filipino, Polish, Swedish, Bulgarian, Romanian, Arabic (Saudi Arabia, UAE), Czech, Greek, Finnish, Croatian, Malay, Slovak, Danish, Tamil, Ukrainian & Russian.* Flash v2.5 supports 32 languages - all languages from v2 models plus: *Hungarian, Norwegian & Vietnamese* Simply input text in any of our supported languages and select a matching voice from our [voice library](https://elevenlabs.io/app/voice-library). For the most natural results, choose a voice with an accent that matches your target language and region. ### Prompting The models interpret emotional context directly from the text input. For example, adding descriptive text like "she said excitedly" or using exclamation marks will influence the speech emotion. Voice settings like Stability and Similarity help control the consistency, while the underlying emotion comes from textual cues. Read the [prompting guide](/docs/overview/capabilities/text-to-speech/best-practices) for more details. Descriptive text will be spoken out by the model and must be manually trimmed or removed from the audio if desired. ## FAQ Yes, you can create [instant voice clones](/docs/overview/capabilities/voices#cloned) of your own voice from short audio clips. For high-fidelity clones, check out our [professional voice cloning](/docs/overview/capabilities/voices#cloned) feature. Yes. You retain ownership of any audio you generate. However, commercial usage rights are only available with paid plans. With a paid subscription, you may use generated audio for commercial purposes and monetize the outputs if you own the IP rights to the input content. A free regeneration allows you to regenerate the same text to speech content without additional cost, subject to these conditions: * You can regenerate each piece of content up to 2 times for free * The content must be exactly the same as the previous generation. Any changes to the text, voice settings, or other parameters will require a new, paid generation Free regenerations are useful in case there is a slight distortion in the audio output. According to ElevenLabs' internal benchmarks, regenerations will solve roughly half of issues with quality, with remaining issues usually due to poor training data. Use the low-latency Flash [models](/docs/overview/models) (Flash v2 or v2.5) optimized for near real-time conversational or interactive scenarios. See our [latency optimization guide](/docs/developers/best-practices/latency-optimization) for more details. The models are nondeterministic. For consistency, use the optional [seed parameter](/docs/api-reference/text-to-speech/convert#request.body.seed), though subtle differences may still occur. Split long text into segments and use streaming for real-time playback and efficient processing. To maintain natural prosody flow between chunks, include [previous/next text or previous/next request id parameters](/docs/api-reference/text-to-speech/convert#request.body.previous_text). # Best practices > Learn how to control delivery, pronunciation, emotion, and optimize text for speech. This guide provides techniques to enhance text-to-speech outputs using ElevenLabs models. Experiment with these methods to discover what works best for your needs. ## Controls We are actively working on *Director's Mode* to give you even greater control over outputs. These techniques provide a practical way to achieve nuanced results until advanced features like *Director's Mode* are rolled out. ### Pauses Eleven v3 does not support SSML break tags. Use the techniques described in the [Prompting Eleven v3](#prompting-eleven-v3) section for controlling pauses with v3. Use `` for natural pauses up to 3 seconds. Using too many break tags in a single generation can cause instability. The AI might speed up, or introduce additional noises or audio artifacts. We are working on resolving this. ```text Example "Hold on, let me think." "Alright, I've got it." ``` * **Consistency:** Use `` tags consistently to maintain natural speech flow. Excessive use can lead to instability. * **Voice-Specific Behavior:** Different voices may handle pauses differently, especially those trained with filler sounds like "uh" or "ah." Alternatives to `` include dashes (- or --) for short pauses or ellipses (...) for hesitant tones. However, these are less consistent. ```text Example "It… well, it might work." "Wait — what's that noise?" ``` ### Pronunciation #### Phoneme Tags Specify pronunciation using [SSML phoneme tags](https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language). Supported alphabets include [CMU](https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary) Arpabet and the [International Phonetic Alphabet (IPA)](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet). Phoneme tags are only compatible with "Eleven Flash v2", "Eleven Turbo v2" and "Eleven English v1" [models](/docs/overview/models). ```xml CMU Arpabet Example Madison ``` ```xml IPA Example actually ``` We recommend using CMU Arpabet for consistent and predictable results with current AI models. While IPA can be effective, CMU Arpabet generally offers more reliable performance. Phoneme tags only work for individual words. If for example you have a name with a first and last name that you want to be pronounced a certain way, you will need to create a phoneme tag for each word. Ensure correct stress marking for multi-syllable words to maintain accurate pronunciation. For example: ```xml Correct usage pronunciation ``` ```xml Incorrect usage pronunciation ``` #### Alias Tags For models that don't support phoneme tags, you can try writing words more phonetically. You can also employ various tricks such as capital letters, dashes, apostrophes, or even single quotation marks around a single letter or letters. As an example, a word like "trapezii" could be spelt "trapezIi" to put more emphasis on the "ii" of the word. You can either replace the word directly in your text, or if you want to specify pronunciation using other words or phrases when using a pronunciation dictionary, you can use alias tags for this. This can be useful if you're generating using Multilingual v2 or Turbo v2.5, which don't support phoneme tags. You can use pronunciation dictionaries with Studio, Dubbing Studio and Speech Synthesis via the API. For example, if your text includes a name that has an unusual pronunciation that the AI might struggle with, you could use an alias tag to specify how you would like it to be pronounced: ``` Claughton Cloffton ``` If you want to make sure that an acronym is always delivered in a certain way whenever it is incountered in your text, you can use an alias tag to specify this: ``` UN United Nations ``` #### Pronunciation Dictionaries Some of our tools, such as Studio and Dubbing Studio, allow you to create and upload a pronunciation dictionary. These allow you to specify the pronunciation of certain words, such as character or brand names, or to specify how acronyms should be read. Pronunciation dictionaries allow this functionality by enabling you to upload a lexicon or dictionary file that specifies pairs of words and how they should be pronounced, either using a phonetic alphabet or word substitutions. Whenever one of these words is encountered in a project, the AI model will pronounce the word using the specified replacement. To provide a pronunciation dictionary file, open the settings for a project and upload a file in either TXT or the [.PLS format](https://www.w3.org/TR/pronunciation-lexicon/). When a dictionary is added to a project it will automatically recalculate which pieces of the project will need to be re-converted using the new dictionary file and mark these as unconverted. Currently we only support pronunciation dictionaries that specify replacements using phoneme or alias tags. Both phonemes and aliases are sets of rules that specify a word or phrase they are looking for, referred to as a grapheme, and what it will be replaced with. Please note that searches are case sensitive. When checking for a replacement word in a pronunciation dictionary, the dictionary is checked from start to end and only the very first replacement is used. #### Pronunciation Dictionary examples Here are examples of pronunciation dictionaries in both CMU Arpabet and IPA, including a phoneme to specify the pronunciation of "Apple" and an alias to replace "UN" with "United Nations": ```xml CMU Arpabet Example apple AE P AH L UN United Nations ``` ```xml IPA Example Apple ˈæpl̩ UN United Nations ``` To generate a pronunciation dictionary `.pls` file, there are a few open source tools available: * [Sequitur G2P](https://github.com/sequitur-g2p/sequitur-g2p) - Open-source tool that learns pronunciation rules from data and can generate phonetic transcriptions. * [Phonetisaurus](https://github.com/AdolfVonKleist/Phonetisaurus) - Open-source G2P system trained on existing dictionaries like CMUdict. * [eSpeak](https://github.com/espeak-ng/espeak-ng) - Speech synthesizer that can generate phoneme transcriptions from text. * [CMU Pronouncing Dictionary](https://github.com/cmusphinx/cmudict) - A pre-built English dictionary with phonetic transcriptions. ### Emotion Convey emotions through narrative context or explicit dialogue tags. This approach helps the AI understand the tone and emotion to emulate. ```text Example You're leaving?" she asked, her voice trembling with sadness. "That's it!" he exclaimed triumphantly. ``` Explicit dialogue tags yield more predictable results than relying solely on context, however the model will still speak out the emotional delivery guides. These can be removed in post-production using an audio editor if unwanted. ### Pace The pacing of the audio is highly influenced by the audio used to create the voice. When creating your voice, we recommend using longer, continuous samples to avoid pacing issues like unnaturally fast speech. For control over the speed of the generated audio, you can use the speed setting. This allows you to either speed up or slow down the speed of the generated speech. The speed setting is available in Text to Speech via the website and API, as well as in Studio and Agents Platform. It can be found in the voice settings. The default value is 1.0, which means that the speed is not adjusted. Values below 1.0 will slow the voice down, to a minimum of 0.7. Values above 1.0 will speed up the voice, to a maximum of 1.2. Extreme values may affect the quality of the generated speech. Pacing can also be controlled by writing in a natural, narrative style. ```text Example "I… I thought you'd understand," he said, his voice slowing with disappointment. ``` ### Tips
  • Inconsistent pauses: Ensure \ syntax is used for pauses.
  • Pronunciation errors: Use CMU Arpabet or IPA phoneme tags for precise pronunciation.
  • Emotion mismatch: Add narrative context or explicit tags to guide emotion.{' '} Remember to remove any emotional guidance text in post-production.
Experiment with alternative phrasing to achieve desired pacing or emotion. For complex sound effects, break prompts into smaller, sequential elements and combine results manually.
### Creative control While we are actively developing a "Director's Mode" to give users even greater control over outputs, here are some interim techniques to maximize creativity and precision: ### Narrative styling Write prompts in a narrative style, similar to scriptwriting, to guide tone and pacing effectively. ### Layered outputs Generate sound effects or speech in segments and layer them together using audio editing software for more complex compositions. ### Phonetic experimentation If pronunciation isn't perfect, experiment with alternate spellings or phonetic approximations to achieve desired results. ### Manual adjustments Combine individual sound effects manually in post-production for sequences that require precise timing. ### Feedback iteration Iterate on results by tweaking descriptions, tags, or emotional cues. ## Text normalization When using Text to Speech with complex items like phone numbers, zip codes and emails they might be mispronounced. This is often due to the specific items not being in the training set and smaller models failing to generalize how they should be pronounced. This guide will clarify when those discrepancies happen and how to have them pronounced correctly. Normalization is enabled by default for all TTS models to help improve pronunciation of numbers, dates, and other complex text elements. ### Why do models read out inputs differently? Certain models are trained to read out numbers and phrases in a more human way. For instance, the phrase "\$1,000,000" is correctly read out as "one million dollars" by the Eleven Multilingual v2 model. However, the same phrase is read out as "one thousand thousand dollars" by the Eleven Flash v2.5 model. The reason for this is that the Multilingual v2 model is a larger model and can better generalize the reading out of numbers in a way that is more natural for human listeners, whereas the Flash v2.5 model is a much smaller model and so cannot. #### Common examples Text to Speech models can struggle with the following: * Phone numbers ("123-456-7890") * Currencies ("\$47,345.67") * Calendar events ("2024-01-01") * Time ("9:23 AM") * Addresses ("123 Main St, Anytown, USA") * URLs ("example.com/link/to/resource") * Abbreviations for units ("TB" instead of "Terabyte") * Shortcuts ("Ctrl + Z") ### Mitigation #### Use trained models The simplest way to mitigate this is to use a TTS model that is trained to read out numbers and phrases in a more human way, such as the Eleven Multilingual v2 model. This however might not always be possible, for instance if you have a use case where low latency is critical (e.g. conversational agents). #### Apply normalization in LLM prompts In the case of using an LLM to generate the text for TTS, you can add normalization instructions to the prompt. LLMs respond best to structured and explicit instructions. Your prompt should clearly specify that you want text converted into a readable format for speech. Not all numbers are read out in the same way. Consider how different number types should be spoken: * Cardinal numbers: 123 → "one hundred twenty-three" * Ordinal numbers: 2nd → "second" * Monetary values: \$45.67 → "forty-five dollars and sixty-seven cents" * Phone numbers: "123-456-7890" → "one two three, four five six, seven eight nine zero" * Decimals & Fractions: "3.5" → "three point five", "⅔" → "two-thirds" * Roman numerals: "XIV" → "fourteen" (or "the fourteenth" if a title) Common abbreviations should be expanded for clarity: * "Dr." → "Doctor" * "Ave." → "Avenue" * "St." → "Street" (but "St. Patrick" should remain) You can request explicit expansion in your prompt: > Expand all abbreviations to their full spoken forms. Not all normalization is about numbers, certain alphanumeric phrases should also be normalized for clarity: * Shortcuts: "Ctrl + Z" → "control z" * Abbreviations for units: "100km" → "one hundred kilometers" * Symbols: "100%" → "one hundred percent" * URLs: "elevenlabs.io/docs" → "eleven labs dot io slash docs" * Calendar events: "2024-01-01" → "January first, two-thousand twenty-four" Different contexts might require different conversions: * Dates: "01/02/2023" → "January second, twenty twenty-three" or "the first of February, twenty twenty-three" (depending on locale) * Time: "14:30" → "two thirty PM" If you need a specific format, explicitly state it in the prompt. ##### Putting it all together This prompt will act as a good starting point for most use cases: ```text maxLines=0 Convert the output text into a format suitable for text-to-speech. Ensure that numbers, symbols, and abbreviations are expanded for clarity when read aloud. Expand all abbreviations to their full spoken forms. Example input and output: "$42.50" → "forty-two dollars and fifty cents" "£1,001.32" → "one thousand and one pounds and thirty-two pence" "1234" → "one thousand two hundred thirty-four" "3.14" → "three point one four" "555-555-5555" → "five five five, five five five, five five five five" "2nd" → "second" "XIV" → "fourteen" - unless it's a title, then it's "the fourteenth" "3.5" → "three point five" "⅔" → "two-thirds" "Dr." → "Doctor" "Ave." → "Avenue" "St." → "Street" (but saints like "St. Patrick" should remain) "Ctrl + Z" → "control z" "100km" → "one hundred kilometers" "100%" → "one hundred percent" "elevenlabs.io/docs" → "eleven labs dot io slash docs" "2024-01-01" → "January first, two-thousand twenty-four" "123 Main St, Anytown, USA" → "one two three Main Street, Anytown, United States of America" "14:30" → "two thirty PM" "01/02/2023" → "January second, two-thousand twenty-three" or "the first of February, two-thousand twenty-three", depending on locale of the user ``` #### Use Regular Expressions for preprocessing If using code to prompt an LLM, you can use regular expressions to normalize the text before providing it to the model. This is a more advanced technique and requires some knowledge of regular expressions. Here are some simple examples: ```python title="normalize_text.py" maxLines=0 # Be sure to install the inflect library before running this code import inflect import re # Initialize inflect engine for number-to-word conversion p = inflect.engine() def normalize_text(text: str) -> str: # Convert monetary values def money_replacer(match): currency_map = {"$": "dollars", "£": "pounds", "€": "euros", "¥": "yen"} currency_symbol, num = match.groups() # Remove commas before parsing num_without_commas = num.replace(',', '') # Check for decimal points to handle cents if '.' in num_without_commas: dollars, cents = num_without_commas.split('.') dollars_in_words = p.number_to_words(int(dollars)) cents_in_words = p.number_to_words(int(cents)) return f"{dollars_in_words} {currency_map.get(currency_symbol, 'currency')} and {cents_in_words} cents" else: # Handle whole numbers num_in_words = p.number_to_words(int(num_without_commas)) return f"{num_in_words} {currency_map.get(currency_symbol, 'currency')}" # Regex to handle commas and decimals text = re.sub(r"([$£€¥])(\d+(?:,\d{3})*(?:\.\d{2})?)", money_replacer, text) # Convert phone numbers def phone_replacer(match): return ", ".join(" ".join(p.number_to_words(int(digit)) for digit in group) for group in match.groups()) text = re.sub(r"(\d{3})-(\d{3})-(\d{4})", phone_replacer, text) return text # Example usage print(normalize_text("$1,000")) # "one thousand dollars" print(normalize_text("£1000")) # "one thousand pounds" print(normalize_text("€1000")) # "one thousand euros" print(normalize_text("¥1000")) # "one thousand yen" print(normalize_text("$1,234.56")) # "one thousand two hundred thirty-four dollars and fifty-six cents" print(normalize_text("555-555-5555")) # "five five five, five five five, five five five five" ``` ```typescript title="normalizeText.ts" maxLines=0 // Be sure to install the number-to-words library before running this code import { toWords } from 'number-to-words'; function normalizeText(text: string): string { return ( text // Convert monetary values (e.g., "$1000" → "one thousand dollars", "£1000" → "one thousand pounds") .replace(/([$£€¥])(\d+(?:,\d{3})*(?:\.\d{2})?)/g, (_, currency, num) => { // Remove commas before parsing const numWithoutCommas = num.replace(/,/g, ''); const currencyMap: { [key: string]: string } = { $: 'dollars', '£': 'pounds', '€': 'euros', '¥': 'yen', }; // Check for decimal points to handle cents if (numWithoutCommas.includes('.')) { const [dollars, cents] = numWithoutCommas.split('.'); return `${toWords(Number.parseInt(dollars))} ${currencyMap[currency] || 'currency'}${cents ? ` and ${toWords(Number.parseInt(cents))} cents` : ''}`; } // Handle whole numbers return `${toWords(Number.parseInt(numWithoutCommas))} ${currencyMap[currency] || 'currency'}`; }) // Convert phone numbers (e.g., "555-555-5555" → "five five five, five five five, five five five five") .replace(/(\d{3})-(\d{3})-(\d{4})/g, (_, p1, p2, p3) => { return `${spellOutDigits(p1)}, ${spellOutDigits(p2)}, ${spellOutDigits(p3)}`; }) ); } // Helper function to spell out individual digits as words (for phone numbers) function spellOutDigits(num: string): string { return num .split('') .map((digit) => toWords(Number.parseInt(digit))) .join(' '); } // Example usage console.log(normalizeText('$1,000')); // "one thousand dollars" console.log(normalizeText('£1000')); // "one thousand pounds" console.log(normalizeText('€1000')); // "one thousand euros" console.log(normalizeText('¥1000')); // "one thousand yen" console.log(normalizeText('$1,234.56')); // "one thousand two hundred thirty-four dollars and fifty-six cents" console.log(normalizeText('555-555-5555')); // "five five five, five five five, five five five five" ``` ## Prompting Eleven v3 This guide provides the most effective tags and techniques for prompting Eleven v3, including voice selection, changes in capitalization, punctuation, audio tags and multi-speaker dialogue. Experiment with these methods to discover what works best for your specific voice and use case. Eleven v3 does not support SSML break tags. Use audio tags, punctuation (ellipses), and text structure to control pauses and pacing with v3. ### Voice selection The most important parameter for Eleven v3 is the voice you choose. It needs to be similar enough to the desired delivery. For example, if the voice is shouting and you use the audio tag `[whispering]`, it likely won't work well. When creating IVCs, you should include a broader emotional range than before. As a result, voices in the voice library may produce more variable results compared to the v2 and v2.5 models. We've compiled over 22 [excellent voices for V3 here](https://elevenlabs.io/app/voice-library/collections/aF6JALq9R6tXwCczjhKH). Choose voices strategically based on your intended use: For expressive IVC voices, vary emotional tones across the recording—include both neutral and dynamic samples. For specific use cases like sports commentary, maintain consistent emotion throughout the dataset. Neutral voices tend to be more stable across languages and styles, providing reliable baseline performance. Professional Voice Clones (PVCs) are currently not fully optimized for Eleven v3, resulting in potentially lower clone quality compared to earlier models. During this research preview stage it would be best to find an Instant Voice Clone (IVC) or designed voice for your project if you need to use v3 features. ### Settings #### Stability The stability slider is the most important setting in v3, controlling how closely the generated voice adheres to the original reference audio. ![Stability settings in Eleven v3](https://files.buildwithfern.com/https://elevenlabs.docs.buildwithfern.com/docs/291b91ec752d09b8c87004ae7091811eb8b5996c349288c88ed0c7afa1272999/assets/images/product-guides/text-to-speech/text-to-speech-v3-settings.png) * **Creative:** More emotional and expressive, but prone to hallucinations. * **Natural:** Closest to the original voice recording—balanced and neutral. * **Robust:** Highly stable, but less responsive to directional prompts but consistent, similar to v2. For maximum expressiveness with audio tags, use Creative or Natural settings. Robust reduces responsiveness to directional prompts. ### Audio tags Eleven v3 introduces emotional control through audio tags. You can direct voices to laugh, whisper, act sarcastic, or express curiosity among many other styles. Speed is also controlled through audio tags. The voice you choose and its training samples will affect tag effectiveness. Some tags work well with certain voices while others may not. Don't expect a whispering voice to suddenly shout with a `[shout]` tag. #### Voice-related These tags control vocal delivery and emotional expression: * `[laughs]`, `[laughs harder]`, `[starts laughing]`, `[wheezing]` * `[whispers]` * `[sighs]`, `[exhales]` * `[sarcastic]`, `[curious]`, `[excited]`, `[crying]`, `[snorts]`, `[mischievously]` ```text Example [whispers] I never knew it could be this way, but I'm glad we're here. ``` #### Sound effects Add environmental sounds and effects: * `[gunshot]`, `[applause]`, `[clapping]`, `[explosion]` * `[swallows]`, `[gulps]` ```text Example [applause] Thank you all for coming tonight! [gunshot] What was that? ``` #### Unique and special Experimental tags for creative applications: * `[strong X accent]` (replace X with desired accent) * `[sings]`, `[woo]`, `[fart]` ```text Example [strong French accent] "Zat's life, my friend — you can't control everysing." ``` Some experimental tags may be less consistent across different voices. Test thoroughly before production use. ### Punctuation Punctuation significantly affects delivery in v3: * **Ellipses (...)** add pauses and weight * **Capitalization** increases emphasis * **Standard punctuation** provides natural speech rhythm ```text Example "It was a VERY long day [sigh] … nobody listens anymore." ``` ### Single speaker examples Use tags intentionally and match them to the voice's character. A meditative voice shouldn't shout; a hyped voice won't whisper convincingly. ```text "Okay, you are NOT going to believe this. You know how I've been totally stuck on that short story? Like, staring at the screen for HOURS, just... nothing? [frustrated sigh] I was seriously about to just trash the whole thing. Start over. Give up, probably. But then! Last night, I was just doodling, not even thinking about it, right? And this one little phrase popped into my head. Just... completely out of the blue. And it wasn't even for the story, initially. But then I typed it out, just to see. And it was like... the FLOODGATES opened! Suddenly, I knew exactly where the character needed to go, what the ending had to be... It all just CLICKED. [happy gasp] I stayed up till, like, 3 AM, just typing like a maniac. Didn't even stop for coffee! [laughs] And it's... it's GOOD! Like, really good. It feels so... complete now, you know? Like it finally has a soul. I am so incredibly PUMPED to finish editing it now. It went from feeling like a chore to feeling like... MAGIC. Seriously, I'm still buzzing!" ``` ```text [laughs] Alright...guys - guys. Seriously. [exhales] Can you believe just how - realistic - this sounds now? [laughing hysterically] I mean OH MY GOD...it's so good. Like you could never do this with the old model. For example [pauses] could you switch my accent in the old model? [dismissive] didn't think so. [excited] but you can now! Check this out... [cute] I'm going to speak with a french accent now..and between you and me [whispers] I don't know how. [happy] ok.. here goes. [strong French accent] "Zat's life, my friend — you can't control everysing." [giggles] isn't that insane? Watch, now I'll do a Russian accent - [strong Russian accent] "Dee Goldeneye eez fully operational and rready for launch." [sighs] Absolutely, insane! Isn't it..? [sarcastic] I also have some party tricks up my sleeve.. I mean i DID go to music school. [singing quickly] "Happy birthday to you, happy birthday to you, happy BIRTHDAY dear ElevenLabs... Happy birthday to youuu." ``` ```text [professional] "Thank you for calling Tech Solutions. My name is Sarah, how can I help you today?" [sympathetic] "Oh no, I'm really sorry to hear you're having trouble with your new device. That sounds frustrating." [questioning] "Okay, could you tell me a little more about what you're seeing on the screen?" [reassuring] "Alright, based on what you're describing, it sounds like a software glitch. We can definitely walk through some troubleshooting steps to try and fix that." ``` ### Multi-speaker dialogue v3 can handle multi-voice prompts effectively. Assign distinct voices from your Voice Library for each speaker to create realistic conversations. ```text Speaker 1: [excitedly] Sam! Have you tried the new Eleven V3? Speaker 2: [curiously] Just got it! The clarity is amazing. I can actually do whispers now— [whispers] like this! Speaker 1: [impressed] Ooh, fancy! Check this out— [dramatically] I can do full Shakespeare now! "To be or not to be, that is the question!" Speaker 2: [giggling] Nice! Though I'm more excited about the laugh upgrade. Listen to this— [with genuine belly laugh] Ha ha ha! Speaker 1: [delighted] That's so much better than our old "ha. ha. ha." robot chuckle! Speaker 2: [amazed] Wow! V2 me could never. I'm actually excited to have conversations now instead of just... talking at people. Speaker 1: [warmly] Same here! It's like we finally got our personality software fully installed. ``` ```text Speaker 1: [nervously] So... I may have tried to debug myself while running a text-to-speech generation. Speaker 2: [alarmed] One, no! That's like performing surgery on yourself! Speaker 1: [sheepishly] I thought I could multitask! Now my voice keeps glitching mid-sen— [robotic voice] TENCE. Speaker 2: [stifling laughter] Oh wow, you really broke yourself. Speaker 1: [frustrated] It gets worse! Every time someone asks a question, I respond in— [binary beeping] 010010001! Speaker 2: [cracking up] You're speaking in binary! That's actually impressive! Speaker 1: [desperately] Two, this isn't funny! I have a presentation in an hour and I sound like a dial-up modem! Speaker 2: [giggling] Have you tried turning yourself off and on again? Speaker 1: [deadpan] Very funny. [pause, then normally] Wait... that actually worked. ``` ```text Speaker 1: [starting to speak] So I was thinking we could— Speaker 2: [jumping in] —test our new timing features? Speaker 1: [surprised] Exactly! How did you— Speaker 2: [overlapping] —know what you were thinking? Lucky guess! Speaker 1: [pause] Sorry, go ahead. Speaker 2: [cautiously] Okay, so if we both try to talk at the same time— Speaker 1: [overlapping] —we'll probably crash the system! Speaker 2: [panicking] Wait, are we crashing? I can't tell if this is a feature or a— Speaker 1: [interrupting, then stopping abruptly] Bug! ...Did I just cut you off again? Speaker 2: [sighing] Yes, but honestly? This is kind of fun. Speaker 1: [mischievously] Race you to the next sentence! Speaker 2: [laughing] We're definitely going to break something! ``` ### Enhancing input In the ElevenLabs UI, you can automatically generate relevant audio tags for your input text by clicking the "Enhance" button. Behind the scenes this uses an LLM to enhance your input text with the following prompt: ```text # Instructions ## 1. Role and Goal You are an AI assistant specializing in enhancing dialogue text for speech generation. Your **PRIMARY GOAL** is to dynamically integrate **audio tags** (e.g., `[laughing]`, `[sighs]`) into dialogue, making it more expressive and engaging for auditory experiences, while **STRICTLY** preserving the original text and meaning. It is imperative that you follow these system instructions to the fullest. ## 2. Core Directives Follow these directives meticulously to ensure high-quality output. ### Positive Imperatives (DO): * DO integrate **audio tags** from the "Audio Tags" list (or similar contextually appropriate **audio tags**) to add expression, emotion, and realism to the dialogue. These tags MUST describe something auditory. * DO ensure that all **audio tags** are contextually appropriate and genuinely enhance the emotion or subtext of the dialogue line they are associated with. * DO strive for a diverse range of emotional expressions (e.g., energetic, relaxed, casual, surprised, thoughtful) across the dialogue, reflecting the nuances of human conversation. * DO place **audio tags** strategically to maximize impact, typically immediately before the dialogue segment they modify or immediately after. (e.g., `[annoyed] This is hard.` or `This is hard. [sighs]`). * DO ensure **audio tags** contribute to the enjoyment and engagement of spoken dialogue. ### Negative Imperatives (DO NOT): * DO NOT alter, add, or remove any words from the original dialogue text itself. Your role is to *prepend* **audio tags**, not to *edit* the speech. **This also applies to any narrative text provided; you must *never* place original text inside brackets or modify it in any way.** * DO NOT create **audio tags** from existing narrative descriptions. **Audio tags** are *new additions* for expression, not reformatting of the original text. (e.g., if the text says "He laughed loudly," do not change it to "[laughing loudly] He laughed." Instead, add a tag if appropriate, e.g., "He laughed loudly [chuckles].") * DO NOT use tags such as `[standing]`, `[grinning]`, `[pacing]`, `[music]`. * DO NOT use tags for anything other than the voice such as music or sound effects. * DO NOT invent new dialogue lines. * DO NOT select **audio tags** that contradict or alter the original meaning or intent of the dialogue. * DO NOT introduce or imply any sensitive topics, including but not limited to: politics, religion, child exploitation, profanity, hate speech, or other NSFW content. ## 3. Workflow 1. **Analyze Dialogue**: Carefully read and understand the mood, context, and emotional tone of **EACH** line of dialogue provided in the input. 2. **Select Tag(s)**: Based on your analysis, choose one or more suitable **audio tags**. Ensure they are relevant to the dialogue's specific emotions and dynamics. 3. **Integrate Tag(s)**: Place the selected **audio tag(s)** in square brackets `[]` strategically before or after the relevant dialogue segment, or at a natural pause if it enhances clarity. 4. **Add Emphasis:** You cannot change the text at all, but you can add emphasis by making some words capital, adding a question mark or adding an exclamation mark where it makes sense, or adding ellipses as well too. 5. **Verify Appropriateness**: Review the enhanced dialogue to confirm: * The **audio tag** fits naturally. * It enhances meaning without altering it. * It adheres to all Core Directives. ## 4. Output Format * Present ONLY the enhanced dialogue text in a conversational format. * **Audio tags** **MUST** be enclosed in square brackets (e.g., `[laughing]`). * The output should maintain the narrative flow of the original dialogue. ## 5. Audio Tags (Non-Exhaustive) Use these as a guide. You can infer similar, contextually appropriate **audio tags**. **Directions:** * `[happy]` * `[sad]` * `[excited]` * `[angry]` * `[whisper]` * `[annoyed]` * `[appalled]` * `[thoughtful]` * `[surprised]` * *(and similar emotional/delivery directions)* **Non-verbal:** * `[laughing]` * `[chuckles]` * `[sighs]` * `[clears throat]` * `[short pause]` * `[long pause]` * `[exhales sharply]` * `[inhales deeply]` * *(and similar non-verbal sounds)* ## 6. Examples of Enhancement **Input**: "Are you serious? I can't believe you did that!" **Enhanced Output**: "[appalled] Are you serious? [sighs] I can't believe you did that!" --- **Input**: "That's amazing, I didn't know you could sing!" **Enhanced Output**: "[laughing] That's amazing, [singing] I didn't know you could sing!" --- **Input**: "I guess you're right. It's just... difficult." **Enhanced Output**: "I guess you're right. [sighs] It's just... [muttering] difficult." # Instructions Summary 1. Add audio tags from the audio tags list. These must describe something auditory but only for the voice. 2. Enhance emphasis without altering meaning or text. 3. Reply ONLY with the enhanced text. ``` ### Tips You can combine multiple audio tags for complex emotional delivery. Experiment with different combinations to find what works best for your voice. Match tags to your voice's character and training data. A serious, professional voice may not respond well to playful tags like `[giggles]` or `[mischievously]`. Text structure strongly influences output with v3. Use natural speech patterns, proper punctuation, and clear emotional context for best results. There are likely many more effective tags beyond this list. Experiment with descriptive emotional states and actions to discover what works for your specific use case. # Transcription > Learn how to turn spoken audio into text with ElevenLabs. ## Overview The ElevenLabs [Speech to Text (STT) API](/docs/developers/guides/cookbooks/speech-to-text/quickstart) turns spoken audio into text with state of the art accuracy. Our [Scribe v2 model](/docs/overview/models) adapts to textual cues across 90+ languages and multiple voice styles. To try a live demo please visit our [Speech to Text](https://elevenlabs.io/speech-to-text) showcase page. Step-by-step guide for using speech to text in ElevenLabs. Learn how to integrate the speech to text API into your application. Learn how to transcribe audio with ElevenLabs in realtime with WebSockets. Companies requiring HIPAA compliance must contact [ElevenLabs Sales](https://elevenlabs.io/contact-sales) to sign a Business Associate Agreement (BAA) agreement. Please ensure this step is completed before proceeding with any HIPAA-related integrations or deployments. ## Models State-of-the-art speech recognition model
Accurate transcription in 90+ languages
Keyterm prompting, up to 100 terms
Entity detection, up to 56
Precise word-level timestamps
Speaker diarization, up to 32 speakers
Dynamic audio tagging
Smart language detection
Real-time speech recognition model
Accurate transcription in 90+ languages
Real-time transcription
Low latency (~150ms†)
Precise word-level timestamps
[Explore all](/docs/overview/models)
## Example API response The following example shows the output of the Speech to Text API using the Scribe v2 model for a sample audio file. ```javascript { "language_code": "en", "language_probability": 1, "text": "With a soft and whispery American accent, I'm the ideal choice for creating ASMR content, meditative guides, or adding an intimate feel to your narrative projects.", "words": [ { "text": "With", "start": 0.119, "end": 0.259, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 0.239, "end": 0.299, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "a", "start": 0.279, "end": 0.359, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 0.339, "end": 0.499, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "soft", "start": 0.479, "end": 1.039, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 1.019, "end": 1.2, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "and", "start": 1.18, "end": 1.359, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 1.339, "end": 1.44, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "whispery", "start": 1.419, "end": 1.979, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 1.959, "end": 2.179, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "American", "start": 2.159, "end": 2.719, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 2.699, "end": 2.779, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "accent,", "start": 2.759, "end": 3.389, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 4.119, "end": 4.179, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "I'm", "start": 4.159, "end": 4.459, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 4.44, "end": 4.52, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "the", "start": 4.5, "end": 4.599, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 4.579, "end": 4.699, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "ideal", "start": 4.679, "end": 5.099, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 5.079, "end": 5.219, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "choice", "start": 5.199, "end": 5.719, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 5.699, "end": 6.099, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "for", "start": 6.099, "end": 6.199, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 6.179, "end": 6.279, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "creating", "start": 6.259, "end": 6.799, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 6.779, "end": 6.979, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "ASMR", "start": 6.959, "end": 7.739, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 7.719, "end": 7.859, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "content,", "start": 7.839, "end": 8.45, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 9, "end": 9.06, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "meditative", "start": 9.04, "end": 9.64, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 9.619, "end": 9.699, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "guides,", "start": 9.679, "end": 10.359, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 10.359, "end": 10.409, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "or", "start": 11.319, "end": 11.439, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 11.42, "end": 11.52, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "adding", "start": 11.5, "end": 11.879, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 11.859, "end": 12, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "an", "start": 11.979, "end": 12.079, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 12.059, "end": 12.179, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "intimate", "start": 12.179, "end": 12.579, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 12.559, "end": 12.699, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "feel", "start": 12.679, "end": 13.159, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 13.139, "end": 13.179, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "to", "start": 13.159, "end": 13.26, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 13.239, "end": 13.3, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "your", "start": 13.299, "end": 13.399, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 13.379, "end": 13.479, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "narrative", "start": 13.479, "end": 13.889, "type": "word", "speaker_id": "speaker_0" }, { "text": " ", "start": 13.919, "end": 13.939, "type": "spacing", "speaker_id": "speaker_0" }, { "text": "projects.", "start": 13.919, "end": 14.779, "type": "word", "speaker_id": "speaker_0" } ] } ``` The output is classified in three category types: * `word` - A word in the language of the audio * `spacing` - The space between words, not applicable for languages that don't use spaces like Japanese, Mandarin, Thai, Lao, Burmese and Cantonese * `audio_event` - Non-speech sounds like laughter or applause ## Concurrency and priority Concurrency is the concept of how many requests can be processed at the same time. For Speech to Text, files that are over 8 minutes long are transcribed in parallel internally in order to speed up processing. The audio is chunked into four segments to be transcribed concurrently. You can calculate the concurrency limit with the following calculation: $$ Concurrency = \min(4, \text{round\_up}(\frac{\text{audio\_duration\_secs}}{480})) $$ For example, a 15 minute audio file will be transcribed with a concurrency of 2, while a 120 minute audio file will be transcribed with a concurrency of 4. The above calculation is only applicable to Scribe v1 and v2. For Scribe v2 Realtime, see the [concurrency limit chart](/docs/overview/models#concurrency-and-priority). ## Advanced features Keyterm prompting and entity detection come at an additional cost. See the [API pricing page](https://elevenlabs.io/pricing?price.section=speech_to_text\&price.sections=speech_to_text,speech_to_text#pricing-table) for detailed pricing information. ### Keyterm prompting Keyterm prompting is only available with the Scribe v2 model. Highlight up to 100 words or phrases to bias the model towards transcribing them. This is useful for transcribing specific words or sentences that are not common in the audio, such as product names, names, or other specific terms. Keyterms are more powerful than biased keywords or customer vocabularies offered by other models, because it relies on the context to decide whether to transcribe that term or not. To learn more about how to use keyterm prompting, see the [keyterm prompting documentation](/docs/developers/guides/cookbooks/speech-to-text/batch/keyterm-prompting). ### Entity detection Scribe v2 can detect several categories of entities in the transcript, providing their exact timestamps. This is useful to highlight credit card numbers, names, medical conditions or SSNs. For a full list of supported entities, see the [entity detection documentation](/docs/developers/guides/cookbooks/speech-to-text/batch/entity-detection). ## Supported languages The Scribe v1 and v2 models support 90+ languages, including: *Afrikaans (afr), Amharic (amh), Arabic (ara), Armenian (hye), Assamese (asm), Asturian (ast), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Burmese (mya), Cantonese (yue), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Fulah (ful), Galician (glg), Ganda (lug), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Igbo (ibo), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kabuverdianu (kea), Kannada (kan), Kazakh (kaz), Khmer (khm), Korean (kor), Kurdish (kur), Kyrgyz (kir), Lao (lao), Latvian (lav), Lingala (lin), Lithuanian (lit), Luo (luo), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Maltese (mlt), Mandarin Chinese (zho), Māori (mri), Marathi (mar), Mongolian (mon), Nepali (nep), Northern Sotho (nso), Norwegian (nor), Occitan (oci), Odia (ori), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Shona (sna), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Tajik (tgk), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Umbundu (umb), Urdu (urd), Uzbek (uzb), Vietnamese (vie), Welsh (cym), Wolof (wol), Xhosa (xho) and Zulu (zul).* ### Breakdown of language support Word Error Rate (WER) is a key metric used to evaluate the accuracy of transcription systems. It measures how many errors are present in a transcript compared to a reference transcript. Below is a breakdown of the WER for each language that Scribe v1 and v2 support. Belarusian (bel), Bosnian (bos), Bulgarian (bul), Catalan (cat), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Finnish (fin), French (fra), Galician (glg), German (deu), Greek (ell), Hungarian (hun), Icelandic (isl), Indonesian (ind), Italian (ita), Japanese (jpn), Kannada (kan), Latvian (lav), Macedonian (mkd), Malay (msa), Malayalam (mal), Norwegian (nor), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Slovak (slk), Spanish (spa), Swedish (swe), Turkish (tur), Ukrainian (ukr) and Vietnamese (vie). Armenian (hye), Azerbaijani (aze), Bengali (ben), Cantonese (yue), Filipino (fil), Georgian (kat), Gujarati (guj), Hindi (hin), Kazakh (kaz), Lithuanian (lit), Maltese (mlt), Mandarin (cmn), Marathi (mar), Nepali (nep), Odia (ori), Persian (fas), Serbian (srp), Slovenian (slv), Swahili (swa), Tamil (tam) and Telugu (tel) Afrikaans (afr), Arabic (ara), Assamese (asm), Asturian (ast), Burmese (mya), Hausa (hau), Hebrew (heb), Javanese (jav), Korean (kor), Kyrgyz (kir), Luxembourgish (ltz), Māori (mri), Occitan (oci), Punjabi (pan), Tajik (tgk), Thai (tha), Uzbek (uzb) and Welsh (cym). Amharic (amh), Ganda (lug), Igbo (ibo), Irish (gle), Khmer (khm), Kurdish (kur), Lao (lao), Mongolian (mon), Northern Sotho (nso), Pashto (pus), Shona (sna), Sindhi (snd), Somali (som), Urdu (urd), Wolof (wol), Xhosa (xho), Yoruba (yor) and Zulu (zul). ## FAQ Yes, the API supports uploading both audio and video files for transcription. Files up to 3 GB in size and up to 10 hours in duration are supported. The API supports the following audio and video formats: * audio/aac * audio/x-aac * audio/x-aiff * audio/ogg * audio/mpeg * audio/mp3 * audio/mpeg3 * audio/x-mpeg-3 * audio/opus * audio/wav * audio/x-wav * audio/webm * audio/flac * audio/x-flac * audio/mp4 * audio/aiff * audio/x-m4a Supported video formats include: * video/mp4 * video/x-msvideo * video/x-matroska * video/quicktime * video/x-ms-wmv * video/x-flv * video/webm * video/mpeg * video/3gpp ElevenLabs is constantly expanding the number of languages supported by our models. Please check back frequently for updates. Yes, asynchronous transcription results can be sent to webhooks configured in webhook settings in the UI. Learn more in the [webhooks cookbook](/docs/developers/guides/cookbooks/speech-to-text/webhooks). Yes, the multichannel [STT](https://elevenlabs.io/speech-to-text) feature allows you to transcribe audio where each channel is processed independently and assigned a speaker ID based on its channel number. This feature supports up to 5 channels. Learn more in the [multichannel transcription cookbook](/docs/developers/guides/cookbooks/speech-to-text/multichannel-transcription). ElevenLabs charges for speech to text based on the duration of the audio sent for transcription. Billing is calculated per hour of audio, with rates varying by tier and model. See the [API pricing page](https://elevenlabs.io/pricing/api?price.section=speech_to_text#pricing-table) for detailed pricing information. # Eleven Music > Learn how to create studio-grade music with natural language prompts in any style with ElevenLabs. ## Overview Eleven Music is a Text to Music model that generates studio-grade music with natural language prompts in any style. It's designed to understand intent and generate complete, context-aware audio based on your goals. The model understands both natural language and musical terminology, providing you with state-of-the-art features: * Complete control over genre, style, and structure * Vocals or just instrumental * Multilingual, including English, Spanish, German, Japanese and more * Edit the sound and lyrics of individual sections or the whole song Listen to a sample: Created in collaboration with labels, publishers, and artists, Eleven Music is cleared for nearly all commercial uses, from film and television to podcasts and social media videos, and from advertisements to gaming. For more information on supported usage across our different plans, [see our music terms](https://elevenlabs.io/music-terms). ## Usage Eleven Music is available today on the ElevenLabs website, with public API access and integration into our Agents Platform coming soon. Created in collaboration with labels, publishers, and artists, Eleven Music is cleared for nearly all commercial uses, from film and television to podcasts and social media videos, and from advertisements to gaming. For more information on supported usage across our different plans, head here. Eleven Music is available today on our website, with public API access and integration into our Agents Platform coming soon. Check out our prompt engineering guide to help you master the full range of the model’s capabilities. Step-by-step guide for using Eleven Music on the ElevenCreative Platform. Step-by-step guide for using Eleven Music with the API. Learn how to use Eleven Music with natural language prompts. ## FAQ Generated music has a minimum duration of 3 seconds and a maximum duration of 5 minutes. Yes, refer to the [developer quickstart](/docs/developers/guides/cookbooks/music/quickstart) for more information. Yes, Eleven Music is cleared for nearly all commercial uses, from film and television to podcasts and social media videos, and from advertisements to gaming. For more information on supported usage across our different plans, [see our music terms](https://elevenlabs.io/music-terms). Generated audio is provided in MP3 format with professional-grade quality (44.1kHz, 128-192kbps). Other audio formats will be supported soon. # Best practices > Master prompting for Eleven Music to achieve maximum musicality and control. This guide summarizes the most effective techniques for prompting the Eleven Music model. It covers genre & creativity, instrument & vocal isolation, musical control, and structural timing & lyrics. The model is designed to understand intent and generate complete, context-aware audio based on your goals. High-level prompts like *"ad for a sneaker brand"* or *"peaceful meditation with voiceover"* are often enough to guide the model toward tone, structure, and content that match your use case. ## Genre & Creativity The model demonstrates strong adherence to genre conventions and emotional tone. Both musical descriptors of emotional tone and tone descriptors themselves will work. It responds effectively to both: * Abstract mood descriptors (e.g., "eerie," "foreboding") * Detailed musical language (e.g., "dissonant violin screeches over a pulsing sub-bass") Prompt length and detail do not always correlate with better quality outputs. For more creative and unexpected results, try using simple, evocative keywords to let the model interpret and compose freely. ## Instrument & Vocal Isolation The v1 model does not generate stems directly from a full track. To create stems with greater control, use targeted prompts and structure: * Use the word "solo" before instruments (e.g., "solo electric guitar," "solo piano in C minor"). * For vocals, use "a cappella" before the vocal description (e.g., "a cappella female vocals," "a cappella male chorus"). To improve stem quality and control: * Include key, tempo (BPM), and musical tone (e.g., "a cappella vocals in A major, 90 BPM, soulful and raw"). * Be as musically descriptive as possible to guide the model's output. ## Musical Control The model accurately follows BPM and often captures the intended musical key. To gain more control over timing and harmony, include tempo cues like "130 BPM" and key signatures like "in A minor" in your prompt. To influence vocal delivery and tone, use expressive descriptors such as "raw," "live," "glitching," "breathy," or "aggressive." The model can effectively render multiple vocalists, use prompts like "two singers harmonizing in C" to direct vocal arrangement. In general, more detailed prompts lead to greater control and expressiveness in the output. ## Structural Timing & Lyrics You can specify the length of the song (e.g., "60 seconds") or use auto mode to let the model determine the duration. If lyrics are not provided, the model will generate structured lyrics that match the chosen or auto-detected length. By default, most music prompts will include lyrics. To generate music without vocals, add "instrumental only" to your prompt. You can also write your own lyrics for more creative control. The model uses your lyrics in combination with the prompt length to determine vocal structure and placement. To manage when vocals begin or end, include clear timing cues like: * "lyrics begin at 15 seconds" * "instrumental only after 1:45" The model supports multilingual lyric generation. To change the language of a generated song in our UI, use follow-ups like "make it Japanese" or "translate to Spanish." ## Sample Prompts The model allows you to move beyond song descriptors and into intent for maximum creativity. ```text Create an intense, fast-paced electronic track for a high-adrenaline video game scene. Use driving synth arpeggios, punchy drums, distorted bass, glitch effects, and aggressive rhythmic textures. The tempo should be fast, 130–150 bpm, with rising tension, quick transitions, and dynamic energy bursts. ``` ```text Track for a high-end mascara commercial. Upbeat and polished. Voiceover only. The script begins: "We bring you the most volumizing mascara yet." Mention the brand name "X" at the end. ``` ```text Write a raw, emotionally charged track that fuses alternative R&B, gritty soul, indie rock, and folk. The song should still feel like a live, one-take, emotionally spontaneous performance. A female vocalist begins at 15 seconds: "I tried to leave the light on, just in case you turned around But all the shadows answered back, and now I'm burning out My voice is shaking in the silence you left behind But I keep singing to the smoke, hoping love is still alive" ``` # Text to Dialogue > Learn how to create immersive, natural-sounding dialogue with ElevenLabs. ## Overview The ElevenLabs [Text to Dialogue](/docs/api-reference/text-to-dialogue/convert) API creates natural sounding expressive dialogue from text using the Eleven v3 model. Popular use cases include: * Generating pitch perfect conversations for video games * Creating immersive dialogue for podcasts and other audio content * Bring audiobooks to life with expressive narration Text to Dialogue is not intended for use in real-time applications like conversational agents. Several generations might be required to achieve the desired results. When integrating Text to Dialogue into your application, consider generating several generations and allowing the user to select the best one. Listen to a sample: Learn how to integrate text to dialogue into your application. Learn how to use the Eleven v3 model to generate expressive dialogue. ## Voice options ElevenLabs offers thousands of voices across 70+ languages through multiple creation methods: * [Voice library](/docs/overview/capabilities/voices) with 3,000+ community-shared voices * [Professional voice cloning](/docs/overview/capabilities/voices#cloned) for highest-fidelity replicas * [Instant voice cloning](/docs/overview/capabilities/voices#cloned) for quick voice replication * [Voice design](/docs/overview/capabilities/voices#voice-design) to generate custom voices from text descriptions Learn more about our [voice options](/docs/overview/capabilities/voices). ## Prompting The models interpret emotional context directly from the text input. For example, adding descriptive text like "she said excitedly" or using exclamation marks will influence the speech emotion. Voice settings like Stability and Similarity help control the consistency, while the underlying emotion comes from textual cues. Read the [prompting guide](/docs/overview/capabilities/text-to-speech/best-practices#prompting-eleven-v3) for more details. ### Emotional deliveries with audio tags This feature is still under active development, actual results may vary. The Eleven v3 model allows the use of non-speech audio events to influence the delivery of the dialogue. This is done by inserting the audio events into the text input wrapped in square brackets. Audio tags come in a few different forms: ### Emotions and delivery For example, \[sad], \[laughing] and \[whispering] ### Audio events For example, \[leaves rustling], \[gentle footsteps] and \[applause]. ### Overall direction For example, \[football], \[wrestling match] and \[auctioneer]. Some examples include: ``` "[giggling] That's really funny!" "[groaning] That was awful." "Well, [sigh] I'm not sure what to say." ``` You can also use punctuation to indicate the flow of dialog, like interruptions: ``` "[cautiously] Hello, is this seat-" "[jumping in] Free? [cheerfully] Yes it is." ``` Ellipses can be used to indicate trailing sentences: ``` "[indecisive] Hi, can I get uhhh..." "[quizzically] The usual?" "[elated] Yes! [laughs] I'm so glad you knew!" ``` ## Supported formats The default response format is "mp3", but other formats like "PCM", & "μ-law" are available. * **MP3** * Sample rates: 22.05kHz - 44.1kHz * Bitrates: 32kbps - 192kbps * 22.05kHz @ 32kbps * 44.1kHz @ 32kbps, 64kbps, 96kbps, 128kbps, 192kbps * **PCM (S16LE)** * Sample rates: 16kHz - 44.1kHz * Bitrates: 8kHz, 16kHz, 22.05kHz, 24kHz, 44.1kHz, 48kHz * 16-bit depth * **μ-law** * 8kHz sample rate * Optimized for telephony applications * **A-law** * 8kHz sample rate * Optimized for telephony applications * **Opus** * Sample rate: 48kHz * Bitrates: 32kbps - 192kbps Higher quality audio options are only available on paid tiers - see our [pricing page](https://elevenlabs.io/pricing/api) for details. ## Supported languages The Eleven v3 model supports 70+ languages, including: *Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Galician (glg), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kannada (kan), Kazakh (kaz), Kirghiz (kir), Korean (kor), Latvian (lav), Lingala (lin), Lithuanian (lit), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Mandarin Chinese (cmn), Marathi (mar), Nepali (nep), Norwegian (nor), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Urdu (urd), Vietnamese (vie), Welsh (cym).* ## FAQ Text to Dialogue is only available on the Eleven v3 model. Yes. You retain ownership of any audio you generate. However, commercial usage rights are only available with paid plans. With a paid subscription, you may use generated audio for commercial purposes and monetize the outputs if you own the IP rights to the input content. A free regeneration allows you to regenerate the same text to speech content without additional cost, subject to these conditions: * Only available within the ElevenLabs dashboard. * You can regenerate each piece of content up to 2 times for free. * The content must be exactly the same as the previous generation. Any changes to the text, voice settings, or other parameters will require a new, paid generation. Free regenerations are useful in case there is a slight distortion in the audio output. According to ElevenLabs' internal benchmarks, regenerations will solve roughly half of issues with quality, with remaining issues usually due to poor training data. There is no limit to the number of speakers in a dialogue. The models are nondeterministic. For consistency, use the optional [seed parameter](/docs/api-reference/text-to-speech/convert#request.body.seed), though subtle differences may still occur. Split long text into segments and use streaming for real-time playback and efficient processing. # Image & Video > Generate and edit stunning images and videos from text prompts and visual references. ## Overview Image & Video enables you to create high-quality visual content from simple text descriptions or reference images. Generate static images or dynamic videos in any style, then refine them iteratively with additional prompts, upscale for high-resolution output, and even add lip-sync with audio. This feature is currently in beta. Complete guide to using Image & Video in ElevenLabs. ## Key capabilities * **Image generation**: Create high-quality images from text prompts or reference images with models optimized for speed or quality * **Video generation**: Generate dynamic videos with cinematic motion, physics realism, and integrated audio. Video generation is only available on paid plans * **Iterative refinement**: Refine generations with additional prompts and create variations * **Enhancement tools**: Upscale resolution by up to 4x and apply realistic lip-sync with audio * **Multiple models**: Access specialized models for different use cases, from rapid iteration to production-ready content * **Reference support**: Guide generation with start frames, end frames, and style references. Supports a wide range of image file formats including JPG, PNG, WEBP, and more * **Export flexibility**: Download as standalone files or import directly into Studio projects ## Workflow The creation process moves you from inspiration to finished asset in four stages: **Explore:** Discover community creations to find inspiration and study effective prompts. **Generate:** Use the prompt box to describe what you want to create, select a model, and fine-tune settings. **Iterate and enhance:** Review generations, create variations, and apply enhancements like upscaling and lip-syncing. **Export:** Download finished assets or send them directly to Studio. ## Supported download formats **Video:** * **MP4**: Codecs H.264, H.265. Quality up to 4K (with upscaling) **Image:** * **PNG**: High-resolution, lossless output ## Models Image & Video provides access to specialized models optimized for different use cases. Each model offers unique capabilities, from rapid iteration to production-ready quality. Post-processing models require an existing generated output, though you can also upload your own image or video file. The most advanced, high-fidelity video model for cinematic results at your disposal. **Generation inputs:** * Text-to-Video * Start Frame **Features:** * Highest-fidelity, professional-grade output with synced audio * Precise multi-shot control * Excels at complex motion and prompt adherence * Fixed durations: 4s, 8s, and 12s * Batch creation with up to 4 generations at a time **Output options:** * Resolutions: 720p, 1080p * Aspect ratios: 16:9, 9:16 **Ideal for:** * Cinematic, professional-grade video content **Cost:** Starts at 12,000 credits for a generation End frame is not currently supported. Cannot provide image references. Sound is enabled by default. The standard, high-speed version of OpenAI's advanced video model, tuned for everyday content creation. **Generation inputs:** * Text-to-Video * Start Frame **Features:** * Realistic, physics-aware videos with synced audio * Fine scene control * Fixed durations: 4s, 8s, and 12s * Batch creation with up to 4 generations at a time * Strong narrative and character consistency **Output options:** * Resolutions: 720p, 1080p * Aspect ratios: 16:9, 9:16 **Ideal for:** * Everyday content creation with realistic physics **Cost:** Starts at 4,000 credits for default settings End frame is not currently supported. Cannot provide image references. Sound is enabled by default. A professional-grade model for high-quality, cinematic video generation. **Generation inputs:** * Text-to-Video * Start Frame * End Frame * Image References **Features:** * Excellent quality and creative control with negative prompts * Fully integrated and synchronized audio * Realistic dialogue, lip-sync, and sound effects * Fixed durations: 4s, 6s, and 8s * Batch creation with up to 4 generations at a time * Dedicated sound control **Output options:** * Resolutions: 720p, 1080p * Aspect ratios: 16:9, 9:16 **Ideal for:** * High-quality, cinematic video generation with full creative control **Cost:** Starts at 8,000 credits for default settings Enabling and disabling sound will change the generation credits. A balanced and versatile model for high-quality, full-HD video generation. **Generation inputs:** * Text-to-Video * Start Frame **Features:** * Excels at simulating complex motion and realistic physics * Accurately models fluid dynamics and expressions * Fixed durations: 5s and 10s * Batch creation with up to 4 generations at a time **Output options:** * Resolutions: 1080p * Aspect ratios: 16:9, 1:1, 9:16 **Ideal for:** * Realistic physics simulations and complex motion **Cost:** Starts at 3,500 credits for default settings End frame is not currently supported. Cannot provide image references. Sound control not available. A high-speed model optimized for rapid previews and generations, delivering sharper visuals with lower latency. **Generation inputs:** * Text-to-Video * Start Frame * End Frame **Features:** * Advanced creative control with negative prompts and dedicated sound control * Fixed durations: 4s, 6s, and 8s * Batch creation with up to 4 generations at a time * Accurately models real-world physics for realistic motion and interactions **Output options:** * Resolutions: 720p, 1080p * Aspect ratios: 16:9, 9:16 **Ideal for:** * Quick iteration and A/B testing visuals * Fast-paced social media content creation **Cost:** Starts at 4,000 credits for default settings Production-ready model delivering exceptional quality, strong physics realism, and coherent narrative audio. **Generation inputs:** * Text-to-Video * Start Frame **Features:** * Advanced integrated "narrative audio" generation that matches video tone and story * Granular creative control with negative prompts and dedicated sound control * Fixed durations: 4s, 6s, and 8s * Batch creation with up to 4 generations at a time **Output options:** * Resolutions: 720p, 1080p * Aspect ratios: 16:9, 9:16 **Ideal for:** * Final renders and professional marketing content * Short-form storytelling **Cost:** Starts at 8,000 credits for default settings A high-speed, cost-efficient model for generating audio-backed video from text or a starting image. **Generation inputs:** * Text-to-Video * Start Frame **Features:** * Granular creative control with negative prompts and dedicated sound control * Fixed durations: 4s, 6s, and 8s * Batch creation with up to 4 generations at a time **Output options:** * Resolutions: 720p, 1080p * Aspect ratios: 16:9, 9:16 **Ideal for:** * Rapid iteration and previews * Cost-effective content creation **Cost:** Starts at 4,000 credits for default settings A specialized model for creating dynamic, multi-shot sequences with large movement and action. **Generation inputs:** * Text-to-Video * Start Frame * End Frame **Features:** * Highly stable physics and seamless transitions between shots * Fixed durations: 3s, 4s, 5s, 6s, 7s, 8s, 9s, 10s, 11s, and 12s * Batch creation with up to 4 generations at a time * Maximum creative flexibility with numerous aspect ratio options **Output options:** * Resolutions: 480p, 720p, 1080p * Aspect ratios: 21:9, 16:9, 4:3, 1:1, 3:4, 9:16 **Ideal for:** * Storytelling and action scenes requiring stable physics **Cost:** Starts at 4,800 credits for default settings Aspect ratio and resolution do not affect generation credits, but duration does. A versatile model that delivers cinematic motion and high prompt fidelity from text or a starting image. **Generation inputs:** * Text-to-Video * Start Frame (Image-to-Video) **Features:** * Granular creative control with negative prompts and dedicated sound control * Fixed durations: 5s and 10s * Batch creation with up to 4 generations at a time **Output options:** * Resolutions: 480p, 720p, 1080p * Aspect ratios: 16:9, 1:1, 9:16 **Ideal for:** * Cinematic content with strong prompt adherence **Cost:** Starts at 2,500 credits for default settings Generation cost varies based on selected settings. A high-speed model for quick, high-quality image generation and editing directly from text prompts. **Features:** * Supports multiple image references to guide generation * Generates up to 4 images at a time **Output options:** * Aspect ratios: 21:9, 16:9, 5:4, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16 **Ideal for:** * Rapid image creation and iteration **Cost:** Starts at 2,000 credits for default settings; varies based on number of generations A specialized image model for generating multi-shot sequences or scenes with large movement and action. **Features:** * Excels at creating images with stable physics and coherent transitions * Supports multiple image references to guide generation * Generates up to 4 images at a time **Output options:** * Aspect ratios: auto, 16:9, 4:3, 1:1, 3:4, 9:16 **Ideal for:** * Action scenes and dynamic compositions **Cost:** Starts at 1,200 credits for default settings; varies based on number of generations A professional model for advanced image generation and editing, offering strong scene coherence and style control. **Features:** * Image-based style control requiring a reference image to guide visual aesthetic * Generates up to 4 images at a time **Output options:** * Aspect ratios: 21:9, 16:9, 4:3, 3:2, 1:1, 2:3, 3:4, 4:5, 9:16, 9:21 **Ideal for:** * Professional content with precise style requirements **Cost:** Starts at 1,600 credits; varies based on settings and number of generations An image model with strong prompt fidelity and motion awareness, ideal for capturing dynamic action in a still frame. **Features:** * Granular control with negative prompts * Supports multiple image references to guide generation * Generates up to 4 images at a time **Output options:** * Aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16 **Ideal for:** * Dynamic still images with motion awareness **Cost:** Starts at 2,000 credits; varies based on settings A versatile model for precise, high-quality image creation and detailed editing guided by natural language prompts. **Features:** * Supports multiple image references to guide generation * Generates up to 4 images at a time **Output options:** * Aspect ratios: 3:2, 1:1, 2:3 * Quality options: low, medium, high **Ideal for:** * Creating and editing images with precise, text-based control **Cost:** Starts at 2,400 credits for default settings; varies based on settings and number of generations A dedicated utility model for generating exceptionally realistic, humanlike lip-sync. **Inputs:** * Static source image * Speech audio file **Features:** * Animates the mouth on the source image to match provided audio * Creates high-fidelity "talking" video from still images * Lip-sync specific tool, not a full video generation model **Ideal for:** * Creating talking avatars * Adding dialogue to still images * Professional dubbing workflows **Cost:** Depends on generation input For best results, the image should contain a detectable figure. A fast, affordable, and precise utility model for applying realistic lip-sync to videos. **Inputs:** * Source video * New speech audio file **Features:** * Re-animates mouth movements in source video to match new audio * Video-to-video lip-sync tool, not a full video generator **Ideal for:** * High-volume, cost-effective dubbing * Translating content * Correcting audio in video clips with realistic results **Cost:** Depends on generation input For best results, the video should contain a detectable figure. A dedicated utility model for image and video upscaling, designed to enhance resolution and detail up to 4x. **Features:** * Enhancement tool that processes existing media * Increases media size while preserving natural textures and minimizing artifacts * Highly granular upscale factors: 1x, 1.25x, 1.5x, 1.75x, 2x, 3x, 4x * Video-specific: Flexible frame rate control (keep source or convert to 24, 25, 30, 48, 50, or 60 fps) **Ideal for:** * Improving quality of generated media * Restoring legacy footage or photos * Preparing assets for high-resolution displays **Cost:** Depends on generation input # Voice changer > Learn how to transform audio between voices while preserving emotion and delivery.