Text to Dialogue
Learn how to create immersive, natural-sounding dialogue with ElevenLabs.
Eleven v3 API access is currently not publicly available, but will be soon. To request access, please contact our sales team.
Overview
The ElevenLabs Text to Dialogue API creates natural sounding expressive dialogue from text using the Eleven v3 model. Popular use cases include:
- Generating pitch perfect conversations for video games
- Creating immersive dialogue for podcasts and other audio content
- Bring audiobooks to life with expressive narration
Text to Dialogue is not intended for use in real-time applications like Conversational AI. Several generations might be required to achieve the desired results. When integrating Text to Dialogue into your application, consider generating several generations and allowing the user to select the best one.
Listen to a sample:
Learn how to integrate text to dialogue into your application.
Step-by-step guide for using text to dialogue in ElevenLabs.
Voice options
ElevenLabs offers thousands of voices across 70+ languages through multiple creation methods:
- Voice library with 3,000+ community-shared voices
- Professional voice cloning for highest-fidelity replicas
- Instant voice cloning for quick voice replication
- Voice design to generate custom voices from text descriptions
Learn more about our voice options.
Prompting
The models interpret emotional context directly from the text input. For example, adding descriptive text like “she said excitedly” or using exclamation marks will influence the speech emotion. Voice settings like Stability and Similarity help control the consistency, while the underlying emotion comes from textual cues.
Read the prompting guide for more details.
Emotional deliveries with audio tags
The Eleven v3 model allows the use of non-speech audio events to influence the delivery of the dialogue. This is done by inserting the audio events into the text input wrapped in square brackets.
Audio tags come in a few different forms:
Emotions and delivery
For example, [sad], [laughing] and [whispering]
Audio events
For example, [leaves rustling], [gentle footsteps] and [applause].
Overall direction
For example, [football], [wrestling match] and [auctioneer].
Some examples include:
You can also use punctuation to indicate the flow of dialog, like interruptions:
Ellipses can be used to indicate trailing sentences:
Supported formats
The default response format is “mp3”, but other formats like “PCM”, & “μ-law” are available.
- MP3
- Sample rates: 22.05kHz - 44.1kHz
- Bitrates: 32kbps - 192kbps
- 22.05kHz @ 32kbps
- 44.1kHz @ 32kbps, 64kbps, 96kbps, 128kbps, 192kbps
- PCM (S16LE)
- Sample rates: 16kHz - 44.1kHz
- Bitrates: 8kHz, 16kHz, 22.05kHz, 24kHz, 44.1kHz, 48kHz
- 16-bit depth
- μ-law
- 8kHz sample rate
- Optimized for telephony applications
- A-law
- 8kHz sample rate
- Optimized for telephony applications
- Opus
- Sample rate: 48kHz
- Bitrates: 32kbps - 192kbps
Higher quality audio options are only available on paid tiers - see our pricing page for details.
Supported languages
The Eleven v3 model supports 70+ languages, including:
Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Galician (glg), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kannada (kan), Kazakh (kaz), Kirghiz (kir), Korean (kor), Latvian (lav), Lingala (lin), Lithuanian (lit), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Mandarin Chinese (cmn), Marathi (mar), Nepali (nep), Norwegian (nor), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Urdu (urd), Vietnamese (vie), Welsh (cym).
FAQ
Which models can I use?
Text to Dialogue is only available on the Eleven v3 model.
Do I own the audio output?
Yes. You retain ownership of any audio you generate. However, commercial usage rights are only available with paid plans. With a paid subscription, you may use generated audio for commercial purposes and monetize the outputs if you own the IP rights to the input content.
What qualifies as a free regeneration?
A free regeneration allows you to regenerate the same text to speech content without additional cost, subject to these conditions:
- Only available within the ElevenLabs dashboard.
- You can regenerate each piece of content up to 2 times for free.
- The content must be exactly the same as the previous generation. Any changes to the text, voice settings, or other parameters will require a new, paid generation.
Free regenerations are useful in case there is a slight distortion in the audio output. According to ElevenLabs’ internal benchmarks, regenerations will solve roughly half of issues with quality, with remaining issues usually due to poor training data.
How many speakers can my dialogue have?
There is no limit to the number of speakers in a dialogue.
Why is my output sometimes inconsistent?
The models are nondeterministic. For consistency, use the optional seed parameter, though subtle differences may still occur.
What's the best practice for large text conversions?
Split long text into segments and use streaming for real-time playback and efficient processing.