Text to Dialogue
Learn how to create immersive, natural-sounding dialogue with ElevenLabs.
Learn how to create immersive, natural-sounding dialogue with ElevenLabs.
The ElevenLabs Text to Dialogue API creates natural sounding expressive dialogue from text using the Eleven v3 model. Popular use cases include:
Text to Dialogue is not intended for use in real-time applications like conversational agents. Several generations might be required to achieve the desired results. When integrating Text to Dialogue into your application, consider generating several generations and allowing the user to select the best one.
Listen to a sample:
Learn how to integrate text to dialogue into your application.
Learn how to use the Eleven v3 model to generate expressive dialogue.
Full API reference for the Text to Dialogue endpoint.
ElevenLabs offers thousands of voices across 70+ languages through multiple creation methods:
Learn more about our voice options.
The models interpret emotional context directly from the text input. For example, adding descriptive text like “she said excitedly” or using exclamation marks will influence the speech emotion. Voice settings like Stability and Similarity help control the consistency, while the underlying emotion comes from textual cues.
Read the prompting guide for more details.
The Eleven v3 model allows the use of non-speech audio events to influence the delivery of the dialogue. This is done by inserting the audio events into the text input wrapped in square brackets.
In Text to Dialogue, each dialogue turn has its own text and voice. Add audio tags inside the text for the turn they should affect. The voice_id still selects the speaker voice for that turn, while the tags guide delivery.
For example, a speaker can use one voice while the text starts with [giggling], and the next speaker can use a different voice while the text starts with [whispering]. For an API example that combines tags with voice_id, see the Text to Dialogue quickstart.
Audio tags are natural-language instructions, not an enum parameter. Wrap the instruction in square brackets and place it in the text where the delivery should change. The examples below are not exhaustive; use the Eleven v3 prompting guide for more guidance on effective tags.
Audio tags come in a few different forms:
For example, [sad], [laughing] and [whispering]
For example, [leaves rustling], [gentle footsteps] and [applause].
For example, [football], [wrestling match] and [auctioneer].
Some examples include:
You can also use punctuation to indicate the flow of dialog, like interruptions:
Ellipses can be used to indicate trailing sentences:
The default response format is mp3, but other formats like pcm and ulaw are available.
Higher quality audio options are only available on paid tiers - see our pricing page for details.
The Eleven v3 model supports 70+ languages, including:
Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Galician (glg), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kannada (kan), Kazakh (kaz), Kirghiz (kir), Korean (kor), Latvian (lav), Lingala (lin), Lithuanian (lit), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Mandarin Chinese (cmn), Marathi (mar), Nepali (nep), Norwegian (nor), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Urdu (urd), Vietnamese (vie), Welsh (cym).
Text to Dialogue is only available on the Eleven v3 model.
Yes. You retain ownership of any audio you generate. However, commercial usage rights are only available with paid plans. With a paid subscription, you may use generated audio for commercial purposes and monetize the outputs if you own the IP rights to the input content.
A free regeneration allows you to regenerate the same text to speech content without additional cost, subject to these conditions:
Free regenerations are useful in case there is a slight distortion in the audio output. According to ElevenLabs’ internal benchmarks, regenerations will solve roughly half of issues with quality, with remaining issues usually due to poor training data.
There is no limit to the number of speakers in a dialogue.
The models are nondeterministic. For consistency, use the optional seed parameter, though subtle differences may still occur.
Keep the total length of all inputs[].text values at or below 2,000 characters per request for reliable generation. Split longer text into chunks and concatenate the resulting audio in your application.
inputs[].text values at or below 2,000 characters per requestseed parameter for more consistent results