Text to Dialogue

Learn how to create immersive, natural-sounding dialogue with ElevenLabs.

Eleven v3 API access is currently not publicly available, but will be soon. To request access, please contact our sales team.

Overview

The ElevenLabs Text to Dialogue API creates natural sounding expressive dialogue from text using the Eleven v3 model. Popular use cases include:

  • Generating pitch perfect conversations for video games
  • Creating immersive dialogue for podcasts and other audio content
  • Bring audiobooks to life with expressive narration

Text to Dialogue is not intended for use in real-time applications like Conversational AI. Several generations might be required to achieve the desired results. When integrating Text to Dialogue into your application, consider generating several generations and allowing the user to select the best one.

Listen to a sample:

Voice options

ElevenLabs offers thousands of voices across 70+ languages through multiple creation methods:

Learn more about our voice options.

Prompting

The models interpret emotional context directly from the text input. For example, adding descriptive text like “she said excitedly” or using exclamation marks will influence the speech emotion. Voice settings like Stability and Similarity help control the consistency, while the underlying emotion comes from textual cues.

Read the prompting guide for more details.

Emotional deliveries with audio tags

This feature is still under active development, actual results may vary.

The Eleven v3 model allows the use of non-speech audio events to influence the delivery of the dialogue. This is done by inserting the audio events into the text input wrapped in square brackets.

Audio tags come in a few different forms:

Emotions and delivery

For example, [sad], [laughing] and [whispering]

Audio events

For example, [leaves rustling], [gentle footsteps] and [applause].

Overall direction

For example, [football], [wrestling match] and [auctioneer].

Some examples include:

"[giggling] That's really funny!"
"[groaning] That was awful."
"Well, [sigh] I'm not sure what to say."

You can also use punctuation to indicate the flow of dialog, like interruptions:

"[cautiously] Hello, is this seat-"
"[jumping in] Free? [cheerfully] Yes it is."

Ellipses can be used to indicate trailing sentences:

"[indecisive] Hi, can I get uhhh..."
"[quizzically] The usual?"
"[elated] Yes! [laughs] I'm so glad you knew!"

Supported formats

The default response format is “mp3”, but other formats like “PCM”, & “μ-law” are available.

  • MP3
    • Sample rates: 22.05kHz - 44.1kHz
    • Bitrates: 32kbps - 192kbps
    • 22.05kHz @ 32kbps
    • 44.1kHz @ 32kbps, 64kbps, 96kbps, 128kbps, 192kbps
  • PCM (S16LE)
    • Sample rates: 16kHz - 44.1kHz
    • Bitrates: 8kHz, 16kHz, 22.05kHz, 24kHz, 44.1kHz, 48kHz
    • 16-bit depth
  • μ-law
    • 8kHz sample rate
    • Optimized for telephony applications
  • A-law
    • 8kHz sample rate
    • Optimized for telephony applications
  • Opus
    • Sample rate: 48kHz
    • Bitrates: 32kbps - 192kbps

Higher quality audio options are only available on paid tiers - see our pricing page for details.

Supported languages

The Eleven v3 model supports 70+ languages, including:

Afrikaans (afr), Arabic (ara), Armenian (hye), Assamese (asm), Azerbaijani (aze), Belarusian (bel), Bengali (ben), Bosnian (bos), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Chichewa (nya), Croatian (hrv), Czech (ces), Danish (dan), Dutch (nld), English (eng), Estonian (est), Filipino (fil), Finnish (fin), French (fra), Galician (glg), Georgian (kat), German (deu), Greek (ell), Gujarati (guj), Hausa (hau), Hebrew (heb), Hindi (hin), Hungarian (hun), Icelandic (isl), Indonesian (ind), Irish (gle), Italian (ita), Japanese (jpn), Javanese (jav), Kannada (kan), Kazakh (kaz), Kirghiz (kir), Korean (kor), Latvian (lav), Lingala (lin), Lithuanian (lit), Luxembourgish (ltz), Macedonian (mkd), Malay (msa), Malayalam (mal), Mandarin Chinese (cmn), Marathi (mar), Nepali (nep), Norwegian (nor), Pashto (pus), Persian (fas), Polish (pol), Portuguese (por), Punjabi (pan), Romanian (ron), Russian (rus), Serbian (srp), Sindhi (snd), Slovak (slk), Slovenian (slv), Somali (som), Spanish (spa), Swahili (swa), Swedish (swe), Tamil (tam), Telugu (tel), Thai (tha), Turkish (tur), Ukrainian (ukr), Urdu (urd), Vietnamese (vie), Welsh (cym).

FAQ

Text to Dialogue is only available on the Eleven v3 model.

Yes. You retain ownership of any audio you generate. However, commercial usage rights are only available with paid plans. With a paid subscription, you may use generated audio for commercial purposes and monetize the outputs if you own the IP rights to the input content.

A free regeneration allows you to regenerate the same text to speech content without additional cost, subject to these conditions:

  • Only available within the ElevenLabs dashboard.
  • You can regenerate each piece of content up to 2 times for free.
  • The content must be exactly the same as the previous generation. Any changes to the text, voice settings, or other parameters will require a new, paid generation.

Free regenerations are useful in case there is a slight distortion in the audio output. According to ElevenLabs’ internal benchmarks, regenerations will solve roughly half of issues with quality, with remaining issues usually due to poor training data.

There is no limit to the number of speakers in a dialogue.

The models are nondeterministic. For consistency, use the optional seed parameter, though subtle differences may still occur.

Split long text into segments and use streaming for real-time playback and efficient processing.