Normalization | ElevenLabs Documentation

When using Text to Speech with complex items like phone numbers, zip codes and emails they might be mispronounced. This is often due to the specific items not being in the training set and smaller models failing to generalize how they should be pronounced. This guide will clarify when those discrepancies happen and how to have them pronounced correctly.

Why do models read out inputs differently?

Certain models are trained to read out numbers and phrases in a more human way. For instance, the phrase “$1,000,000” is correctly read out as “one million dollars” by the Eleven Multilingual v2 model. However, the same phrase is read out as “one thousand thousand dollars” by the Eleven Flash v2.5 model.

The reason for this is that the Multilingual v2 model is a larger model and can better generalize the reading out of numbers in a way that is more natural for human listeners, whereas the Flash v2.5 model is a much smaller model and so cannot.

Common examples

Text to Speech models can struggle with the following:

Phone numbers (“123-456-7890”)
Currencies (“$47,345.67”)
Calendar events (“2024-01-01”)
Time (“9:23 AM”)
Addresses (“123 Main St, Anytown, USA”)
URLs (“example.com/link/to/resource”)
Abbreviations for units (“TB” instead of “Terabyte”)
Shortcuts (“Ctrl + Z”)

Mitigation

Use trained models

The simplest way to mitigate this is to use a TTS model that is trained to read out numbers and phrases in a more human way, such as the Eleven Multilingual v2 model. This however might not always be possible, for instance if you have a use case where low latency is critical (e.g. Conversational AI).

Apply normalization in LLM prompts

In the case of using an LLM to generate the text for TTS, you can add normalization instructions to the prompt.

Use clear and explicit prompts

LLMs respond best to structured and explicit instructions. Your prompt should clearly specify that you want text converted into a readable format for speech.

Handle different number formats

Not all numbers are read out in the same way. Consider how different number types should be spoken:

Cardinal numbers: 123 → “one hundred twenty-three”
Ordinal numbers: 2nd → “second”
Monetary values: $45.67 → “forty-five dollars and sixty-seven cents”
Phone numbers: “123-456-7890” → “one two three, four five six, seven eight nine zero”
Decimals & Fractions: “3.5” → “three point five”, “⅔” → “two-thirds”
Roman numerals: “XIV” → “fourteen” (or “the fourteenth” if a title)

Remove or expand abbreviations

Common abbreviations should be expanded for clarity:

“Dr.” → “Doctor”
“Ave.” → “Avenue”
“St.” → “Street” (but “St. Patrick” should remain)

You can request explicit expansion in your prompt:

Expand all abbreviations to their full spoken forms.

Alphanumeric normalization

Not all normalization is about numbers, certain alphanumeric phrases should also be normalized for clarity:

Shortcuts: “Ctrl + Z” → “control z”
Abbreviations for units: “100km” → “one hundred kilometers”
Symbols: “100%” → “one hundred percent”
URLs: “elevenlabs.io/docs” → “eleven labs dot io slash docs”
Calendar events: “2024-01-01” → “January first, two-thousand twenty-four”

Consider edge cases

Different contexts might require different conversions:

Dates: “01/02/2023” → “January second, twenty twenty-three” or “the first of February, twenty twenty-three” (depending on locale)
Time: “14:30” → “two thirty PM”

If you need a specific format, explicitly state it in the prompt.

Putting it all together

This prompt will act as a good starting point for most use cases:

Convert the output text into a format suitable for text-to-speech. Ensure that numbers, symbols, and abbreviations are expanded for clarity when read aloud. Expand all abbreviations to their full spoken forms.
Example input and output:
"$42.50" → "forty-two dollars and fifty cents"
"£1,001.32" → "one thousand and one pounds and thirty-two pence"
"1234" → "one thousand two hundred thirty-four"
"3.14" → "three point one four"
"555-555-5555" → "five five five, five five five, five five five five"
"2nd" → "second"
"XIV" → "fourteen" - unless it's a title, then it's "the fourteenth"
"3.5" → "three point five"
"⅔" → "two-thirds"
"Dr." → "Doctor"
"Ave." → "Avenue"
"St." → "Street" (but saints like "St. Patrick" should remain)
"Ctrl + Z" → "control z"
"100km" → "one hundred kilometers"
"100%" → "one hundred percent"
"elevenlabs.io/docs" → "eleven labs dot io slash docs"
"2024-01-01" → "January first, two-thousand twenty-four"
"123 Main St, Anytown, USA" → "one two three Main Street, Anytown, United States of America"
"14:30" → "two thirty PM"
"01/02/2023" → "January second, two-thousand twenty-three" or "the first of February, two-thousand twenty-three", depending on locale of the user

Use Regular Expressions for preprocessing

If using code to prompt an LLM, you can use regular expressions to normalize the text before providing it to the model. This is a more advanced technique and requires some knowledge of regular expressions. Here are some simple examples:

1 # Be sure to install the inflect library before running this code
2 import inflect
3 import re
4 
5 # Initialize inflect engine for number-to-word conversion
6 p = inflect.engine()
7 
8 def normalize_text(text: str) -> str:
9     # Convert monetary values
10     def money_replacer(match):
11         currency_map = {"$": "dollars", "£": "pounds", "€": "euros", "¥": "yen"}
12         currency_symbol, num = match.groups()
13 
14         # Remove commas before parsing
15         num_without_commas = num.replace(',', '')
16 
17         # Check for decimal points to handle cents
18         if '.' in num_without_commas:
19             dollars, cents = num_without_commas.split('.')
20             dollars_in_words = p.number_to_words(int(dollars))
21             cents_in_words = p.number_to_words(int(cents))
22             return f"{dollars_in_words} {currency_map.get(currency_symbol, 'currency')} and {cents_in_words} cents"
23         else:
24             # Handle whole numbers
25             num_in_words = p.number_to_words(int(num_without_commas))
26             return f"{num_in_words} {currency_map.get(currency_symbol, 'currency')}"
27 
28     # Regex to handle commas and decimals
29     text = re.sub(r"([$£€¥])(\d+(?:,\d{3})*(?:\.\d{2})?)", money_replacer, text)
30 
31     # Convert phone numbers
32     def phone_replacer(match):
33         return ", ".join(" ".join(p.number_to_words(int(digit)) for digit in group) for group in match.groups())
34 
35     text = re.sub(r"(\d{3})-(\d{3})-(\d{4})", phone_replacer, text)
36 
37     return text
38 
39 # Example usage
40 print(normalize_text("$1,000"))   # "one thousand dollars"
41 print(normalize_text("£1000"))   # "one thousand pounds"
42 print(normalize_text("€1000"))   # "one thousand euros"
43 print(normalize_text("¥1000"))   # "one thousand yen"
44 print(normalize_text("$1,234.56"))   # "one thousand two hundred thirty-four dollars and fifty-six cents"
45 print(normalize_text("555-555-5555"))  # "five five five, five five five, five five five five"