Normalization
Learn how to normalize text for Text to Speech.
When using Text to Speech with complex items like phone numbers, zip codes and emails they might be mispronounced. This is often due to the specific items not being in the training set and smaller models failing to generalize how they should be pronounced. This guide will clarify when those discrepancies happen and how to have them pronounced correctly.
Why do models read out inputs differently?
Certain models are trained to read out numbers and phrases in a more human way. For instance, the phrase “$1,000,000” is correctly read out as “one million dollars” by the Eleven Multilingual v2 model. However, the same phrase is read out as “one thousand thousand dollars” by the Eleven Flash v2.5 model.
The reason for this is that the Multilingual v2 model is a larger model and can better generalize the reading out of numbers in a way that is more natural for human listeners, whereas the Flash v2.5 model is a much smaller model and so cannot.
Common examples
Text to Speech models can struggle with the following:
- Phone numbers (“123-456-7890”)
- Currencies (“$47,345.67”)
- Calendar events (“2024-01-01”)
- Time (“9:23 AM”)
- Addresses (“123 Main St, Anytown, USA”)
- URLs (“example.com/link/to/resource”)
- Abbreviations for units (“TB” instead of “Terabyte”)
- Shortcuts (“Ctrl + Z”)
Mitigation
Use trained models
The simplest way to mitigate this is to use a TTS model that is trained to read out numbers and phrases in a more human way, such as the Eleven Multilingual v2 model. This however might not always be possible, for instance if you have a use case where low latency is critical (e.g. Conversational AI).
Apply normalization in LLM prompts
In the case of using an LLM to generate the text for TTS, you can add normalization instructions to the prompt.
Use clear and explicit prompts
LLMs respond best to structured and explicit instructions. Your prompt should clearly specify that you want text converted into a readable format for speech.
Handle different number formats
Not all numbers are read out in the same way. Consider how different number types should be spoken:
- Cardinal numbers: 123 → “one hundred twenty-three”
- Ordinal numbers: 2nd → “second”
- Monetary values: $45.67 → “forty-five dollars and sixty-seven cents”
- Phone numbers: “123-456-7890” → “one two three, four five six, seven eight nine zero”
- Decimals & Fractions: “3.5” → “three point five”, “⅔” → “two-thirds”
- Roman numerals: “XIV” → “fourteen” (or “the fourteenth” if a title)
Remove or expand abbreviations
Common abbreviations should be expanded for clarity:
- “Dr.” → “Doctor”
- “Ave.” → “Avenue”
- “St.” → “Street” (but “St. Patrick” should remain)
You can request explicit expansion in your prompt:
Expand all abbreviations to their full spoken forms.
Alphanumeric normalization
Not all normalization is about numbers, certain alphanumeric phrases should also be normalized for clarity:
- Shortcuts: “Ctrl + Z” → “control z”
- Abbreviations for units: “100km” → “one hundred kilometers”
- Symbols: “100%” → “one hundred percent”
- URLs: “elevenlabs.io/docs” → “eleven labs dot io slash docs”
- Calendar events: “2024-01-01” → “January first, two-thousand twenty-four”
Consider edge cases
Different contexts might require different conversions:
- Dates: “01/02/2023” → “January second, twenty twenty-three” or “the first of February, twenty twenty-three” (depending on locale)
- Time: “14:30” → “two thirty PM”
If you need a specific format, explicitly state it in the prompt.
Putting it all together
This prompt will act as a good starting point for most use cases:
Use Regular Expressions for preprocessing
If using code to prompt an LLM, you can use regular expressions to normalize the text before providing it to the model. This is a more advanced technique and requires some knowledge of regular expressions. Here are some simple examples: