AI is a highly advanced field of technology and can, at times, be unpredictable as the output is based on the input and then interpreted by the AI. We have tried to minimize the unpredictability as much as possible and keep adding features and improvements that make it more predictable and controllable. However, there are still a few things you need to be mindful of, and this applies to all generative AI.

In this section, we will go through some of the issues we’ve encountered and that have been reported by users. Just because they are mentioned here doesn’t mean that you will experience these issues, and in many cases, there are things you can do to prevent them from happening.

General Troubleshooting

Inconsistency in volume and quality

If your audio output is inconsistent or changes in volume or tone throughout, the issue often stems from the training audio, for example, if the training audio is inconsistent with a high dynamic range.

To fix this, use compression to reduce the dynamic range, ensuring your audio stays steady. Aim for an RMS (Root Mean Square) level between -23 dB and -18 dB and keep the true peak below -3 dB. RMS measures the average energy of your audio, while dB (decibels) indicates the loudness.

Dynamic range is the difference between the quietest and loudest parts of your audio, and compression helps to balance this range by reducing the volume of the louder parts. RMS, or Root Mean Square, is a statistical measure of the magnitude of a varying quantity. In audio terms, it represents the average power or energy of the audio signal. Maintaining an appropriate RMS level ensures that your audio is neither too quiet nor too loud, providing a consistent listening experience.

However, there are other reasons why this might happen. The speaker might go from whispering to shouting or from talking close to the mic to far away from the mic, resulting in very varying volume or inconsistent tonality. These types of issues can also occur if the input audio contains music, noise, rumble, or pops/plosives. These noises or sudden bursts of energy or consistent low-frequency energy can make the AI less stable. You should only include the actual voice you want to clone in the audio.

In general, for consistency, you should use around 1 to 2 minutes of audio when Instant Voice Cloning. This audio should be very consistent across all aspects such as tonality, performance, accent, quality, and so on, if you want the output to be consistent. Using more audio than that can make the AI too variable, which can cause inconsistency between generations.

For Professional Voice Cloning, we recommend at the very least 30 minutes of audio and suggest between 2 to 3 hours of audio for the best results. Keep in mind, the audio needs to be consistent throughout for the best result. You can find more in-depth information in the guide specifically about cloning.

If you clone a voice properly, one that is consistent throughout, of high quality, properly recorded without background noise or multiple speakers, and if you follow the guidelines, you should be able to get a very good and consistent clone. It might require a bit experimenting if you don’t get it right the first time.

To minimize these issues if they do show up, consider breaking your text into smaller segments. This approach helps maintain a consistent volume and reduces the likelihood of degradation over longer audio generations. Utilizing our Projects feature can also be beneficial, as it allows you to generate several smaller audio segments simultaneously, ensuring better quality and consistency.

Quick Tips:

  • Apply Compression: Smooth out volume changes for a more uniform sound.
  • Normalize Levels: Ensure your audio stays within the recommended RMS and dB ranges.
  • Clean Up Noise: Remove background sounds like rumble or pops to enhance clarity.
  • Projects: Use our Projects feature to ensure more stable audio for long-form content.

We recommend that you read our guides on how to get the best possible Instant Voice Clone and Professional Voice Clone as they contain a lot of advice and best practices.

Mispronunciation

The multilingual models may rarely mispronounce certain words, even in English. So far, the trigger seems somewhat arbitrary, but it appears to be voice and text-dependent. It seems to happen more often with certain voices and text than others, especially if you use words that appear in other languages as well.

The best way to deal with this is to use the Projects feature, which seems to minimize the issue as it is more prevalent across longer sections of text when using Speech Synthesis. It will not completely remove the issue, but it will hopefully help both avoid it and make it easier to just regenerate the specific section affected without redoing the whole text.

As with the above issue of inconsistency, this issue also seems to be minimized by using a properly cloned voice, cloned in the languages you want the AI to speak.

When using our Projects feature, you may want to specify the pronunciation of certain words, such as characters and brand names, or to specify how acronyms should be read. For more information on how to do this, please see the Pronunciation Dictionary section of our guide to Projects.

Language Switching and Accent Drift

The AI can sometimes switch languages or accents throughout a single generation, especially if that generation is longer in length - very similar to the mispronunciation issue above. This is also something we’re working on fixing, hopefully with the next iteration, as there’s not too much you can do right now. Using a proper clone, either an Instant Voice Clone or a Professional Voice Clone, trained on high-quality, consistent audio in the language you want the AI to speak should again help mitigate most of this, especially when paired with Projects.

The most important thing to remember is that Default and generated voices are English and might have an English accent when used to generate other languages. This means that they may not have the proper pronunciation and might be more prone to switching languages and accent. The best approach would be to clone a voice speaking the language you want the AI to speak with the accent you want. This will provide the most context for the AI to understand how to perform a passage and should minimize language switching.

There is currently no way to select the language you want the AI to speak. Instead, the way you “select” the language is by writing in the language you want the AI to speak. If you are using a voice that is not native to the language - for example, one of the pre-made voices since they are in English - the AI might have a slight English accent when speaking other languages.

To get optimal results, we recommend cloning a voice that speaks the original language with the correct accent. This is especially important when dealing with languages that are very similar and share a lot of common words. This ensures that the AI has the most information to understand which pronunciation and language it should choose.

Another important point to note is that the AI usually begins with one accent and can gradually shift over longer segments of text, which generally means text longer than a few hundred characters. We highly recommend using Projects feature to avoid many of these issues. When using Text-to-Speech, we typically see the best results when generations are shorter than 800-900 characters.

Mispronounced numbers, symbols or acronyms?

The multilingual models may mispronounce certain numbers, symbols and acronyms. For instance, 1, 2, 3 might be pronounced as “one,” “two,” “three.” Therefore, if you need them to be pronounced in another language, it is recommended to write them out in words, exactly as you would like the AI to deliver them.

Corrupt Speech

This is a very rare issue, but some users have encountered it. It seems to be a bit arbitrary when this happens, but sometimes the AI produces speech that is warped, sounding very muffled and strange. It sounds like it has some sort of effect on it. Unfortunately, we do not have any suggestions for it as we have not been able to replicate the issue or find any cause for it. If this happens, the best course of action is to just regenerate the section, and it should resolve itself, as it is very rare.

Audio degrading over longer generations

We are aware that some voices have a tendency to degrade during longer audio generations, and our team is working hard to develop the technology to improve upon this. This issue is more prominent in the experimental Multilingual v1 model, which we no longer recommend using, unless it is required for a specific voice.

If you are encountering issues with the audio degrading, we recommend breaking down the text into shorter sections, preferably below 800 characters, as this can help maintain better quality.

Some voices are more prone to this issue than others. If you’re using cloned voices, the quality of the samples used is very important to the final output. Noise and other artifacts tend to be amplified during long generations.

Both stability and similarity can change how the voice acts as well as how prominent the artifacts are. Hovering over the ! next to each side of the sliders will reveal some more information.

Style Exaggeration

For some voices, this voice setting can lead to instability, including inconsistent speed, mispronunciation and the addition of extra sounds. We recommend keeping this setting at 0, especially if you find you are experiencing these issues in your generated audio.

Models

Flash v2.5

Flash v2.5 is optimised for low latency and should be used when this is the most important concern, for example, real-time conversational AI. Where quality and accuracy are the primary concern, for example, for content creation, we recommend using our Multilingual v2 model.

Multilingual v1

The multilingual model is generally only recommended for old PVC clones that require it, as it does exhibit a few issues that are not present in the newer models.

During generation, the audio may change in tone, quality, introduce noise and distortion, and the voice may transition from male to female or start whispering, and more. The prominence of these issues largely depends on the model and voice used.

Projects

Projects is one of the world’s most advanced workflows for creating long-form content using AI. Even despite its high complexity, there are very few issues with Projects, and in general, it works fantastically well if you use a proper voice paired with the appropriate model. For more information, see our Projects documentation.

Import Function

The import function will do its best to try and import the file you give it to the website. However, since there are so many variables related to websites and how a book can be formatted, including the presence of images, you should always double-check to ensure that everything is imported correctly.

One such issue you might encounter is when importing a book where each chapter starts with an image as the first letter. This can be very confusing for the AI, as it will not be able to extrapolate the letter. Therefore, you will have to add that letter to each chapter.

If something is imported as a single long paragraph instead of being split where a new line break starts, something is wrong, and it might not work properly. It should follow the same structure as the original book. If that doesn’t work, you can try copying and pasting. If that also doesn’t work, there might be something wrong with how the text is presented, and this book might not work without first converting it to another format or rewriting it fully. This is very unusual, but it’s essential to keep in mind.

EPUB is the best file format to use to create your project. If the EPUB is well-structured and correctly formatted, it will automatically split each chapter into its own chapter in Projects, making it very easy to navigate. To format your EPUB so that Projects can recognize your chapters, you need to make sure that each chapter heading is formatted as “Heading 1”.

Glitches between paragraphs

On rare occasions, you might encounter glitches or sharp breaths between paragraphs, which you might not experience with Speech Synthesis, as they operate differently. Generally, this issue is not extremely disruptive and is relatively rare, but we are actively working on resolving it. At the moment, there is no straightforward solution to completely avoid this problem. If you do happen to encounter an issue like this, we recommend regenerating the paragraph immediately before the issue occurs. These issues tend to occur at the end of paragraphs rather than at the beginning. So, if you hear a problem between two paragraphs, it’s usually the preceding paragraph that is the cause of the issue.