Instant Voice Cloning (IVC) allows you to create voice clones form shorter samples near instantaneously. Creating an instant voice clone does not train or create a custom AI model. Instead, it relies on prior knowledge from training data to make an educated guess rather than training on the exact voice. This works extremely well for a lot of voices.

However, the biggest limitation of IVC is if you are trying to clone a very unique voice with a very unique accent where the AI might not have heard a similar voices before during training. In such cases, creating a custom model with explicit training using Professional Voice Cloning (PVC) might be the best option.

Voice Creation

When cloning a voice, it’s important to consider what the AI has been trained on: which languages and what type of dataset. In this case, you can find the languages for each model here, and the dataset is quite varied, especially for the multilingual v2. You can read more about each individual model here and their strengths.

As mentioned earlier, if the voice you try to clone falls outside of these parameters or outside of what the AI has heard during training, it might have a hard time replicating the voice perfectly using instant voice cloning.

How the audio was recorded is more important than the total length (total runtime) of the samples. The number of samples you use doesn’t matter; it is the total combined length (total runtime) that is the important part.

Approximately 1-2 minutes of clear audio without any reverb, artifacts, or background noise of any kind appears to be the sweet spot. When we speak of “audio or recording quality,” we do not mean the codec, such as MP3 or WAV; we mean how the audio was captured. However, regarding audio codecs, using MP3 at 128 kbps and above seems to work just fine, and higher bitrates don’t seem to markedly improve the quality of the clone.

The AI will attempt to mimic everything it hears in the audio; the speed of the person talking as well as the inflections, the accent and tonality, breathing pattern and strength, as well as noise and mouth clicks and everything else, including noise and artefacts which can confuse it.

Another important thing to keep in mind is that the AI will try to replicate the performance of the voice you provide. If you talk in a slow, monotone voice without much emotion, that is what the AI will mimic. On the other hand, if you talk quickly with much emotion, that is what the AI will try to replicate.

It is crucial that the voice remains consistent throughout all the samples, not only in tone but also in performance. If there is too much variance, it might confuse the AI, leading to more varied output between generations.

  • The most important aspect to get a proper clone is the voice itself, the language and accent, and the quality of the recording.
  • Audio length is less important than quality but still plays an important role up to a certain point. At a minimum, input audio should be 1 minute long. Avoid adding beyond 3 minutes; this will yield little improvement and can, in some cases, even be detrimental to the clone, making it more unstable.
  • Keep the audio consistent. Ensure that the voice maintains a consistent tone throughout, with a consistent performance. Also, make sure that the audio quality of the voice remains consistent across all the samples. Even if you only use a single sample, ensure that it remains consistent throughout the full sample. Feeding the AI audio that is very dynamic, meaning wide fluctuations in pitch and volume, will yield less predictable results.
  • Find a good balance for the volume so the audio is neither too quiet nor too loud. The ideal would be between -23 dB and -18 dB RMS with a true peak of -3 dB.

If you are unsure about what is permissible from a legal standpoint, please consult the Terms of Service for more information.