The video is currently slightly outdated as we’ve released new features since it was made, and the training time is significantly quicker. However, a lot of the information in it is still relevant.

Professional Voice Cloning (PVC), unlike Instant Voice Cloning (IVC) which lets you clone voices with very short samples nearly instantaneously, allows you to train a hyper-realistic model of a voice. This is achieved by training a dedicated model on a large set of voice data to produce a model that’s indistinguishable from the original voice.

Since the custom models require fine-tuning and training, it will take a bit longer to train these Professional Voice Clones compared to the Instant Voice Clones. Giving an estimate is challenging as it depends on the number of people in the queue before you and a few other factors.

Here are the current estimates for Professional Voice Cloning:

  • English: ~3 hours
  • Multilingual: ~6 hours
2024-02-28: Currently, we are experiencing longer wait times for Multilingual as we just released the new technology for all voices. Thus, we are expected to have a longer wait time until we get through the backlog of voices waiting to be trained. This should only be temporary, and training times will stabilize and go back to the above estimate in a few days’ time.

Voice Creation

There are a few things to be mindful of before you start uploading your samples, and some steps that you need to take to ensure the best possible results.

Firstly, Professional Voice Cloning is highly accurate in cloning the samples used for its training. It will create a near-perfect clone of what it hears, including all the intricacies and characteristics of that voice, but also including any artifacts and unwanted audio present in the samples. This means that if you upload low-quality samples with background noise, room reverb/echo, or any other type of unwanted sounds like music on multiple people speaking, the AI will try to replicate all of these elements in the clone as well.

Secondly, make sure there’s only a single speaking voice throughout the audio, as more than one speaker or excessive noise or anything of the above can confuse the AI. This confusion can result in the AI being unable to discern which voice to clone or misinterpreting what the voice actually sounds like because it is being masked by other sounds, leading to a less-than-optimal clone.

Thirdly, make sure you have enough material to clone the voice properly. The bare minimum we recommend is 30 minutes of audio, but for the optimal result and the most accurate clone, we recommend closer to 3 hours of audio. You might be able to get away with less, but at that point, we can’t vouch for the quality of the resulting clone.

Fourthly, the speaking style in the samples you provide will be replicated in the output, so depending on what delivery you are looking for, the training data should correspond to that style (e.g. if you are looking to voice an audiobook with a clone of your voice, the audio you submit for training should be a recording of you reading a book in the tone of voice you want to use). It is better to just include one style in the uploaded samples for consistencies sake.

Lastly, it’s best to use samples speaking where you are speaking the language that the PVC will mainly be used for. Of course, the AI can speak any language that we currently support. However, it is worth noting that if the voice itself is not native to the language you want the AI to speak - meaning you cloned a voice speaking a different language - it might have an accent from the original language and might mispronounce words and inflections. For instance, if you clone a voice speaking English and then want it to speak Spanish, it will very likely have an English accent when speaking Spanish. We only support cloning samples recorded in one of our supported languages, and the application will reject your sample if it is recorded in an unsupported language.

For now, we only allow you to clone your own voice. You will be asked to go through a verification process before submitting your fine-tuning request.

  • Professional Recording Equipment: Use high-quality recording equipment for optimal results as the AI will clone everything about the audio. High-quality input = high-quality output. Any microphone will work, but an XLR mic going into a dedicated audio interface would be our recommendation. A few general recommendations on low-end would be something like an Audio Technica AT2020 or a Rode NT1 going into a Focusrite interface or similar.
  • Use a Pop-Filter: Use a Pop-Filter when recording. This will minimize plosives when recording.
  • Microphone Distance: Position yourself at the right distance from the microphone - approximately two fists away from the mic is recommended, but it also depends on what type of recording you want.
  • Noise-Free Recording: Ensure that the audio input doesn’t have any interference, like background music or noise. The AI cloning works best with clean, uncluttered audio.
  • Room Acoustics: Preferably, record in an acoustically-treated room. This reduces unwanted echoes and background noises, leading to clearer audio input for the AI. You can make something temporary using a thick duvet or quilt to dampen the recording space.
  • Audio Pre-processing: Consider editing your audio beforehand if you’re aiming for a specific sound you want the AI to output. For instance, if you want a polished podcast-like output, pre-process your audio to match that quality, or if you have long pauses or many “uhm”s and “ahm”s between words as the AI will mimic those as well.
  • Volume Control: Maintain a consistent volume that’s loud enough to be clear but not so loud that it causes distortion. The goal is to achieve a balanced and steady audio level. The ideal would be between -23dB and -18dB RMS with a true peak of -3dB.
  • Sufficient Audio Length: Provide at least 30 minutes of high-quality audio that follows the above guidelines for best results - preferably closer to 3 hours of audio. The more quality data you can feed into the AI, the better the voice clone will be. The number of samples is irrelevant; the total runtime is what matters. However, if you plan to upload multiple hours of audio, it is better to split it into multiple ~30-minute samples. This makes it easier to upload.
  • Uploading: After pressing upload, you will not be able to make any changes to the clone and it will be locked in. Ensure that you have uploaded the correct samples that you want to you.
  • Verify Your Voice: Once everything is recorded and uploaded, you will be asked to verify your voice. To ensure a smooth experience, please try to verify your voice using the same or similar equipment used to record the samples and in a tone and delivery that is similar to what was present in the samples. If you do not have access to the same equipment, try verifying the best you can. If it fails, you will have to reach out to support.

Keep in mind that all of this depends on the output you want. The AI will try to clone everything in the audio, but for the AI to work optimally and predictably, we suggest following the guidelines mentioned above.

Once you’ve uploaded your samples, there are four stages of the cloning process that you might see on your voice card.

  • Verify: This means that they have uploaded the voice samples, but you have not yet finished the verification step. You will need to finish this step before it can start training.
  • Processing: This means that the voice has been verified and is preprocessing, ready to be trained. When you’ve reached this step, the rest is automatic, and you will not need to do anything.
  • Fine-tuning: This is when the voice is actually training. Along with this label, you will also see a loading bar to show you the progress.
  • Fine-tuned: This means the voice has finished training and is ready to be used!

Scripts

What you read is not very important; how you read it is very important, however. The AI will try to mimic everything it hears in a voice: the tonal quality, the accent, the inflection, and many other intricate details. It will replicate how you pronounce certain words, vowels, and consonants, but not the actual words themselves. So, it is better to choose a text or script that conveys the emotion you want to capture, and read in a tone of voice you want to use.