Professional Voice Cloning

Learn how to clone your voice professionally using our best-in-class models.

Creating a Professional Voice Clone

When cloning a voice, it’s important to consider what the AI has been trained on: which languages and what type of dataset. You can find more information about which languages each model has been trained on in our help center.

Read more about each individual model and their strengths in the Models page).

Guide

If you are unsure about what is permissible from a legal standpoint, please consult the Terms of Service and our AI Safety information for more information.

2

Upload your audio

Voice cloning IVC modal

Follow the on-screen instructions to label your voice clone and upload audio samples.

3

Verify your voice

Once everything is recorded and uploaded, you will be asked to verify your voice. To ensure a smooth experience, please try to verify your voice using the same or similar equipment used to record the samples and in a tone and delivery that is similar to what was present in the samples. If you do not have access to the same equipment, try verifying the best you can. If it fails, you will have to reach out to support.

4

Use your voice clone

Under the “Voices” section in the dashboard, select the “Personal” tab, then click on your voice clone to begin using it.

There are a few things to be mindful of before you start uploading your samples, and some steps that you need to take to ensure the best possible results.

1

Record high quality audio

Professional Voice Cloning is highly accurate in cloning the samples used for its training. It will create a near-perfect clone of what it hears, including all the intricacies and characteristics of that voice, but also including any artifacts and unwanted audio present in the samples. This means that if you upload low-quality samples with background noise, room reverb/echo, or any other type of unwanted sounds like music on multiple people speaking, the AI will try to replicate all of these elements in the clone as well.

2

Ensure there’s only a single speaking voice

Make sure there’s only a single speaking voice throughout the audio, as more than one speaker or excessive noise or anything of the above can confuse the AI. This confusion can result in the AI being unable to discern which voice to clone or misinterpreting what the voice actually sounds like because it is being masked by other sounds, leading to a less-than-optimal clone.

3

Provide enough material

Make sure you have enough material to clone the voice properly. The bare minimum we recommend is 30 minutes of audio, but for the optimal result and the most accurate clone, we recommend closer to 2+ hours of audio. You might be able to get away with less, but at that point, we can’t vouch for the quality of the resulting clone.

4

Keep the style consistent

The speaking style in the samples you provide will be replicated in the output, so depending on what delivery you are looking for, the training data should correspond to that style (e.g. if you are looking to voice an audiobook with a clone of your voice, the audio you submit for training should be a recording of you reading a book in the tone of voice you want to use). It is better to just include one style in the uploaded samples for consistencies sake.

5

Use samples speaking the language you want the PVC to be used for

It is best to use samples speaking where you are speaking the language that the PVC will mainly be used for. Of course, the AI can speak any language that we currently support. However, it is worth noting that if the voice itself is not native to the language you want the AI to speak - meaning you cloned a voice speaking a different language - it might have an accent from the original language and might mispronounce words and inflections. For instance, if you clone a voice speaking English and then want it to speak Spanish, it will very likely have an English accent when speaking Spanish. We only support cloning samples recorded in one of our supported languages, and the application will reject your sample if it is recorded in an unsupported language.

See the examples below for what to expect from a good and bad recording.

For now, we only allow you to clone your own voice. You will be asked to go through a verification process before submitting your fine-tuning request.

Tips and suggestions

Professional Recording Equipment

Use high-quality recording equipment for optimal results as the AI will clone everything about the audio. High-quality input = high-quality output. Any microphone will work, but an XLR mic going into a dedicated audio interface would be our recommendation. A few general recommendations on low-end would be something like an Audio Technica AT2020 or a Rode NT1 going into a Focusrite interface or similar.

Use a Pop-Filter

Use a Pop-Filter when recording. This will minimize plosives when recording.

Microphone Distance

Position yourself at the right distance from the microphone - approximately two fists away from the mic is recommended, but it also depends on what type of recording you want.

Noise-Free Recording

Ensure that the audio input doesn’t have any interference, like background music or noise. The AI cloning works best with clean, uncluttered audio.

Room Acoustics

Preferably, record in an acoustically-treated room. This reduces unwanted echoes and background noises, leading to clearer audio input for the AI. You can make something temporary using a thick duvet or quilt to dampen the recording space.

Audio Pre-processing

Consider editing your audio beforehand if you’re aiming for a specific sound you want the AI to output. For instance, if you want a polished podcast-like output, pre-process your audio to match that quality, or if you have long pauses or many “uhm”s and “ahm”s between words as the AI will mimic those as well.

Volume Control

Maintain a consistent volume that’s loud enough to be clear but not so loud that it causes distortion. The goal is to achieve a balanced and steady audio level. The ideal would be between -23dB and -18dB RMS with a true peak of -3dB.

Sufficient Audio Length

Provide at least 30 minutes of high-quality audio that follows the above guidelines for best results - preferably closer to 2+ hours of audio. The more quality data you can feed into the AI, the better the voice clone will be. The number of samples is irrelevant; the total runtime is what matters. However, if you plan to upload multiple hours of audio, it is better to split it into multiple ~30-minute samples. This makes it easier to upload.

Built with