Step-by-step Guide
Step-by-step guide to creating the highest quality voice clone available
What is a Professional Voice Clone (PVC)?
A Professional Voice Clone (PVC) is a special feature that is available to our Creator+ plans. A PVC is an ultra-realistic, custom AI model of your voice. This is done by training our specialized model with longer voice data (at least 30 mins and up to 3 hours for optimum results) to make it sound just like the original voice.
Essentially, a PVC is a more advanced version of our Instant Voice Cloning feature. For now, we only allow you to clone your own voice. You will be asked to go through a verification process with our voice Captcha before submitting your fine-tuning request.
Custom AI models require fine-tuning and training, so PVCs will take longer (about 4 to 8 hours) compared to Instant Voice Clones.
Video
How to create a PVC?
A Step-by-step guide to create a high-quality PVC:
1. Go to your VoiceLab by first clicking on the “Voices” tab:
2. Click on “Add Generative or Cloned Voice” and choose “Professional Voice Cloning”:
3. Confirm that you have read our Guidelines and Rules and click “Start”:
4. Name your PVC, Choose the Language, and Upload your recordings:
5. Add Labels and a Description for your Professional Clone Voice (this can be changed later) and click “Create Professional Voice”:
6. Once the PVC model has accepted all of your Audio Samples, you will then be asked to verify your voice:
Before moving on, ensure that your browser has access to your Microphone and that you are not muted. The AI will compare your voice to the samples you just submitted, so it’s important to use the same equipment that you used to record your uploaded audio samples - try your best to match the tone and delivery!
Once you’re ready, click “Start Recording” and you will see a generated captcha to read such as the following:
If the AI detects a difference in the audio quality of your recordings, you may see an error message letting you know:
Ensure that your audio quality matches your uploaded recordings, speak clearly and concisely, and try again:
7.Once successful, your voice will be marked as “successfully verified”. Your PVC will be then queued for Fine-Tuning:
8. After fine-tuning your PVC, you will find it in your VoiceLab. Click “use” so it can appear in the Speech Synthesis page, so that you can use it to generate the audio you need.
9. If you would like to share your PVC in our Voice Library and start earning passively with Payouts, follow these steps.
Recording audio for your PVC
Recording Key Considerations
Before you upload your audio samples for Professional Voice Cloning (PVC), there are key considerations to keep in mind to achieve the best results.
-
Recording Quality Firstly, Professional Voice Cloning is highly accurate in cloning the samples used for its training. It will create a near-perfect clone of what it hears, including all the nuances and characteristics of that voice, but also including any artifacts and unwanted audio present in the samples. This means that if you upload low-quality samples with background noise, room reverb/echo, or any other type of unwanted sounds, the AI will try to replicate all of these elements in the clone as well. Making your model also have ample background noise, sibilance, or reverb. Please follow these guidelines for best results.
-
Clear audio with a single speaker and no background music or sound effects Ensure there’s only a single speaking voice throughout the audio, as more than one speaker or excessive noise or anything of the above can confuse the AI. This confusion can result in the AI being unable to discern which voice to clone or misinterpreting what the voice actually sounds like because it is being masked by other sounds, leading to a less-than-optimal clone.
-
Use at least 30 mins to 3 hours of audio The bare minimum we recommend is 30 minutes of audio, but for optimal results and the most accurate clone, we recommend closer to 3 hours of audio. The more quality data you can feed into the AI, the better the voice clone will be. This can be either one long file, or several different files. If you choose to upload multiple audio files, make sure they have the same audio quality and are recorded in the same space. However, if you plan to upload multiple hours of audio, it is better to split it into multiple ~30-minute samples. This makes it easier to upload.
-
Use a consistent delivery style The speaking style in your samples will be replicated in the output. For consistent results, use one style per upload. For instance, if you’re creating a voice model intended for audiobooks, submit recordings of yourself reading books in a consistent style, avoiding different character voices or else this will create errors in your voice model. This does not mean monotone or emotionless, feel free to vary your tone and emotion according to the context of the text.
-
Use audio samples in the same language as your PVC model For best results, use samples in the language you primarily intend the PVC for. While the AI can speak any supported language, cloning a voice from a different language may result in accents or mispronunciations. For example, if you clone an English voice for Spanish, it may retain an English accent. We only support cloning samples recorded in one of our supported languages, and the application will reject your sample if it is recorded in an unsupported language.
-
Clone your own voice only For now, we only allow you to clone your own voice. You will be asked to go through a verification process with our voice Captcha before submitting your fine-tuning request.
Recording Quality Guidelines
Whether you’re new to voice recording or a seasoned professional. Here are some quality guidelines to consider. Please note that if you’re sharing your PVC in our Voice Library and it follows these guidelines and showcases consistent output, your PVC may earn a High-Quality Badge in our Voice Library, enhancing your ranking and potential earnings!
General recording guidelines:
- Use professional recording equipment: The AI will clone everything in your audio. High-quality input = high-quality output. Opt for a professional XLR mic going into a dedicated audio interface.
- Use a pop-filter: This will minimize plosives when recording.
- Microphone distance: Position yourself at the right distance from the microphone - approximately two fists away from the mic is recommended, but it also depends on what type of recording you want.
- Noise-free recording: Ensure that the audio input doesn’t have any interference, like background music or noise. The AI cloning works best with clean, uncluttered audio.
- Room acoustics: Always record in an acoustically-treated room. This reduces unwanted echoes and background noises, leading to clearer audio input for the AI.
- Audio pre-processing (optional): You might find that adding light compression or other tools can improve your audio files before creating your PVC. Please note that excessive processing can have diminishing returns, so it’s best to be conservative with these effects.
- Volume control: Maintain a consistent volume that’s loud enough to be clear but not so loud that it causes distortion. The goal is to achieve a balanced and steady audio level. The ideal would be between -23dB and -18dB RMS with a true peak of -3dB.
- Audio file format: Mono, .wav, Minimum 44.1 kHz sample rate, and Minimum 16-bit depth
Please avoid these technical recording issues:
- Room echo or “boxiness.”
- Background noise, including hiss, white noise, electrical hum, or external disturbances.
- Apparent editing issues (i.e. clicks, pops, audible cuts).
- Distortion, clipping, heavy compression, or excessive processing (i.e. noise gate, noise reduction plugin, normalization, EQ).
- Sibilance, loud breath noises, plosives, and mouth clicks.
- Repeats, mistakes, and long periods of silence (5 seconds or more).
- Voice level/input gain imbalance anywhere in the recording.
Performance guidelines:
- Emphasis, intonations and emotions should align appropriately with the context of the text to create a realistic PVC.
- In some cases (e.g. audiobooks), emotional range and variance is helpful in delivering an engaging performance and creating a great AI voice. Our models can capture this emotional range, but the voice itself should remain consistent.
- Please vary your tone and pace naturally when reading. ✅
- Please avoid changing voices for different characters in a single recording or else this will create errors in your voice model. ❌
- Ensure correct and articulate pronunciation.
- Avoid sounding nasal, muffled, or wet (excess saliva).
Beginner’s Guide to Audio Recording
New to audio recording? Follow our guideline below!
1) Recording Location
When recording audio, choose a suitable location and set up to minimize room echo/reverb. So, we want to “deaden” the room as much as possible. This is precisely what a vocal booth that is acoustically treated made for, and if you do not have a vocal booth readily available, you can experiment with some ideas for a DIY vocal booth, “blanket fort”, or closet.
Here are a few YouTube examples of DIY acoustics ideas:
- I made a vocal booth for $0.00!
- How to Record GOOD Vocals in a BAD Room
- The 5 BEST Vocal Home Recording TIPS!
2) 2) Equipment: Microphone, pop-filter, and audio interface
A good microphone is crucial. Microphones range from 10,000, but a professional XLR microphone costing 300 is sufficient for most voiceover work.
For an affordable yet high-quality setup for voiceover work, consider a Focusrite interface paired with an Audio-Technica AT2020 or Rode NT1 microphone. This setup, costing between 500, offers high-quality recording suitable for professional use, with minimal self-noise for clean results.
Also, please ensure that you have a proper pop-filter in front of the microphone when recording to avoid plosives as well as breaths and air hitting the diaphragm/microphone directly, as it will sound poor and will also cause issues with the cloning process.
3) Digital Audio Workstation (DAW)
There are many different recording solutions out there that all accomplish the same thing: recording audio. However, they are not all created equally. As long as they can record WAV files at 44.1kHz or 48kHz with a bitrate of at least 24 bits, they should be fine. You don’t need any fancy post-processing, plugins, denoisers, or anything because we want to keep audio recording simple.
If you want a recommendation, we would suggest something like REAPER, which is a fantastic DAW with a tremendous amount of flexibility. It is the industry standard for a lot of audio work. For a personal license or a discounted license, it is only $60. Another good free option is Audacity.
Maintain optimal recording levels (not too loud or too quiet) to avoid digital distortion and excessive noise. Aim for peaks of -6 dB to -3 dB and an average loudness of -18 dB for voiceover work, ensuring clarity while minimizing the noise floor. Monitor closely and adjust levels as needed for the best results based on the project and recording environment.
4) Positioning
One helpful guideline to follow is to maintain a distance of about two fists away from the microphone, which is approximately 20cm (7-8 in), with a pop filter placed between you and the microphone. Some people prefer to position the pop filter all the way back so that they can press it up right against it. This helps them maintain a consistent distance from the microphone more easily.
Another common technique to avoid directly breathing into the microphone or causing plosive sounds is to speak at an angle. Speaking at an angle ensures that exhaled air is less likely to hit the microphone directly and, instead, passes by it.
5) Performance
The performance you give is one of the most crucial aspects of this entire recording session. The AI will try to clone everything about your voice to the best of its ability, which is very high. This means that it will attempt to replicate your cadence, tonality, performance style, the length of your pauses, whether you stutter, take deep breaths, sound breathy, or use a lot of “uhms” and “ahs” – it can even replicate those. Therefore, what we want in the audio file is precisely the performance and voice that we want to clone, nothing less and nothing more. That is also why it’s quite important to find a script that you can read that fits the tonality we are aiming for.
When recording for AI, it is very important to be consistent. if you are recording a voice either keep it very animated throughout or keep it very subdued throughout you can’t mix and match or the AI can become unstable because it doesn’t know what part of the voice to clone. same if you’re doing an accent keep the same accent throughout the recording. Consistency is key to a proper clone!
Scripts
Here’s a variety of English scripts to help you create PVCs optimized for some of the most popular use cases.
Please remember that what you read is not very important; how you read it is very important, however. The AI will try to mimic everything it hears in a voice: the tonal quality, the accent, the inflection, and many other intricate details. It will replicate how you pronounce certain words, vowels, and consonants, but not the actual words themselves. So, it is better to choose a text or script that conveys the emotion you want to capture, and read in a tone of voice you want to use, and optimized for the use case it’s intended to serve.