Cloning Your Voice — A Comprehensive Guide (TTS)

Photo by NeONBRAND on Unsplash

The What, Why, Who, and When of TTS creation

Have you no friends? Does no one seem to “get you?” Do you find yourself talking to your reflection for hours on end hoping for some flicker of social interaction, much like a lost sailor drinking his own urine? Well, with the advancement of technology, you can now talk to yourself but with a technological twist!

A WIP Sample of my AI voice
  1. Dataset Creation
    A. Installing Libraries
    B. Preparing
    C. Recording
  2. Preprocessing
  3. Configuration
    A. Installing
    B. SNR
    C. Config Files
  4. Training
    A. Setting Up
    B. Reviewing
  5. Generation

Requirements

  • A good microphone and minimal outside noise
  • A Google Colab Pro account
  • Basic knowledge of Audacity or Adobe Audition or similar
  • Basic Python programming knowledge
  • Anaconda Installed
  • At least 100GB google drive account

1. Dataset Creation

Before we can start even thinking of training, we need data to train with. The way the AI we are using reads data, it must be formatted in a very particular way. Thankfully, a program exists that makes this process easier. Created by a user formally known as GalaticGum, I modified it to fit our current needs for TTS creation. It’s the best way to create a dataset, in my experience, without the hassle of having to manually transcribe everything you say. This will be done locally on a computer, unlike the training that is done on the cloud. But it is not computer-intensive.

/MyTTSDataset
|
| -> metadata.csv
| -> /wavs
| -> audio1.wav
| -> audio2.wav
| ...
# metadata.csvwavs/audio1.wav|This is my sentence.
wavs/audio2.wav|This is maybe my sentence.
wavs/audio3.wav|This is certainly my sentence.
wavs/audio4.wav|Let this be your sentence.

A. Installing Libraries

We first need to install the python file for dataset creation and create the conda environment.

git clone https://github.com/rioharper/VoiceDatasetCreation
conda create -n dataset pyaudio ffmpeg
pip install -r requirements.txt#run this if you are on linux as well
sudo apt-get install '^libxcb.*-dev' libx11-xcb-dev libglu1-mesa-dev libxrender-dev libxi-dev libxkbcommon-dev libxkbcommon-x11-dev

B. Preparing

Before you run the python file, you need to prepare for the recording process. VITS is very particular about background noise.

  • Get in a place where you hear minimal cars, lawnmowers, the cries of the people in your basement, kitchen clanking, etc.
  • To confirm your microphone sounds optimal, test it on an audio recording software at a 22050hz sample rate.

C. Recording

Now you are ready to start recording. So let’s open up that python file and give it a whirl.

An Example Of What A Good Recording Snippit Sounds Like

2. Preprocessing

Congratulations, you’ve done it! You can frolic in some grass, the sunlight smothering you in kisses as unicorns prance around you and the birds chirp beautiful odes in your greatness.

  • Removing silences from the beginning and end of the files
  • Cross-checking the audio and the transcription to confirm that the audio uses correct punctuation and word choice (if not, change the text of the metadata file to match the audio)
  • Removing unnatural pauses in speech and gasps for air. Rerecord pieces if necessary.
  • The FFT filter to remove plosives (the pop from saying p too hard)
  • Noise reduction
  • Voice equalizer to ensure all my files are at the same level dB (-7 to -10 dB is a good range)

3. Configuration

Good job, you got through the most tedious part of the process. Now it’s time to judge not only how good your dataset is, but how to squeeze as much performance out of it as possible. The following link will open up a google colab notebook for you to run diagnostics.

A. Installing

Copy this notebook onto your own google drive account, and then follow along:

  • First, run setup. Make sure to connect your notebook to the drive you want to train your TTS model with.
  • Then install libraries
  • Upload your dataset to google drive under the VoiceCloning/datasets folder and unzip using google colab.

B. SNR

SNR is an acronym for “Signal to Noise Ratio” (the ratio between your voice and background noise) and will give you a rough estimate of what voice clips are good, and the ones that need to get scrapped. The code will point out the worst quality clips in your dataset, which you can remove manually. If your average falls below a value of 15, your dataset is of poor quality, and either you should spend more time processing the audio or rerecord it entirely. I’ve probably made 6 fully finished datasets out of experimentation before I got an SNR that I was happy with. Hopefully, you’re a better speaker than I.

C. Config Values

After sifting through your SNR data, you can then go on to set your training configuration. Don’t expect generated voice samples to sound like your ground truth, but getting as close as reasonably possible is the goal. You can compare other values, but I find that the ones I included are most influential when training. Once you’re happy choosing which values sound best, note them down and we can continue onto training!

4. Training

Now for the waiting game. For the sake of simplicity, I’ve made training a separate notebook, but this first one covers the meats and potatoes of the training process. Open ‘er up and we can get cooking.

A. Setting Up

First, connect to the same google drive as the one you used for configuration, and install the necessary files. From there, we need to add our values to the training process.

B. Reviewing

Thankfully, a wonderfully useful tool called Tensorboard is enabled on Coqui TTS, and so we can see real-time visualizations and samples of the audio.

Left: Bad Alignment Right: Healthy Alignment

5. Generation

The moment of truth. What a terrifying thought. So many hours spent combing data, ruining your vocal cords, and analyzing graphs for this moment. I would recommend perhaps some celebratory sparkling cider. Through hell and back, you have prevailed.

The birch canoe slid on the smooth planks.Glue the sheet to the dark blue background.It’s easy to tell the depth of a well.These days a chicken leg is a rare dish.Rice is often served in round bowls.The juice of lemons makes fine punch.The box was thrown beside the parked truck.The hogs were fed chopped corn and garbage.Four hours of steady work faced us.Large size in stockings is hard to sell.

Conclusion

And that’s it. For now, at least. If you want to dive further into TTS technology, I have only covered the absolute basics of training. The Coqui TTS documentation gets much more into detail and is still very much active. They are releasing new models every few months that improve in quality, speed, and lessen the data needed to train a good model. Many of these steps, including configuration and dataset creation, can be used to expedite the process. A few tweaks to the training file (detailed in Coqui TTS documentation) will get you up and running in no time. Feel free to comment on any special tricks you’ve found, or questions about the training process.

--

--

A high school senior trying to make the AI process at least 10% less homicidal

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rio

A high school senior trying to make the AI process at least 10% less homicidal