Updated | Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab

This is about as close to automated as I can make things. I’ve put together a Colab notebook that uses a bunch of spaghetti code, rnnoise, OpenAI’s Whisper Speech to Text, and Coqui Text to Speech to train a VITS model.

Upload audio files, split and process clips, denoise clips, transcribe clips with Whisper, then use that dataset to fine tune a VITS model.

Colab script revised to add toggles for freezing layers and some (possibly broken) audio processing toggles This is for fine tuning English voices; things are hardcoded for English. Adjusting this will take some work on your part, and fine tuning across languages is hit and miss.

First part of the video covers using Audacity and the VST3 port of rnnoise to more accurately clip samples on your PC. Second half is the Colab run-through.

Real time noise suppression plugin: https://github.com/werman/noise-suppression-for-voice

Colab script (r4): https://colab.research.google.com/drive/1Swo0GH_PjjAMqYYV6He9uFaq5TQsJ7ZH?usp=sharing

Audacity: https://www.audacityteam.org/

Coqui’s Dataset Guide: https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset

rnnoise: https://github.com/xiph/rnnoise