I’ve been looking at multispeaker VITS TTS models lately, so thought I’d share the Google Colab notebook. Its similar to the others posted, but this is using precomputed vectors; the configuration is similar to the YourTTS model, however this seems a little easier to fine tune. As always, this stuff is experimental, but this should help you get started if you want to poke around at training a multi-speaker, English language VITS model using the Coqui TTS framework.
Multi-Speaker English language VITS training Colab Notebook: https://colab.research.google.com/drive/1wAuG-TcZeAUYhff0f6ZiG-so9KT-sBIE?usp=sharing
YourTTS video discussing the same training options that can be used here as well: https://www.youtube.com/watch?v=1yt2W-uK8mk
Real time noise suppression plugin: https://github.com/werman/noise-suppression-for-voice
Audacity: https://www.audacityteam.org/
Coqui’s Dataset Guide: https://github.com/coqui-ai/TTS/wiki/What-makes-a-good-TTS-dataset
rnnoise: https://github.com/xiph/rnnoise
Download my multilingual, multispeaker YourTTS model on
Huggingface: https://huggingface.co/AOLCDROM/YourTTS-Fr-En-De-Es
See allvoices.txt for information about each speaker:language training pair. Was trained on character sets, and uses ‘artificial’ language codes.
Generate text with the CLI:
tts --text "text" --out_path outfile.wav --model_path path/to/model_file.pth --config_path path/to/config.json --speakers_file_path speakers/index/path/speakers.pth --speaker_idx VCTK_speaker