Fine Tuning XTTS V2 with (forked) Coqui

Fine tuning XTTS v2 with the forked Coqui project. Coqui AI shut down earlier this year, so what does that mean for us? Here I go over adjusting the Coqui XTTS v2 training recipe, creating a dataset using Audacity and faster-whisper, and training a single speaker and multispeaker XTTS v2 english model

Convert wavs to single channel, 22050hz

for /f "tokens=*" %a in ('dir /b *.wav') do ffmpeg -i "%a" -vn -ar 22050 -ac 1 converted_wavs\%~na.wav

Transcribe clips (Updated 6/27/2024)

from faster_whisper import WhisperModel
import os


#model_size = "large-v3"
model_size = "distil-large-v3"
# Define directory containing audio files
audio_dir = "c:\\voicemodel\\mixed-more\\"
wav_dir = audio_dir + "wavs\\"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
def transcribe_and_save(audio_file):

    base_filename, _ = os.path.splitext(os.path.basename(audio_file))
    try:
        segments, _ = model.transcribe(audio_file, beam_size=5, language="en", condition_on_previous_text=False)
        transcript = " ".join(segment.text.lstrip() for segment in segments)
        print(f"Transcribed and saved: {audio_file}")
        with open(os.path.join(audio_dir, "metadata.csv"), "a", encoding="utf-8") as csvfile:
            csvfile.write(f"{base_filename}|{transcript}|{transcript}\n")
    except Exception as e:
        print(f"Error transcribing {audio_file}: {e}")
for filename in os.listdir(wav_dir):
    if filename.endswith(".wav"):
        audio_file = os.path.join(wav_dir, filename)
        transcribe_and_save(audio_file)

Install Coqui

git clone https://github.com/idiap/coqui-ai-TTS tts
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
cd tts
pip install -e .[all,dev,docs]

Install faster-whisper

conda create -n faster-whisper python=3.10 git pip
conda activate faster-whisper
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"

Remove optimizer

import os
import glob
import torch
model_dir = "./run/training/GPT_XTTS_v2.0_LJSpeech_FT-bw-June-25-2024_11+37AM-98c0f86c/"
model_path = model_dir + "best_model.pth"
checkpoint = torch.load(model_path, map_location=torch.device("cpu"))
del checkpoint["optimizer"]
#https://github.com/coqui-ai/TTS/discussions/3474#discussioncomment-7965683
for key in list(checkpoint["model"].keys()):
    if "dvae" in key:
        del checkpoint["model"][key]
torch.save(checkpoint, model_dir+"model_small.pth")

Example inference test

import os
import torch
import torchaudio
from datetime import datetime
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import logging
import time
logger = logging.getLogger(__name__)
print("Loading model...")
config = XttsConfig()
config.load_json("./xtts_44800/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="./xtts_44800/", use_deepspeed=False)
model.cuda()
speakerpath = "./speakers/"
phrases = ["I like big butts and I cannot lie, You other brothers can't deny. That when a girl walks in with an itty bitty waist, And a round thing in your face, you get sprung. Wanna pull up tough 'cause you notice that butt was stuffed. Deep in the jeans she's wearin', I'm hooked and I can't stop starin'. Oh, baby, I wanna get with ya, And take your picture, My homeboys tried to warn me, But that butt you got makes Me-me so horny.","X T T S is very sensitive to noise and outlying samples in the dataset."]
print(len(phrases))
#phrase_one = "X T T S is very sensitive to noise and outlying samples in the dataset."
for filename in os.listdir(speakerpath):
    if filename.endswith(".wav"):
        for phrase in phrases:
            start_time = time.time()

            print("Computing speaker latents...")
            gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[speakerpath+filename])

            print("Inference...")
            out = model.inference(
            phrase,
            "en",
            gpt_cond_latent, 
            speaker_embedding,
            temperature=0.7, # Add custom parameters here
        )
            now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
            # compute stats
            process_time = time.time() - start_time
            audio_time = len(torch.tensor(out["wav"]).unsqueeze(0) / 24000)
            logger.warning("Processing time: %.3f", process_time)
            logger.warning("Real-time factor: %.3f", process_time / audio_time)
            torchaudio.save(f"{now}-xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

9 thoughts on “Fine Tuning XTTS V2 with (forked) Coqui

  1. please provide this video files include the files you have uploaded in google drive:

    (Training or Fine Tuning a Hindi Language VITS TTS Voice Model with Coqui TTS on Google Colab)
    https://youtu.be/jE-lKBKfxJw?si=lj7j8WtgbRKGna9n
    please help me with this i want to execute your code as it is with your files and i need to do some experiments with that .

    1. I dont think I have the model checkpoint anymore. This was a proof of concept video that I made to help a few people that were asking me if they could train a Hindi model, but it wasnt a fully finished model/project for release. The Colab notebook is broken now, and the dependencies would need to be fixed. Training on Colab is not recommended anymore; the Colab experience is slow, and crashes a lot. This was to demonstrate that it can be done, should someone want to take up the effort. If you are just starting to learn how this stuff works, you’ll probably want to start out with a model that officially supports the language you’re training. XTTSV2 2.0.3 supports Hindi, https://huggingface.co/coqui/XTTS-v2 and there are some VITS and YourTTS checkpoints that have been trained on Hindi by other people, but they’re not in the official Coqui model list.

      1. actually that xtts-v2 doesn’t good at hindi voice cloning and reading out the text , can you suggest the way to train that xtts – v2 on hindi to get perfect voice clone and reading out the text

        1. Ah, that makes sense. I think they just introduced it in the last version, 2.0.3, and it is probably poorly supported. If I was to generate the samples, there wouldn’t be a good way for me to test the quality because I don’t speak the language and can’t pick out the subtle differences.

          I just looked at the xtts v2 tokenizer (vocab.json file) and it looks like the character set is supported. You can probably finetune on Hindi and get it to improve. It might take a larger dataset. One thing to check is to see how the text cleaners in the script work.

          Text cleaners take the input text and modify it before tokenizing. They do things like take ‘0’ and turn it into ‘zero’, ‘Dr.’ to ‘doctor’. Youll need to go look at the code of the whatever trainer script and make sure the text cleaner is handling the data properly. Text cleaners for the wrong language will make a mess of the input data.

          If you can find me high quality, properly transcribed Hindi dataset I can try a small training session when I have some free time.

          1. idk, I may have enough to fine tune it. A couple more hours of clear speech would probably help a lot. I’ve converted the Common Voice Hindi dataset, but the recordings aren’t very clear.

  2. perfect , thanks a lot please post the steps and provide the code files so that we can use it , and if you can make a youtube video on that , it will be more useful and lot of researchers are working on that .

Leave a Reply

Your email address will not be published. Required fields are marked *