Fine Tuning XTTS V2 with (forked) Coqui

Posted on June 26, 2024 (June 28, 2024) by nanonomad

Fine tuning XTTS v2 with the forked Coqui project. Coqui AI shut down earlier this year, so what does that mean for us? Here I go over adjusting the Coqui XTTS v2 training recipe, creating a dataset using Audacity and faster-whisper, and training a single speaker and multispeaker XTTS v2 english model

Convert wavs to single channel, 22050hz

for /f "tokens=*" %a in ('dir /b *.wav') do ffmpeg -i "%a" -vn -ar 22050 -ac 1 converted_wavs\%~na.wav

Transcribe clips (Updated 6/27/2024)

from faster_whisper import WhisperModel
import os


#model_size = "large-v3"
model_size = "distil-large-v3"
# Define directory containing audio files
audio_dir = "c:\\voicemodel\\mixed-more\\"
wav_dir = audio_dir + "wavs\\"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
def transcribe_and_save(audio_file):

    base_filename, _ = os.path.splitext(os.path.basename(audio_file))
    try:
        segments, _ = model.transcribe(audio_file, beam_size=5, language="en", condition_on_previous_text=False)
        transcript = " ".join(segment.text.lstrip() for segment in segments)
        print(f"Transcribed and saved: {audio_file}")
        with open(os.path.join(audio_dir, "metadata.csv"), "a", encoding="utf-8") as csvfile:
            csvfile.write(f"{base_filename}|{transcript}|{transcript}\n")
    except Exception as e:
        print(f"Error transcribing {audio_file}: {e}")
for filename in os.listdir(wav_dir):
    if filename.endswith(".wav"):
        audio_file = os.path.join(wav_dir, filename)
        transcribe_and_save(audio_file)

Install Coqui

git clone https://github.com/idiap/coqui-ai-TTS tts
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
cd tts
pip install -e .[all,dev,docs]

Install faster-whisper

conda create -n faster-whisper python=3.10 git pip
conda activate faster-whisper
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"

Remove optimizer

import os
import glob
import torch
model_dir = "./run/training/GPT_XTTS_v2.0_LJSpeech_FT-bw-June-25-2024_11+37AM-98c0f86c/"
model_path = model_dir + "best_model.pth"
checkpoint = torch.load(model_path, map_location=torch.device("cpu"))
del checkpoint["optimizer"]
#https://github.com/coqui-ai/TTS/discussions/3474#discussioncomment-7965683
for key in list(checkpoint["model"].keys()):
    if "dvae" in key:
        del checkpoint["model"][key]
torch.save(checkpoint, model_dir+"model_small.pth")

Example inference test

import os
import torch
import torchaudio
from datetime import datetime
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import logging
import time
logger = logging.getLogger(__name__)
print("Loading model...")
config = XttsConfig()
config.load_json("./xtts_44800/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="./xtts_44800/", use_deepspeed=False)
model.cuda()
speakerpath = "./speakers/"
phrases = ["I like big butts and I cannot lie, You other brothers can't deny. That when a girl walks in with an itty bitty waist, And a round thing in your face, you get sprung. Wanna pull up tough 'cause you notice that butt was stuffed. Deep in the jeans she's wearin', I'm hooked and I can't stop starin'. Oh, baby, I wanna get with ya, And take your picture, My homeboys tried to warn me, But that butt you got makes Me-me so horny.","X T T S is very sensitive to noise and outlying samples in the dataset."]
print(len(phrases))
#phrase_one = "X T T S is very sensitive to noise and outlying samples in the dataset."
for filename in os.listdir(speakerpath):
    if filename.endswith(".wav"):
        for phrase in phrases:
            start_time = time.time()

            print("Computing speaker latents...")
            gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[speakerpath+filename])

            print("Inference...")
            out = model.inference(
            phrase,
            "en",
            gpt_cond_latent, 
            speaker_embedding,
            temperature=0.7, # Add custom parameters here
        )
            now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
            # compute stats
            process_time = time.time() - start_time
            audio_time = len(torch.tensor(out["wav"]).unsqueeze(0) / 24000)
            logger.warning("Processing time: %.3f", process_time)
            logger.warning("Real-time factor: %.3f", process_time / audio_time)
            torchaudio.save(f"{now}-xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

11 thoughts on “Fine Tuning XTTS V2 with (forked) Coqui”

Durga says:

June 28, 2024 at 11:23 am

please provide this video files include the files you have uploaded in google drive:

(Training or Fine Tuning a Hindi Language VITS TTS Voice Model with Coqui TTS on Google Colab)
https://youtu.be/jE-lKBKfxJw?si=lj7j8WtgbRKGna9n
please help me with this i want to execute your code as it is with your files and i need to do some experiments with that .

Reply
1. nanonomad says:
  
  June 28, 2024 at 5:44 pm
  
  I dont think I have the model checkpoint anymore. This was a proof of concept video that I made to help a few people that were asking me if they could train a Hindi model, but it wasnt a fully finished model/project for release. The Colab notebook is broken now, and the dependencies would need to be fixed. Training on Colab is not recommended anymore; the Colab experience is slow, and crashes a lot. This was to demonstrate that it can be done, should someone want to take up the effort. If you are just starting to learn how this stuff works, you’ll probably want to start out with a model that officially supports the language you’re training. XTTSV2 2.0.3 supports Hindi, https://huggingface.co/coqui/XTTS-v2 and there are some VITS and YourTTS checkpoints that have been trained on Hindi by other people, but they’re not in the official Coqui model list.
  
  Reply
  1. durga says:
    
    July 1, 2024 at 9:29 am
    
    actually that xtts-v2 doesn’t good at hindi voice cloning and reading out the text , can you suggest the way to train that xtts – v2 on hindi to get perfect voice clone and reading out the text
    
    Reply
    1. nanonomad says:
      
      July 1, 2024 at 4:17 pm
      
      Ah, that makes sense. I think they just introduced it in the last version, 2.0.3, and it is probably poorly supported. If I was to generate the samples, there wouldn’t be a good way for me to test the quality because I don’t speak the language and can’t pick out the subtle differences.
      
      I just looked at the xtts v2 tokenizer (vocab.json file) and it looks like the character set is supported. You can probably finetune on Hindi and get it to improve. It might take a larger dataset. One thing to check is to see how the text cleaners in the script work.
      
      Text cleaners take the input text and modify it before tokenizing. They do things like take ‘0’ and turn it into ‘zero’, ‘Dr.’ to ‘doctor’. Youll need to go look at the code of the whatever trainer script and make sure the text cleaner is handling the data properly. Text cleaners for the wrong language will make a mess of the input data.
      
      If you can find me high quality, properly transcribed Hindi dataset I can try a small training session when I have some free time.
      
      Reply
      1. durga says:
        
        July 2, 2024 at 8:21 am
        
        how many hours of data you need ?
      2. nanonomad says:
        
        July 2, 2024 at 5:22 pm
        
        idk, I may have enough to fine tune it. A couple more hours of clear speech would probably help a lot. I’ve converted the Common Voice Hindi dataset, but the recordings aren’t very clear.
      3. nanonomad says:
        
        July 3, 2024 at 12:45 am
        
        I may have enough now. I found this https://www.iitm.ac.in/donlab/indictts/database and converted the two Hindi datasets to smaller subsets in ljspeech dataset format. I’ll write up the basic steps in the next day or two probably.
        
        I think this model is training alright, so I’ll upload it to huggingface later and you can test it out.
durga says:

July 3, 2024 at 7:44 am

perfect , thanks a lot please post the steps and provide the code files so that we can use it , and if you can make a youtube video on that , it will be more useful and lot of researchers are working on that .

Reply
1. nanonomad says:
  
  July 4, 2024 at 12:47 am
  
  http://nanonomad.com/2024/07/03/xttsv2-hindi-finetuning/
  I’m putting together a description of what the scripts do/how to assemble the datasets, but there is a link to the finetuned xttsv2 checkpoints in that post
  
  Reply
dennis says:

December 27, 2024 at 6:18 am

hey nano,

is it possible to do all the audacity stuff programmatically? also are you interested in contract work?

Reply
Technology says:

May 3, 2025 at 7:32 pm

Fine-tuning XTTS v2 seems like a complex but rewarding task. It’s interesting to see the challenges with Hindi voice cloning and how the model handles different languages. Adjusting the text cleaners for proper language support is crucial for accurate results. Using high-quality datasets and clear speech recordings can significantly improve the model’s performance. How can we ensure the text cleaners are optimized for Hindi to avoid data misinterpretation?

Reply

11 thoughts on “Fine Tuning XTTS V2 with (forked) Coqui”

Leave a Reply Cancel reply