LLaMA-Factory with Flash Attention 2 and Unsloth

It was tough to get this working, but I think I’ve figured it out enough to share. Here’s a quick guide on how to set up LLaMA-Factory with support for Flash Attention 2 and Unsloth training on Windows. This is using a RTX3060 12GB GPU, Windows 10, and CUDA 12.1.

Unsloth is an optimization library that claims up to a 2x performance boost with no trade off in accuracy. There’s also a quick and dirty script to convert bulk raw text to a dataset file, and a little overview of the dataset setup. In the video, I also touch on how to fix the error when loading the trained adapter in the Text Generation WebUI caused by mismatching PEFT libs.

Install Microsoft Visual C++ Build Tools

https://visualstudio.microsoft.com/visual-cpp-build-tools

Select Desktop development with C++, .NET desktop build tools, Windows application development build tools and ensure the packages match the check boxes on the right side.

Create a Conda Environment and Install Packages

conda create -n unsloth_env python=3.10 pytorch-cuda=12.1 pytorch cudatoolkit -c pytorch -c nvidia -c xformers -y
conda activate unsloth_env

Install Clang

conda install -y -c conda-forge/label/llvm_rc clangdev

Set CC variable

where clang.exe
set CC=[pathname]

Clone LLaMA-Factory Repo

https://github.com/hiyouga/LLaMA-Factory

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git

Install LLaMA-Factory Requirements

pip install transformers==4.41.2 datasets==2.19.2 accelerate==0.30.1 peft==0.11.1 trl==0.9.4 bitsandbytes xformers==0.0.27.dev792

Install PyTorch for CUDA 12.1

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install Flash Attention 2

https://github.com/bdashore3/flash-attention/releases

pip install flash_attn-2.5.9.post1+cu122torch2.3.1cxx11abiFALSE-cp310-cp310-win_amd64.whl

Install LLaMA-Factory and Dependencies

cd\LLaMA-Factory
pip install -e ".[gptq,awq,metrics]"

Install Unsloth

https://github.com/unslothai/unsloth

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Login to Huggingface and Set Access Token

pip install --upgrade huggingface_hub
huggingface-cli login

Downgrade Numpy

pip install -U numpy==1.26.4

Install Windows Triton 2.1.0

https://huggingface.co/madbuda/triton-windows-builds/tree/main

pip install triton-2.1.0-cp310-cp310-win_amd64.whl

Example Training Command with Flash Attention 2 and Unsloth

python ./src/train.py --stage pt --do_train --seed 1337 --flash_attn fa2 --upcast_layernorm True --model_name_or_path unsloth/mistral-7b-instruct-v0.3-bnb-4bit --quantization_bit 4 --dataset c4_demo --template mistral --dataset_dir ./data --finetuning_type lora --lora_target q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj,embed_tokens,lm_head --output_dir saves/c4_demo/mistral-7b-instruct-v0.3-bnb-4bit/lora/pretrain/4-bit --cutoff_len 2048 --preprocessing_num_workers 8 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1  --warmup_steps 10 --save_steps 100 --eval_steps 100 --evaluation_strategy steps --load_best_model_at_end  --learning_rate 0.0001 --ddp_timeout 180000000 --num_train_epochs 3.0 --val_size 0.001 --plot_loss --fp16 --use_unsloth

Text Files to Alpaca Format JSON Script

import os
import json
from tqdm import tqdm
import nltk
import chardet
from nltk.tokenize import sent_tokenize
def read_chunks(filepath, chunk_size=2000, overlap=48):
    """
    Reads a text file, chunks it by sentences, and returns a list of chunks.
    Prepends the first line and filename to each chunk.

    Args:
        filepath: Path to the text file.
        chunk_size: Number of words (approximate) in each chunk.
        overlap: Number of words to overlap between chunks.

    Returns:
        A list of strings, where each string represents a chunk of text
        prepended with the first line and filename.
    """
    chunks = []
#    with open(filepath, 'r', encoding='utf-8') as f:
    with open(filepath, 'rb') as rawdata:
        print(filepath)
        result = chardet.detect(rawdata.read())
        encoding = result['encoding'] if result['encoding'] else 'utf-8'  # Default to UTF-8 if not detected
    with open(filepath, 'r', encoding=encoding) as f: 
        print(filepath)
    with open(filepath, 'rb') as rawdata:
        print(filepath)
        text_bytes = rawdata.read()
        text = text_bytes.decode('utf-8', errors='replace')  # Decode with error handling
        first_line = rawdata.readline().strip()  # Read and store first line
    filename = os.path.basename(filepath)  # Extract base filename from filepath
    sentences = sent_tokenize(text)
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk.split()) + len(sentence.split()) <= chunk_size:
            current_chunk += " " + sentence
        else:
            chunks.append(f"{filename}: {first_line}\n{current_chunk.strip()}")
            current_chunk = sentence

    if current_chunk:
        chunks.append(f"{filename}: {first_line}\n{current_chunk.strip()}")
    return chunks


def process_directory(directory, output_file):
    """
    Processes all .txt files in a directory, chunks them, and saves as a single JSON.
    Prepends the first line and filename to each chunk.

    Args:
        directory: Path to the directory containing text files.
        output_file: Path to the output JSON file.
    """
    all_data = []
    num_files = len(
        [f for f in os.listdir(directory) if f.lower().endswith(".txt")]
    )
    with tqdm(total=num_files, desc="Processing Files") as pbar:
        for filename in os.listdir(directory):
            if filename.lower().endswith(".txt"):  # Check for lowercase extension
                filepath = os.path.join(directory, filename)
                chunks = read_chunks(filepath)
                for chunk in chunks:
                    data = {"text": chunk} 
                    all_data.append(data)
                    print(data)
            pbar.update(1)  # Update progress bar for each processed file

    with open(output_file, 'w') as f:
        json.dump(all_data, f)


# Example usage
directory = "C:\\INPUT\\DIRECTORY\\"
output_file = "OUTPUT FILE .JSON"
process_directory(directory, output_file)

One thought on “LLaMA-Factory with Flash Attention 2 and Unsloth”

Xing Yen says:

October 18, 2024 at 6:46 am

Hi, I am using window and I have issue in downloading triton 3.0.0 as unsloth depends on triton 3.0.0 Can you let me know how I can solve this problem?

this is the error appear when I want to install unsloth
ERROR: Cannot install unsloth because these package versions have conflicting dependencies.

The conflict is caused by:
unsloth-zoo 2024.10.2 depends on triton>=3.0.0
unsloth-zoo 2024.10.1 depends on triton>=3.0.0
unsloth-zoo 2024.10.0 depends on triton>=3.0.0

To fix this you could try to:
1. loosen the range of package versions you’ve specified
2. remove package versions to allow pip to attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts