I recently discovered a GitHub project which I cover in this video that’s become my go-to for transcription work – Standalone Whisper XXL by Purfview. If you’ve tried implementing OpenAI’s Whisper speech-to-text model before, you know it can get messy with dependencies, especially when using enhanced forks like faster-whisper.
This project solves all those headaches by offering a standalone package called faster-whisper-XXL that includes:
- Support for all Whisper models
- MDX-23 speaker separation
- Various speaker diarization models (Reverb, Pyannote Diarize)
- Voice activity detection
Generation settings I use for subtitles:
--beam_size 5
--best_of 5
--temperature 0.0
--compression_ratio_threshold 2.4
--logprob_threshold -1.0
--condition_on_previous_text True
--suppress_tokens -1
--word_timestamps True
--prepend_punctuations ".。,,!!??"
--append_punctuations ".。,,!!??"
I’m using it mainly for creating diarized podcast transcripts. Running on my RTX 3060, it processes a 4.5-hour file in under 20 minutes using the Reverb V2 diarization model.
The project seems to be actively developed, with the creator fundraising for a GUI and support for more voice separation models.
While you can use the included batch file, it works best with files in the same directory and without spaces in filenames. For serious use, I recommend the command line or writing custom scripts to call the executable.