Download, transcribe, and translate audio & video in 12 Indian languages — entirely on your machine. No cloud. No API keys. No compromises.
Just tell Claude what you need in plain English. The pipeline handles downloading, model selection, transcription, and output formatting automatically.
Grab any video from YouTube, Twitter, TikTok, Instagram, or 1000+ other sites. Any format, any quality — handled by yt-dlp.
Fine-tuned Whisper models are auto-selected for your language. Telugu? Hindi? Bengali? The right model loads automatically.
Get English translations with precise timestamps and SRT subtitles. Three output formats: .txt, .srt, .json — ready for any workflow.
Every flag you need, at a glance. All flags are optional — sensible defaults are built in.
| Flag | What it does | Default | Example |
|---|---|---|---|
--language |
Set source language (skips detection, loads best model) | Auto-detect | --language te |
--model |
Whisper model size: tiny, base, small, medium, large | large | --model base |
--engine |
ASR engine: whisper or qwen | whisper | --engine qwen |
--diarize |
Enable speaker diarization (who spoke when) | Off | --diarize |
--num-speakers |
Exact speaker count (improves diarization accuracy) | Auto | --num-speakers 2 |
--hf-token |
HuggingFace token for diarization | $HF_TOKEN env | --hf-token hf_... |
--output-dir |
Where to save output files | ~/Downloads | --output-dir ./out |
--hf-model |
Override with any HuggingFace Whisper model | Auto-selected | --hf-model vasista22/... |
--diarize is off by default. When enabled, it requires a HuggingFace token. If no token is detected, the pipeline prints a helpful message and continues transcription without speaker labels — it never fails or crashes.
Speaker diarization identifies every voice in the recording. Qwen3-ASR gives you a second engine for comparison and broader language coverage.
Identify who spoke when using pyannote-audio. Label every segment with speaker identities automatically.
--diarize flag--num-speakers N[Speaker 1] labels in .txt, .srt, and .jsonpip install pyannote.audio — then set your HF token
An alternative ASR engine using Alibaba's Qwen3-ASR-1.7B model. Compare results across engines for maximum accuracy.
--engine qwen flagFine-tuned Whisper models from two world-class research labs at IIT Madras — automatically selected based on your language.
Every detail considered — from chunking algorithms that never lose a word, to hardware-optimized inference on your machine.
No data leaves your machine. No API keys required. No cloud dependency. Your audio stays private, always.
25-second windows with 5-second overlap. Three-tier word-level merge algorithm ensures zero words lost at boundaries.
Pass --language te and the best fine-tuned model loads automatically. vasista22 first, IndicWhisper fallback.
Every transcription produces .txt plain text, .srt subtitles with timestamps, and .json structured data.
Automatically detects Apple Silicon (MPS), NVIDIA GPU (CUDA), or falls back to CPU. Float16 where supported.
YouTube, Twitter/X, TikTok, Instagram, Vimeo, Reddit, Twitch — powered by yt-dlp's massive site coverage.
pyannote-audio identifies who spoke when. Add --diarize for speaker-labelled transcripts with [Speaker N] tags.
Whisper + Qwen3-ASR side by side. Switch with --engine qwen for comparison and accuracy across 30+ languages.
Auto-generate .srt subtitle files for YouTube videos in any Indian language. Repurpose regional content for wider audiences.
Transcribe university lectures in Tamil, Kannada, or Hindi. Make educational content searchable and accessible.
Download and transcribe interviews from any platform with precise timestamps. Evidence-grade documentation.
Digitize oral traditions in Sanskrit, Odia, and Punjabi. Preserve pravachans, kirtans, and regional storytelling.
Generate parallel audio-text corpora for ML training. Benchmark ASR accuracy across models and languages.
Transcribe interviews with speaker diarization. Each voice is labelled automatically — know who said what, with timestamps.
Turn podcasts and meetings into speaker-attributed notes. Diarization labels each participant for clean, actionable transcripts.
Everything you need to know about how the pipeline behaves out of the box.
--diarize to enable it. It requires pyannote.audio + a free HuggingFace token. If you use --diarize without a token, the pipeline warns you and continues transcription without speaker labels — it never crashes.
--diarize). Everything else — transcription, translation, language detection — works without any token or account. The token is free (read-only access) at huggingface.co/settings/tokens. You also need to accept both model licenses: speaker-diarization-3.1 and segmentation-3.0.
echo 'export HF_TOKEN="hf_..."' >> ~/.zshrc && source ~/.zshrc
~/.cache/whisper/, HuggingFace models at ~/.cache/huggingface/, and pyannote models at ~/.cache/torch/pyannote/. Subsequent runs load instantly from cache.
--model base (74 MB) for faster processing on clear audio, or --model medium (769 MB) as a middle ground. When you pass --language te (Telugu) or other Indian languages, a fine-tuned model is loaded automatically regardless of --model. The large model is downloaded once and cached at ~/.cache/whisper/.
--engine qwen to switch to Qwen3-ASR-1.7B. Qwen supports Hindi + ~30 languages. For Indian languages not supported by Qwen (Telugu, Kannada, etc.), it automatically falls back to Whisper with the right fine-tuned model.
.txt (plain text), .srt (subtitles with timestamps), and .json (structured data with segments). Override with --output-dir ./my-folder.
From your input to final output — every decision the pipeline makes automatically.
Clone, run the installer, and start transcribing. One command does everything.
This project is possible because of extraordinary open-source work from researchers and engineers around the world.