By HumanCTO

Indic Voice
Pipeline

Download, transcribe, and translate audio & video in 12 Indian languages — entirely on your machine. No cloud. No API keys. No compromises.

12 Indian Languages 100% Local No API Keys Speaker Diarization Whisper + Qwen3-ASR MIT License

Three steps. One command.

Just tell Claude what you need in plain English. The pipeline handles downloading, model selection, transcription, and output formatting automatically.

01
📥

Download

Grab any video from YouTube, Twitter, TikTok, Instagram, or 1000+ other sites. Any format, any quality — handled by yt-dlp.

02
🔊

Transcribe

Fine-tuned Whisper models are auto-selected for your language. Telugu? Hindi? Bengali? The right model loads automatically.

03
🌐

Translate

Get English translations with precise timestamps and SRT subtitles. Three output formats: .txt, .srt, .json — ready for any workflow.

# You say:
$ "Download this Telugu video and transcribe it"

# Claude handles the entire pipeline:
Downloaded video (1080p, 34s)
Extracted audio (16kHz mono WAV)
Loaded vasista22/whisper-telugu-large-v2 on MPS
Transcribed in 12.3s
Saved: transcript.txt, transcript.srt, transcript.json

Cheat sheet.

Every flag you need, at a glance. All flags are optional — sensible defaults are built in.

flags reference
Flag What it does Default Example
--language Set source language (skips detection, loads best model) Auto-detect --language te
--model Whisper model size: tiny, base, small, medium, large large --model base
--engine ASR engine: whisper or qwen whisper --engine qwen
--diarize Enable speaker diarization (who spoke when) Off --diarize
--num-speakers Exact speaker count (improves diarization accuracy) Auto --num-speakers 2
--hf-token HuggingFace token for diarization $HF_TOKEN env --hf-token hf_...
--output-dir Where to save output files ~/Downloads --output-dir ./out
--hf-model Override with any HuggingFace Whisper model Auto-selected --hf-model vasista22/...
Diarization note: --diarize is off by default. When enabled, it requires a HuggingFace token. If no token is detected, the pipeline prints a helpful message and continues transcription without speaker labels — it never fails or crashes.
Basic transcription
$ transcribe video.mp4 --language te
With speaker labels
$ transcribe interview.mp4 --diarize --num-speakers 2
Max accuracy
$ transcribe speech.mp4 --model large --language hi --diarize
Qwen engine + diarize
$ transcribe podcast.mp3 --engine qwen --diarize
Translate to English
$ translate speech.mp4 --language te
Custom HF model
$ transcribe audio.wav --hf-model vasista22/whisper-telugu-large-v2

Know who said what.
Choose your engine.

Speaker diarization identifies every voice in the recording. Qwen3-ASR gives you a second engine for comparison and broader language coverage.

🎙️ New Feature

Speaker Diarization

Identify who spoke when using pyannote-audio. Label every segment with speaker identities automatically.

  • Enable with --diarize flag
  • Set speaker count with --num-speakers N
  • Output includes [Speaker 1] labels in .txt, .srt, and .json
  • Requires free HuggingFace token (one-time setup)
Powered by pyannote/speaker-diarization-3.1
One-time pip install pyannote.audio — then set your HF token
🚀 New Engine

Qwen3-ASR Engine

An alternative ASR engine using Alibaba's Qwen3-ASR-1.7B model. Compare results across engines for maximum accuracy.

  • Use with --engine qwen flag
  • Supports Hindi + 30 languages natively
  • Auto-fallback to Whisper for unsupported languages
  • Run both engines to compare transcription quality
Powered by Qwen/Qwen3-ASR-1.7B (Alibaba Cloud / Qwen Team)
Default engine remains Whisper — switch with a single flag
# Transcribe with speaker labels
$ "Transcribe this podcast --language hi --diarize --num-speakers 2"

Transcribed in 18.7s
Speaker diarization complete (2 speakers detected)
Saved: transcript.txt, transcript.srt, transcript.json

# Output in transcript.txt:
[Speaker 1] 0:00 - 0:14   नमस्ते, आज हम बात करेंगे...
[Speaker 2] 0:14 - 0:28   बिल्कुल, मैं तैयार हूँ...

# Try the Qwen3-ASR engine instead
$ "Transcribe this Hindi lecture --engine qwen"
Using Qwen3-ASR-1.7B engine
Transcribed in 9.4s

12 Indian languages.
State-of-the-art models.

Fine-tuned Whisper models from two world-class research labs at IIT Madras — automatically selected based on your language.

🤗 vasista22 — HuggingFace (auto-downloaded)
Telugu
te
vasista22/whisper-telugu-large-v2
Whisper Large-v2
Hindi
hi
vasista22/whisper-hindi-large-v2
Whisper Large-v2
Kannada
kn
vasista22/whisper-kannada-medium
Whisper Medium
Gujarati
gu
vasista22/whisper-gujarati-medium
Whisper Medium
Tamil
ta
vasista22/whisper-tamil-medium
Whisper Medium
🏛️ AI4Bharat IndicWhisper — ZIP cached locally (~600 MB each)
Bengali
bn
AI4Bharat IndicWhisper
Whisper Medium
Malayalam
ml
AI4Bharat IndicWhisper
Whisper Medium
Marathi
mr
AI4Bharat IndicWhisper
Whisper Medium
Odia
or
AI4Bharat IndicWhisper
Whisper Medium
Punjabi
pa
AI4Bharat IndicWhisper
Whisper Medium
Sanskrit
sa
AI4Bharat IndicWhisper
Whisper Medium
Urdu
ur
AI4Bharat IndicWhisper
Whisper Medium

Engineered for reliability.

Every detail considered — from chunking algorithms that never lose a word, to hardware-optimized inference on your machine.

🔒

100% Local Processing

No data leaves your machine. No API keys required. No cloud dependency. Your audio stays private, always.

🧩

Smart Chunking

25-second windows with 5-second overlap. Three-tier word-level merge algorithm ensures zero words lost at boundaries.

🤖

Auto Model Selection

Pass --language te and the best fine-tuned model loads automatically. vasista22 first, IndicWhisper fallback.

📄

Multi-Format Output

Every transcription produces .txt plain text, .srt subtitles with timestamps, and .json structured data.

Hardware Optimized

Automatically detects Apple Silicon (MPS), NVIDIA GPU (CUDA), or falls back to CPU. Float16 where supported.

🌍

1000+ Download Sources

YouTube, Twitter/X, TikTok, Instagram, Vimeo, Reddit, Twitch — powered by yt-dlp's massive site coverage.

🎙️

Speaker Diarization

pyannote-audio identifies who spoke when. Add --diarize for speaker-labelled transcripts with [Speaker N] tags.

🔄

Multi-Engine ASR

Whisper + Qwen3-ASR side by side. Switch with --engine qwen for comparison and accuracy across 30+ languages.

Built for real workflows.

🎬

Content Creators

Auto-generate .srt subtitle files for YouTube videos in any Indian language. Repurpose regional content for wider audiences.

🎓

Education

Transcribe university lectures in Tamil, Kannada, or Hindi. Make educational content searchable and accessible.

📰

Journalism

Download and transcribe interviews from any platform with precise timestamps. Evidence-grade documentation.

🏛️

Cultural Preservation

Digitize oral traditions in Sanskrit, Odia, and Punjabi. Preserve pravachans, kirtans, and regional storytelling.

💻

Developers & Researchers

Generate parallel audio-text corpora for ML training. Benchmark ASR accuracy across models and languages.

🎙️

Interview Transcription

Transcribe interviews with speaker diarization. Each voice is labelled automatically — know who said what, with timestamps.

🎧

Podcast & Meeting Notes

Turn podcasts and meetings into speaker-attributed notes. Diarization labels each participant for clean, actionable transcripts.

Defaults & common questions.

Everything you need to know about how the pipeline behaves out of the box.

Is diarization (speaker labels) on by default?
No. Diarization is off by default. Add --diarize to enable it. It requires pyannote.audio + a free HuggingFace token. If you use --diarize without a token, the pipeline warns you and continues transcription without speaker labels — it never crashes.
Do I need a HuggingFace token?
Only for speaker diarization (--diarize). Everything else — transcription, translation, language detection — works without any token or account. The token is free (read-only access) at huggingface.co/settings/tokens. You also need to accept both model licenses: speaker-diarization-3.1 and segmentation-3.0.
Is the HF token needed every time, or just for the first download?
Every time. pyannote authenticates on every load, not just the initial model download. Set it permanently: echo 'export HF_TOKEN="hf_..."' >> ~/.zshrc && source ~/.zshrc
Do Whisper models re-download every time?
No. Models are downloaded once and cached permanently. Standard Whisper models are cached at ~/.cache/whisper/, HuggingFace models at ~/.cache/huggingface/, and pyannote models at ~/.cache/torch/pyannote/. Subsequent runs load instantly from cache.
What's the default Whisper model size?
large (1.5 GB) — maximum accuracy out of the box. Use --model base (74 MB) for faster processing on clear audio, or --model medium (769 MB) as a middle ground. When you pass --language te (Telugu) or other Indian languages, a fine-tuned model is loaded automatically regardless of --model. The large model is downloaded once and cached at ~/.cache/whisper/.
Which ASR engine is used by default?
Whisper is the default engine. Use --engine qwen to switch to Qwen3-ASR-1.7B. Qwen supports Hindi + ~30 languages. For Indian languages not supported by Qwen (Telugu, Kannada, etc.), it automatically falls back to Whisper with the right fine-tuned model.
Where do output files go?
~/Downloads by default. Every transcription produces three files: .txt (plain text), .srt (subtitles with timestamps), and .json (structured data with segments). Override with --output-dir ./my-folder.
Does this work offline?
Yes — after the initial model download. Transcription, translation, and diarization all run 100% locally. You only need internet to download videos (yt-dlp) and to download models the first time.
What happens if diarization fails?
The pipeline degrades gracefully. If pyannote isn't installed, or the token is missing, or diarization hits an error — you get a warning message and transcription continues normally without speaker labels. You never lose your transcript.

How the pipeline thinks.

From your input to final output — every decision the pipeline makes automatically.

💬
Your command
"Transcribe video.mp4 --language te --diarize"
URL or local file?
URL File
📥
yt-dlp Download
1000+ sites supported
📁
Use local file directly
mp4, mp3, wav, mkv, webm...
🎵
ffmpeg audio extraction
16kHz mono WAV for inference
--language flag provided?
No Yes
🔍
Auto-detect language
First 30s of audio analyzed
Skip detection, use provided
Faster, loads best model directly
Which engine? (default: whisper)
Whisper Qwen
🤗
Model Router
vasista22 (te/hi/kn/gu/ta)
AI4Bharat (bn/ml/mr/or/pa/sa/ur)
Standard Whisper (others)
🚀
Qwen3-ASR-1.7B
30+ languages supported
Falls back to Whisper if unsupported
🧩
Chunked inference (25s windows, 5s overlap)
Smart merge with word-level dedup at boundaries
--diarize flag set?
Yes No
🎙️
pyannote diarization
Needs HF token + both licenses
No token? Warns & skips gracefully
Skip diarization
Output without speaker labels
📂
.txt   .srt   .json
With or without [Speaker N] labels depending on diarization

Up and running in 60 seconds.

Clone, run the installer, and start transcribing. One command does everything.

Terminal
# Clone and install
$ git clone https://github.com/humancto/indic-voice-pipeline.git
$ cd indic-voice-pipeline && bash install.sh

# That's it. Now use it with Claude Code:
$ claude
> "Transcribe ~/Downloads/speech.mp4 --language te"

# Optional: enable speaker diarization
$ pip install pyannote.audio

# Get a free token at huggingface.co/settings/tokens
# Then accept BOTH model licenses (required):
# 1. huggingface.co/pyannote/speaker-diarization-3.1
# 2. huggingface.co/pyannote/segmentation-3.0

# Set token permanently (needed every run, not just first time)
$ echo 'export HF_TOKEN="hf_..."' >> ~/.zshrc && source ~/.zshrc

# Now use it:
> "Transcribe podcast.mp4 --language hi --diarize --num-speakers 2"
🐍 Python 3.10+
🎬 ffmpeg
Claude Code

The pipeline, visualized.

🔗
URL Input
📥
yt-dlp
Download
🎵
ffmpeg
Audio Extract
⚙️
Engine
Select
Whisper / Qwen
🤗
Whisper
vasista22 / AI4Bharat
🚀
Qwen3-ASR
1.7B • 30+ langs
🧩
Chunked
Inference
25s + 5s overlap
🔀
Smart
Merge
🎙️
Diarize
optional • pyannote
📂
.txt  .srt  .json

Standing on giants.

This project is possible because of extraordinary open-source work from researchers and engineers around the world.

OpenAI Whisper Foundation speech model
vasista22 / IIT Madras Telugu, Hindi, Kannada, Gujarati, Tamil models • Bhashini / MeitY
AI4Bharat / IIT Madras IndicWhisper • Vistaar dataset • 10,700+ hours
yt-dlp Video download engine • 1000+ sites
pyannote-audio Speaker diarization • Hervé Bredin
Qwen3-ASR Alternative ASR engine • Alibaba Cloud / Qwen Team
HuggingFace Transformers • Model infrastructure
Claude Code AI assistant • Anthropic