Indic Voice Pipeline

How it works

Three steps. One command.

Just tell Claude what you need in plain English. The pipeline handles downloading, model selection, transcription, and output formatting automatically.

📥

Download

Grab any video from YouTube, Twitter, TikTok, Instagram, or 1000+ other sites. Any format, any quality — handled by yt-dlp.

🔊

Transcribe

Fine-tuned Whisper models are auto-selected for your language. Telugu? Hindi? Bengali? The right model loads automatically.

🌐

Translate

Get English translations with precise timestamps and SRT subtitles. Three output formats: .txt, .srt, .json — ready for any workflow.

# You say:

$ "Download this Telugu video and transcribe it"

# Claude handles the entire pipeline:

✔ Downloaded video (1080p, 34s)

✔ Extracted audio (16kHz mono WAV)

✔ Loaded vasista22/whisper-telugu-large-v2 on MPS

✔ Transcribed in 12.3s

✔ Saved: transcript.txt, transcript.srt, transcript.json

Quick Reference

Cheat sheet.

Every flag you need, at a glance. All flags are optional — sensible defaults are built in.

flags reference

Flag	What it does	Default	Example
`--language`	Set source language (skips detection, loads best model)	Auto-detect	`--language te`
`--model`	Whisper model size: tiny, base, small, medium, large	large	`--model base`
`--engine`	ASR engine: whisper or qwen	whisper	`--engine qwen`
`--diarize`	Enable speaker diarization (who spoke when)	Off	`--diarize`
`--num-speakers`	Exact speaker count (improves diarization accuracy)	Auto	`--num-speakers 2`
`--hf-token`	HuggingFace token for diarization	$HF_TOKEN env	`--hf-token hf_...`
`--output-dir`	Where to save output files	~/Downloads	`--output-dir ./out`
`--hf-model`	Override with any HuggingFace Whisper model	Auto-selected	`--hf-model vasista22/...`

Diarization note: --diarize is off by default. When enabled, it requires a HuggingFace token. If no token is detected, the pipeline prints a helpful message and continues transcription without speaker labels — it never fails or crashes.

Basic transcription

$ transcribe video.mp4 --language te

With speaker labels

$ transcribe interview.mp4 --diarize --num-speakers 2

Max accuracy

$ transcribe speech.mp4 --model large --language hi --diarize

Qwen engine + diarize

$ transcribe podcast.mp3 --engine qwen --diarize

Translate to English

$ translate speech.mp4 --language te

Custom HF model

$ transcribe audio.wav --hf-model vasista22/whisper-telugu-large-v2

New Capabilities

Know who said what.
Choose your engine.

Speaker diarization identifies every voice in the recording. Qwen3-ASR gives you a second engine for comparison and broader language coverage.

🎙️ New Feature

Speaker Diarization

Identify who spoke when using pyannote-audio. Label every segment with speaker identities automatically.

Enable with --diarize flag
Set speaker count with --num-speakers N
Output includes [Speaker 1] labels in .txt, .srt, and .json
Requires free HuggingFace token (one-time setup)

          Powered by pyannote/speaker-diarization-3.1

          One-time pip install pyannote.audio — then set your HF token

🚀 New Engine

Qwen3-ASR Engine

An alternative ASR engine using Alibaba's Qwen3-ASR-1.7B model. Compare results across engines for maximum accuracy.

Use with --engine qwen flag
Supports Hindi + 30 languages natively
Auto-fallback to Whisper for unsupported languages
Run both engines to compare transcription quality

          Powered by Qwen/Qwen3-ASR-1.7B (Alibaba Cloud / Qwen Team)

          Default engine remains Whisper — switch with a single flag

# Transcribe with speaker labels

$ "Transcribe this podcast --language hi --diarize --num-speakers 2"

✔ Transcribed in 18.7s

✔ Speaker diarization complete (2 speakers detected)

✔ Saved: transcript.txt, transcript.srt, transcript.json

# Output in transcript.txt:

[Speaker 1] 0:00 - 0:14 नमस्ते, आज हम बात करेंगे...

[Speaker 2] 0:14 - 0:28 बिल्कुल, मैं तैयार हूँ...

# Try the Qwen3-ASR engine instead

$ "Transcribe this Hindi lecture --engine qwen"

✔ Using Qwen3-ASR-1.7B engine

✔ Transcribed in 9.4s

Supported Languages

12 Indian languages.
State-of-the-art models.

Fine-tuned Whisper models from two world-class research labs at IIT Madras — automatically selected based on your language.

🤗 vasista22 — HuggingFace (auto-downloaded)

Telugu

vasista22/whisper-telugu-large-v2

Whisper Large-v2

Hindi

vasista22/whisper-hindi-large-v2

Whisper Large-v2

Kannada

vasista22/whisper-kannada-medium

Whisper Medium

Gujarati

vasista22/whisper-gujarati-medium

Whisper Medium

Tamil

vasista22/whisper-tamil-medium

Whisper Medium

🏛️ AI4Bharat IndicWhisper — ZIP cached locally (~600 MB each)

Bengali

AI4Bharat IndicWhisper

Whisper Medium

Malayalam

AI4Bharat IndicWhisper

Whisper Medium

Marathi

AI4Bharat IndicWhisper

Whisper Medium

Odia

AI4Bharat IndicWhisper

Whisper Medium

Punjabi

AI4Bharat IndicWhisper

Whisper Medium

Sanskrit

AI4Bharat IndicWhisper

Whisper Medium

Urdu

AI4Bharat IndicWhisper

Whisper Medium

Features

Engineered for reliability.

Every detail considered — from chunking algorithms that never lose a word, to hardware-optimized inference on your machine.

🔒

100% Local Processing

No data leaves your machine. No API keys required. No cloud dependency. Your audio stays private, always.

🧩

Smart Chunking

25-second windows with 5-second overlap. Three-tier word-level merge algorithm ensures zero words lost at boundaries.

🤖

Auto Model Selection

Pass --language te and the best fine-tuned model loads automatically. vasista22 first, IndicWhisper fallback.

📄

Multi-Format Output

Every transcription produces .txt plain text, .srt subtitles with timestamps, and .json structured data.

⚡

Hardware Optimized

Automatically detects Apple Silicon (MPS), NVIDIA GPU (CUDA), or falls back to CPU. Float16 where supported.

🌍

1000+ Download Sources

YouTube, Twitter/X, TikTok, Instagram, Vimeo, Reddit, Twitch — powered by yt-dlp's massive site coverage.

🎙️

Speaker Diarization

pyannote-audio identifies who spoke when. Add --diarize for speaker-labelled transcripts with [Speaker N] tags.

🔄

Multi-Engine ASR

Whisper + Qwen3-ASR side by side. Switch with --engine qwen for comparison and accuracy across 30+ languages.

Use Cases

Built for real workflows.

🎬

Content Creators

Auto-generate .srt subtitle files for YouTube videos in any Indian language. Repurpose regional content for wider audiences.

🎓

Education

Transcribe university lectures in Tamil, Kannada, or Hindi. Make educational content searchable and accessible.

📰

Journalism

Download and transcribe interviews from any platform with precise timestamps. Evidence-grade documentation.

🏛️

Cultural Preservation

Digitize oral traditions in Sanskrit, Odia, and Punjabi. Preserve pravachans, kirtans, and regional storytelling.

💻

Developers & Researchers

Generate parallel audio-text corpora for ML training. Benchmark ASR accuracy across models and languages.

🎙️

Interview Transcription

Transcribe interviews with speaker diarization. Each voice is labelled automatically — know who said what, with timestamps.

🎧

Podcast & Meeting Notes

Turn podcasts and meetings into speaker-attributed notes. Diarization labels each participant for clean, actionable transcripts.

FAQ

Defaults & common questions.

Everything you need to know about how the pipeline behaves out of the box.

Is diarization (speaker labels) on by default?

No. Diarization is off by default. Add --diarize to enable it. It requires pyannote.audio + a free HuggingFace token. If you use --diarize without a token, the pipeline warns you and continues transcription without speaker labels — it never crashes.

Do I need a HuggingFace token?

Only for speaker diarization (--diarize). Everything else — transcription, translation, language detection — works without any token or account. The token is free (read-only access) at huggingface.co/settings/tokens. You also need to accept both model licenses: speaker-diarization-3.1 and segmentation-3.0.

Is the HF token needed every time, or just for the first download?

Every time. pyannote authenticates on every load, not just the initial model download. Set it permanently: echo 'export HF_TOKEN="hf_..."' >> ~/.zshrc && source ~/.zshrc

Do Whisper models re-download every time?

No. Models are downloaded once and cached permanently. Standard Whisper models are cached at ~/.cache/whisper/, HuggingFace models at ~/.cache/huggingface/, and pyannote models at ~/.cache/torch/pyannote/. Subsequent runs load instantly from cache.

What's the default Whisper model size?

large (1.5 GB) — maximum accuracy out of the box. Use --model base (74 MB) for faster processing on clear audio, or --model medium (769 MB) as a middle ground. When you pass --language te (Telugu) or other Indian languages, a fine-tuned model is loaded automatically regardless of --model. The large model is downloaded once and cached at ~/.cache/whisper/.

Which ASR engine is used by default?

Whisper is the default engine. Use --engine qwen to switch to Qwen3-ASR-1.7B. Qwen supports Hindi + ~30 languages. For Indian languages not supported by Qwen (Telugu, Kannada, etc.), it automatically falls back to Whisper with the right fine-tuned model.

Where do output files go?

~/Downloads by default. Every transcription produces three files: .txt (plain text), .srt (subtitles with timestamps), and .json (structured data with segments). Override with --output-dir ./my-folder.

Does this work offline?

Yes — after the initial model download. Transcription, translation, and diarization all run 100% locally. You only need internet to download videos (yt-dlp) and to download models the first time.

What happens if diarization fails?

The pipeline degrades gracefully. If pyannote isn't installed, or the token is missing, or diarization hits an error — you get a warning message and transcription continues normally without speaker labels. You never lose your transcript.

Decision Flow

How the pipeline thinks.

From your input to final output — every decision the pipeline makes automatically.

💬

Your command

"Transcribe video.mp4 --language te --diarize"

▼

URL or local file?

URL File

📥

yt-dlp Download

1000+ sites supported

📁

Use local file directly

mp4, mp3, wav, mkv, webm...

▼

🎵

ffmpeg audio extraction

16kHz mono WAV for inference

▼

--language flag provided?

No Yes

🔍

Auto-detect language

First 30s of audio analyzed

✅

Skip detection, use provided

Faster, loads best model directly

▼

Which engine? (default: whisper)

Whisper Qwen

🤗

Model Router

vasista22 (te/hi/kn/gu/ta)
AI4Bharat (bn/ml/mr/or/pa/sa/ur)
Standard Whisper (others)

🚀

Qwen3-ASR-1.7B

30+ languages supported
Falls back to Whisper if unsupported

▼

🧩

Chunked inference (25s windows, 5s overlap)

Smart merge with word-level dedup at boundaries

▼

--diarize flag set?

Yes No

🎙️

pyannote diarization

Needs HF token + both licenses
No token? Warns & skips gracefully

⏩

Skip diarization

Output without speaker labels

▼

📂

.txt .srt .json

With or without [Speaker N] labels depending on diarization

Installation

Up and running in 60 seconds.

Clone, run the installer, and start transcribing. One command does everything.

Terminal

# Clone and install

$ git clone https://github.com/humancto/indic-voice-pipeline.git

$ cd indic-voice-pipeline && bash install.sh

# That's it. Now use it with Claude Code:

$ claude

> "Transcribe ~/Downloads/speech.mp4 --language te"

# Optional: enable speaker diarization

$ pip install pyannote.audio

# Get a free token at huggingface.co/settings/tokens

# Then accept BOTH model licenses (required):

# 1. huggingface.co/pyannote/speaker-diarization-3.1

# 2. huggingface.co/pyannote/segmentation-3.0

# Set token permanently (needed every run, not just first time)

$ echo 'export HF_TOKEN="hf_..."' >> ~/.zshrc && source ~/.zshrc

# Now use it:

> "Transcribe podcast.mp4 --language hi --diarize --num-speakers 2"

🐍 Python 3.10+

🎬 ffmpeg

✨ Claude Code

Indic Voice
Pipeline

Three steps. One command.

Download

Transcribe

Translate

Cheat sheet.

Know who said what.
Choose your engine.

Speaker Diarization

Qwen3-ASR Engine

12 Indian languages.
State-of-the-art models.

Engineered for reliability.

100% Local Processing

Smart Chunking

Auto Model Selection

Multi-Format Output

Hardware Optimized

1000+ Download Sources

Speaker Diarization

Multi-Engine ASR

Built for real workflows.

Content Creators

Education

Journalism

Cultural Preservation

Developers & Researchers

Interview Transcription

Podcast & Meeting Notes

Defaults & common questions.

How the pipeline thinks.

Up and running in 60 seconds.

The pipeline, visualized.

Standing on giants.

Indic VoicePipeline

Three steps. One command.

Download

Transcribe

Translate

Cheat sheet.

Know who said what.Choose your engine.

Speaker Diarization

Qwen3-ASR Engine

12 Indian languages.State-of-the-art models.

Engineered for reliability.

100% Local Processing

Smart Chunking

Auto Model Selection

Multi-Format Output

Hardware Optimized

1000+ Download Sources

Speaker Diarization

Multi-Engine ASR

Built for real workflows.

Content Creators

Education

Journalism

Cultural Preservation

Developers & Researchers

Interview Transcription

Podcast & Meeting Notes

Defaults & common questions.

How the pipeline thinks.

Up and running in 60 seconds.

The pipeline, visualized.

Standing on giants.

Indic Voice
Pipeline

Know who said what.
Choose your engine.

12 Indian languages.
State-of-the-art models.