local-first / fish-speech S2 Pro / no cloud, no quotas, no accounts
studio-quality voice clones, on your own machine
clone any voice from ~60s of clean audio, then use it for audio notes, voiceovers, terminal reactions, or just to make Peter Griffin yell when your build fails. fish-speech S2 Pro runs locally. your reference clip never leaves your laptop.
~/ — voiceforge v0.4
$ curl -fsSL https://raw.githubusercontent.com/humancto/voice-forge/main/install.sh | bash
$ voiceforge install-cloning # ~12 GB, ~30 min — fish-speech + whisper voiceforge install-cloning · fish-speech S2 Pro [████████████████████] 17/17 · done ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples)$ voiceforge clone tyson https://www.youtube.com/watch?v=9D05ej8u-gU
$ voiceforge say --voice tyson --text "the universe is under no obligation to make sense to you."
🔊~3s of Neil deGrasse Tyson, synthesized locally.
00what it does todaysix real flows
what you can do with voiceforge today.
seven concrete flows. all live on main, all tested, all on voiceforge --version 0.4.0+. nothing aspirational below. cloning runs through fish-speech S2 Pro; the 5 shipped packs (pre-rendered with the same engine) work zero-install. v0.4 ships voiceforge note — full audiobook narration in a cloned voice — plus voiceforge voices migrate for users upgrading legacy v1 voices.
1 — celebrity voices reacting to your terminal, zero model install
$ brew tap humancto/voiceforge && brew install voiceforge # or: curl install.sh | bash$ voiceforge daemon &; disown
$ voiceforge pack install trump$ voiceforge say --voice trump --text "build_failed" # ~50 ms, no synth$ voiceforge run --voice trump -- npm test # speaks on success/fail→ 5 packs × 13 events = 65 reactions out of the box. ~3 MB binary, 7-10 MB packs. no python, no GPU.
2 — wire it into Claude Code so your AI agent talks back
# ~/.claude/settings.json
{
"hooks": {
"Notification": [{ "command": "voiceforge hook --profile claude-code" }],
"Stop": [{ "command": "voiceforge hook --profile claude-code" }]
}
}
→ walk away from a 20-min deploy, come back to Peter Griffin yelling that it succeeded — or burning down.
3 — auto-react to long terminal commands (zsh / bash)
$ voiceforge install git-hooks # 4 hooks, idempotent, honors core.hooksPath$ git commit -m "feat: x" 🔊 "Commit saved."$ git push 🔊 "Pushed. Now everyone knows."$ git rebase -i HEAD~3 🔊 "History has been rewritten."→ husky/lefthook/pre-commit users get a stderr warning. worktrees + submodules work via git rev-parse --git-common-dir.
5 — clone any voice from any audio source — including YouTube URLs (v0.4: fish-speech S2 Pro)
$ voiceforge install-cloning # one-time, ~30 min, ~12 GB; macOS arm64 pixel-3D BUILD-style wizard renders all 17 install phases live ✓ smoke passes — install proves itself before claiming "ready"$ voiceforge clone tyson https://www.youtube.com/watch?v=...
$ voiceforge say --voice tyson --text "the universe is under no obligation to make sense to you"
→ studio-quality output. ~3-8 s synth on CPU per utterance (M-series).
→ yt-dlp hardened: --no-playlist --max-filesize 250M --socket-timeout 30 --retries 3.
→ EBU R128 loudnorm during ingest — quiet recordings get bumped to broadcast level. silent inputs rejected.
→ schema-2 marker + standalone SMOKE.toml; v1 GPT-SoVITS installs keep working (set VOICEFORGE_TTS_ENGINE=gpt-sovits-v2).
→ v0.4: upgrade an existing v1 voice in-place — voiceforge voices migrate peter — atomic, idempotent, leaves .v1.bak/ for recovery.
6 — pipe NDJSON events from any tool into the daemon
7 — render long-form text (chapters, essays, .md) into narration WAVs (v0.4)
$ voiceforge note --voice tyson --in chapter.md --out chapter.wav
chunking: paragraph-aware via pulldown-cmark; ≤500 chars per chunk [████████████████████████████░░░░] 28/32 chunks · synth via fish-speech S2 Pro ^C (ctrl-c is safe — re-run the same command to resume)$ voiceforge note --voice tyson --in chapter.md --out chapter.wav
→ resumes from <out>.progress.json — only re-synths the 4 unfinished chunks. wrote chapter.wav (32 chunks, 412.6s audio).→ chunk-write ordering is tmp + fsync + rename + fsync + write progress — power loss costs at most 1 chunk.
→ every chunk verified 44.1 kHz mono PCM_16 before ffmpeg -c copy concat (pinned ffmpeg from the v2 install).
→ v2-only; v1 voices bail with a clear "run voiceforge voices migrate" hint.
two quality tiers, by use case:pack if you want ~50 ms playback of a fixed event (build_failed, deploy_done, …), zero install beyond the binary, or a 30-second demo. cloned if you want arbitrary text in any voice you have an audio sample of, or your team has custom phrases. linux + intel-mac users use packs (cloning runtime is macOS-arm64 today; ROADMAP 2.1.1 widens it).
01install~5 sec via prebuilt binary or brew
two install paths. same binary.
brew if you want upgrade-tracking via brew upgrade. curl-pipe-bash if you want one command, no extra commitment. step 2 (cloning runtime, ~12 GB / ~30 min) is heavy and OPTIONAL — skip it if you just want OS-default voice or the 5 pre-rendered packs.
PATH A — Homebrew (macOS)
two commands. brew tracks new releases via brew upgrade voiceforge. arm64 + intel.
$ brew tap humancto/voiceforge $ brew install voiceforge
PATH B — universal installer
one command. ~5 sec via prebuilt for darwin-arm64/x86_64 + linux-x86_64/aarch64 (glibc ≥ 2.35). falls back to from-source on unsupported platforms. inspect-first via curl ... -o install.sh; less install.sh; bash install.sh.
$ voiceforge --version
voiceforge 0.4.0$ voiceforge doctor
[OK] binary /usr/local/bin/voiceforge
[OK] home ~/.voiceforge
[OK] audio backend rodio (CoreAudio)
[OK] embedded TTS macOS say
[OK] daemon not running (~/.voiceforge/voiceforge.sock absent)
[OK] yt-dlp /opt/homebrew/bin/yt-dlp
[OK] cloning fish-speech S2 Pro @ 3dd1f85 (whisper: medium), smoke: ✓ (41203ms, 196608 bytes)
ok: 11 warn: 0 error: 0
$ voiceforge say --text "voiceforge is ready"
🔊 uses macOS `say` — no setup needed for the OS-default voice.
30-second demo (no cloning required)
Peter Griffin reacting to your terminal — no model install
$ voiceforge daemon &; disown # long-running event recipient$ voiceforge pack install peter# 7.8 MB, ~3 sec$ voiceforge play --pack peter --event build_failed # instant, ~50 ms$ voiceforge say --voice peter --text "tests_passed" # pack-aware say, no synth🔊 "Holy crap Lois, the build is on fire!"
optional: live cloning runtime (v0.4 — fish-speech S2 Pro)
skip if you only want the pre-rendered packs (peter / kimmel / neil_tyson / trump / musk). install only if you want arbitrary text in your own cloned voice. v0.4 pivots to fish-speech S2 Pro (the same engine that pre-rendered the shipped packs) — studio-quality output at the cost of a 30-min, ~12 GB one-time install. legacy GPT-SoVITS path stays alive: VOICEFORGE_INSTALL_CLONING_ENGINE=gpt-sovits-v2.
$ voiceforge install-cloning # Python 3.11 + ffmpeg@6 + fish-speech S2 Pro + whisper-medium voiceforge install-cloning · fish-speech S2 Pro [██████████████████] 17/17 ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples)$ voiceforge clone myname https://www.youtube.com/watch?v=... # yt-dlp downloads + clones$ voiceforge say --voice myname --text "anything you write" # ~3-8 s synth on CPU, then cached
prereqs: nothing for the brew path. ffmpeg on PATH if you'll use voiceforge ingest or clone (brew installs it as a dep when you run install-cloning). yt-dlp on PATH if you'll pass URLs (optional — local files don't need it). Linux/Windows users get the binary + packs + agent integrations today; live cloning is macOS-arm64 today (ROADMAP 2.1.1 widens it).
02clonefish-speech S2 Pro recipe (v0.4)
the recipe that pre-rendered all 5 shipped packs at studio quality.
we tried XTTS, F5-TTS, GPT-SoVITS v2/v2Pro/v4 — landed on GPT-SoVITS v2 in v0.3, then pivoted to fish-speech S2 Pro in v0.4 for cleaner timbre and tighter prosody. the same engine renders the 5 packs you can install zero-setup. legacy GPT-SoVITS path stays alive for users mid-migration.
a word on quality: stylized cartoon voices (Peter, Stewie, Quagmire) reach studio quality with fish-speech S2 Pro on ~60s of clean reference audio. real human voices (Tyson, Attenborough, your colleague) approach broadcast quality. the bundled smoke synth proves the runtime works end-to-end before the doctor row reports ready — no more "install succeeded but first clone errors out 30 seconds later."
03useterminal-native
speak any line. react to any command.
say — one-shot
$ voiceforge say --voice peter --text "the build is on fire"
🔊"the build is on fire."
run — wraps a command, reacts on success/fail
$ voiceforge run -- cargo test
Compiling voiceforge v0.1.0
Finished test [unoptimized + debuginfo] target(s) in 12.4s
Running unittests src/main.rs
test result: FAILED. 3 passed; 1 failed; 0 ignored; 0 measured🔊"holy crap, cargo just chose violence."$ voiceforge run -- npm test
✓ 47 tests passing🔊"oh that's nice. the tests passed."
doctor — system check
$ voiceforge doctor
voiceforge 0.2.0 — system check
[OK] binary /usr/local/bin/voiceforge
[OK] home ~/.voiceforge
[OK] audio backend rodio (CoreAudio)
[OK] embedded TTS macOS say
[OK] cloning GPT-SoVITS @ 08d627c
[OK] presets 5 installed
[OK] config.toml active_voice = "peter"
[OK] daemon running at ~/.voiceforge/voiceforge.sock
ok: 10 warn: 0 error: 0
03.5capabilitieseverything voiceforge can do
everything voiceforge can do today.
the full surface, by command. all green on `voiceforge --version` 0.2.0+. run any with --help.
command
what it does
when to use
voiceforge say --voice X --text "..."
speaks text in voice X. routes to pack (~50 ms) if X is an installed pack and text matches an event/phrase; else synthesizes via cloning (~2-3 s) or embedded TTS (<500 ms).
the universal entry point
voiceforge run -- <cmd>
runs <cmd>, speaks build_success / build_failed when it exits.
wrap a single long command
voiceforge daemon
long-running unix-socket NDJSON server at ~/.voiceforge/voiceforge.sock. recipient for everything below.
clones a voice from a local file OR URL (yt-dlp resolves it). EBU R128 loudnorm during ingest. silent inputs rejected. multi-aux-ref recipe (1 main + 5 × 10 s aux), Whisper-transcribed.
make your own voice profiles
voiceforge ingest <input> <output>
transcodes any audio source to canonical 32 kHz mono 16-bit pcm WAV. accepts file paths and URLs. validates 10–60 s duration. applies loudnorm. rejects silent.
prep audio for cloning manually
voiceforge voices
lists built-in presets, cloned voices, AND installed packs in one view. voiceforge voices remove <name> deletes a clone.
audit what voices you have
voiceforge use <name>
sets active default voice. when --voice is omitted on say/run, this is what plays.
pick a default per-machine
voiceforge voices migrate <name>
atomic v1→v2 in-place migration of a GPT-SoVITS voice to fish-speech S2 Pro. preserves cache-key invariants. leaves a .v1.bak/ child for recovery. idempotent on already-v2 voices. v0.4.
upgrade legacy v1 voices to studio quality
voiceforge note --voice X --in F --out F
renders long-form text (.md or .txt or stdin) to a finished narration WAV in voice X. paragraph-aware chunking via pulldown-cmark. resume cache survives Ctrl-C. ffmpeg -c copy concat (no re-encode). v2 voices only. v0.4 audiobook killer demo.
catch-all — shell-init for any command the agent runs (>3s)
$ voiceforge shell-init --install zsh
voiceforge: appended hook to /Users/me/.zshrc
hint: open a new shell or `source /Users/me/.zshrc`# Now any command >3s fires command_succeeded / command_failed.# Threshold is env-tunable: VOICEFORGE_SHELL_THRESHOLD_MS=5000.# Skip-list defaults skip cd/ls/pwd/clear/history/voiceforge.
05pipelineunder the hood
two layers. all local.
rust CLI for the dev-side surface (install, clone, say, run, doctor, daemon) and a python venv that hosts GPT-SoVITS v2. the synth is a long-lived NDJSON child — model loads once per process, warm for every call after that.
three engines, one façade. Engine::Embedded shells to say / espeak-ng for the default voice (no Python required). Engine::Server talks to a Flask server when VOICEFORGE_TTS_URL is set. Engine::Cloning spawns the long-lived NDJSON child. Engine::speak(text, voice) dispatches per-call based on whether the voice name matches a cloned profile.
06voice packsinstall in one command
character voices. install in one command.
live cloning is fast and works for arbitrary text — but on stylized characters (Peter Griffin, Jimmy Kimmel) GPT-SoVITS v2 hits a quality ceiling. for terminal feedback (a fixed set of events) we render packs of phrases once, on a beefy model — fish-speech S2 Pro — and ship the WAVs through a separate index. runtime playback is sub-100ms.
install + use a pack
$ voiceforge pack list
NAME STATUS VERSION TIER DISPLAY NAMEkimmel available 0.1.0 public-figure Jimmy Kimmelpeter available 0.1.3 character Peter Griffin$ voiceforge pack install peter
installed peter v0.1.3 (13 phrases). Try: voiceforge play --pack peter --event tests_passed$ voiceforge play --pack peter --event tests_passed
🔊 "Hey Lois, the tests just passed. Get me a sandwich to celebrate."
packs live in a separate index. install with voiceforge pack install <name>, play with voiceforge play --pack <name> --event <id>. ~50ms playback — no synth, no model load, just rodio reading a WAV. exit code 3 means "event not in pack" so callers (Claude Code hooks etc.) can fall back to voiceforge say.
shipping today — 5 voices, all reacting to build_failed
click play. each one was rendered locally from a 60-second source clip via fish-speech S2 Pro — no fine-tune, no GPU. you can do this with any voice.
queued — phrase manifests authored, awaiting clean reference clips. render your own in ~1 hour, no GPU.
install one with voiceforge pack install peter. play with voiceforge play --pack peter --event build_failed. ~50ms warm path, no synth.
anyone can render their own packs. the full pipeline is documented in docs/PACK_RENDERING.md — clone any character voice, locally, in under an hour, no GPU required. CPU works; renting a GPU box for 10 minutes makes it a one-coffee operation. content guide for community contributors at PACK_CONTENT_GUIDE.md.
packs are educational / research / local-testing use only — see LICENSE-AUDIO.md. attribution required, 48-hour takedown policy. the engine repo stays clean of celebrity audio so per-pack DMCAs only affect voice-forge-packs.
07privacywhat stays on your machine
local-first. by design, not by promise.
voiceforge has zero accounts, zero cloud TTS, zero telemetry. your reference audio, your cloned voices, your reaction text, the synth output — never leaves ~/.voiceforge/.
the only outbound calls voiceforge makes: pack install (tarball from voice-forge-packs), install-cloning (one-time fish-speech S2 Pro + whisper-medium model download from HuggingFace, ~12 GB), and clone <URL> / ingest <URL> (uses yt-dlp to fetch the source — opt out by downloading the audio yourself and passing a local file path).
the codebase is open. the install script you can less first. the synth process you can strace. the model weights are on disk.
✓ local synth✓ local audio✓ local cache✗ no accounts✗ no telemetry✗ no cloud TTS