local-first  /  fish-speech S2 Pro  /  no cloud, no quotas, no accounts

studio-quality voice clones, on your own machine

clone any voice from ~60s of clean audio, then use it for audio notes, voiceovers, terminal reactions, or just to make Peter Griffin yell when your build fails. fish-speech S2 Pro runs locally. your reference clip never leaves your laptop.

~/ — voiceforge v0.4
$ curl -fsSL https://raw.githubusercontent.com/humancto/voice-forge/main/install.sh | bash $ voiceforge install-cloning # ~12 GB, ~30 min — fish-speech + whisper voiceforge install-cloning · fish-speech S2 Pro [████████████████████] 17/17 · done ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples) $ voiceforge clone tyson https://www.youtube.com/watch?v=9D05ej8u-gU $ voiceforge say --voice tyson --text "the universe is under no obligation to make sense to you." 🔊 ~3s of Neil deGrasse Tyson, synthesized locally.

what you can do with voiceforge today.

seven concrete flows. all live on main, all tested, all on voiceforge --version 0.4.0+. nothing aspirational below. cloning runs through fish-speech S2 Pro; the 5 shipped packs (pre-rendered with the same engine) work zero-install. v0.4 ships voiceforge note — full audiobook narration in a cloned voice — plus voiceforge voices migrate for users upgrading legacy v1 voices.

1 — celebrity voices reacting to your terminal, zero model install
$ brew tap humancto/voiceforge && brew install voiceforge # or: curl install.sh | bash $ voiceforge daemon &; disown $ voiceforge pack install trump $ voiceforge say --voice trump --text "build_failed" # ~50 ms, no synth $ voiceforge run --voice trump -- npm test # speaks on success/fail → 5 packs × 13 events = 65 reactions out of the box. ~3 MB binary, 7-10 MB packs. no python, no GPU.
2 — wire it into Claude Code so your AI agent talks back
# ~/.claude/settings.json { "hooks": { "Notification": [{ "command": "voiceforge hook --profile claude-code" }], "Stop": [{ "command": "voiceforge hook --profile claude-code" }] } } → walk away from a 20-min deploy, come back to Peter Griffin yelling that it succeeded — or burning down.
3 — auto-react to long terminal commands (zsh / bash)
$ voiceforge shell-init --install zsh # idempotent; one-time $ npm test # 12 s, exits 0 🔊 "Done." $ cargo build --release # 2 min, exits 1 🔊 "The build failed again." → catches everything > 3 s — npm/cargo/pytest/pulumi/terraform. skip-list silences cd/ls/pwd/clear/history.
4 — audible git workflow (post-commit, post-merge, post-rewrite, pre-push)
$ voiceforge install git-hooks # 4 hooks, idempotent, honors core.hooksPath $ git commit -m "feat: x" 🔊 "Commit saved." $ git push 🔊 "Pushed. Now everyone knows." $ git rebase -i HEAD~3 🔊 "History has been rewritten." → husky/lefthook/pre-commit users get a stderr warning. worktrees + submodules work via git rev-parse --git-common-dir.
5 — clone any voice from any audio source — including YouTube URLs (v0.4: fish-speech S2 Pro)
$ voiceforge install-cloning # one-time, ~30 min, ~12 GB; macOS arm64 pixel-3D BUILD-style wizard renders all 17 install phases live ✓ smoke passes — install proves itself before claiming "ready" $ voiceforge clone tyson https://www.youtube.com/watch?v=... $ voiceforge say --voice tyson --text "the universe is under no obligation to make sense to you" → studio-quality output. ~3-8 s synth on CPU per utterance (M-series). → yt-dlp hardened: --no-playlist --max-filesize 250M --socket-timeout 30 --retries 3. → EBU R128 loudnorm during ingest — quiet recordings get bumped to broadcast level. silent inputs rejected. → schema-2 marker + standalone SMOKE.toml; v1 GPT-SoVITS installs keep working (set VOICEFORGE_TTS_ENGINE=gpt-sovits-v2). → v0.4: upgrade an existing v1 voice in-place — voiceforge voices migrate peter — atomic, idempotent, leaves .v1.bak/ for recovery.
6 — pipe NDJSON events from any tool into the daemon
$ tail -f ~/.my-agent/events.log | voiceforge hook --event-from event_type $ echo '{"hook":{"event_name":"deploy_failed"}}' \ | voiceforge hook --event-from hook.event_name $ my-script --stream | voiceforge hook --voice peter --passthrough | jq → exit codes distinguish daemon-down (2) from rejected-frame (1) from post-connect-garbage (4). → backpressure-safe via per-frame fail threshold; passthrough preserves stdout pipelines.
7 — render long-form text (chapters, essays, .md) into narration WAVs (v0.4)
$ voiceforge note --voice tyson --in chapter.md --out chapter.wav chunking: paragraph-aware via pulldown-cmark; ≤500 chars per chunk [████████████████████████████░░░░] 28/32 chunks · synth via fish-speech S2 Pro ^C (ctrl-c is safe — re-run the same command to resume) $ voiceforge note --voice tyson --in chapter.md --out chapter.wav → resumes from <out>.progress.json — only re-synths the 4 unfinished chunks. wrote chapter.wav (32 chunks, 412.6s audio). → chunk-write ordering is tmp + fsync + rename + fsync + write progress — power loss costs at most 1 chunk. → every chunk verified 44.1 kHz mono PCM_16 before ffmpeg -c copy concat (pinned ffmpeg from the v2 install). → v2-only; v1 voices bail with a clear "run voiceforge voices migrate" hint.

two quality tiers, by use case: pack if you want ~50 ms playback of a fixed event (build_failed, deploy_done, …), zero install beyond the binary, or a 30-second demo. cloned if you want arbitrary text in any voice you have an audio sample of, or your team has custom phrases. linux + intel-mac users use packs (cloning runtime is macOS-arm64 today; ROADMAP 2.1.1 widens it).

two install paths. same binary.

brew if you want upgrade-tracking via brew upgrade. curl-pipe-bash if you want one command, no extra commitment. step 2 (cloning runtime, ~12 GB / ~30 min) is heavy and OPTIONAL — skip it if you just want OS-default voice or the 5 pre-rendered packs.

PATH A — Homebrew (macOS)
two commands. brew tracks new releases via brew upgrade voiceforge. arm64 + intel.
$ brew tap humancto/voiceforge
$ brew install voiceforge
PATH B — universal installer
one command. ~5 sec via prebuilt for darwin-arm64/x86_64 + linux-x86_64/aarch64 (glibc ≥ 2.35). falls back to from-source on unsupported platforms. inspect-first via curl ... -o install.sh; less install.sh; bash install.sh.
$ curl -fsSL https://raw.githubusercontent.com/humancto/voice-forge/main/install.sh | bash

verify the install (5 seconds)

first-run smoke
$ voiceforge --version voiceforge 0.4.0 $ voiceforge doctor [OK] binary /usr/local/bin/voiceforge [OK] home ~/.voiceforge [OK] audio backend rodio (CoreAudio) [OK] embedded TTS macOS say [OK] daemon not running (~/.voiceforge/voiceforge.sock absent) [OK] yt-dlp /opt/homebrew/bin/yt-dlp [OK] cloning fish-speech S2 Pro @ 3dd1f85 (whisper: medium), smoke: ✓ (41203ms, 196608 bytes) ok: 11 warn: 0 error: 0 $ voiceforge say --text "voiceforge is ready" 🔊 uses macOS `say` — no setup needed for the OS-default voice.

30-second demo (no cloning required)

Peter Griffin reacting to your terminal — no model install
$ voiceforge daemon &; disown # long-running event recipient $ voiceforge pack install peter # 7.8 MB, ~3 sec $ voiceforge play --pack peter --event build_failed # instant, ~50 ms $ voiceforge say --voice peter --text "tests_passed" # pack-aware say, no synth 🔊 "Holy crap Lois, the build is on fire!"

optional: live cloning runtime (v0.4 — fish-speech S2 Pro)

skip if you only want the pre-rendered packs (peter / kimmel / neil_tyson / trump / musk). install only if you want arbitrary text in your own cloned voice. v0.4 pivots to fish-speech S2 Pro (the same engine that pre-rendered the shipped packs) — studio-quality output at the cost of a 30-min, ~12 GB one-time install. legacy GPT-SoVITS path stays alive: VOICEFORGE_INSTALL_CLONING_ENGINE=gpt-sovits-v2.

cloning runtime (~30 min, ~12 GB; macOS arm64 today)
$ voiceforge install-cloning # Python 3.11 + ffmpeg@6 + fish-speech S2 Pro + whisper-medium voiceforge install-cloning · fish-speech S2 Pro [██████████████████] 17/17 ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples) $ voiceforge clone myname https://www.youtube.com/watch?v=... # yt-dlp downloads + clones $ voiceforge say --voice myname --text "anything you write" # ~3-8 s synth on CPU, then cached

prereqs: nothing for the brew path. ffmpeg on PATH if you'll use voiceforge ingest or clone (brew installs it as a dep when you run install-cloning). yt-dlp on PATH if you'll pass URLs (optional — local files don't need it). Linux/Windows users get the binary + packs + agent integrations today; live cloning is macOS-arm64 today (ROADMAP 2.1.1 widens it).

the recipe that pre-rendered all 5 shipped packs at studio quality.

we tried XTTS, F5-TTS, GPT-SoVITS v2/v2Pro/v4 — landed on GPT-SoVITS v2 in v0.3, then pivoted to fish-speech S2 Pro in v0.4 for cleaner timbre and tighter prosody. the same engine renders the 5 packs you can install zero-setup. legacy GPT-SoVITS path stays alive for users mid-migration.

source
~60s clean single-speaker audio (you bring it; YouTube URL OK)
backend
fish-speech S2 Pro · DAC codec + text2semantic AR
reference
single 8-30s clip + Whisper-medium transcript
latency
~30-90s cold model load, ~3-8s synth thereafter (CPU M-series)
cache
sha256(text + voice + created_at) — separate hash domain from explicit-ref smoke path
install gate
post-install smoke synth verifies the runtime end-to-end before reporting "ready"
shipped packs (pre-rendered)
peter · kimmel · neil_tyson · trump · musk
voiceforge install-cloning · fish-speech S2 Pro
$ voiceforge install-cloning ┌─────────────────────────────────────────────────┐ │ ◯═══════ VOICEFORGE ═══════◯ ▶ v0.4 install │ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ └─────────────────────────────────────────────────┘ ==> voiceforge install-cloning v2 (fish-speech S2 Pro) ==> disk-space precheck (need >=15 GB free) ==> downloading fishaudio/s2-pro (S2 Pro weights, ~10 GB) ==> sha256-verifying load-bearing model files ==> downloading whisper medium model (~1.5 GB) ==> smoke test: importing fish_speech.models.text2semantic + dac ==> writing schema-v2 marker voiceforge install-cloning · fish-speech S2 Pro [████████████████████] 17/17 · done ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples) next: voiceforge clone myname <yt-url-or-wav>

a word on quality: stylized cartoon voices (Peter, Stewie, Quagmire) reach studio quality with fish-speech S2 Pro on ~60s of clean reference audio. real human voices (Tyson, Attenborough, your colleague) approach broadcast quality. the bundled smoke synth proves the runtime works end-to-end before the doctor row reports ready — no more "install succeeded but first clone errors out 30 seconds later."

speak any line. react to any command.

say — one-shot
$ voiceforge say --voice peter --text "the build is on fire" 🔊 "the build is on fire."
run — wraps a command, reacts on success/fail
$ voiceforge run -- cargo test Compiling voiceforge v0.1.0 Finished test [unoptimized + debuginfo] target(s) in 12.4s Running unittests src/main.rs test result: FAILED. 3 passed; 1 failed; 0 ignored; 0 measured 🔊 "holy crap, cargo just chose violence." $ voiceforge run -- npm test ✓ 47 tests passing 🔊 "oh that's nice. the tests passed."
doctor — system check
$ voiceforge doctor voiceforge 0.2.0 — system check [OK] binary /usr/local/bin/voiceforge [OK] home ~/.voiceforge [OK] audio backend rodio (CoreAudio) [OK] embedded TTS macOS say [OK] cloning GPT-SoVITS @ 08d627c [OK] presets 5 installed [OK] config.toml active_voice = "peter" [OK] daemon running at ~/.voiceforge/voiceforge.sock ok: 10 warn: 0 error: 0

everything voiceforge can do today.

the full surface, by command. all green on `voiceforge --version` 0.2.0+. run any with --help.

command what it does when to use
voiceforge say --voice X --text "..."speaks text in voice X. routes to pack (~50 ms) if X is an installed pack and text matches an event/phrase; else synthesizes via cloning (~2-3 s) or embedded TTS (<500 ms).the universal entry point
voiceforge run -- <cmd>runs <cmd>, speaks build_success / build_failed when it exits.wrap a single long command
voiceforge daemonlong-running unix-socket NDJSON server at ~/.voiceforge/voiceforge.sock. recipient for everything below.always-on prerequisite for hooks/agents
voiceforge send <event>single-frame daemon client. exit codes 0/1/2/4 distinguish ok/rejected/unreachable/protocol-error. retry-loop for daemon-startup race.cursor / scripts / one-shots
voiceforge hook [--profile claude-code]streaming NDJSON forwarder. reads stdin line-by-line, fires daemon events. --profile claude-code maps Claude's hook payload schema automatically. --passthrough preserves stdout pipelines.claude code / mcp servers / log tails
voiceforge shell-init <zsh|bash> --installwrites a sentinel-bounded preexec/precmd hook block to your rc. fires command_succeeded / command_failed for any command > 3 s. idempotent.terminal catch-all
voiceforge install git-hooksinstalls post-commit / post-merge / post-rewrite / pre-push hooks. honors core.hooksPath. chases worktree .git-file. detects + warns on husky/lefthook collisions.git workflow audio
voiceforge pack {list,install,remove,info}manages pre-rendered packs from the static index at humancto/voice-forge-packs. sha256-verified install. ~3 sec, ~7-10 MB per pack.add/remove celebrity voices
voiceforge play --pack X --event Y~50 ms WAV playback from an installed pack. no synth, no model load.scripted reactions to fixed events
voiceforge install-cloninginstalls the GPT-SoVITS v2 cloning runtime (Python 3.11 + ffmpeg@6). idempotent. --check verifies, --force rebuilds, --uninstall removes.unlock arbitrary-text cloning (mac arm64)
voiceforge clone <name> <source>clones a voice from a local file OR URL (yt-dlp resolves it). EBU R128 loudnorm during ingest. silent inputs rejected. multi-aux-ref recipe (1 main + 5 × 10 s aux), Whisper-transcribed.make your own voice profiles
voiceforge ingest <input> <output>transcodes any audio source to canonical 32 kHz mono 16-bit pcm WAV. accepts file paths and URLs. validates 10–60 s duration. applies loudnorm. rejects silent.prep audio for cloning manually
voiceforge voiceslists built-in presets, cloned voices, AND installed packs in one view. voiceforge voices remove <name> deletes a clone.audit what voices you have
voiceforge use <name>sets active default voice. when --voice is omitted on say/run, this is what plays.pick a default per-machine
voiceforge voices migrate <name>atomic v1→v2 in-place migration of a GPT-SoVITS voice to fish-speech S2 Pro. preserves cache-key invariants. leaves a .v1.bak/ child for recovery. idempotent on already-v2 voices. v0.4.upgrade legacy v1 voices to studio quality
voiceforge note --voice X --in F --out Frenders long-form text (.md or .txt or stdin) to a finished narration WAV in voice X. paragraph-aware chunking via pulldown-cmark. resume cache survives Ctrl-C. ffmpeg -c copy concat (no re-encode). v2 voices only. v0.4 audiobook killer demo.narrate a chapter, an essay, a blog post
voiceforge doctor [--json]10-check system health: binary, home, audio backend, embedded TTS, python server, cache, presets, config.toml, ffmpeg, yt-dlp, daemon socket, cloning install. JSON for tooling.diagnose anything weird

not yet shipped (queued): voiceforge record (mic capture; 2.4), voiceforge watch (filesystem changes; 3.4), macOS notification bridge (3.5), LLM reaction provider (4.1), streaming TTS (4.2), voiceforge share asciinema-with-audio (5.3). full ROADMAP at ROADMAP.md.

plug into any AI coding agent.

four surfaces, depending on how your agent emits events. all four ship today (ROADMAP 1.8 / 1.9 / 3.3 / 3.1). full guide: docs/AGENTS.md.

prerequisite: the daemon must be running. one-time:

daemon — long-running unix-socket server
$ voiceforge daemon & $ disown voiceforge daemon: listening on /Users/me/.voiceforge/voiceforge.sock (max 8 inflight)
claude code — native hook system
# ~/.claude/settings.json { "hooks": { "Notification": [{ "command": "voiceforge hook --profile claude-code" }], "Stop": [{ "command": "voiceforge hook --profile claude-code" }], "PreToolUse": [{ "command": "voiceforge hook --profile claude-code" }] } } # --profile claude-code maps hook_event_name -> daemon event automatically
cursor / continue / aider — anything that runs shell on events
$ voiceforge send agent_done --message "code review finished" spoken: "Done." (voice: hype_narrator) $ voiceforge send command_failed --message "tests broke after refactor" spoken: "Your command failed." (voice: angry_duck) # exit codes: 0 ok, 1 daemon rejected, 2 daemon not reachable, 4 protocol
generic streaming source (logfile, MCP server, custom)
$ tail -f ~/.my-agent/events.log | voiceforge hook --event-from event_type $ echo '{"hook":{"event_name":"build_failed"}}' \ | voiceforge hook --event-from hook.event_name $ my-agent --stream | voiceforge hook --voice peter --passthrough | jq
catch-all — shell-init for any command the agent runs (>3s)
$ voiceforge shell-init --install zsh voiceforge: appended hook to /Users/me/.zshrc hint: open a new shell or `source /Users/me/.zshrc` # Now any command >3s fires command_succeeded / command_failed. # Threshold is env-tunable: VOICEFORGE_SHELL_THRESHOLD_MS=5000. # Skip-list defaults skip cd/ls/pwd/clear/history/voiceforge.

two layers. all local.

rust CLI for the dev-side surface (install, clone, say, run, doctor, daemon) and a python venv that hosts GPT-SoVITS v2. the synth is a long-lived NDJSON child — model loads once per process, warm for every call after that.

terminal event voiceforge process (build, test, git, agent) ┌─────────────────────┐ ┌──────────────────────────┐ │ apps/voiceforge-cli │ │ scripts/cloning_synth.py │ │ (Rust, async tokio) │ ───► │ (Python, GPT-SoVITS v2) │ │ │ NDJSON │ │ │ rule engine │ stdin │ lazy load │ │ engine facade │ stdout │ 1 main + 5 aux refs │ │ ~/.voiceforge/cache │ │ atomic write │ └─────────────────────┘ └──────────────────────────┘ │ embedded fallback │ cached .wav under │ (macOS say / espeak-ng) │ ~/.voiceforge/cache/ ┌─────────────────────────────────────────────────────────┐ │ rodio (CoreAudio / ALSA) │ └─────────────────────────────────────────────────────────┘ 🔊 speakers go brrrr

three engines, one façade. Engine::Embedded shells to say / espeak-ng for the default voice (no Python required). Engine::Server talks to a Flask server when VOICEFORGE_TTS_URL is set. Engine::Cloning spawns the long-lived NDJSON child. Engine::speak(text, voice) dispatches per-call based on whether the voice name matches a cloned profile.

character voices. install in one command.

live cloning is fast and works for arbitrary text — but on stylized characters (Peter Griffin, Jimmy Kimmel) GPT-SoVITS v2 hits a quality ceiling. for terminal feedback (a fixed set of events) we render packs of phrases once, on a beefy model — fish-speech S2 Pro — and ship the WAVs through a separate index. runtime playback is sub-100ms.

install + use a pack
$ voiceforge pack list NAME STATUS VERSION TIER DISPLAY NAME kimmel available 0.1.0 public-figure Jimmy Kimmel peter available 0.1.3 character Peter Griffin $ voiceforge pack install peter installed peter v0.1.3 (13 phrases). Try: voiceforge play --pack peter --event tests_passed $ voiceforge play --pack peter --event tests_passed 🔊 "Hey Lois, the tests just passed. Get me a sandwich to celebrate."

packs live in a separate index. install with voiceforge pack install <name>, play with voiceforge play --pack <name> --event <id>. ~50ms playback — no synth, no model load, just rodio reading a WAV. exit code 3 means "event not in pack" so callers (Claude Code hooks etc.) can fall back to voiceforge say.

shipping today — 5 voices, all reacting to build_failed

click play. each one was rendered locally from a 60-second source clip via fish-speech S2 Pro — no fine-tune, no GPU. you can do this with any voice.

peter Peter Griffin (Family Guy) v0.1.4
kimmel Jimmy Kimmel v0.1.0
neil_tyson Neil deGrasse Tyson v0.1.0
trump Donald Trump v0.1.0
musk Elon Musk v0.1.0
stewie · bob_ross · herzog · ramsay · obama · bender queued — phrase manifests authored, awaiting clean reference clips. render your own in ~1 hour, no GPU.

install one with voiceforge pack install peter. play with voiceforge play --pack peter --event build_failed. ~50ms warm path, no synth.

anyone can render their own packs. the full pipeline is documented in docs/PACK_RENDERING.md — clone any character voice, locally, in under an hour, no GPU required. CPU works; renting a GPU box for 10 minutes makes it a one-coffee operation. content guide for community contributors at PACK_CONTENT_GUIDE.md.

packs are educational / research / local-testing use only — see LICENSE-AUDIO.md. attribution required, 48-hour takedown policy. the engine repo stays clean of celebrity audio so per-pack DMCAs only affect voice-forge-packs.

local-first. by design, not by promise.

voiceforge has zero accounts, zero cloud TTS, zero telemetry. your reference audio, your cloned voices, your reaction text, the synth output — never leaves ~/.voiceforge/.

the only outbound calls voiceforge makes: pack install (tarball from voice-forge-packs), install-cloning (one-time fish-speech S2 Pro + whisper-medium model download from HuggingFace, ~12 GB), and clone <URL> / ingest <URL> (uses yt-dlp to fetch the source — opt out by downloading the audio yourself and passing a local file path).

the codebase is open. the install script you can less first. the synth process you can strace. the model weights are on disk.

✓ local synth ✓ local audio ✓ local cache ✗ no accounts ✗ no telemetry ✗ no cloud TTS
backend (v0.4)
fish-speech S2 Pro (open source) · legacy GPT-SoVITS v2 stays opt-in
first-run download
~12 GB from huggingface.co/fishaudio/s2-pro + ~1.5 GB whisper-medium
runtime network
none
license
MIT