local-first / fish-speech S2 Pro / no cloud, no quotas, no accounts

studio-quality voice clones, on your own machine

clone any voice from ~60s of clean audio, then use it for audio notes, voiceovers, terminal reactions, or just to make Peter Griffin yell when your build fails. fish-speech S2 Pro runs locally. your reference clip never leaves your laptop.

~/ — voiceforge v0.4

$ curl -fsSL https://raw.githubusercontent.com/humancto/voice-forge/main/install.sh | bash $ voiceforge install-cloning # ~12 GB, ~30 min — fish-speech + whisper voiceforge install-cloning · fish-speech S2 Pro [████████████████████] 17/17 · done ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples) $ voiceforge clone tyson https://www.youtube.com/watch?v=9D05ej8u-gU $ voiceforge say --voice tyson --text "the universe is under no obligation to make sense to you." 🔊 ~3s of Neil deGrasse Tyson, synthesized locally.

00 what it does today six real flows

what you can do with voiceforge today.

seven concrete flows. all live on main, all tested, all on voiceforge --version 0.4.0+. nothing aspirational below. cloning runs through fish-speech S2 Pro; the 5 shipped packs (pre-rendered with the same engine) work zero-install. v0.4 ships voiceforge note — full audiobook narration in a cloned voice — plus voiceforge voices migrate for users upgrading legacy v1 voices.

1 — celebrity voices reacting to your terminal, zero model install

$ brew tap humancto/voiceforge && brew install voiceforge # or: curl install.sh | bash $ voiceforge daemon &; disown $ voiceforge pack install trump $ voiceforge say --voice trump --text "build_failed" # ~50 ms, no synth $ voiceforge run --voice trump -- npm test # speaks on success/fail → 5 packs × 13 events = 65 reactions out of the box. ~3 MB binary, 7-10 MB packs. no python, no GPU.

2 — wire it into Claude Code so your AI agent talks back

# ~/.claude/settings.json { "hooks": { "Notification": [{ "command": "voiceforge hook --profile claude-code" }], "Stop": [{ "command": "voiceforge hook --profile claude-code" }] } } → walk away from a 20-min deploy, come back to Peter Griffin yelling that it succeeded — or burning down.

3 — auto-react to long terminal commands (zsh / bash)

$ voiceforge shell-init --install zsh # idempotent; one-time $ npm test # 12 s, exits 0 🔊 "Done." $ cargo build --release # 2 min, exits 1 🔊 "The build failed again." → catches everything > 3 s — npm/cargo/pytest/pulumi/terraform. skip-list silences cd/ls/pwd/clear/history.

4 — audible git workflow (post-commit, post-merge, post-rewrite, pre-push)

$ voiceforge install git-hooks # 4 hooks, idempotent, honors core.hooksPath $ git commit -m "feat: x" 🔊 "Commit saved." $ git push 🔊 "Pushed. Now everyone knows." $ git rebase -i HEAD~3 🔊 "History has been rewritten." → husky/lefthook/pre-commit users get a stderr warning. worktrees + submodules work via git rev-parse --git-common-dir.

5 — clone any voice from any audio source — including YouTube URLs (v0.4: fish-speech S2 Pro)

$ voiceforge install-cloning # one-time, ~30 min, ~12 GB; macOS arm64 pixel-3D BUILD-style wizard renders all 17 install phases live ✓ smoke passes — install proves itself before claiming "ready" $ voiceforge clone tyson https://www.youtube.com/watch?v=... $ voiceforge say --voice tyson --text "the universe is under no obligation to make sense to you" → studio-quality output. ~3-8 s synth on CPU per utterance (M-series). → yt-dlp hardened: --no-playlist --max-filesize 250M --socket-timeout 30 --retries 3. → EBU R128 loudnorm during ingest — quiet recordings get bumped to broadcast level. silent inputs rejected. → schema-2 marker + standalone SMOKE.toml; v1 GPT-SoVITS installs keep working (set VOICEFORGE_TTS_ENGINE=gpt-sovits-v2). → v0.4: upgrade an existing v1 voice in-place — voiceforge voices migrate peter — atomic, idempotent, leaves .v1.bak/ for recovery.

6 — pipe NDJSON events from any tool into the daemon

$ tail -f ~/.my-agent/events.log | voiceforge hook --event-from event_type $ echo '{"hook":{"event_name":"deploy_failed"}}' \ | voiceforge hook --event-from hook.event_name $ my-script --stream | voiceforge hook --voice peter --passthrough | jq → exit codes distinguish daemon-down (2) from rejected-frame (1) from post-connect-garbage (4). → backpressure-safe via per-frame fail threshold; passthrough preserves stdout pipelines.

7 — render long-form text (chapters, essays, .md) into narration WAVs (v0.4)

$ voiceforge note --voice tyson --in chapter.md --out chapter.wav chunking: paragraph-aware via pulldown-cmark; ≤500 chars per chunk [████████████████████████████░░░░] 28/32 chunks · synth via fish-speech S2 Pro ^C (ctrl-c is safe — re-run the same command to resume) $ voiceforge note --voice tyson --in chapter.md --out chapter.wav → resumes from <out>.progress.json — only re-synths the 4 unfinished chunks. wrote chapter.wav (32 chunks, 412.6s audio). → chunk-write ordering is tmp + fsync + rename + fsync + write progress — power loss costs at most 1 chunk. → every chunk verified 44.1 kHz mono PCM_16 before ffmpeg -c copy concat (pinned ffmpeg from the v2 install). → v2-only; v1 voices bail with a clear "run voiceforge voices migrate" hint.

two quality tiers, by use case: pack if you want ~50 ms playback of a fixed event (build_failed, deploy_done, …), zero install beyond the binary, or a 30-second demo. cloned if you want arbitrary text in any voice you have an audio sample of, or your team has custom phrases. linux + intel-mac users use packs (cloning runtime is macOS-arm64 today; ROADMAP 2.1.1 widens it).

01 install ~5 sec via prebuilt binary or brew

two install paths. same binary.

brew if you want upgrade-tracking via brew upgrade. curl-pipe-bash if you want one command, no extra commitment. step 2 (cloning runtime, ~12 GB / ~30 min) is heavy and OPTIONAL — skip it if you just want OS-default voice or the 5 pre-rendered packs.

PATH A — Homebrew (macOS)

two commands. brew tracks new releases via brew upgrade voiceforge. arm64 + intel.

$ brew tap humancto/voiceforge
$ brew install voiceforge

PATH B — universal installer

one command. ~5 sec via prebuilt for darwin-arm64/x86_64 + linux-x86_64/aarch64 (glibc ≥ 2.35). falls back to from-source on unsupported platforms. inspect-first via curl ... -o install.sh; less install.sh; bash install.sh.

$ curl -fsSL https://raw.githubusercontent.com/humancto/voice-forge/main/install.sh | bash

verify the install (5 seconds)

first-run smoke

$ voiceforge --version voiceforge 0.4.0 $ voiceforge doctor [OK] binary /usr/local/bin/voiceforge [OK] home ~/.voiceforge [OK] audio backend rodio (CoreAudio) [OK] embedded TTS macOS say [OK] daemon not running (~/.voiceforge/voiceforge.sock absent) [OK] yt-dlp /opt/homebrew/bin/yt-dlp [OK] cloning fish-speech S2 Pro @ 3dd1f85 (whisper: medium), smoke: ✓ (41203ms, 196608 bytes) ok: 11 warn: 0 error: 0 $ voiceforge say --text "voiceforge is ready" 🔊 uses macOS `say` — no setup needed for the OS-default voice.

30-second demo (no cloning required)

Peter Griffin reacting to your terminal — no model install

$ voiceforge daemon &; disown # long-running event recipient $ voiceforge pack install peter # 7.8 MB, ~3 sec $ voiceforge play --pack peter --event build_failed # instant, ~50 ms $ voiceforge say --voice peter --text "tests_passed" # pack-aware say, no synth 🔊 "Holy crap Lois, the build is on fire!"

optional: live cloning runtime (v0.4 — fish-speech S2 Pro)

skip if you only want the pre-rendered packs (peter / kimmel / neil_tyson / trump / musk). install only if you want arbitrary text in your own cloned voice. v0.4 pivots to fish-speech S2 Pro (the same engine that pre-rendered the shipped packs) — studio-quality output at the cost of a 30-min, ~12 GB one-time install. legacy GPT-SoVITS path stays alive: VOICEFORGE_INSTALL_CLONING_ENGINE=gpt-sovits-v2.

cloning runtime (~30 min, ~12 GB; macOS arm64 today)

$ voiceforge install-cloning # Python 3.11 + ffmpeg@6 + fish-speech S2 Pro + whisper-medium voiceforge install-cloning · fish-speech S2 Pro [██████████████████] 17/17 ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples) $ voiceforge clone myname https://www.youtube.com/watch?v=... # yt-dlp downloads + clones $ voiceforge say --voice myname --text "anything you write" # ~3-8 s synth on CPU, then cached

prereqs: nothing for the brew path. ffmpeg on PATH if you'll use voiceforge ingest or clone (brew installs it as a dep when you run install-cloning). yt-dlp on PATH if you'll pass URLs (optional — local files don't need it). Linux/Windows users get the binary + packs + agent integrations today; live cloning is macOS-arm64 today (ROADMAP 2.1.1 widens it).

02 clone fish-speech S2 Pro recipe (v0.4)

the recipe that pre-rendered all 5 shipped packs at studio quality.

we tried XTTS, F5-TTS, GPT-SoVITS v2/v2Pro/v4 — landed on GPT-SoVITS v2 in v0.3, then pivoted to fish-speech S2 Pro in v0.4 for cleaner timbre and tighter prosody. the same engine renders the 5 packs you can install zero-setup. legacy GPT-SoVITS path stays alive for users mid-migration.

source

~60s clean single-speaker audio (you bring it; YouTube URL OK)

backend

fish-speech S2 Pro · DAC codec + text2semantic AR

reference

single 8-30s clip + Whisper-medium transcript

latency

~30-90s cold model load, ~3-8s synth thereafter (CPU M-series)

cache

sha256(text + voice + created_at) — separate hash domain from explicit-ref smoke path

install gate

post-install smoke synth verifies the runtime end-to-end before reporting "ready"

shipped packs (pre-rendered)

peter · kimmel · neil_tyson · trump · musk

voiceforge install-cloning · fish-speech S2 Pro

$ voiceforge install-cloning ┌─────────────────────────────────────────────────┐ │ ◯═══════ VOICEFORGE ═══════◯ ▶ v0.4 install │ │ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ └─────────────────────────────────────────────────┘ ==> voiceforge install-cloning v2 (fish-speech S2 Pro) ==> disk-space precheck (need >=15 GB free) ==> downloading fishaudio/s2-pro (S2 Pro weights, ~10 GB) ==> sha256-verifying load-bearing model files ==> downloading whisper medium model (~1.5 GB) ==> smoke test: importing fish_speech.models.text2semantic + dac ==> writing schema-v2 marker voiceforge install-cloning · fish-speech S2 Pro [████████████████████] 17/17 · done ✓ smoke passed in 41203 ms (196608 bytes, 98304 samples) next: voiceforge clone myname <yt-url-or-wav>

a word on quality: stylized cartoon voices (Peter, Stewie, Quagmire) reach studio quality with fish-speech S2 Pro on ~60s of clean reference audio. real human voices (Tyson, Attenborough, your colleague) approach broadcast quality. the bundled smoke synth proves the runtime works end-to-end before the doctor row reports ready — no more "install succeeded but first clone errors out 30 seconds later."

03 use terminal-native

speak any line. react to any command.

say — one-shot

$ voiceforge say --voice peter --text "the build is on fire" 🔊 "the build is on fire."

run — wraps a command, reacts on success/fail

$ voiceforge run -- cargo test Compiling voiceforge v0.1.0 Finished test [unoptimized + debuginfo] target(s) in 12.4s Running unittests src/main.rs test result: FAILED. 3 passed; 1 failed; 0 ignored; 0 measured 🔊 "holy crap, cargo just chose violence." $ voiceforge run -- npm test ✓ 47 tests passing 🔊 "oh that's nice. the tests passed."

doctor — system check

$ voiceforge doctor voiceforge 0.2.0 — system check [OK] binary /usr/local/bin/voiceforge [OK] home ~/.voiceforge [OK] audio backend rodio (CoreAudio) [OK] embedded TTS macOS say [OK] cloning GPT-SoVITS @ 08d627c [OK] presets 5 installed [OK] config.toml active_voice = "peter" [OK] daemon running at ~/.voiceforge/voiceforge.sock ok: 10 warn: 0 error: 0

03.5 capabilities everything voiceforge can do

everything voiceforge can do today.

the full surface, by command. all green on `voiceforge --version` 0.2.0+. run any with --help.

command	what it does	when to use
`voiceforge say --voice X --text "..."`	speaks text in voice X. routes to pack (~50 ms) if X is an installed pack and text matches an event/phrase; else synthesizes via cloning (~2-3 s) or embedded TTS (<500 ms).	the universal entry point
`voiceforge run -- <cmd>`	runs <cmd>, speaks build_success / build_failed when it exits.	wrap a single long command
`voiceforge daemon`	long-running unix-socket NDJSON server at `~/.voiceforge/voiceforge.sock`. recipient for everything below.	always-on prerequisite for hooks/agents
`voiceforge send <event>`	single-frame daemon client. exit codes 0/1/2/4 distinguish ok/rejected/unreachable/protocol-error. retry-loop for daemon-startup race.	cursor / scripts / one-shots
`voiceforge hook [--profile claude-code]`	streaming NDJSON forwarder. reads stdin line-by-line, fires daemon events. `--profile claude-code` maps Claude's hook payload schema automatically. `--passthrough` preserves stdout pipelines.	claude code / mcp servers / log tails
`voiceforge shell-init <zsh\|bash> --install`	writes a sentinel-bounded preexec/precmd hook block to your rc. fires command_succeeded / command_failed for any command > 3 s. idempotent.	terminal catch-all
`voiceforge install git-hooks`	installs post-commit / post-merge / post-rewrite / pre-push hooks. honors `core.hooksPath`. chases worktree `.git`-file. detects + warns on husky/lefthook collisions.	git workflow audio
`voiceforge pack {list,install,remove,info}`	manages pre-rendered packs from the static index at humancto/voice-forge-packs. sha256-verified install. ~3 sec, ~7-10 MB per pack.	add/remove celebrity voices
`voiceforge play --pack X --event Y`	~50 ms WAV playback from an installed pack. no synth, no model load.	scripted reactions to fixed events
`voiceforge install-cloning`	installs the GPT-SoVITS v2 cloning runtime (Python 3.11 + ffmpeg@6). idempotent. `--check` verifies, `--force` rebuilds, `--uninstall` removes.	unlock arbitrary-text cloning (mac arm64)
`voiceforge clone <name> <source>`	clones a voice from a local file OR URL (yt-dlp resolves it). EBU R128 loudnorm during ingest. silent inputs rejected. multi-aux-ref recipe (1 main + 5 × 10 s aux), Whisper-transcribed.	make your own voice profiles
`voiceforge ingest <input> <output>`	transcodes any audio source to canonical 32 kHz mono 16-bit pcm WAV. accepts file paths and URLs. validates 10–60 s duration. applies loudnorm. rejects silent.	prep audio for cloning manually
`voiceforge voices`	lists built-in presets, cloned voices, AND installed packs in one view. `voiceforge voices remove <name>` deletes a clone.	audit what voices you have
`voiceforge use <name>`	sets active default voice. when `--voice` is omitted on say/run, this is what plays.	pick a default per-machine
`voiceforge voices migrate <name>`	atomic v1→v2 in-place migration of a GPT-SoVITS voice to fish-speech S2 Pro. preserves cache-key invariants. leaves a `.v1.bak/` child for recovery. idempotent on already-v2 voices. v0.4.	upgrade legacy v1 voices to studio quality
`voiceforge note --voice X --in F --out F`	renders long-form text (.md or .txt or stdin) to a finished narration WAV in voice X. paragraph-aware chunking via pulldown-cmark. resume cache survives Ctrl-C. ffmpeg `-c copy` concat (no re-encode). v2 voices only. v0.4 audiobook killer demo.	narrate a chapter, an essay, a blog post
`voiceforge doctor [--json]`	10-check system health: binary, home, audio backend, embedded TTS, python server, cache, presets, config.toml, ffmpeg, yt-dlp, daemon socket, cloning install. JSON for tooling.	diagnose anything weird

not yet shipped (queued): voiceforge record (mic capture; 2.4), voiceforge watch (filesystem changes; 3.4), macOS notification bridge (3.5), LLM reaction provider (4.1), streaming TTS (4.2), voiceforge share asciinema-with-audio (5.3). full ROADMAP at ROADMAP.md.

04 agents claude code, cursor, anything that runs shell

plug into any AI coding agent.

four surfaces, depending on how your agent emits events. all four ship today (ROADMAP 1.8 / 1.9 / 3.3 / 3.1). full guide: docs/AGENTS.md.

prerequisite: the daemon must be running. one-time:

daemon — long-running unix-socket server

$ voiceforge daemon & $ disown voiceforge daemon: listening on /Users/me/.voiceforge/voiceforge.sock (max 8 inflight)

claude code — native hook system

# ~/.claude/settings.json { "hooks": { "Notification": [{ "command": "voiceforge hook --profile claude-code" }], "Stop": [{ "command": "voiceforge hook --profile claude-code" }], "PreToolUse": [{ "command": "voiceforge hook --profile claude-code" }] } } # --profile claude-code maps hook_event_name -> daemon event automatically

cursor / continue / aider — anything that runs shell on events

$ voiceforge send agent_done --message "code review finished" spoken: "Done." (voice: hype_narrator) $ voiceforge send command_failed --message "tests broke after refactor" spoken: "Your command failed." (voice: angry_duck) # exit codes: 0 ok, 1 daemon rejected, 2 daemon not reachable, 4 protocol

generic streaming source (logfile, MCP server, custom)

$ tail -f ~/.my-agent/events.log | voiceforge hook --event-from event_type $ echo '{"hook":{"event_name":"build_failed"}}' \ | voiceforge hook --event-from hook.event_name $ my-agent --stream | voiceforge hook --voice peter --passthrough | jq

catch-all — shell-init for any command the agent runs (>3s)

$ voiceforge shell-init --install zsh voiceforge: appended hook to /Users/me/.zshrc hint: open a new shell or `source /Users/me/.zshrc` # Now any command >3s fires command_succeeded / command_failed. # Threshold is env-tunable: VOICEFORGE_SHELL_THRESHOLD_MS=5000. # Skip-list defaults skip cd/ls/pwd/clear/history/voiceforge.

05 pipeline under the hood

two layers. all local.

rust CLI for the dev-side surface (install, clone, say, run, doctor, daemon) and a python venv that hosts GPT-SoVITS v2. the synth is a long-lived NDJSON child — model loads once per process, warm for every call after that.

terminal event voiceforge process (build, test, git, agent) │ ▼ ┌─────────────────────┐ ┌──────────────────────────┐ │ apps/voiceforge-cli │ │ scripts/cloning_synth.py │ │ (Rust, async tokio) │ ───► │ (Python, GPT-SoVITS v2) │ │ │ NDJSON │ │ │ rule engine │ stdin │ lazy load │ │ engine facade │ stdout │ 1 main + 5 aux refs │ │ ~/.voiceforge/cache │ │ atomic write │ └─────────────────────┘ └──────────────────────────┘ │ │ │ embedded fallback │ cached .wav under │ (macOS say / espeak-ng) │ ~/.voiceforge/cache/ ▼ ▼ ┌─────────────────────────────────────────────────────────┐ │ rodio (CoreAudio / ALSA) │ └─────────────────────────────────────────────────────────┘ │ ▼ 🔊 speakers go brrrr

three engines, one façade. Engine::Embedded shells to say / espeak-ng for the default voice (no Python required). Engine::Server talks to a Flask server when VOICEFORGE_TTS_URL is set. Engine::Cloning spawns the long-lived NDJSON child. Engine::speak(text, voice) dispatches per-call based on whether the voice name matches a cloned profile.

06 voice packs install in one command

character voices. install in one command.

live cloning is fast and works for arbitrary text — but on stylized characters (Peter Griffin, Jimmy Kimmel) GPT-SoVITS v2 hits a quality ceiling. for terminal feedback (a fixed set of events) we render packs of phrases once, on a beefy model — fish-speech S2 Pro — and ship the WAVs through a separate index. runtime playback is sub-100ms.

install + use a pack

$ voiceforge pack list NAME STATUS VERSION TIER DISPLAY NAME kimmel available 0.1.0 public-figure Jimmy Kimmel peter available 0.1.3 character Peter Griffin $ voiceforge pack install peter installed peter v0.1.3 (13 phrases). Try: voiceforge play --pack peter --event tests_passed $ voiceforge play --pack peter --event tests_passed 🔊 "Hey Lois, the tests just passed. Get me a sandwich to celebrate."

packs live in a separate index. install with voiceforge pack install <name>, play with voiceforge play --pack <name> --event <id>. ~50ms playback — no synth, no model load, just rodio reading a WAV. exit code 3 means "event not in pack" so callers (Claude Code hooks etc.) can fall back to voiceforge say.

shipping today — 5 voices, all reacting to `build_failed`

click play. each one was rendered locally from a 60-second source clip via fish-speech S2 Pro — no fine-tune, no GPU. you can do this with any voice.

peter	Peter Griffin (Family Guy)	v0.1.4
kimmel	Jimmy Kimmel	v0.1.0
neil_tyson	Neil deGrasse Tyson	v0.1.0
trump	Donald Trump	v0.1.0
musk	Elon Musk	v0.1.0
stewie · bob_ross · herzog · ramsay · obama · bender	queued — phrase manifests authored, awaiting clean reference clips. render your own in ~1 hour, no GPU.

install one with voiceforge pack install peter. play with voiceforge play --pack peter --event build_failed. ~50ms warm path, no synth.

anyone can render their own packs. the full pipeline is documented in docs/PACK_RENDERING.md — clone any character voice, locally, in under an hour, no GPU required. CPU works; renting a GPU box for 10 minutes makes it a one-coffee operation. content guide for community contributors at PACK_CONTENT_GUIDE.md.

packs are educational / research / local-testing use only — see LICENSE-AUDIO.md. attribution required, 48-hour takedown policy. the engine repo stays clean of celebrity audio so per-pack DMCAs only affect voice-forge-packs.

07 privacy what stays on your machine

local-first. by design, not by promise.

voiceforge has zero accounts, zero cloud TTS, zero telemetry. your reference audio, your cloned voices, your reaction text, the synth output — never leaves ~/.voiceforge/.

the only outbound calls voiceforge makes: pack install (tarball from voice-forge-packs), install-cloning (one-time fish-speech S2 Pro + whisper-medium model download from HuggingFace, ~12 GB), and clone <URL> / ingest <URL> (uses yt-dlp to fetch the source — opt out by downloading the audio yourself and passing a local file path).

the codebase is open. the install script you can less first. the synth process you can strace. the model weights are on disk.

✓ local synth ✓ local audio ✓ local cache ✗ no accounts ✗ no telemetry ✗ no cloud TTS

backend (v0.4)

fish-speech S2 Pro (open source) · legacy GPT-SoVITS v2 stays opt-in

first-run download

~12 GB from huggingface.co/fishaudio/s2-pro + ~1.5 GB whisper-medium

runtime network

none

license

MIT