What Its Actually Like to Code By Voice — With the AI Replying In My Own Cloned Voice

The closest analogy I can give for what this feels like is having a quiet co-worker in the room who happens to sound exactly like you. You think out loud. They respond out loud. You both work on the same code. Neither of you is touching a keyboard.

It’s still a little uncanny. But it’s also the most natural way to work I’ve found in twenty-plus years of writing software.

The setup runs entirely on my MacBook. Apple’s on-device speech recognition listens for me. A local language model thinks. A cloned-voice text-to-speech says the response back. Nothing leaves the laptop. Nothing requires a network. The whole loop is on-device, and that turns out to matter for reasons I didn’t expect.

How it actually works

A compiled Swift binary wraps Apple’s SFSpeechRecognizer — the same engine that powers macOS dictation — in a continuous-listening daemon. It transcribes everything I say into the active terminal window where Claude Code is running. End-of-utterance is detected by a stability heuristic: if the recognized text stops changing for about 2.5 seconds, the recognizer treats that sentence as final and submits it.

That submission gets injected into Claude Code via AppleScript, addressed to a specific window by ID so it can’t leak into whatever else is open. Claude Code processes the request against a local language model running on MLX (Apple’s native ML framework). The response comes back as text in the terminal — and a separate launcher pipes that text into a TTS engine running a cloned version of my voice. The reply plays through the speakers. The listener auto-pauses while audio is playing, so the model’s spoken reply never gets picked up as a new prompt. Then it resumes listening for the next thing I say.

End-to-end latency, on a current MacBook, is around two to three seconds. Fast enough to feel like a conversation. Slow enough that you notice it’s a different kind of pacing than typing.

What surprises you the first hour

The first thing that surprised me is how much I already narrate while coding. The interior monologue — “okay let me look at the test, that’s failing because the path is wrong, let me grep for the constant, oh it’s in a different file, fix that here…” — turns out to be most of how I work anyway. Speaking it out loud changed nothing about my reasoning. It just routed it to a different output channel.

The second thing that surprised me is how much faster context-switching gets. When you type, you have to break to compose. When you speak, you can just keep going. “That’s done — now check the function signature in the parent class — yeah okay update the docstring to match — git status — looks good, commit it.” Five tasks, no pause, no posture change.

The third surprise is the physical difference. After half a day of voice-driven work I’m not stiff. My eyes aren’t tired. I haven’t held a clamshell wrist position for hours. There’s a real bodily cost to the way we normally use computers, and removing it feels like removing a weight you didn’t know was there.

Why on-device matters specifically for this

You can build a voice-driven coding setup with cloud APIs. Whisper for speech-in, ElevenLabs for speech-out, GPT or Claude for the brain. Many people do. The result is a tool that works great until your wifi gets weird, your API key hits a rate limit, your monthly bill arrives, or you realize you’ve just sent every word you said in front of your laptop today to three different vendors’ servers.

The on-device version doesn’t have any of those failure modes. It works on a plane. It works in a Faraday cage. It works when the rest of the internet is on fire. The bill is a one-time hardware purchase, not a perpetual subscription. And nothing — no audio, no text, no inference request — ever crosses the network. For me that’s the difference between an interesting demo and a tool I actually use day-to-day.

The cloned voice is a real thing, not a gimmick

The cloned voice is the part everyone reacts to first. When the AI reads its response in your own voice, your nervous system files it under “internal monologue” rather than “external announcement.” It’s a smoother experience than a stranger’s TTS voice and it doesn’t pull your attention the same way.

But it works because the cloning is also on-device. The voice clone trains and runs locally — Pocket TTS in my case, but other local TTS engines slot in if you have a preference. Cloud voice services would mean my own voice (and everything I make it say) is sitting on someone’s server. Not interested.

Where it falls short today

Three real limitations:

1. Domain-specific vocabulary. Apple’s recognizer is excellent at general English, less excellent at obscure software terms, library names, and acronyms. “Refactor the YOLOv8 inference loop” often comes through as “refactor the yellow vate inference loop.” The fix is a custom vocabulary file you can register with the recognizer; that closes most of the gap but takes setup time.

2. Background noise. A quiet office is fine. A coffee shop is workable. A hot tub with the jets running, surprisingly, also fine. A room with kids and a dog is harder. The continuous-listen mode is robust but not magic.

3. Long pauses. If you stop talking to think for 20 seconds, the recognizer will sometimes finalize a partial sentence that wasn’t done yet, and you have to restart it. Workable but a real friction point I’m still iterating on.

None of these are fundamental. All of them get better as the recognition models get better, which they’re doing every macOS release.

What this unlocks for me personally

Coding while pacing the room. Coding while cooking. Coding while in the hot tub. (Yes, really. The mac is on a desk; my voice carries; I check back in by walking up to the screen when something needs visual confirmation.) Holding voice work sessions that last for hours without my body breaking down.

It also lets me work in spaces that aren’t desks. Most of my best thinking happens away from a screen anyway — the keyboard part was always the bottleneck. Removing it doesn’t make me think differently. It makes the time I spend actually capturing the thinking more honest about how that thinking happens.

If you want to try it

The local-AI server is at github.com/nicedreamzapp/claude-code-local. The voice listener and dispatcher are at github.com/nicedreamzapp/NarrateClaude. Both are MIT-licensed, both run entirely on a MacBook with Apple Silicon, and both ship with double-click launchers so the install is closer to “set up an app” than “build a system.”

You will spend an evening getting it tuned to your voice and your vocabulary. After that, it just works. And the working life it produces is, in my experience, qualitatively different from screen-and-keyboard.

— Matt

Part of the Nice Dreamz lineup. If you’re a firm exploring private on-device AI, AirGap AI is the engagement I do for setting this up inside law / medical / accounting practices.