MacBook Pro with M2 Max chip

Three Generations of Running Claude Code Locally on a MacBook — What I Actually Learned

Timely update — April 22, 2026. XDA Developers reports Anthropic is A/B-testing a Pro plan that doesn’t include Claude Code — affecting ~2% of new signups, with the pricing page updated to show Claude Code unchecked in the test variant. Current Pro users keep access for now, but the signal is clear: if you want Claude Code in the cloud, the cheapest path is moving toward the $100+ Max tier. What’s below is the free local version that runs exactly the same Claude Code CLI against a model on your own Mac, no subscription needed.

I spent a weekend trying to get Claude Code running against a local model on my Mac. I ended up rewriting the whole setup three times before I had something that didn’t embarrass me on a real coding task. Here’s what went wrong and what finally worked.

The project is open source — it’s at github.com/nicedreamzapp/claude-code-local if you want to skip the story and just run it.

Why do this at all

Claude Code is great, but every call you make sends your code to Anthropic’s cloud. For a lot of what I do — NDA work, client code, things that legally shouldn’t leave the room — that’s a non-starter. I don’t want to turn the AI off, I want to run it somewhere the data doesn’t travel. On a MacBook with 128 GB of unified memory, that’s not hypothetical anymore. You can fit a 70B+ parameter model in RAM and get real work done without a single packet leaving the box.

Gen 1: Ollama + a translation proxy

The obvious first move. Ollama speaks the OpenAI API. Claude Code speaks the Anthropic API. So you write a little Python proxy in the middle that translates one to the other.

It worked. It was also painfully slow — 30 tokens/second, and a real coding task took 133 seconds. The proxy was doing two API translations per turn, and every tool call meant serializing and re-deserializing JSON twice. For anything more complex than “write a loop” the thing spent more time shuffling bytes than running inference.

Honestly, Gen 1 was mostly useful for proving the idea was sound. Nothing more.

Gen 2: llama.cpp + TurboQuant

I tried to fix the speed problem by swapping Ollama for llama.cpp and adding Google Research’s TurboQuant KV cache compression. That got me to 41 tok/s — a real improvement on the model side. But Claude Code tasks still took 133 seconds, because the proxy was the bottleneck, not the model.

This was the lesson I needed: you can’t fix a translation-overhead problem by making the translator’s client faster.

Gen 3: kill the proxy entirely

This is where things clicked. Instead of making Ollama or llama.cpp speak Anthropic’s API through a middleman, I wrote a native MLX server that speaks the Anthropic API directly.

No proxy. No translation layer. Claude Code connects to localhost:4000 thinking it’s talking to api.anthropic.com, and the server routes straight into MLX (Apple’s native Metal-based ML framework) running the model on the GPU side of unified memory.

Same Claude Code task: 17.6 seconds. That’s 7.5× faster than anything I had before, and it came from deleting code, not adding it.

Throughput with Qwen 3.5 122B (a mixture-of-experts model where only 10B of the 122B params activate per token) hits 65 tok/s on an M5 Max. That’s faster than cloud Opus and, depending on how you measure, close to cloud Sonnet. At zero dollars a month and with nothing leaving the machine.

What I didn’t expect

Three things surprised me:

1. Tool-call recovery mattered more than raw speed. Small local models sometimes emit garbled tool calls — XML syntax mixed with JSON keys, that sort of thing. Claude Code just silently retries when it can’t parse a call, and without a recovery layer you get infinite loops of “let me try that for you” that never actually run anything. Writing a parser that catches the common garbles and re-infers tool names from parameter keys turned out to be the difference between a demo and something I’d actually use.

2. The 10K-token Claude Code harness prompt is too big for local models. Claude Code sends a giant system prompt with every request that’s tuned for cloud Claude. Local models see it and often respond with “I am not able to execute this task.” I added an auto-detection that recognizes a Claude Code coding session (presence of Bash/Read/Edit/Write tools) and swaps in a 100-token version tuned for local models. Prompt token count drops by 99% and the refusal problem goes away.

3. KV cache quantization bits matter. 4-bit KV cache saves memory, but small models lose coherence on long conversations. Bumping to 8-bit (starting at token 1024) fixed the “wait, what were we doing?” drift without a meaningful memory hit.

Where I ended up

The repo ships three model options now — Gemma 4 31B (fast, fits a 64 GB Mac), Llama 3.3 70B (slowest but smartest, full 8-bit precision), and Qwen 3.5 122B (fastest throughput, MoE sparsity). Same server, same API, one env var swaps the model.

It also has a browser agent that drives Brave via Chrome DevTools (so local AI can actually do research, not just write code), an iMessage pipeline for when I’m away from the Mac, and — the part I’m proudest of — a hands-free voice loop where Apple’s on-device SFSpeechRecognizer listens and a cloned-voice TTS replies. Both halves of that loop run locally. Your voice never leaves the laptop.

If you want to try it

git clone https://github.com/nicedreamzapp/claude-code-local
cd claude-code-local
bash setup.sh

setup.sh auto-detects your RAM, picks an appropriate model, downloads it (one-time, 18-75 GB), installs the MLX server, and drops a launcher on your Desktop. Double-click and you’re coding locally.

The full source is MIT. It’s ~1000 lines of Python plus a few shell launchers. No dependencies I couldn’t audit from scratch, and zero outbound network calls from any of it (Claude Code’s own binary makes one non-blocking startup handshake to Anthropic that you can firewall off with no loss of function — documented in the README).

From where I sit, the interesting piece isn’t the speed numbers. It’s that “your code never leaves the machine” stopped being aspirational. If you’re a lawyer, an accountant, a doctor, or anyone whose work comes with confidentiality obligations, this is the version of AI coding you can actually use.

On-device isn’t just a MacBook thing

This project is one of two I’ve shipped in the on-device-AI space. The other one is RealTime AI Camera — a free iPhone app that detects all 601 object classes from Open Images V7 fully offline, at an average of 10 FPS. Every other iPhone detection app I’ve seen caps out at the standard 80 COCO classes because the bigger model is much harder to get running on-device. I spent weeks on the PyTorch → CoreML conversion, hallucination tuning across the extra 521 classes, and the memory-bandwidth bottleneck in the camera pipeline — wrote up the whole build here if that’s your lane.

Different hardware, different model size, same philosophy: your data doesn’t have to leave the device for the AI to be useful. claude-code-local is the MacBook version of that idea. RealTime AI Camera is the iPhone version.

— Matt


More of my open-source lineup: nicedreamzwholesale.com/software. If you want this set up inside a firm or practice, that’s AirGap AI — book a 15-min call there.

Leave a Comment