A Local LLM Coding Stack for macOS
A re-runnable setup for running local LLMs alongside Claude Code on a 16 GB Apple Silicon Mac. Ollama, Continue for VS Code, Qwen 2.5 Coder 7B for code, Gemma 4 e4b for chat.
It's 2026 and the hosted frontier models still write better code than anything you can run on your laptop. So why bother setting up Ollama on your Mac?
Three reasons that have held up for me over the last few months:
- Privacy-sensitive snippets. I work on client codebases I'd rather not send to a hosted API, even one I trust.
- Offline coding. Trains, bad hotel Wi-Fi, the occasional outage. A local model is always there.
- High-volume autocomplete. Paying per-token for "finish this line" gets old. Local is free after the model download.
What I haven't given up: Claude Code stays my daily driver for serious agentic work — multi-file refactors, debugging across modules, planning sprints. The local stack is a complement, not a replacement. The two run side by side, and the split is explicit.
This post walks through the setup I'm running on a 16 GB M4 Mac Mini (and the same setup on a 16 GB M5 MacBook Air): Ollama as the runtime, Continue for the VS Code integration, Qwen 2.5 Coder 7B for code, Gemma 4 e4b for chat / docs / multimodal. The whole thing is a single re-runnable shell script you can find at github.com/marchildmann/IDS-Scripts (mac_setup_llm.sh).
The stack
┌─────────────────────────────────────────────────────────┐
│ VS Code │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Continue extension │ │
│ │ Cmd+L chat panel │ │
│ │ Cmd+I inline edit │ │
│ └────────────┬────────────────────────────────────────┘ │
└──────────────┼──────────────────────────────────────────┘
│ HTTP, http://localhost:11434
▼
┌─────────────────────────────────────────────────────────┐
│ Ollama daemon (brew services start ollama) │
│ │
│ qwen2.5-coder:7b 4.7 GB code, autocomplete │
│ gemma4:e4b 9.6 GB chat, docs, multimodal │
│ nomic-embed-text 274 MB @codebase embeddings │
└─────────────────────────────────────────────────────────┘
Three components, each well-documented and easy to swap.
Why Ollama, not LM Studio or MLX?
- Ollama is what every modern VS Code extension expects. The HTTP API on
:11434is the de-facto standard; the model library is huge;brew install+brew services startgives you a managed background daemon that comes up on login. - LM Studio is nicer for browsing models in a GUI, but it's not as scriptable.
- MLX (Apple's framework) is faster on Apple Silicon for some models, especially the cutting-edge ones that publish MLX weights weeks before GGUF quants reach Ollama. The ecosystem of clients is still smaller, though. Worth revisiting in a year.
Why these models?
Qwen 2.5 Coder 7B is the best 7B-parameter code model I've used. Small enough to load and respond fast (sub-second to first token for inline completion), good enough to write real functions, and it speaks every language I care about. ~4.7 GB on disk, ~5 GB resident in RAM when loaded.
Gemma 4 e4b is Google's new multimodal model. The e4b naming carries over from Gemma 3n: 4 billion effective parameters via Per-Layer Embeddings selective activation. The model is bigger on disk (~9.6 GB) but only a subset is active per token. Three things make it interesting:
- Text and image input — paste a screenshot of a stack trace and ask "what's wrong here?"
- 128K context window (vs. 32K on Gemma 3n).
- Configurable thinking modes — explicit reasoning steps when you ask for them.
Real benchmarks running a ~650-token explain-this-code prompt:
| Hardware | Total time | Eval rate |
|---|---|---|
| M4 Mac Mini, 16 GB | 23.15 s | 29.15 tok/s |
| M5 MacBook Air, 16 GB | 17.52 s | 34.88 tok/s |
That's fast enough that the model doesn't feel like the bottleneck in a chat flow.
nomic-embed-text is tiny (~270 MB) but essential — it powers Continue's @codebase retrieval. Without an embedding model, asking "where is X defined?" can't be answered from the codebase context.
The script
mac_setup_llm.sh is ~220 lines of bash. Full source is in the repo; the design decisions worth mentioning:
Everything overridable at the top
OLLAMA_HOST="${OLLAMA_HOST:-http://localhost:11434}"
PRIMARY_MODEL="${PRIMARY_MODEL:-qwen2.5-coder:7b}"
LLM_MODELS=(
"qwen2.5-coder:7b"
"gemma4:e4b"
"nomic-embed-text"
)
Every value can be overridden per-run: PRIMARY_MODEL=qwen2.5-coder:32b ./mac_setup_llm.sh. Adding another model is a one-line edit to the array, then re-run.
Idempotent end-to-end
The script is designed to be re-run any time — when a new model comes out, when you upgrade VS Code, when something feels off. It checks before each step:
brew list ollamabeforebrew installcurl /api/tagsbeforebrew services startollama list(with tag normalization for:latest) beforeollama pullcode --list-extensionsbeforecode --install-extension
Nothing is done twice unnecessarily.
Never overwrites your config
Every file the script writes under ~/.continue/ goes through a small helper:
function write_if_absent {
local path="$1"
if [[ -e "$path" ]]; then
cat > /dev/null
echo_warn "Preserving existing $path (delete it and re-run to regenerate)"
else
cat > "$path"
echo_success "Wrote $path"
fi
}
If you've customized ~/.continue/config.json — added a model, changed the autocomplete settings, anything — the next script run leaves it alone. The yellow ! Preserving existing … line tells you exactly what was skipped. To get a fresh default: rm ~/.continue/config.json && ./mac_setup_llm.sh.
End-to-end verification
The last step curls /api/generate with a "reply with the word ok" prompt:
RESPONSE=$(curl -sS --max-time 90 "$OLLAMA_HOST/api/generate" \
-H 'Content-Type: application/json' \
-d "{\"model\":\"$PRIMARY_MODEL\",\"prompt\":\"Reply with the single word ok.\",\"stream\":false}" \
| grep -o '"response":"[^"]*"' | head -1)
Catches the "Ollama daemon installed but not serving" failure mode before you find it in VS Code three minutes later.
What ~/.continue/config.json looks like
The script writes a Continue config that puts both chat models in the dropdown, points tab autocomplete at Qwen specifically, and wires nomic-embed-text in for embeddings:
{
"models": [
{
"title": "Qwen 2.5 Coder 7B (local) — code",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://localhost:11434"
},
{
"title": "Gemma 4 e4b (local) — chat / docs / multimodal",
"provider": "ollama",
"model": "gemma4:e4b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen 2.5 Coder 7B (autocomplete)",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://localhost:11434"
},
"embeddingsProvider": {
"provider": "ollama",
"model": "nomic-embed-text",
"apiBase": "http://localhost:11434"
},
"contextProviders": [
{ "name": "code" }, { "name": "docs" },
{ "name": "diff" }, { "name": "terminal" },
{ "name": "open" }, { "name": "codebase" },
{ "name": "folder" }
],
"allowAnonymousTelemetry": false
}
Tab completion stays on Qwen because it's noticeably better line-by-line. Chat lets you pick per-question from the dropdown — Qwen for "write a function", Gemma for "explain this error" or "what's in this screenshot".
Daily workflow
After install:
code . # open any project
# Cmd+L # open the Continue chat panel
# Pick a model from the dropdown above the input
# Type: @codebase what does this project do?
# (the embedding model retrieves relevant chunks first)
# Select code, Cmd+I, describe edit (inline rewrite, accept hunk-by-hunk)
Drag a screenshot into the chat input and ask Gemma 4 about it — works for stack traces, terminal output, architecture diagrams, anything visual. Qwen 2.5 Coder is text-only and can't do that.
CLI-only smoke tests:
$ ollama run gemma4:e4b "Explain this stack trace: ..."
$ ollama ps # what's resident in RAM right now
Adding more models:
# Option A — edit LLM_MODELS in the script, re-run
# Option B — direct pull, then add manually to ~/.continue/config.json
ollama pull qwen2.5-coder:32b # if you have 32 GB+ unified RAM
Restarting the runtime:
Continue automatically reconnects when Ollama comes back up.
Coexisting with Claude Code
I'm explicit about the split:
| Claude Code | Local stack | |
|---|---|---|
| Use for… | hard agentic tasks, multi-file refactors, planning, "fix this failing test suite" | inline edits, autocomplete, "explain this", privacy-sensitive snippets, offline work |
| Model | Anthropic Sonnet / Opus | Qwen 2.5 Coder 7B, Gemma 4 e4b |
| Network | required | local-only |
| Cost | per-token | electricity |
Claude Code stays exactly as it is — no proxy layer, no LiteLLM, no trying to swap in a local model. Cmd+L in VS Code gets me Continue with a local model; claude in a terminal gets me the hosted stack. They never step on each other.
Gotchas I hit
A few non-obvious things that bit me while building this out — worth knowing if you copy the script.
code CLI on PATH before running. The script tries code --install-extension Continue.continue. If you haven't run VS Code's Shell Command: Install 'code' command in PATH from Cmd+Shift+P yet, the script prints a warning and you have to re-run after fixing.
Window reload after extension install. After code --install-extension puts Continue on disk, Cmd+L doesn't work in an already-open VS Code window — the extension is installed but not activated. Cmd+Shift+P → Developer: Reload Window fixes it. Or quit and reopen. The Activity Bar icon and Cmd+L both wake up after the reload.
Bash 3.2 + UTF-8 ellipsis = an unbound-variable nightmare. macOS still ships bash 3.2 as /bin/bash for license reasons. With set -u, this line:
echo_step "Ensuring Ollama API is reachable at $OLLAMA_HOST…"
aborts the script with OLLAMA_HOST?: unbound variable. The ? is bash's error formatter mangling the UTF-8 bytes of the … ellipsis character (U+2026, three bytes \xe2\x80\xa6) when it complains about the variable name. The parser slurped those bytes into the variable name and looked up OLLAMA_HOST\xe2\x80\xa6, which is unbound. The fix is to brace the variable so the parser sees a hard boundary:
echo_step "Ensuring Ollama API is reachable at ${OLLAMA_HOST}…"
I now grep my own scripts for \$[A-Za-z_][A-Za-z_0-9]*… after writing them. (Or upgrade to bash 5 via Homebrew. Either works.)
Tag normalization. ollama list always shows an explicit tag, so nomic-embed-text:latest. The script's array has "nomic-embed-text" with no tag. An grep -Fxq exact match misses, the script "re-pulls" (Ollama dedupes layers, so it's effectively instant, but the log message is misleading). The fix is one line:
expected="$model"
[[ "$model" == *:* ]] || expected="${model}:latest"
What I'd add next
Things I'm experimenting with but didn't bake into the default script:
- MLX runtime side-by-side with Ollama. Some frontier models publish MLX weights weeks before GGUF quantizations show up in Ollama. The
mlx-lmCLI can be installed in parallel and used for one-offs. - Qwen 2.5 Coder 32B for hard refactors. Needs 32 GB+ unified RAM. On an M4 Max it's slower per token than 7B but the quality jump is real on multi-file work.
- A dedicated reasoning model — DeepSeek-R1 distilled variants are interesting for "think step by step before answering" tasks where latency cost is acceptable.
Adding any of these is one line in LLM_MODELS and a re-run — the script is built for that.
Try it
$ cd IDS-Scripts
$ ./mac_setup_llm.sh
About 15 GB of model downloads on first run; subsequent re-runs take seconds. The script tells you exactly which step it's on (and what was preserved vs. regenerated), and the final summary prints the URLs, models, shortcuts, and a daily-workflow cheatsheet.
Three commands and your Mac is talking to itself about code. Not a bad way to spend a coffee.