A Local LLM Coding Stack for macOS

A re-runnable setup for running local LLMs alongside Claude Code on a 16 GB Apple Silicon Mac. Ollama, Continue for VS Code, Qwen 2.5 Coder 7B for code, Gemma 4 e4b for chat.

It's 2026 and the hosted frontier models still write better code than anything you can run on your laptop. So why bother setting up Ollama on your Mac?

Three reasons that have held up for me over the last few months:

  1. Privacy-sensitive snippets. I work on client codebases I'd rather not send to a hosted API, even one I trust.
  2. Offline coding. Trains, bad hotel Wi-Fi, the occasional outage. A local model is always there.
  3. High-volume autocomplete. Paying per-token for "finish this line" gets old. Local is free after the model download.

What I haven't given up: Claude Code stays my daily driver for serious agentic work — multi-file refactors, debugging across modules, planning sprints. The local stack is a complement, not a replacement. The two run side by side, and the split is explicit.

This post walks through the setup I'm running on a 16 GB M4 Mac Mini (and the same setup on a 16 GB M5 MacBook Air): Ollama as the runtime, Continue for the VS Code integration, Qwen 2.5 Coder 7B for code, Gemma 4 e4b for chat / docs / multimodal. The whole thing is a single re-runnable shell script you can find at github.com/marchildmann/IDS-Scripts (mac_setup_llm.sh).

The stack

┌─────────────────────────────────────────────────────────┐
│ VS Code                                                 │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Continue extension                                  │ │
│ │   Cmd+L  chat panel                                 │ │
│ │   Cmd+I  inline edit                                │ │
│ └────────────┬────────────────────────────────────────┘ │
└──────────────┼──────────────────────────────────────────┘
               │ HTTP, http://localhost:11434
               ▼
┌─────────────────────────────────────────────────────────┐
│ Ollama daemon (brew services start ollama)              │
│                                                         │
│   qwen2.5-coder:7b    4.7 GB   code, autocomplete       │
│   gemma4:e4b          9.6 GB   chat, docs, multimodal   │
│   nomic-embed-text    274 MB   @codebase embeddings     │
└─────────────────────────────────────────────────────────┘

Three components, each well-documented and easy to swap.

Why Ollama, not LM Studio or MLX?

Why these models?

Qwen 2.5 Coder 7B is the best 7B-parameter code model I've used. Small enough to load and respond fast (sub-second to first token for inline completion), good enough to write real functions, and it speaks every language I care about. ~4.7 GB on disk, ~5 GB resident in RAM when loaded.

Gemma 4 e4b is Google's new multimodal model. The e4b naming carries over from Gemma 3n: 4 billion effective parameters via Per-Layer Embeddings selective activation. The model is bigger on disk (~9.6 GB) but only a subset is active per token. Three things make it interesting:

Real benchmarks running a ~650-token explain-this-code prompt:

Hardware Total time Eval rate
M4 Mac Mini, 16 GB 23.15 s 29.15 tok/s
M5 MacBook Air, 16 GB 17.52 s 34.88 tok/s

That's fast enough that the model doesn't feel like the bottleneck in a chat flow.

nomic-embed-text is tiny (~270 MB) but essential — it powers Continue's @codebase retrieval. Without an embedding model, asking "where is X defined?" can't be answered from the codebase context.

The script

mac_setup_llm.sh is ~220 lines of bash. Full source is in the repo; the design decisions worth mentioning:

Everything overridable at the top

OLLAMA_HOST="${OLLAMA_HOST:-http://localhost:11434}"
PRIMARY_MODEL="${PRIMARY_MODEL:-qwen2.5-coder:7b}"

LLM_MODELS=(
  "qwen2.5-coder:7b"
  "gemma4:e4b"
  "nomic-embed-text"
)

Every value can be overridden per-run: PRIMARY_MODEL=qwen2.5-coder:32b ./mac_setup_llm.sh. Adding another model is a one-line edit to the array, then re-run.

Idempotent end-to-end

The script is designed to be re-run any time — when a new model comes out, when you upgrade VS Code, when something feels off. It checks before each step:

Nothing is done twice unnecessarily.

Never overwrites your config

Every file the script writes under ~/.continue/ goes through a small helper:

function write_if_absent {
  local path="$1"
  if [[ -e "$path" ]]; then
    cat > /dev/null
    echo_warn "Preserving existing $path (delete it and re-run to regenerate)"
  else
    cat > "$path"
    echo_success "Wrote $path"
  fi
}

If you've customized ~/.continue/config.json — added a model, changed the autocomplete settings, anything — the next script run leaves it alone. The yellow ! Preserving existing … line tells you exactly what was skipped. To get a fresh default: rm ~/.continue/config.json && ./mac_setup_llm.sh.

End-to-end verification

The last step curls /api/generate with a "reply with the word ok" prompt:

RESPONSE=$(curl -sS --max-time 90 "$OLLAMA_HOST/api/generate" \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"$PRIMARY_MODEL\",\"prompt\":\"Reply with the single word ok.\",\"stream\":false}" \
  | grep -o '"response":"[^"]*"' | head -1)

Catches the "Ollama daemon installed but not serving" failure mode before you find it in VS Code three minutes later.

What ~/.continue/config.json looks like

The script writes a Continue config that puts both chat models in the dropdown, points tab autocomplete at Qwen specifically, and wires nomic-embed-text in for embeddings:

{
  "models": [
    {
      "title": "Qwen 2.5 Coder 7B (local) — code",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "http://localhost:11434"
    },
    {
      "title": "Gemma 4 e4b (local) — chat / docs / multimodal",
      "provider": "ollama",
      "model": "gemma4:e4b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen 2.5 Coder 7B (autocomplete)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://localhost:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "apiBase": "http://localhost:11434"
  },
  "contextProviders": [
    { "name": "code"     }, { "name": "docs"     },
    { "name": "diff"     }, { "name": "terminal" },
    { "name": "open"     }, { "name": "codebase" },
    { "name": "folder"   }
  ],
  "allowAnonymousTelemetry": false
}

Tab completion stays on Qwen because it's noticeably better line-by-line. Chat lets you pick per-question from the dropdown — Qwen for "write a function", Gemma for "explain this error" or "what's in this screenshot".

Daily workflow

After install:

code .                              # open any project
# Cmd+L                             # open the Continue chat panel
# Pick a model from the dropdown above the input
# Type: @codebase what does this project do?
#       (the embedding model retrieves relevant chunks first)
# Select code, Cmd+I, describe edit (inline rewrite, accept hunk-by-hunk)

Drag a screenshot into the chat input and ask Gemma 4 about it — works for stack traces, terminal output, architecture diagrams, anything visual. Qwen 2.5 Coder is text-only and can't do that.

CLI-only smoke tests:

Smoke tests
$ ollama run qwen2.5-coder:7b "Refactor this Go function to use generics: ..."
$ ollama run gemma4:e4b "Explain this stack trace: ..."
$ ollama ps                           # what's resident in RAM right now

Adding more models:

# Option A — edit LLM_MODELS in the script, re-run
# Option B — direct pull, then add manually to ~/.continue/config.json
ollama pull qwen2.5-coder:32b       # if you have 32 GB+ unified RAM

Restarting the runtime:

Restart Ollama
$ brew services {stop,start,restart} ollama

Continue automatically reconnects when Ollama comes back up.

Coexisting with Claude Code

I'm explicit about the split:

Claude Code Local stack
Use for… hard agentic tasks, multi-file refactors, planning, "fix this failing test suite" inline edits, autocomplete, "explain this", privacy-sensitive snippets, offline work
Model Anthropic Sonnet / Opus Qwen 2.5 Coder 7B, Gemma 4 e4b
Network required local-only
Cost per-token electricity

Claude Code stays exactly as it is — no proxy layer, no LiteLLM, no trying to swap in a local model. Cmd+L in VS Code gets me Continue with a local model; claude in a terminal gets me the hosted stack. They never step on each other.

Gotchas I hit

A few non-obvious things that bit me while building this out — worth knowing if you copy the script.

code CLI on PATH before running. The script tries code --install-extension Continue.continue. If you haven't run VS Code's Shell Command: Install 'code' command in PATH from Cmd+Shift+P yet, the script prints a warning and you have to re-run after fixing.

Window reload after extension install. After code --install-extension puts Continue on disk, Cmd+L doesn't work in an already-open VS Code window — the extension is installed but not activated. Cmd+Shift+P → Developer: Reload Window fixes it. Or quit and reopen. The Activity Bar icon and Cmd+L both wake up after the reload.

Bash 3.2 + UTF-8 ellipsis = an unbound-variable nightmare. macOS still ships bash 3.2 as /bin/bash for license reasons. With set -u, this line:

echo_step "Ensuring Ollama API is reachable at $OLLAMA_HOST…"

aborts the script with OLLAMA_HOST?: unbound variable. The ? is bash's error formatter mangling the UTF-8 bytes of the ellipsis character (U+2026, three bytes \xe2\x80\xa6) when it complains about the variable name. The parser slurped those bytes into the variable name and looked up OLLAMA_HOST\xe2\x80\xa6, which is unbound. The fix is to brace the variable so the parser sees a hard boundary:

echo_step "Ensuring Ollama API is reachable at ${OLLAMA_HOST}…"

I now grep my own scripts for \$[A-Za-z_][A-Za-z_0-9]*… after writing them. (Or upgrade to bash 5 via Homebrew. Either works.)

Tag normalization. ollama list always shows an explicit tag, so nomic-embed-text:latest. The script's array has "nomic-embed-text" with no tag. An grep -Fxq exact match misses, the script "re-pulls" (Ollama dedupes layers, so it's effectively instant, but the log message is misleading). The fix is one line:

expected="$model"
[[ "$model" == *:* ]] || expected="${model}:latest"

What I'd add next

Things I'm experimenting with but didn't bake into the default script:

Adding any of these is one line in LLM_MODELS and a re-run — the script is built for that.

Try it

One-shot install
$ git clone https://github.com/marchildmann/IDS-Scripts.git
$ cd IDS-Scripts
$ ./mac_setup_llm.sh

About 15 GB of model downloads on first run; subsequent re-runs take seconds. The script tells you exactly which step it's on (and what was preserved vs. regenerated), and the final summary prints the URLs, models, shortcuts, and a daily-workflow cheatsheet.

Three commands and your Mac is talking to itself about code. Not a bad way to spend a coffee.