How to Run Claude Code Offline with Local Models (LM Studio + Mac)
I was on a 10-hour flight last month, laptop open, working on a side project. No Wi-Fi. No API calls. And yet Claude Code was running on my machine, helping me refactor code, write tests, and scaffold new features — all powered by a local model through LM Studio.
Here's how I set it up, and why you might want to.
Two reasons to run Claude Code locally
1. You're on a plane (or anywhere without internet)
This is the obvious one. You're flying, you're on a train, you're at a café with terrible Wi-Fi. You've got 4 hours of uninterrupted focus time and a codebase that needs work. Normally Claude Code is useless without a connection. But with a local model, you get the same agentic workflow — file exploration, code edits, tool calls, patch application — running entirely on your machine.
2. You want to stop burning through your Claude plan
This one is less obvious but arguably more practical day-to-day.
If you're on Claude's Max plan, you're essentially limitless — but you're also paying for it. If you're on a lower tier (Pro), you'll eventually hit rate limits, and it always happens at the worst time, right in the middle of a flow state.
You can work around this by using the API directly and adding credits to your account, but if you're using Claude Code heavily, the costs add up fast and it becomes unsustainable.
A better approach: use a local model for the smaller tasks and save your Anthropic credits for when you actually need them. Writing boilerplate, scaffolding components, running simple refactors, adding tests for straightforward functions — a local model handles these fine. When you hit something that genuinely requires deep thinking or complex multi-file reasoning, switch back to Claude's cloud models.
And here's the thing that surprised me: if your project is well-structured with a solid CLAUDE.md and your skills/agents files properly set up, even the local model becomes remarkably capable. The model doesn't have to be a genius when your project context is doing half the heavy lifting.
What you'll need
- A computer with enough RAM. LM Studio runs on Mac, Windows, and Linux — this setup is not Mac-only. That said, Macs with Apple Silicon have a significant advantage for running local models. On a PC, your model has to fit in your GPU's VRAM (typically 8–24GB on consumer cards like the RTX 4090). If it doesn't fit, it either won't run or crawls at unusable speeds. On a Mac, the CPU and GPU share the same unified memory pool — your entire RAM is available for the model. That means a MacBook Pro with 128GB can load an 80B parameter model that would require multiple expensive GPUs on a PC. If you're on Windows/Linux with a high-VRAM GPU, everything in this guide still works — just use GGUF model formats instead of MLX.
- At least 32GB of memory (unified memory on Mac, or VRAM + RAM on PC). At 16GB it'll technically run with small models, but the experience will be rough.
- LM Studio — download it from lmstudio.ai.
- Claude Code — install via npm:
npm install -g @anthropic-ai/claude-code
- A good local model (see recommendations below).
Which model should you use?
This matters more than you'd think. Claude Code is an agentic tool — it doesn't just generate code, it explores codebases, sequences tool calls, applies patches, and runs commands. You need a model that's strong at tool use and code repair, not just raw intelligence.
After researching the options, here's how the main contenders stack up for Claude Code specifically:
| Criteria | Qwen3-Coder-Next (80B) | GLM-4.7-Flash (30B) | Qwen3.5-35B-A3B | Qwen3-Coder-30B |
|---|---|---|---|---|
| Agentic tool use | ★★★ Best | ★★★ Great | ★ Weak | ★★ Mid |
| Code repair (SWE-bench) | ★★★ Sonnet 4.5-class | ★★★ 59.2% | ★★ Mid | ★★ Good |
| General reasoning | ★★ Strong | ★ Weakest | ★★★ Best | ★★ Mid |
| Inference speed | ★★ Fast (3B active) | ★★★ Fastest | ★★★ Fast | ★ Slower |
| RAM needed (Q4) | ~51GB | ~17GB | ~20GB | ~18GB |
| Claude Code fit | ★★★ Best overall | ★★★ Best for its size | ★ Worst | ★★ Mid |
Which model for your hardware?
Your available memory determines which tier you're in — and the difference is significant. On a Mac, this is your unified memory (e.g., 32GB, 64GB, 128GB). On a PC, it's primarily your GPU VRAM, though with partial offloading to system RAM you can stretch further.
64GB or more → Qwen3-Coder-Next (80B). This is the best local model for Claude Code, period. It was specifically designed for coding agents and tool use, with SWE-Bench Pro performance roughly on par with Claude Sonnet 4.5. At ~51GB (Q4) it fits comfortably with plenty of headroom for context window and your OS. If you have 128GB, you can run it at higher quantization (6-bit or 8-bit) for even better quality, or push the context length well beyond the default 25K tokens.
32–48GB → GLM-4.7-Flash (30B). At just ~17GB (Q4) it loads fast, runs fast, and punches way above its weight on agentic tasks — 59.2% on SWE-bench Verified and 79.5% on tau2-Bench for multi-step tool use. It also maintains consistent performance across all difficulty levels, unlike the Qwen models which tend to crater on complex multi-file tasks. This is the sweet spot for most people.
24GB → GLM-4.7-Flash at Q4 is your only real option. It'll work, but context length will be tight. Keep your tasks small and focused.
16GB → Not recommended. You can technically squeeze in a small model, but the experience with Claude Code will be frustrating. Better to use cloud mode.
A note on model formats: MLX vs GGUF
When you download a model in LM Studio, you'll see two format options: GGUF and MLX.
GGUF is the universal format — it works on Mac, Windows, and Linux. It's the safe default if you're not on Apple Silicon.
MLX is Apple's machine learning framework, built specifically for the unified memory architecture of M-series chips. MLX models are typically 20–30% faster than GGUF on the same Mac hardware because they're optimized for how Apple Silicon accesses memory. The speed gap gets even wider with larger models.
If you're on a Mac with Apple Silicon, always pick the MLX version. If you're on Windows or Linux, go with GGUF. In LM Studio, both show up side by side when you search for a model — just pick the right one for your platform.
Setup guide
Step 1: Set up LM Studio as a server
- Open LM Studio and download your chosen model (e.g., Qwen3-Coder-Next if you have 64GB+, or GLM-4.7-Flash for 32–48GB).
- Go to the Developer tab (
Cmd+2). - Load your model (
Cmd+L). - Set the context length to at least 25,000 tokens — Claude Code is context-hungry. If you have RAM headroom (e.g., 128GB with Qwen3-Coder-Next), push this higher — 64K or more will give Claude Code much more room to work with.
- Start the server (
Cmd+R). It will be available athttp://localhost:1234.
Step 2: Bypass Claude Code's login screen
This is the tricky part. Even if you set ANTHROPIC_BASE_URL to your local server, Claude Code still forces a login screen. Setting environment variables alone won't work — I learned this the hard way.
The fix is to use Claude Code's official apiKeyHelper feature:
# Create the helper script
mkdir -p ~/.claude
cat > ~/.claude/api-key-helper.sh << 'EOF'
#!/bin/bash
echo "lm-studio"
EOF
chmod +x ~/.claude/api-key-helper.sh
Then add it to ~/.claude/settings.json:
{
"apiKeyHelper": "/Users/YOUR_USERNAME/.claude/api-key-helper.sh",
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
}
}
Replace YOUR_USERNAME with your macOS username (run whoami to check).
Important: Do NOT also set
ANTHROPIC_API_KEYas an environment variable when usingapiKeyHelper— they conflict and you'll get an auth warning.
Why
CLAUDE_CODE_ATTRIBUTION_HEADER: 0? Claude Code prepends an attribution header to requests that invalidates the KV cache on local models, making inference up to 90% slower. This setting must be insettings.json, not as an env export.
Step 3: Set environment variables and launch
export ANTHROPIC_BASE_URL="http://localhost:1234"
export ANTHROPIC_MODEL="zai-org/glm-4.7-flash" # matches your loaded model
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude
You should see Claude Code start directly — no login screen.
Switching between local and cloud
This is where it gets practical. If you're using local models for smaller tasks and Anthropic for complex ones, you don't want to be editing config files every time you switch.
I built a launcher script that handles everything:
claude-launcher local— Checks if LM Studio is running, auto-detects the loaded model, creates theapiKeyHelper, sets all the right env vars, applies the KV cache fix, and launches Claude Code.claude-launcher cloud— RemovesapiKeyHelper, cleans up local overrides fromsettings.json, unsets env vars, and launches Claude Code with normal OAuth login.claude-launcher status— Shows whether LM Studio is online, which model is loaded, and which auth mode is active.claude-launcher(no args) — Interactive menu with all of the above.
Here's what it looks like in practice:
┌─────────────────────────────────────┐
│ Claude Code Launcher 🚀 │
└─────────────────────────────────────┘
LM Studio: ● online — zai-org/glm-4.7-flash
1) 🖥 Local mode (LM Studio)
2) ☁️ Cloud mode (Anthropic)
3) 📊 Status
q) Exit
The full script is published as a GitHub Gist — you can view it below or open it on GitHub.
What to expect (honest take)
Local models are not Claude. Let's be clear about that.
What works well locally: Scaffolding, boilerplate generation, simple refactors, writing tests for well-defined functions, code review, renaming and restructuring. Basically any task where the intent is clear and the scope is contained.
What you should save for Anthropic's models: Complex multi-file reasoning, architectural decisions, debugging subtle issues, anything that requires deep thinking across a large context. When you need these, switch to cloud mode and use the real thing.
The multiplier: A well-structured project makes a huge difference. If you have a solid CLAUDE.md, well-defined skills files, and proper agent configuration, even a smaller local model can produce solid results. The model doesn't need to be brilliant when your project context is telling it exactly what conventions to follow, what patterns to use, and how your codebase is organized. Invest in your project scaffolding — it pays off whether you're running locally or in the cloud.
Speed and context: Local inference is noticeably slower than cloud. On a 32–48GB Mac with GLM-4.7-Flash, expect 25–32K context. On a 64GB+ Mac with Qwen3-Coder-Next, you can push to 64K+ which makes a real difference for larger codebases. Either way, keep your tasks focused and you'll be fine.
TL;DR
- Pick the right model for your RAM: Qwen3-Coder-Next (64GB+) for Sonnet-class performance, GLM-4.7-Flash (32–48GB) for the best bang-per-gigabyte.
- Use
apiKeyHelperto bypass Claude Code's login screen (env vars alone won't work). - Set
CLAUDE_CODE_ATTRIBUTION_HEADER: 0insettings.jsonto avoid a 90% speed penalty. - Use a launcher script to toggle between local and cloud modes without editing files.
- Use local for small tasks, Anthropic for complex ones — this is the sustainable way to use Claude Code without burning through your plan or your wallet.
The next time you board a flight, open your laptop, and code. No Wi-Fi required.