№ 13Learn With Darin · Field Guide

Local models: a practitioner's field guide.

Running open-weight LLMs on your own hardware: when it actually pays off, when it absolutely does not, and what the runtimes, quantization choices, and hardware tiers really look like in May 2026.

Updated May 2026 ~25 min read Covers Ollama, LM Studio, llama.cpp, MLX

Part 01

When local actually makes sense

Most practitioners are better served by a frontier API for the hard work and a local model for the privacy-sensitive slice. That's the honest version of this guide compressed into a sentence. The rest of it is the texture around that claim, because the cases where local genuinely earns its keep are real, and the cases where it doesn't are easy to talk yourself into anyway.

The cases where running locally is the right answer:

Data residency and privacy. The prompt and the response never leave the machine. For draft work involving anything regulated (health, legal, internal financial, unreleased product details), this is the strongest argument and often the only one that matters.
Offline capability. On a plane, on bad hotel wifi, in a basement with no signal, a local model keeps working. For travel and field work, this is genuinely valuable.
No per-token cost. Once the hardware exists, the marginal cost of a generation is electricity. For batch jobs that would burn through API credits (summarizing thousands of documents, classifying a corpus, embedding a backlog), this changes the economics.
No rate limits. You hit your hardware ceiling, not someone else's pricing tier.
Learning. Understanding what a model is, what tokens look like coming out, what "context window" means when the machine you can see is the one running out of memory: there is no faster education than running an LLM locally.

The cases where it isn't:

You need the best answer. Frontier closed models (Sonnet 4.5, GPT-5, Gemini 3 Pro) are still meaningfully better than the best open-weights model that fits on consumer hardware. The gap has narrowed, not closed.
You need speed. A 70B model on an M3 Max generates around 8 tokens per second. The same prompt against a hosted API streams at 60 to 120 tokens per second. For interactive use, the cloud feels like a different product.
You want breadth. Hosted services give you image, audio, video, tools, web search, and code execution behind one API. Local gets you text generation and (with effort) image input. Multimodal beyond that is possible but rough.
You don't want to operate it. Local models are software you run. They need updates, disk space, occasional troubleshooting, and a passing understanding of what's happening when something goes wrong.

Note The mistake I see most often is people running local because it feels more serious or more private, then getting frustrated that the answers are worse than what they used to get from a hosted model. If the work is hard and the data isn't sensitive, use the better tool. Local is for the slice where the privacy or cost argument is real.

Part 02

The runtimes, ranked by approachability

"Running a model locally" is shorthand for "using a runtime that loads a model file and exposes a way to talk to it." There are five worth knowing about. They're built on overlapping foundations (most of the ergonomic ones wrap llama.cpp), so picking one is a question of how much you want to manage versus how much control you want.

Ollama

The easiest start. Install, then ollama run llama3.3 and you're talking to a model.
CLI-first, with a small GUI app on Mac and Windows that's mostly a launcher.
Models pulled from Ollama's library; a Modelfile lets you customize system prompts, parameters, and quantizations.
Exposes an OpenAI-compatible HTTP server on port 11434 by default. This is the killer feature for tool integration.
Best for: most people, most of the time.

LM Studio

Polished GUI app on Mac, Windows, Linux. The right answer for non-CLI users.
Built-in chat interface, model browser pulling from Hugging Face, and a one-click localhost server.
Better at surfacing what model variants exist (GGUF, MLX, different quantizations) and what your hardware can actually run.
Apple Silicon performance is excellent; Intel Mac is marginal and getting worse with each release.
Best for: GUI users, model exploration, anyone who wants the friendly version of llama.cpp.

llama.cpp

The C++ runtime everyone else is built on. Ollama and LM Studio both ship it under the hood.
Maximum control: every quantization, every backend (CUDA, ROCm, Metal, Vulkan, CPU), every flag.
You build it (or grab a release binary) and run llama-server or llama-cli directly.
Has its own OpenAI-compatible server.
Best for: when Ollama or LM Studio's defaults aren't right and you know exactly what you want to change.

Two more runtimes that matter for specific situations:

MLX (Apple Silicon)

Apple's native framework for ML on M-series chips. The mlx-lm package is the LLM-focused entry point.
Faster than llama.cpp on Apple Silicon for most models, sometimes by 30 to 50 percent.
CLI-only, smaller model selection (anything on Hugging Face tagged mlx), more rough edges.
Best for: Apple Silicon users who want the fastest inference and don't mind a CLI.

vLLM

Server-grade runtime for batched inference. GPU required (Nvidia primarily, AMD experimental).
Designed for many concurrent users from one box, with continuous batching and paged attention.
Significantly faster than llama.cpp on the same GPU when serving multiple requests at once.
OpenAI-compatible server is first-class.
Best for: hosting an open-weights model for a team or an internal app, not for running on your laptop.

Honorable mentions

Jan: open-source LM Studio alternative, Electron-based, getting better.
GPT4All: another GUI option, friendlier than LM Studio, smaller model catalog.
Text Generation WebUI (Oobabooga): power-user web UI with extensions and fine-tuning hooks.
Hugging Face Transformers: not a runtime so much as the Python library. Good for experimentation, not for daily use.

Tip Start with Ollama. Once you've used it for a week and have a sense of what models work on your hardware, you'll know whether to graduate to llama.cpp (more control) or LM Studio (more discoverability) or stay where you are. Most practitioners stay with Ollama.

A short Ollama session, end to end:

# pull and run a model
ollama run llama3.3

# pull without running, then list
ollama pull qwen3:30b
ollama list

# the server is already running on 11434
curl http://localhost:11434/api/tags

# stop a running model to free memory
ollama stop llama3.3

Part 03

Quantization, in plain English

Every open-weights model is published as a pile of numbers (the weights). The original numbers are 16-bit or 32-bit floats. That's a lot of bytes per number and a lot of memory to load. Quantization is the trick of using fewer bits per weight, accepting a little quality loss for a lot of memory savings.

The naming convention you'll see most often comes from llama.cpp's GGUF format: Q4_K_M, Q5_K_S, Q8_0, and so on. The leading Q<n> is the bit width. The suffix is which "mixture" was used (which weights got more bits, which got fewer).

The practical translation:

Quantization	Bits / weight	Memory vs FP16	Quality vs original	When to use
FP16 (no quantization)	16	100%	baseline	You have GPU memory to burn.
Q8_0	8	~50%	indistinguishable	Headroom available; safety margin.
Q6_K	~6	~38%	basically indistinguishable	The "quality with savings" pick.
Q5_K_M	~5	~32%	very small loss	Sweet spot if Q4 feels thin.
Q4_K_M	~4	~25%	mild, usually unnoticeable	The sane default.
Q3_K_M	~3	~19%	noticeable but workable	Last-resort fitting on small hardware.
Q2_K	~2	~13%	noticeably degraded	Only if Q3 won't fit and you need it to run at all.

For most practitioners, the rule is: start with Q4_K_M, move to Q5_K_M if quality feels thin, move to Q8_0 if you have memory to spare. Below Q4 the quality drop is real; above Q5 the memory cost rarely pays for itself.

One more piece of vocabulary. K-quants (the ones with _K_ in the name) are llama.cpp's smarter quantization that gives more bits to the weights that need them. Legacy quants like Q4_0 or Q4_1 still exist but are strictly worse than the K-quant of the same bit width. If you have a choice, take the K version.

Note Apple Silicon's MLX format does its own quantization (4bit, 8bit) that's not directly comparable to GGUF Q4 or Q8. It's a different tradeoff curve, generally favorable on M-series. If you're on a Mac and the model is available in MLX format, it's usually the right pick.

Part 04

Hardware reality

The single most important number for running local models is how much fast memory you have. Not disk, not CPU cores: the memory the GPU (or Apple's unified memory) can read at speed. That number sets the ceiling on what you can run.

Apple Silicon: the dark-horse story

Apple's M-series chips share memory between CPU and GPU. The whole 64 or 128 gigabytes is GPU memory if it needs to be. On every other platform, GPU memory is a separate, much smaller pool, and getting more of it is expensive. This is the architectural detail that quietly turned MacBooks into the most cost-effective local-LLM machines on the market.

An M3 Max 128GB MacBook Pro at around $4,500 will run a 70B model at Q4 entirely in memory and generate at about 8 to 10 tokens per second. The Nvidia rig that runs the same model at the same quality starts at two RTX 4090s ($3,200) plus a workstation to host them, and the unified-memory advantage stops mattering when you go bigger only if you need batch throughput, which most local users don't.

Nvidia: still the answer for the upper end

If you're going past 70B, or if you need the throughput of batched inference, Nvidia is still where the headroom is. CUDA is the most mature backend, every runtime supports it, and the 24GB and 48GB cards are the standard.

AMD and CPU-only

ROCm has gotten substantially better. As of mid-2025, llama.cpp on a Radeon RX 7900 XTX is competitive with a 4080 on most models, but the driver story is still bumpier than CUDA, and not every runtime supports it equally well. CPU-only inference works for models up to 8B if your RAM is fast (DDR5 helps a lot), but the tokens-per-second numbers are slow enough that it's better thought of as "batch tool" than "interactive."

Rough numbers as of May 2026, on Q4 quantizations, single-user, generating short replies:

Hardware	8B model	30B model	70B model	Notes
M2 Pro 32GB	~40 tok/s	~14 tok/s	won't fit	Capable laptop tier.
M3 Max 64GB	~55 tok/s	~20 tok/s	~9 tok/s	Workstation tier; 70B fits cleanly.
M4 Max 128GB	~70 tok/s	~28 tok/s	~12 tok/s	Top of consumer Apple Silicon.
RTX 4090 (24GB)	~120 tok/s	~45 tok/s	split GPU+CPU, ~4 tok/s	Fast on what fits, painful on what doesn't.
2× RTX 4090 (48GB)	~120 tok/s	~60 tok/s	~22 tok/s	The "70B at speed" rig.
A6000 (48GB)	~95 tok/s	~50 tok/s	~18 tok/s	Workstation card; quieter, lower TDP.
RX 7900 XTX (24GB)	~95 tok/s	~38 tok/s	doesn't fit	ROCm; check runtime support.
CPU only (DDR5, 64GB)	~7 tok/s	~2 tok/s	~0.5 tok/s	Fine for 8B, painful above.

These are rough. Actual numbers depend on the specific model, the exact quantization, your context length, and what else the machine is doing. Treat the table as orders of magnitude, not benchmarks.

Warn Don't budget the model's "raw size" as your memory floor. You also need room for the KV cache, which grows with context length. A 70B Q4 model is about 40GB on disk; running it at 16K context can need another 8 to 12GB on top. If you're at the edge of fitting, drop the context window before you drop the quantization.

Part 05

The model landscape

Open-weights releases come constantly, and the leaderboard reshuffles every few months. As of May 2026, these are the families that actually matter for local use:

Llama 3.3 (8B, 70B). Meta's open-weights workhorse. The 8B is the best small model for general chat; the 70B is the best general-purpose model that fits on serious consumer hardware. The 405B exists but is impractical to run locally for almost everyone. If you're not sure what to pull, pull Llama 3.3.
Qwen 3 (multiple sizes up to 235B). Alibaba's family. Particularly strong at code and multilingual tasks; the 30B and 72B variants are excellent and frequently outperform Llama 3.3 70B on benchmarks. The MoE variants give better tokens-per-second at the cost of more memory. My current daily-driver for code completion locally.
Mistral and Mixtral (Small 3, Large 2). The European answer. Mistral Small 3 (24B) is a quietly excellent generalist that punches above its weight class; Mixtral 8x22B and Large 2 are bigger but harder to fit. See the dedicated Mistral guide for more on the family.
DeepSeek V3 / R1. Mixture-of-experts at huge scale (670B+ total parameters, ~37B active). Quality is excellent but local hosting requires serious hardware: 4× A100 territory, or M3/M4 Ultra with 192GB+ for a quantized fit. For most local users, this is "interesting on paper."
Phi-4 (14B) from Microsoft. Small, focused, unusually capable for its size. The benchmark scores look implausible until you actually use it; it's not as good as Llama 3.3 70B but it's surprisingly close at a fraction of the resource cost. The dark-horse pick if you want something smarter than 8B but lighter than 30B.
Gemma 3 (1B, 4B, 12B, 27B) from Google. The open-weights side of Google's lineup. Multimodal at 4B and above (image input). The 27B variant is competitive with Llama 3.3 70B on many tasks at much lower memory cost. The smaller ones are the right pick for embedding-class hardware (single-board computers, Pi 5, edge devices).

Two patterns to internalize. First, model size is not the whole story. A well-distilled 30B model will beat a poorly-trained 70B on the tasks it was trained for. Second, "better on benchmarks" doesn't always mean "better in your hands". Pull two candidates, run them on the actual prompts you care about, pick the one whose outputs you prefer.

In practice For a single workstation in May 2026, my baseline rotation is: Qwen3 30B for code, Llama 3.3 70B for general chat and reasoning, Phi-4 14B as a fast everyday driver, and Gemma 3 4B as a tiny background model for autocomplete and quick classification. About 200GB of disk total. That's enough to cover the workflows the rest of the guide describes.

Part 06

Practical workflows

The workflows below are the ones that have justified the hardware for me, more than the abstract argument about privacy. If your work doesn't touch anything in this list, the case for local is weaker than you think.

Privacy-sensitive drafting.

First-pass writing on anything regulated or pre-disclosure: medical notes, legal drafts, internal performance feedback, unreleased product design. Pull Llama 3.3 70B (or Qwen3 30B if memory is tight), point your editor at the local server, draft inside it. The output never touches a third-party log. Once the draft is anonymized or generalized, you can polish in a hosted model with cleaner prose, but the sensitive first pass stays on the box.

ii.

Offline travel companion.

Before a long flight or a stretch of unreliable connectivity, pull a 14B-class model (Phi-4 or Gemma 3 12B) onto your laptop. It handles writing, code questions, summarization, and translation without a network. The quality ceiling is lower than what you'd get from a frontier API, but "lower ceiling" beats "no answer at all" by a long way.

iii.

Local RAG over your notes.

Embed your notes (Obsidian, Apple Notes export, a wiki dump) using a local embedding model like nomic-embed-text via Ollama, store them in a small vector DB (LanceDB, Chroma, or even a SQLite extension), and have a local model answer questions over the result. End-to-end privacy, infinite query budget. The quality depends heavily on chunking and on the generator model; use Llama 3.3 70B if you have the memory, Qwen3 30B otherwise.

iv.

Code completion in the editor.

Continue.dev, Twinny, or Cursor's local-model option will all point at an Ollama server. Use a code-tuned small model (Qwen3-Coder 7B or 14B is the current pick) for inline completion: low latency matters more than absolute quality for completion. Reserve the 30B or 70B for chat-style code questions where you can wait a few seconds.

Batch summarization.

A folder of PDFs, a year of meeting transcripts, an inbox export. Anything where the cost of running it through a hosted API would add up to real money. A small script that loops over the files and calls the local OpenAI-compatible endpoint will run overnight on a workstation and produce summaries for free. The quality is good enough for first-pass triage even if you'd want a frontier model for the summaries you'll read carefully.

vi.

"Swap base URL to local" for an existing app.

If you've built or use an internal tool that talks to OpenAI's API, point its OPENAI_API_BASE (or equivalent) at http://localhost:11434/v1 and supply any string for the API key. Most apps will work without other changes. This is the fastest way to test "does our pipeline still work on local" without rewriting anything.

Part 07

The OpenAI-compatible server pattern

Every serious local runtime exposes an HTTP server that speaks the OpenAI Chat Completions API. This is, quietly, the most important fact in the local-LLM ecosystem. It means any tool that knows how to talk to OpenAI also knows how to talk to your laptop.

Ollama is the friendliest version. It runs on port 11434 the moment you install it; no flags, no config:

# point any OpenAI client at it
export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=ollama  # any string; auth is not enforced

# then call it like normal
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3",
    "messages": [{"role": "user", "content": "say hello"}]
  }'

llama.cpp's llama-server binary does the same, with more flags:

llama-server -m ./models/llama-3.3-70b-q4.gguf \
  --port 8080 \
  --ctx-size 8192 \
  --n-gpu-layers 99

LM Studio exposes a server toggle in its UI that does the same on a port you choose. vLLM ships vllm serve with the same surface for GPU rigs.

Why this matters: it makes local and cloud swappable. The same client code, the same prompt format, the same tool integrations. You can develop against a cheap hosted model for fast feedback, deploy against a local model for a privacy-sensitive customer, fall back to local when the network is down, A/B between them. The pattern that took years for cloud APIs to standardize on (one HTTP shape, swappable backends) is the default for local.

Tip If you're building anything that could plausibly run locally or in the cloud, write the code against the OpenAI shape from day one. Set the base URL from an environment variable. The marginal effort is zero, and it keeps the option open. Almost every serious LLM tool released since 2024 has converged on this convention; betting against it is betting against the consensus.

Part 08

Limits and pitfalls

The places local will disappoint you, in roughly the order you'll meet them.

"The answers are noticeably worse than what I get from Sonnet or GPT-5"

They probably are. The best open-weights model that fits on consumer hardware is roughly a generation behind the frontier closed models on hard reasoning, code, and math. The gap is smaller on simple tasks (summarization, drafting, classification) and bigger on novel problems. If the task is hard, send it to the better model. Local is for the slice where privacy or cost outweighs the quality gap.

"The hardware was expensive and I'm not sure I'm getting my money's worth"

This is real, and honest. A $4,500 MacBook amortizes against API spend slowly: at $0.01 per 1K input tokens, it would take roughly 450 million input tokens to break even, which is a lot. The hardware is worth it if (a) the privacy argument is real for your work, (b) you'd own a comparable laptop anyway and the marginal cost is the spec bump, or (c) you genuinely use it daily. If none of those apply, you bought a status object.

"I thought local was private but a tool I'm using sends prompts somewhere"

The model is private; the wrapper might not be. Some chat UIs, some IDE plugins, and most "AI features" inside third-party apps phone home with telemetry that may include prompt content, even when the model running underneath is local. If privacy is the reason you're running locally, audit the network traffic of the tool you've wrapped around the model. Little Snitch on Mac, a firewall rule on Linux, or simply blocking outbound DNS for the wrapper while letting Ollama keep working: any of these will tell you the truth. "Local model" and "local data flow" are not the same property.

"Updating models is more annoying than I expected"

It is. Every new release means downloading another 20 to 50 gigabytes, deciding which old version to delete, and noticing that the prompt template has changed in some subtle way. There's no auto-update for "always run the best small model." Expect to spend an hour every couple of months keeping your local stack current, or accept that you're freezing on whatever you have.

"My machine gets hot and loud when running a model"

Generation is sustained GPU load. Laptops will throttle, fans will spin up, batteries will drain in 90 minutes. This is normal; the chips are doing what you asked. If it's intolerable, run smaller models, run on AC power, or move the runtime to a desktop and call it from your laptop over the local network.

"Context windows feel small compared to what I'm used to"

Most local models advertise 32K, 128K, or 1M context. They can handle the smaller end well; the longer end usually degrades faster than equivalent hosted models, and the KV cache at long context eats a lot of memory. For long-document work, either chunk the input (RAG-style) or send the long-context job to a hosted model. Don't expect a 70B local model at 128K to behave the way Claude does at 200K.

"Tool support varies wildly between runtimes"

Function calling, JSON mode, vision input, prompt caching: each runtime supports a different subset, often imperfectly. If you depend on a specific feature, test it on the runtime you plan to use before committing. Ollama's tool-calling has improved a lot since 2024 but is still not as solid as the hosted equivalents. LM Studio is closer. Plan for the gaps.

"I tried to run a model and it crashed without a clear error"

Almost always out-of-memory. Either the model plus its KV cache didn't fit, or something else on the machine took the memory you expected to be free. Drop the context window, drop to a smaller quantization, close the browser, try again. If you're on Apple Silicon, check Activity Monitor's "Memory Pressure" graph; if it's red, that's the problem.

Part 09

When to choose local, when to reach for cloud

The cleanest decision frame I've found, in order:

Pick the tool by the constraint that matters most: privacy, capability, cost, or convenience. One of those is doing the work in any given decision. — TWD

Concretely:

Privacy is the constraint: local. Anything regulated, anything pre-disclosure, anything you'd rather not see in someone else's logs.
Capability is the constraint: cloud. If the task is at the edge of what current models can do, run it on the model that has the best chance of finishing.
Cost is the constraint: local for high-volume batch jobs (embeddings, classification, summarization at scale), cloud for low-volume hard problems where the per-call price doesn't matter.
Convenience is the constraint: cloud, almost always. The hosted experience is friction-free in a way local has not yet matched and may never match.

The combination most practitioners settle into looks like this. A frontier hosted model (Claude, ChatGPT, Gemini, pick your favorite) for hard problems, novel work, anything multimodal, anything where the answer needs to be as good as possible. A local Ollama install for the privacy-sensitive slice, for offline use, and for the batch jobs that would otherwise burn API credit. The two coexist, and the ergonomics of the OpenAI-compatible server pattern mean swapping between them is a config change.

One closing observation

The argument for running models locally has gotten stronger in the last two years, not weaker. Open-weights releases have closed most of the quality gap on most everyday tasks. Apple Silicon turned a niche hobby into something a single laptop can do well. Tooling has converged on the OpenAI HTTP shape, which makes everything portable. None of that means local is the right answer for every workflow, or even for most of yours. It means local is now a real option for the workflows where it fits, instead of a curiosity. Use it where it fits, ignore it where it doesn't, and don't romanticize either side.

If any of this is out of date by the time you read it: the Ollama blog, the LM Studio release notes, and the Hugging Face open-llm-leaderboard are the three places I check. They lag the actual releases by a few days, but the signal-to-noise on each of them is high.