When is a local LLM actually the right tool?

Table of Contents

Most of the time, the local model loses. That’s the part the Hacker News threads tend to skip. You install Ollama, pull Qwen and Llama, run a few prompts that feel impressive in isolation, then go right back to Claude or Copilot for anything you actually ship. The daemon sits there warming a corner of your RAM for nothing.

The pull is real. I wrote a guide to LM Studio, set up Zed pointed at a local model, and wired Ollama into Void because the setup is genuinely satisfying, and there are jobs where a local 7B model is the right call. There just aren’t as many as the marketing implies. Here’s the honest count.

Case one: code you can’t paste into a hosted box
#

The clearest win is the one nobody argues about. If you’re under an NDA that won’t let third-party code touch a vendor’s servers, or you’re working air-gapped, or the repo has secrets the legal team would rather you didn’t ship to Anthropic, a local model is the only model.

This isn’t a quality argument. A hosted frontier model would do the task better. But “better” doesn’t matter when the answer to “can I use it” is no. Qwen 2.5 Coder at 1.5B running through Zed via http://127.0.0.1:11434 will autocomplete a function signature, explain a regex, and rubber-duck a refactor without anything leaving the machine. That’s the bar. Clear it and you’re done.

If this is your situation, keep the model. If it isn’t, the next two cases are narrower than they sound.

Case two: latency on the inner loop
#

The second honest win is autocomplete. Not chat, not reasoning, not “write me this function”. The tab-key suggestion that fires every time you stop typing for 200ms.

A round trip to a hosted API is somewhere between 300ms and a second, depending on the wind. A local 1.5B model on a warm daemon is closer to 50ms. For a chat turn that doesn’t matter; for autocomplete it’s the difference between a tool that helps and a tool that gets in the way. The Qwen 2.5 Coder 1.5B model is specifically the right size for this: small enough to stay snappy, trained enough to be useful, and only 1GB of memory while it’s resident.

The catch: this only pays off if you keep the daemon warm. Cold-starting Ollama on every completion is worse than just calling the hosted API. If you’re not running it as a service that stays up, the latency win evaporates.

Case three: offline
#

I write on planes more than I’d like. A local model that works in airplane mode is worth something even if the quality is half of what’s on the other end of an internet connection, because half of something beats all of nothing.

This is the weakest of the three cases, because the honest answer is usually “wait until you land.” But if you’re doing field work, traveling somewhere with bad connectivity, or just prefer the discipline of a workflow that doesn’t depend on someone else’s uptime, the local model earns its disk space.

The longer list where local loses
#

Now the part the enthusiasm threads skip. A local model is the wrong tool when:

You need recent knowledge. A 7B model frozen on its training data isn’t going to know about the framework you upgraded to last month. Hosted models with search tools will. This is most questions.
The task is hard enough that the quality drop is visible. “Refactor this 400-line file into smaller modules” is a task where Claude or GPT-4-class models pull miles ahead of anything you can run on a laptop. You will notice. You will get frustrated. You will paste it into the hosted tool anyway.
You only fire it occasionally. The whole local-LLM economics assumes the daemon is warm. If you’re running it twice a day, the cold-start cost plus the quality gap means you’re paying for the privacy of a tool you barely use.
You’re chasing the demo, not a job. Pulling a new model every weekend because someone on HN said it was good is a hobby, not a workflow. Fine. Hobbies are fine. Just don’t confuse them with productivity wins.

What to keep, what to `ollama rm`
#

If you pulled Qwen 2.5 Coder 1.5B and Llama 3.1 after that Hacker News thread and aren’t sure which to keep: keep the Qwen. It’s small, fast, scoped for code, and the right size for the autocomplete case that’s actually a win. The 8B Llama is a worse version of what Claude does well, sitting in 5GB of RAM waiting for a use case that doesn’t usually come.

The exception is case one — if your code can’t go to a hosted model, keep both, because then the comparison isn’t local vs. hosted, it’s local vs. nothing.

The rest is taste. Local models are a real tool for a narrow set of jobs. They are not the future of how most developers work, and the people selling that story usually have a model to sell. Pick the job, pick the tool, move on.

Additional resources
#

Decided a local model is the right call for one of these cases? The setup guides:

Level up your local AI: getting started with LM Studio — a GUI for downloading and chatting with local models
Setting up Zed Editor — pointing a lightweight editor at a local model
Void your concerns: a guide to private AI in Void Editor — Ollama wired into a VS Code fork
Ollama — the daemon all three build on

Author

Chandler Thompson

I lead engineering teams and coach the people who run them. This is where I write down what actually worked.

Case one: code you can’t paste into a hosted box#

Case two: latency on the inner loop#

Case three: offline#

The longer list where local loses#

What to keep, what to ollama rm#

Additional resources#

Related