Ollama, Open-WebUI, and the complete stack for running large language models on your own hardware — with the honest tradeoffs, the hardware math that actually matters, and the five-layer framework that separates toys from working systems.
Your existing homelab box probably isn't enough. Local LLMs are a different workload — memory-bound in a way nothing else you run is. Before you install anything, we'll be honest about what clears the bar and what doesn't.
Pricing in this chapter is current as of April 22, 2026. Consumer GPU street prices have been volatile through the DRAM shortage of late 2025 and into 2026 — always verify on the day you buy. Apple Silicon and mini-PC pricing has been more stable.
If you already run Home Assistant, Plex, Immich, or twenty Docker containers on a ThinkCentre M720q with 16 GB of RAM, you've earned the right to be a little offended by the statement that your box isn't enough. You shouldn't be. Local LLMs are a different workload from anything else in your rack — not bigger, not smaller, different. They lean on exactly one specification in a way nothing else you run does, and the hardware that was "good enough" for your entire existing stack will hit a wall the first time you try to load a useful model.
This chapter is the gate. If the hardware you have, or can budget for, doesn't clear the bar, everything downstream of this in the guide is wasted effort. So we're going to be honest about it before we ask you to install anything.
Every other homelab service is CPU-bound or I/O-bound or network-bound. Plex transcodes. Immich indexes. Home Assistant polls. You solve their performance problems by adding cores, adding disks, adding a better NIC. RAM matters, but it's not usually the bottleneck that defines what you can and can't run.
Local LLMs are memory-bound in a way those services aren't. The entire model has to fit in fast memory to run at usable speed. A 13B-parameter model quantized to 4 bits is roughly 8 GB of weights. Loading it onto a box with 8 GB of system RAM doesn't mean "it runs slowly" — it means the operating system swaps to disk and you watch a progress bar instead of a conversation.
Capacity first, speed second. A slower card with enough memory always beats a faster card that can't hold your model. Write this on a sticky note before you start shopping.
Memory requirements aren't just "model file size." Three things stack on top of raw weights, and any one of them can blow your budget.
A 13B model at Q4 is ~8 GB of weights. Runtime footprint is higher because the inference engine allocates working memory for activations and KV cache. Budget 20–30% over the raw weight size as a starting point.
Every token in your conversation or prompt consumes memory. Ollama's defaults scale with VRAM: under 24 GiB gets 4k tokens, 24–48 GiB gets 32k, 48 GiB+ gets 256k. KV cache grows linearly with length.
If you serve more than yourself, required memory scales with simultaneous requests × context length. "One user, short messages" and "five users, long documents" are dramatically different hardware targets.
Required memory ≈ (model weights × 1.25) + (context tokens × KV cache factor × parallel requests)
You don't have to calculate it exactly. You do have to understand that picking a model is never just about parameter count.
A 13B model on a 12 GB GPU running at the default 4k context will leave an agent constantly losing context mid-task. Plan for context length explicitly — this is where most "why is this not working" problems start.
What actually works at each budget, not what marketing says. Specific machines named for reference, but category matters more than brand — equivalent options exist from Minisforum, GMKtec, and others.
7B models comfortably, 13B at aggressive quantization. 5–15 tokens/second on 7B. Zero ambient noise, under 30W at load. Perfect for "try local AI without committing."
30B+ models, production serving, multi-user access, anything that feels like "real" speed. Good starting tier, ceiling hits fast.
This tier splits cleanly along a values axis. Pick one based on what matters to you.
24 GB VRAM at used-market prices makes it excellent for 30B-class Q4/Q5 models. Handles real workloads at respectable speed. Runs 70B Q4 only with meaningful compromise (partial offload to system memory). Budget $400–600 for the rest of the machine if you don't already have a workstation.
Unified memory is the quiet revolution. GPU doesn't fight CPU for memory — one 273 GB/s pool. Runs 30B at 10–15 tok/s comfortably, handles multiple loaded models. 30W under load, zero noise, desk-paperback size. Single Homebrew command to set up.
CUDA wins on raw speed, future fine-tuning ambition, and ecosystem depth. Apple wins on silence, power draw, ease of setup, reliability over time, and resale value. For inference-first buyers who care about silence and simplicity, Apple. For maximum ecosystem support and CUDA-centric tooling later, NVIDIA.
This is where the "repurposing" framing dies entirely. You're either buying a new hobby budget or you're not.
| Tier | Hardware | Current street price |
|---|---|---|
| Entry | Beelink SER8, 32 GB DDR5, 1 TB NVMe | ~$499 |
| Entry | Beelink SER9 Pro, 32 GB LPDDR5 | ~$899 |
| Entry | Minisforum UM870 series (varies) | ~$550–$750 |
| Capable | Used RTX 3090, 24 GB VRAM | $700–$900 |
| Capable | Mac mini M4 Pro, 48 GB unified | ~$1,799 |
| Capable | Mac mini M4 Pro, 64 GB unified | ~$2,199 |
| Serious | RTX 4090, 24 GB VRAM | $2,755+ (MSRP $1,599) |
| Serious | RTX 5090, 32 GB VRAM | $3,695–$4,800 (MSRP $1,999) |
| Serious | Mac Studio M4 Max, 128 GB unified | ~$3,950 |
GPU pricing in April 2026 is not rational. The 4090 is ~72% over MSRP; the 5090 is ~85%. Mini-PC and Apple pricing is stable. Check three retailers on the day you buy. The relative tier recommendations will hold longer than the absolute dollar figures.
(And why people keep trying.) Every failure mode below is the direct result of misunderstanding §1.1 and §1.2.
A 7B/8B at Q4 technically fits (~4.9 GB listed), but once you add runtime overhead, KV cache, and useful context, you're OOM-edge for every request. Below the floor for the experience this guide targets.
GTX 1070, RTX 2060, anything in that class. A 3B at Q4 fits but you're well below what's useful for coding, agents, or extended chat. Not a sane starter path.
An M-series MacBook with enough unified memory can run the stack fine. But: thermal throttling, battery drain, and the machine is portable and shouldn't be. Close the lid, your service goes down. Experiment on laptops; host elsewhere.
"I want to run a 70B model" without thinking about context length or users hitting it is the most common buyer mistake. Parameter count is the headline; context × concurrency is the invoice.
A Dell R730 with 128 GB ECC seems "free" until you realize the GPU slot is 75W PCIe with no aux power. Some people make it work. It's a research project, not a starter path.
Soldered memory (most N100 boxes, many SER models) — you can't. Workstation with free slots — you can, but now Home Assistant goes down every time you restart for a model pull. Dedicated hardware.
This guide isn't going to tell you whether the specific old NUC in your closet is fit for this. But the decision framework is above: what's the RAM ceiling, what's the memory bandwidth, is there a GPU path, can you tolerate it being busy all the time? Most idle homelab boxes are under 32 GB and have no GPU slot. Entry-tier candidates at best — fine for experimenting, not for landing.
If you're genuinely unsure about a box you already own: install Ollama, pull a 7B model, run ollama run, and watch resource utilization. If RAM pressure hits 95% or tokens per second drops below 5, you've answered the question. Move on.
If you know which tier you're aiming at — or which tier the box in your closet falls into — the rest of the guide works. Chapter 2 covers why you'd do this at all and introduces the 5-layer model we'll build up through the remaining chapters. Chapter 3 gets Ollama running.
If the hardware math doesn't work for your budget right now, that's a real answer. Cloud APIs or a cheap Ollama-compatible VPS might be the better first step. Chapter 2 covers that path honestly too.
You've read the hardware reality — memory math, honest tiers, and what doesn't work. The full guide covers the 5-layer stack, Ollama in Docker, Open-WebUI configuration, picking models honestly, remote access without exposing yourself, scaling beyond the starter, workflow integration, and keeping it running. About ~95 pages.
v1.1.0 · Last updated May 2026