Bench Notes · HomeLabGuides
Digital Guide
Local AI at Home

The Honest
Starter Stack for
Local AI

Ollama, Open-WebUI, and the complete stack for running large language models on your own hardware — with the honest tradeoffs, the hardware math that actually matters, and the five-layer framework that separates toys from working systems.

Ollama · Open-WebUI · Docker Qwen3 · Llama 3.3 · DeepSeek-R1 Cloudflare Tunnel · Access · Zero Trust
L5 Evaluation measure improvement · catch regressions · eval suites future guide L4 Tools & Runtime tool calling · structured output · actions future guide L3 Retrieval / Data embeddings · vector store · RAG · your data future guide L2 Inference Server Ollama · Open-WebUI · API endpoints — THIS GUIDE BUILDS LAYERS 1–2 — this guide L1 Model + Quantization weights · GGUF · Q4/Q5/Q8 · hardware fit this guide
10
Chapters
5
Layer Model
3
Hardware Tiers
LOCAL
First, always
Table of Contents
01The Hardware TruthSTART HERE
Why local LLMs are memory-bound · Bottleneck math · Three honest tiers · What doesn't work
When local wins · When cloud wins · Layers 1–5 introduced · How the guide maps to the layers
Compose file · Volume strategy · GPU acceleration · Apple Silicon native · Context length tuning
First-admin security gate · Adding users · API endpoints · Features to skip at first
The Default Five framework · Shortlist per tier · Quantization hierarchy · 10-prompt eval protocol
Threat model · CF Access first, then tunnel · Service Tokens · What not to expose
Layer 3 (retrieval) · Layer 4 (tools) · Layer 5 (evaluation) · Prompt injection · Fine-tuning
VS Code / Continue · Obsidian · Browser PWA · Shell wrapper function
Five failure modes · Disk · Uptime Kuma monitors · Model lifecycle · OOM · Backup · Monthly routine
When hardware hits the ceiling · What local still can't do · Community resources · The deep-dive guide
Chapter 01 · Start Here

The Hardware
Truth

Your existing homelab box probably isn't enough. Local LLMs are a different workload — memory-bound in a way nothing else you run is. Before you install anything, we'll be honest about what clears the bar and what doesn't.

§1.1 — Why This Workload Is Different

Why Local LLMs Break the Usual Homelab Rules

Pricing in this chapter is current as of April 22, 2026. Consumer GPU street prices have been volatile through the DRAM shortage of late 2025 and into 2026 — always verify on the day you buy. Apple Silicon and mini-PC pricing has been more stable.

Start here before you buy anything

If you already run Home Assistant, Plex, Immich, or twenty Docker containers on a ThinkCentre M720q with 16 GB of RAM, you've earned the right to be a little offended by the statement that your box isn't enough. You shouldn't be. Local LLMs are a different workload from anything else in your rack — not bigger, not smaller, different. They lean on exactly one specification in a way nothing else you run does, and the hardware that was "good enough" for your entire existing stack will hit a wall the first time you try to load a useful model.

This chapter is the gate. If the hardware you have, or can budget for, doesn't clear the bar, everything downstream of this in the guide is wasted effort. So we're going to be honest about it before we ask you to install anything.

Why this workload is different

Every other homelab service is CPU-bound or I/O-bound or network-bound. Plex transcodes. Immich indexes. Home Assistant polls. You solve their performance problems by adding cores, adding disks, adding a better NIC. RAM matters, but it's not usually the bottleneck that defines what you can and can't run.

Local LLMs are memory-bound in a way those services aren't. The entire model has to fit in fast memory to run at usable speed. A 13B-parameter model quantized to 4 bits is roughly 8 GB of weights. Loading it onto a box with 8 GB of system RAM doesn't mean "it runs slowly" — it means the operating system swaps to disk and you watch a progress bar instead of a conversation.

💡
The First Rule

Capacity first, speed second. A slower card with enough memory always beats a faster card that can't hold your model. Write this on a sticky note before you start shopping.

§1.2 — The Bottleneck Math

The Memory Math Nobody Shows You

Memory requirements aren't just "model file size." Three things stack on top of raw weights, and any one of them can blow your budget.

What actually consumes memory

📦Quantization overhead

A 13B model at Q4 is ~8 GB of weights. Runtime footprint is higher because the inference engine allocates working memory for activations and KV cache. Budget 20–30% over the raw weight size as a starting point.

📏Context length

Every token in your conversation or prompt consumes memory. Ollama's defaults scale with VRAM: under 24 GiB gets 4k tokens, 24–48 GiB gets 32k, 48 GiB+ gets 256k. KV cache grows linearly with length.

👥Concurrency

If you serve more than yourself, required memory scales with simultaneous requests × context length. "One user, short messages" and "five users, long documents" are dramatically different hardware targets.

🧮
The practical formula

Required memory ≈ (model weights × 1.25) + (context tokens × KV cache factor × parallel requests)

You don't have to calculate it exactly. You do have to understand that picking a model is never just about parameter count.

⚠️
Ollama Recommends 64k+ for Coding & Agents

A 13B model on a 12 GB GPU running at the default 4k context will leave an agent constantly losing context mid-task. Plan for context length explicitly — this is where most "why is this not working" problems start.

§1.3 — The Three Honest Tiers

The Three Honest Tiers

What actually works at each budget, not what marketing says. Specific machines named for reference, but category matters more than brand — equivalent options exist from Minisforum, GMKtec, and others.

Entry — CPU-only mini PC, 32 GB RAM

What it runs well

7B models comfortably, 13B at aggressive quantization. 5–15 tokens/second on 7B. Zero ambient noise, under 30W at load. Perfect for "try local AI without committing."

What it's not for

30B+ models, production serving, multi-user access, anything that feels like "real" speed. Good starting tier, ceiling hits fast.

Capable — Used workstation GPU or Mac mini M4 Pro

This tier splits cleanly along a values axis. Pick one based on what matters to you.

🎮CUDA path — Used RTX 3090

24 GB VRAM at used-market prices makes it excellent for 30B-class Q4/Q5 models. Handles real workloads at respectable speed. Runs 70B Q4 only with meaningful compromise (partial offload to system memory). Budget $400–600 for the rest of the machine if you don't already have a workstation.

🍎Apple path — Mac mini M4 Pro, 48 GB

Unified memory is the quiet revolution. GPU doesn't fight CPU for memory — one 273 GB/s pool. Runs 30B at 10–15 tok/s comfortably, handles multiple loaded models. 30W under load, zero noise, desk-paperback size. Single Homebrew command to set up.

⚖️
The Honest Tradeoff

CUDA wins on raw speed, future fine-tuning ambition, and ecosystem depth. Apple wins on silence, power draw, ease of setup, reliability over time, and resale value. For inference-first buyers who care about silence and simplicity, Apple. For maximum ecosystem support and CUDA-centric tooling later, NVIDIA.

Serious — Flagship GPU, Mac Studio, or dual-card build

This is where the "repurposing" framing dies entirely. You're either buying a new hobby budget or you're not.

Pricing table (as of April 22, 2026)

TierHardwareCurrent street price
EntryBeelink SER8, 32 GB DDR5, 1 TB NVMe~$499
EntryBeelink SER9 Pro, 32 GB LPDDR5~$899
EntryMinisforum UM870 series (varies)~$550–$750
CapableUsed RTX 3090, 24 GB VRAM$700–$900
CapableMac mini M4 Pro, 48 GB unified~$1,799
CapableMac mini M4 Pro, 64 GB unified~$2,199
SeriousRTX 4090, 24 GB VRAM$2,755+ (MSRP $1,599)
SeriousRTX 5090, 32 GB VRAM$3,695–$4,800 (MSRP $1,999)
SeriousMac Studio M4 Max, 128 GB unified~$3,950

GPU pricing in April 2026 is not rational. The 4090 is ~72% over MSRP; the 5090 is ~85%. Mini-PC and Apple pricing is stable. Check three retailers on the day you buy. The relative tier recommendations will hold longer than the absolute dollar figures.

§1.4 — What Doesn't Work

What Does Not Work

(And why people keep trying.) Every failure mode below is the direct result of misunderstanding §1.1 and §1.2.

💾8 GB RAM/VRAM as your ceiling

A 7B/8B at Q4 technically fits (~4.9 GB listed), but once you add runtime overhead, KV cache, and useful context, you're OOM-edge for every request. Below the floor for the experience this guide targets.

🎞️Older GPUs under 8 GB VRAM

GTX 1070, RTX 2060, anything in that class. A 3B at Q4 fits but you're well below what's useful for coding, agents, or extended chat. Not a sane starter path.

💻Laptops as primary host

An M-series MacBook with enough unified memory can run the stack fine. But: thermal throttling, battery drain, and the machine is portable and shouldn't be. Close the lid, your service goes down. Experiment on laptops; host elsewhere.

🔢Sizing by parameter count alone

"I want to run a 70B model" without thinking about context length or users hitting it is the most common buyer mistake. Parameter count is the headline; context × concurrency is the invoice.

🖥️Repurposed rack servers with ECC-only

A Dell R730 with 128 GB ECC seems "free" until you realize the GPU slot is 75W PCIe with no aux power. Some people make it work. It's a research project, not a starter path.

🏠"I'll just add RAM to my existing HA box"

Soldered memory (most N100 boxes, many SER models) — you can't. Workstation with free slots — you can, but now Home Assistant goes down every time you restart for a model pull. Dedicated hardware.

Thinking about repurposing an idle homelab box

This guide isn't going to tell you whether the specific old NUC in your closet is fit for this. But the decision framework is above: what's the RAM ceiling, what's the memory bandwidth, is there a GPU path, can you tolerate it being busy all the time? Most idle homelab boxes are under 32 GB and have no GPU slot. Entry-tier candidates at best — fine for experimenting, not for landing.

🔍
Fast triage

If you're genuinely unsure about a box you already own: install Ollama, pull a 7B model, run ollama run, and watch resource utilization. If RAM pressure hits 95% or tokens per second drops below 5, you've answered the question. Move on.

Where this leaves you

If you know which tier you're aiming at — or which tier the box in your closet falls into — the rest of the guide works. Chapter 2 covers why you'd do this at all and introduces the 5-layer model we'll build up through the remaining chapters. Chapter 3 gets Ollama running.

If the hardware math doesn't work for your budget right now, that's a real answer. Cloud APIs or a cheap Ollama-compatible VPS might be the better first step. Chapter 2 covers that path honestly too.

◆ End of the sample chapter ◆

The other 8 chapters are in the PDF.

You've read the hardware reality — memory math, honest tiers, and what doesn't work. The full guide covers the 5-layer stack, Ollama in Docker, Open-WebUI configuration, picking models honestly, remote access without exposing yourself, scaling beyond the starter, workflow integration, and keeping it running. About ~95 pages.

Get the PDF

Read offline. Print it. Support the work.

The polished, printable version of Local AI — formatted for letter size, ad-free, yours to keep.

Get the PDF on Etsy
Bench Notes shop · HomeLabGuides
  Etsy checkout · Instant PDF download
Inside the PDF
02Why Local AI & the 5-Layer Stack
03Ollama in Docker
04Open-WebUI: The Chat Interface
05Picking Your Models (Honestly)
06Remote Access Without Exposing Yourself
07Beyond the Starter Stack
08Integrating Into Your Workflow
09Keeping It Running
10What's Next

v1.1.0 · Last updated May 2026