Table of Contents

01The Hardware TruthSTART HERE

—Why local LLMs are memory-bound · Bottleneck math · Three honest tiers · What doesn't work

02Why Local AI & the 5-Layer StackFRAMEWORK

—When local wins · When cloud wins · Layers 1–5 introduced · How the guide maps to the layers

03Ollama in DockerBUILD

—Compose file · Volume strategy · GPU acceleration · Apple Silicon native · Context length tuning

04Open-WebUI: The Chat InterfaceBUILD

—First-admin security gate · Adding users · API endpoints · Features to skip at first

05Picking Your Models (Honestly)REFERENCE

—The Default Five framework · Shortlist per tier · Quantization hierarchy · 10-prompt eval protocol

06Remote Access Without Exposing YourselfSECURITY

—Threat model · CF Access first, then tunnel · Service Tokens · What not to expose

07Beyond the Starter StackOVERVIEW

—Layer 3 (retrieval) · Layer 4 (tools) · Layer 5 (evaluation) · Prompt injection · Fine-tuning

08Integrating Into Your WorkflowDAILY USE

—VS Code / Continue · Obsidian · Browser PWA · Shell wrapper function

09Operations: Keeping It RunningREFERENCE

—Five failure modes · Disk · Uptime Kuma monitors · Model lifecycle · OOM · Backup · Monthly routine

10What's Next

—When hardware hits the ceiling · What local still can't do · Community resources · The deep-dive guide

Chapter 01 · Start Here

The Hardware
Truth

Your existing homelab box probably isn't enough. Local LLMs are a different workload — memory-bound in a way nothing else you run is. Before you install anything, we'll be honest about what clears the bar and what doesn't.

1.1
Why this workload is different
Capacity first, speed second. The single rule that defines everything downstream.
1.2
The bottleneck math nobody shows you
Quantization overhead, context length, and concurrency. Memory scales in three dimensions.
1.3
The three honest tiers
Entry, Capable, Serious. What each costs, what each runs, and who they're for.
1.4
What does not work (and why people keep trying)
8GB ceilings, old GPUs, laptops, parameter-count-only sizing, and other dead ends.

§1.1 — Why This Workload Is Different

Why Local LLMs Break the Usual Homelab Rules

Pricing in this chapter is current as of April 22, 2026. Consumer GPU street prices have been volatile through the DRAM shortage of late 2025 and into 2026 — always verify on the day you buy. Apple Silicon and mini-PC pricing has been more stable.

Start here before you buy anything

If you already run Home Assistant, Plex, Immich, or twenty Docker containers on a ThinkCentre M720q with 16 GB of RAM, you've earned the right to be a little offended by the statement that your box isn't enough. You shouldn't be. Local LLMs are a different workload from anything else in your rack — not bigger, not smaller, different. They lean on exactly one specification in a way nothing else you run does, and the hardware that was "good enough" for your entire existing stack will hit a wall the first time you try to load a useful model.

This chapter is the gate. If the hardware you have, or can budget for, doesn't clear the bar, everything downstream of this in the guide is wasted effort. So we're going to be honest about it before we ask you to install anything.

Why this workload is different

Every other homelab service is CPU-bound or I/O-bound or network-bound. Plex transcodes. Immich indexes. Home Assistant polls. You solve their performance problems by adding cores, adding disks, adding a better NIC. RAM matters, but it's not usually the bottleneck that defines what you can and can't run.

Local LLMs are memory-bound in a way those services aren't. The entire model has to fit in fast memory to run at usable speed. A 13B-parameter model quantized to 4 bits is roughly 8 GB of weights. Loading it onto a box with 8 GB of system RAM doesn't mean "it runs slowly" — it means the operating system swaps to disk and you watch a progress bar instead of a conversation.

💡

The First Rule

Capacity first, speed second. A slower card with enough memory always beats a faster card that can't hold your model. Write this on a sticky note before you start shopping.

§1.2 — The Bottleneck Math

The Memory Math Nobody Shows You

Memory requirements aren't just "model file size." Three things stack on top of raw weights, and any one of them can blow your budget.

What actually consumes memory

📦Quantization overhead

A 13B model at Q4 is ~8 GB of weights. Runtime footprint is higher because the inference engine allocates working memory for activations and KV cache. Budget 20–30% over the raw weight size as a starting point.

📏Context length

Every token in your conversation or prompt consumes memory. Ollama's defaults scale with VRAM: under 24 GiB gets 4k tokens, 24–48 GiB gets 32k, 48 GiB+ gets 256k. KV cache grows linearly with length.

👥Concurrency

If you serve more than yourself, required memory scales with simultaneous requests × context length. "One user, short messages" and "five users, long documents" are dramatically different hardware targets.

🧮

The practical formula

Required memory ≈ (model weights × 1.25) + (context tokens × KV cache factor × parallel requests)

You don't have to calculate it exactly. You do have to understand that picking a model is never just about parameter count.

⚠️

Ollama Recommends 64k+ for Coding & Agents

A 13B model on a 12 GB GPU running at the default 4k context will leave an agent constantly losing context mid-task. Plan for context length explicitly — this is where most "why is this not working" problems start.

§1.3 — The Three Honest Tiers

The Three Honest Tiers

What actually works at each budget, not what marketing says. Specific machines named for reference, but category matters more than brand — equivalent options exist from Minisforum, GMKtec, and others.

Entry — CPU-only mini PC, 32 GB RAM

✅What it runs well

7B models comfortably, 13B at aggressive quantization. 5–15 tokens/second on 7B. Zero ambient noise, under 30W at load. Perfect for "try local AI without committing."

❌What it's not for

30B+ models, production serving, multi-user access, anything that feels like "real" speed. Good starting tier, ceiling hits fast.

Capable — Used workstation GPU or Mac mini M4 Pro

This tier splits cleanly along a values axis. Pick one based on what matters to you.

🎮CUDA path — Used RTX 3090

24 GB VRAM at used-market prices makes it excellent for 30B-class Q4/Q5 models. Handles real workloads at respectable speed. Runs 70B Q4 only with meaningful compromise (partial offload to system memory). Budget $400–600 for the rest of the machine if you don't already have a workstation.

🍎Apple path — Mac mini M4 Pro, 48 GB

Unified memory is the quiet revolution. GPU doesn't fight CPU for memory — one 273 GB/s pool. Runs 30B at 10–15 tok/s comfortably, handles multiple loaded models. 30W under load, zero noise, desk-paperback size. Single Homebrew command to set up.

⚖️

The Honest Tradeoff

CUDA wins on raw speed, future fine-tuning ambition, and ecosystem depth. Apple wins on silence, power draw, ease of setup, reliability over time, and resale value. For inference-first buyers who care about silence and simplicity, Apple. For maximum ecosystem support and CUDA-centric tooling later, NVIDIA.

Serious — Flagship GPU, Mac Studio, or dual-card build

This is where the "repurposing" framing dies entirely. You're either buying a new hobby budget or you're not.

RTX 4090, 24 GB VRAM — fastest widely-supported single consumer card. 128 tok/s on 8B models. Scalped well over MSRP.
RTX 5090, 32 GB VRAM — only consumer card with 32 GB, unlocks 70B at full Q4. 185+ tok/s on 8B. Supply constrained through at least mid-2026.
Mac Studio M4 Max, 128 GB unified — runs 70B Q4 with comfortable headroom. For an always-on shared AI box at home, arguably the sanest purchase at this tier.

Pricing table (as of April 22, 2026)

Tier	Hardware	Current street price
Entry	Beelink SER8, 32 GB DDR5, 1 TB NVMe	~$499
Entry	Beelink SER9 Pro, 32 GB LPDDR5	~$899
Entry	Minisforum UM870 series (varies)	~$550–$750
Capable	Used RTX 3090, 24 GB VRAM	$700–$900
Capable	Mac mini M4 Pro, 48 GB unified	~$1,799
Capable	Mac mini M4 Pro, 64 GB unified	~$2,199
Serious	RTX 4090, 24 GB VRAM	$2,755+ (MSRP $1,599)
Serious	RTX 5090, 32 GB VRAM	$3,695–$4,800 (MSRP $1,999)
Serious	Mac Studio M4 Max, 128 GB unified	~$3,950

GPU pricing in April 2026 is not rational. The 4090 is ~72% over MSRP; the 5090 is ~85%. Mini-PC and Apple pricing is stable. Check three retailers on the day you buy. The relative tier recommendations will hold longer than the absolute dollar figures.

§1.4 — What Doesn't Work

What Does Not Work

(And why people keep trying.) Every failure mode below is the direct result of misunderstanding §1.1 and §1.2.

💾8 GB RAM/VRAM as your ceiling

A 7B/8B at Q4 technically fits (~4.9 GB listed), but once you add runtime overhead, KV cache, and useful context, you're OOM-edge for every request. Below the floor for the experience this guide targets.

🎞️Older GPUs under 8 GB VRAM

GTX 1070, RTX 2060, anything in that class. A 3B at Q4 fits but you're well below what's useful for coding, agents, or extended chat. Not a sane starter path.

💻Laptops as primary host

An M-series MacBook with enough unified memory can run the stack fine. But: thermal throttling, battery drain, and the machine is portable and shouldn't be. Close the lid, your service goes down. Experiment on laptops; host elsewhere.

🔢Sizing by parameter count alone

"I want to run a 70B model" without thinking about context length or users hitting it is the most common buyer mistake. Parameter count is the headline; context × concurrency is the invoice.

🖥️Repurposed rack servers with ECC-only

A Dell R730 with 128 GB ECC seems "free" until you realize the GPU slot is 75W PCIe with no aux power. Some people make it work. It's a research project, not a starter path.

🏠"I'll just add RAM to my existing HA box"

Soldered memory (most N100 boxes, many SER models) — you can't. Workstation with free slots — you can, but now Home Assistant goes down every time you restart for a model pull. Dedicated hardware.

Thinking about repurposing an idle homelab box

This guide isn't going to tell you whether the specific old NUC in your closet is fit for this. But the decision framework is above: what's the RAM ceiling, what's the memory bandwidth, is there a GPU path, can you tolerate it being busy all the time? Most idle homelab boxes are under 32 GB and have no GPU slot. Entry-tier candidates at best — fine for experimenting, not for landing.

🔍

Fast triage

If you're genuinely unsure about a box you already own: install Ollama, pull a 7B model, run ollama run, and watch resource utilization. If RAM pressure hits 95% or tokens per second drops below 5, you've answered the question. Move on.

Where this leaves you

If you know which tier you're aiming at — or which tier the box in your closet falls into — the rest of the guide works. Chapter 2 covers why you'd do this at all and introduces the 5-layer model we'll build up through the remaining chapters. Chapter 3 gets Ollama running.

If the hardware math doesn't work for your budget right now, that's a real answer. Cloud APIs or a cheap Ollama-compatible VPS might be the better first step. Chapter 2 covers that path honestly too.

◆ End of the sample chapter ◆

The other 8 chapters are in the PDF.

You've read the hardware reality — memory math, honest tiers, and what doesn't work. The full guide covers the 5-layer stack, Ollama in Docker, Open-WebUI configuration, picking models honestly, remote access without exposing yourself, scaling beyond the starter, workflow integration, and keeping it running. About ~95 pages.

 Get the PDF

Read offline. Print it. Support the work.

The polished, printable version of Local AI — formatted for letter size, ad-free, yours to keep.

✓ Full guide — about ~95 pages
✓ Polished PDF for print & tablet
✓ Pay once, no subscription
✓ Helps me write the next one

Get the PDF on Etsy →

Bench Notes shop · HomeLabGuides

●  Etsy checkout · Instant PDF download

Inside the PDF

02Why Local AI & the 5-Layer Stack

03Ollama in Docker

04Open-WebUI: The Chat Interface

05Picking Your Models (Honestly)

06Remote Access Without Exposing Yourself

07Beyond the Starter Stack

08Integrating Into Your Workflow

09Keeping It Running

10What's Next

The HonestStarter Stack forLocal AI

The HardwareTruth

Why Local LLMs Break the Usual Homelab Rules

Start here before you buy anything

Why this workload is different

The Memory Math Nobody Shows You

What actually consumes memory

The Three Honest Tiers

Entry — CPU-only mini PC, 32 GB RAM

Capable — Used workstation GPU or Mac mini M4 Pro

Serious — Flagship GPU, Mac Studio, or dual-card build

Pricing table (as of April 22, 2026)

What Does Not Work

Thinking about repurposing an idle homelab box

Where this leaves you

The other 8 chapters are in the PDF.

Read offline. Print it. Support the work.

The Honest
Starter Stack for
Local AI

The Hardware
Truth