Is GPU VRAM or memory bandwidth more important for local LLMs?

VRAM determines whether the model FITS. Bandwidth determines how FAST it runs once it does. A 24GB RTX 3090 (936 GB/s) and 24GB MacBook Pro M-series (~300-400 GB/s) both fit Qwen 3.6 27B at Q4 — the 3090 runs it about 2-3x faster on tokens/sec because of bandwidth. For OpenClaw production where reliability matters more than tokens-per-second, both work; for interactive chat where waiting feels bad, prefer high bandwidth.

Can I run a local LLM on an RTX 4060 Ti 16GB?

Yes. The RTX 4060 Ti 16GB version (NOT the 8GB version) is the budget sweet spot for local LLMs. It runs Qwen 3.5 9B at Q8, gpt-oss 20B at Q4, or Qwen 3.6 27B at IQ3 squeeze. Bandwidth is only 288 GB/s (vs 936 GB/s on RTX 3090), so expect 15-25 tok/sec on a 20B model — usable for interactive work, slower for batch.

Is the RTX 5090 worth it for local LLMs vs RTX 4090?

For local LLMs specifically: yes if you can find one at MSRP. The 5090 has 32GB VRAM (vs 24GB on 4090) and 1792 GB/s bandwidth (vs 1008 GB/s on 4090). The extra 8GB lets you run Qwen 3.6 27B at Q8 or Qwen 3.6 35B-A3B at Q5 with comfortable context — workloads that don't fit on a 4090. For pure tokens/sec on 24GB-or-less models, the 4090 is still excellent and often cheaper.

← Back to Blog

Hardware May 18, 2026

Best Local LLM by GPU (2026): RTX 3090, 4090, 5090, A6000, M-series Picks

Your GPU (or unified-memory chip) is the biggest determinant of which local LLM runs well. This hub maps every popular consumer + workstation + Apple Silicon option to the best model that actually fits, with quants, tokens/sec, and the exact OpenClaw config. Click through to the dedicated GPU page for detailed picks.

Need help picking the right GPU for your model?

See our AI training options. We'll match your workload to the cheapest GPU that runs it.

🎮 THE LOCAL-LLM GPU LADDER

Pick by VRAM: a budget 12 GB RTX 3060 handles 8-14B models, 24 GB (RTX 3090 value, 4090 fast, or AMD RX 7900 XTX) runs 27B-class at Q4, the 5090's 32 GB adds headroom, and the 96 GB RTX PRO 6000 Blackwell is workstation territory for 70B at long context.

3060MSI RTX 3060 12 GB ↗ 3090EVGA RTX 3090 24 GB ↗ 4090GIGABYTE RTX 4090 24 GB ↗ 7900XFX RX 7900 XTX 24 GB ↗ 5090GIGABYTE RTX 5090 32 GB ↗ 96GBRTX PRO 6000 Blackwell 96 GB ↗

Pick Your GPU (2026)

Consumer NVIDIA

Your GPU	VRAM	Best Pick	Speed	Detailed Guide
RTX 3090	24 GB	Qwen 3.6 27B (Q4_K_M)	~35 tok/s	3090 guide →
RTX 4090	24 GB	Qwen 3.6 27B (Q4_K_M)	~50 tok/s	4090 guide →
RTX 5090	32 GB	Qwen 3.6 35B-A3B (Q6) ← NEW	~80 tok/s	5090 guide →
RTX 4070 Ti SUPER	16 GB	Qwen 3.5 9B (Q8)	~45 tok/s	4070 Ti SUPER guide →
RTX 4060 Ti 16GB	16 GB	gpt-oss 20B (Q4)	~22 tok/s	4060 Ti 16GB guide →
RTX 5080	16 GB	gpt-oss 20B (Q4)	~40 tok/s	5080 guide →
RTX 4080 / 4080 SUPER	16 GB	gpt-oss 20B (Q4)	~30 tok/s	4080 guide →
RTX 5070 / 5070 Ti	12 / 16 GB	Qwen 3.5 9B (Q6) / gpt-oss 20B (Q4)	~35 tok/s	5070 guide →
RTX 4070	12 GB	Qwen 3.5 9B (Q6)	~30 tok/s	4070 guide →
RTX 3060 12GB	12 GB	Qwen 3.5 9B (Q6)	~16 tok/s	3060 guide →

Decision guides:

RTX 3090 vs RTX 4090 for local LLMs — same 24GB VRAM, different speed and value profile.
RTX 4070 Ti Super 16GB local LLM guide — what fits before you pay for 24GB VRAM.
RTX 5090 vs RTX 4090 vs used RTX 3090 — 32GB ceiling vs fast 24GB vs used-card value.
Mac Studio vs RTX workstation for local LLMs — unified memory vs CUDA VRAM, quiet simplicity vs NVIDIA speed.
Mac mini vs Mac Studio for local LLMs — which Apple Silicon host to buy by memory ceiling and bandwidth.

Workstation NVIDIA

Your GPU	VRAM	Best Pick	Speed	Detailed Guide
RTX A6000	48 GB	GLM-5.1 32B or Qwen 3.6 27B (Q8)	~28 tok/s	A6000 guide →
RTX PRO 6000 Blackwell	96 GB	70B-class models at higher quants	workload-dependent	Mac vs RTX decision →

Consumer AMD

Your GPU	VRAM	Best Pick	Speed	Detailed Guide
RX 7900 XTX	24 GB	Qwen 3.6 27B (Q4) via ROCm	~40 tok/s	7900 XTX guide →
Radeon AI PRO R9700	32 GB	70B-class via ROCm	workload-dependent	R9700 vs 3090 →

Intel Arc

Your GPU	VRAM	Best Pick	Speed	Detailed Guide
Arc B580	12 GB	Qwen 3.5 9B (IPEX-LLM)	~18 tok/s	Arc B580 guide →

Apple Silicon

Your Mac	Unified RAM	Best Pick	Speed	Detailed Guide
Mac mini / MBP M4 Pro	24-64 GB	Qwen 3.6 27B (Q4)	~15-18 tok/s	M4 Pro guide →
MacBook Pro M4 Max	36-128 GB	Qwen 3.6 27B (Q6 or Q8)	~25 tok/s	M4 Max guide →
Mac Studio M4 (M4 Max)	36-128 GB	Llama 3.3 70B (Q4)	~20 tok/s	Mac Studio M4 guide →
Mac Studio M2 Ultra	64-192 GB	gpt-oss 120B or Mistral Small 4 (119B-A6B)	~25 tok/s	M2 Ultra guide →
Mac Studio M3 Ultra	96-512 GB	Llama 3.3 70B (Q8), 100B+ MoE	~25-30 tok/s	M3 Ultra guide →

How to Read the Speed Numbers

The tok/sec figures above are realistic ranges on the recommended model — not theoretical max. Real-world drift depends on:

Quantization — Q4 runs ~30% faster than Q8 on the same model
Context length — KV cache eats VRAM and slows inference as it fills
Batch size — single-user inference is bandwidth-bound; batched serving is compute-bound

For OpenClaw specifically, tool-call accuracy matters more than tokens/sec. A 22 tok/s response that nails the JSON is better than 60 tok/s that drifts.

VRAM Tier vs Model Pick

The pattern is consistent across GPUs:

Available VRAM	Best Pick	For OpenClaw
8-12 GB	Qwen 3.5 9B (Q4 or Q5)	Not recommended — use cloud
16 GB	Qwen 3.5 9B (Q8) or gpt-oss 20B (Q4)	gpt-oss 20B (Q4)
24 GB	Qwen 3.6 27B (Q4_K_M)	gpt-oss 20B (Q5)
32 GB	Qwen 3.6 27B (Q6) or 35B-A3B (Q5)	gpt-oss 20B (Q8)
48 GB	GLM-5.1 32B (Q5) or Llama 3.3 70B (Q3)	Dual: gpt-oss 20B + Qwen 3.6 27B

OpenClaw Tool-Calling Reality Check

Most GPU guides talk about benchmark scores or raw tokens/sec. For OpenClaw, only one thing matters: does the model emit clean JSON for tool calls, hundreds of times in a row, without drift?

Models that pass this filter regardless of GPU:

gpt-oss 20B — cleanest tool-call JSON; safe production default
gpt-oss 120B — same, scaled up (needs 64+ GB VRAM)
Qwen 3.6 27B — fixed the Qwen 3.5 tool-calling regressions
Qwen 3.6 35B-A3B (MoE) — fast inference, reliable tools

Models to avoid for OpenClaw right now (regardless of how fast your GPU runs them):

Qwen 3.5 27B — known broken tool-calling in Ollama (GitHub issue #14493)
Anything under 7B at any quant — drifts under load

Can Your GPU Run It? (exact-answer guides)

Can a 24GB GPU run a 70B local LLM? — why 70B needs more than 24GB
Can an RTX 3090 run a 70B model? — the 24GB value card at 70B
Can an RTX 4090 run a 70B model? — same 24GB ceiling, more speed
Can 16GB VRAM run Qwen 3.5 27B? — 16GB fit reality
Can you run a 160GB MoE on 8GB VRAM? — expert-streaming edge case
Qwen 3.5 27B on a single RTX 3090 benchmark — real tokens/sec on 24GB
Best Local LLM Reddit picks for the RTX 4090 — Reddit-intent 4090 model picks

Loop Engineering in 5 Minutes