Best Local LLM by GPU (2026): RTX 3090, 4090, 5090, A6000, M-series Picks
Your GPU (or unified-memory chip) is the biggest determinant of which local LLM runs well. This hub maps every popular consumer + workstation + Apple Silicon option to the best model that actually fits, with quants, tokens/sec, and the exact OpenClaw config. Click through to the dedicated GPU page for detailed picks.
Need help picking the right GPU for your model?
Book a Call at calendly.com/cloudyeti/meet. We'll match your workload to the cheapest GPU that runs it.
Pick Your GPU (2026)
Consumer NVIDIA
| Your GPU | VRAM | Best Pick | Speed | Detailed Guide |
|---|---|---|---|---|
| RTX 3090 | 24 GB | Qwen 3.6 27B (Q4_K_M) | ~35 tok/s | 3090 guide → |
| RTX 4090 | 24 GB | Qwen 3.6 27B (Q4_K_M) | ~50 tok/s | 4090 guide → |
| RTX 5090 | 32 GB | Qwen 3.6 35B-A3B (Q6) ← NEW | ~80 tok/s | 5090 guide → |
| RTX 4070 Ti SUPER | 16 GB | Qwen 3.5 9B (Q8) | ~45 tok/s | 4070 Ti SUPER guide → |
| RTX 4060 Ti 16GB | 16 GB | gpt-oss 20B (Q4) | ~22 tok/s | 4060 Ti 16GB guide → |
Workstation NVIDIA
| Your GPU | VRAM | Best Pick | Speed | Detailed Guide |
|---|---|---|---|---|
| RTX A6000 | 48 GB | GLM-5.1 32B or Qwen 3.6 27B (Q8) | ~28 tok/s | A6000 guide → |
Apple Silicon
| Your Mac | Unified RAM | Best Pick | Speed | Detailed Guide |
|---|---|---|---|---|
| MacBook Pro M4 Max | 36-128 GB | Qwen 3.6 27B (Q6 or Q8) | ~25 tok/s | M4 Max guide → |
| Mac Studio M2 Ultra | 64-192 GB | gpt-oss 120B or Mistral Small 4 (119B-A6B) | ~25 tok/s | M2 Ultra guide → |
How to Read the Speed Numbers
The tok/sec figures above are realistic ranges on the recommended model — not theoretical max. Real-world drift depends on:
- Quantization — Q4 runs ~30% faster than Q8 on the same model
- Context length — KV cache eats VRAM and slows inference as it fills
- Batch size — single-user inference is bandwidth-bound; batched serving is compute-bound
For OpenClaw specifically, tool-call accuracy matters more than tokens/sec. A 22 tok/s response that nails the JSON is better than 60 tok/s that drifts.
VRAM Tier vs Model Pick
The pattern is consistent across GPUs:
| Available VRAM | Best Pick | For OpenClaw |
|---|---|---|
| 8-12 GB | Qwen 3.5 9B (Q4 or Q5) | Not recommended — use cloud |
| 16 GB | Qwen 3.5 9B (Q8) or gpt-oss 20B (Q4) | gpt-oss 20B (Q4) |
| 24 GB | Qwen 3.6 27B (Q4_K_M) | gpt-oss 20B (Q5) |
| 32 GB | Qwen 3.6 27B (Q6) or 35B-A3B (Q5) | gpt-oss 20B (Q8) |
| 48 GB | GLM-5.1 32B (Q5) or Llama 3.3 70B (Q3) | Dual: gpt-oss 20B + Qwen 3.6 27B |
OpenClaw Tool-Calling Reality Check
Most GPU guides talk about benchmark scores or raw tokens/sec. For OpenClaw, only one thing matters: does the model emit clean JSON for tool calls, hundreds of times in a row, without drift?
Models that pass this filter regardless of GPU:
- gpt-oss 20B — cleanest tool-call JSON; safe production default
- gpt-oss 120B — same, scaled up (needs 64+ GB VRAM)
- Qwen 3.6 27B — fixed the Qwen 3.5 tool-calling regressions
- Qwen 3.6 35B-A3B (MoE) — fast inference, reliable tools
Models to avoid for OpenClaw right now (regardless of how fast your GPU runs them):
- Qwen 3.5 27B — known broken tool-calling in Ollama (GitHub issue #14493)
- Anything under 7B at any quant — drifts under load
See Also
- Best Local LLM by RAM (8GB–128GB) — RAM-tier matrix for non-GPU rigs
- Best Local Models for OpenClaw — model-first comparison
- OpenClaw Costs Guide — when local hardware pays back
- OpenClaw Troubleshooting — Ollama, MCP, tool-call issues
Get guides like this in your inbox every Wednesday.
No spam. Unsubscribe anytime.
You'll probably need this again.
Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.
Need help with your OpenClaw setup?
We do remote setup, troubleshooting, and training worldwide.
Book a Call