Best Local LLM for RTX 5090 (2026): 32GB VRAM Picks + OpenClaw Setup
The RTX 5090 jumped the consumer LLM ceiling from 24 GB to 32 GB VRAM and nearly doubled memory bandwidth (1008 → 1792 GB/s) over the RTX 4090. That's enough headroom to run Qwen 3.6 27B at Q8 (near-FP16) with 64K context, or step up to MoE models with 35B+ parameters.
Just bought an RTX 5090?
Book a Call at calendly.com/cloudyeti/meet. We'll set up OpenClaw + Ollama to run all your AI locally on the 5090, free.
Bottom Line
- Best overall pick: Qwen 3.6 35B-A3B (MoE) at Q6_K — ~80 tok/sec, 35B-class quality
- Best for OpenClaw production: gpt-oss 20B at Q8_0 (cleanest tool calls)
- Best premium 27B: Qwen 3.6 27B at Q8_0 (near-FP16)
- Best squeeze for 70B: Llama 3.3 70B at Q3_K_S (fits, but quality compromised)
Top Picks for RTX 5090 (32 GB VRAM, 1792 GB/s bandwidth)
1. Qwen 3.6 35B-A3B (Q6_K) — best overall
Mixture-of-Experts variant of Qwen 3.6 (April 22, 2026). 35B total params, 3B active per token. At Q6_K uses about 28 GB. The 5090’s bandwidth + MoE design = blistering inference.
ollama pull qwen3.6:35b-q6_K openclaw config set agents.defaults.models.chat ollama/qwen3.6:35b-q6_K
Expected speed: 75-90 tokens/sec.
2. gpt-oss 20B (Q8_0) — best for OpenClaw production
OpenAI’s 20B at full Q8 uses about 22 GB. Cleanest tool-call JSON of any open-weight model.
ollama pull gpt-oss:20b-q8_0 openclaw config set agents.defaults.models.chat ollama/gpt-oss:20b-q8_0 openclaw run --agent --max-hours 8 "Implement the spec end-to-end"
3. Qwen 3.6 27B (Q8_0) — premium quality
Full Q8 of the April 22 release uses about 30 GB with 32K context. Near-FP16 quality. Speed: ~45 tok/sec.
4. Mistral Small 4 (119B-A6B MoE, IQ3_XS) — premium reasoning squeeze
Mistral’s March 16, 2026 release. 119B total params, 6B active. At IQ3_XS uses about 30 GB. Quality is degraded at IQ3 but the underlying model is premium tier.
5. Qwen 3.5 122B-A10B (IQ2_XXS) — biggest squeeze
For breadth of knowledge over inference quality. ~30 GB at IQ2_XXS. Note: Qwen 3.5 has the Ollama tool-calling bug — pair with gpt-oss for agent loops.
What Fits in 32 GB VRAM (RTX 5090)
| Model | Quant | VRAM | Tok/sec |
|---|---|---|---|
| Qwen 3.6 35B-A3B (MoE) | Q6_K | ~28 GB | 75-90 |
| Qwen 3.6 27B | Q8_0 | ~30 GB | 40-50 |
| gpt-oss 20B | Q8_0 | ~22 GB | 70-85 |
| Mistral Small 4 (119B-A6B) | IQ3_XS | ~30 GB | 50-65 (MoE) |
| Llama 3.3 70B | Q3_K_S | ~28 GB | 15-22 (degraded) |
OpenClaw Setup on RTX 5090
ollama pull qwen3.6:35b-q6_K ollama pull gpt-oss:20b-q8_0 openclaw config set agents.defaults.models.chat ollama/qwen3.6:35b-q6_K openclaw config set agents.defaults.models.agent ollama/gpt-oss:20b-q8_0 openclaw config set agents.defaults.keep_alive 30m
Common Mistakes on RTX 5090
- Running Llama 3.3 70B at IQ2 because it fits. Quality at IQ2 is so degraded that Qwen 3.6 27B at Q8 beats it on every benchmark and runs 2-3x faster.
- Maxing context to 256K. KV cache at 256K eats 20+ GB. Cap at 64K-128K depending on the model.
- Buying the 5090 just for tokens/sec. The real value is the 32 GB VRAM ceiling. If you only run 24GB-and-under models, the 4090 is half the price and still fast.
🛒 Mac alternative
Want 32GB+ unified memory without the GPU build? Mac Studio Ultra delivers.
Amazon affiliate links — we earn a small commission at no cost to you.
See Also
- Best Local LLM for RTX 4090 — same family, 24GB tier
- Best Local LLM for RTX A6000 → — 48GB workstation
- Best Local LLM by GPU (hub)
- Best Local LLM by RAM (hub)
Get guides like this in your inbox every Wednesday.
No spam. Unsubscribe anytime.
You'll probably need this again.
Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.
Need help with your OpenClaw setup?
We do remote setup, troubleshooting, and training worldwide.
Book a Call