5 OpenClaw Cost Mistakes
▶ New Video 8 min watch
5 OpenClaw Mistakes Costing You Money Right Now
Cut your bill from $36K/yr to $5–10K — heartbeat fix, model routing, session resets
Watch →
Need help? Remote OpenClaw setup, troubleshooting, and training - $100/hour Book a Call →
View on Amazon →
💻 Running OpenClaw locally? MINIMUM MacBook Pro M-series (24 GB) ↗ RECOMMENDED Premium Mac for 48 GB+ ↗
← Back to Blog

Best Local LLM for RTX 5090 (2026): 32GB VRAM Picks + OpenClaw Setup

The RTX 5090 jumped the consumer LLM ceiling from 24 GB to 32 GB VRAM and nearly doubled memory bandwidth (1008 → 1792 GB/s) over the RTX 4090. That's enough headroom to run Qwen 3.6 27B at Q8 (near-FP16) with 64K context, or step up to MoE models with 35B+ parameters.

Just bought an RTX 5090?

Book a Call at calendly.com/cloudyeti/meet. We'll set up OpenClaw + Ollama to run all your AI locally on the 5090, free.

Bottom Line

  • Best overall pick: Qwen 3.6 35B-A3B (MoE) at Q6_K — ~80 tok/sec, 35B-class quality
  • Best for OpenClaw production: gpt-oss 20B at Q8_0 (cleanest tool calls)
  • Best premium 27B: Qwen 3.6 27B at Q8_0 (near-FP16)
  • Best squeeze for 70B: Llama 3.3 70B at Q3_K_S (fits, but quality compromised)

Top Picks for RTX 5090 (32 GB VRAM, 1792 GB/s bandwidth)

1. Qwen 3.6 35B-A3B (Q6_K) — best overall

Mixture-of-Experts variant of Qwen 3.6 (April 22, 2026). 35B total params, 3B active per token. At Q6_K uses about 28 GB. The 5090’s bandwidth + MoE design = blistering inference.

ollama pull qwen3.6:35b-q6_K
openclaw config set agents.defaults.models.chat ollama/qwen3.6:35b-q6_K

Expected speed: 75-90 tokens/sec.

2. gpt-oss 20B (Q8_0) — best for OpenClaw production

OpenAI’s 20B at full Q8 uses about 22 GB. Cleanest tool-call JSON of any open-weight model.

ollama pull gpt-oss:20b-q8_0
openclaw config set agents.defaults.models.chat ollama/gpt-oss:20b-q8_0
openclaw run --agent --max-hours 8 "Implement the spec end-to-end"

3. Qwen 3.6 27B (Q8_0) — premium quality

Full Q8 of the April 22 release uses about 30 GB with 32K context. Near-FP16 quality. Speed: ~45 tok/sec.

4. Mistral Small 4 (119B-A6B MoE, IQ3_XS) — premium reasoning squeeze

Mistral’s March 16, 2026 release. 119B total params, 6B active. At IQ3_XS uses about 30 GB. Quality is degraded at IQ3 but the underlying model is premium tier.

5. Qwen 3.5 122B-A10B (IQ2_XXS) — biggest squeeze

For breadth of knowledge over inference quality. ~30 GB at IQ2_XXS. Note: Qwen 3.5 has the Ollama tool-calling bug — pair with gpt-oss for agent loops.

What Fits in 32 GB VRAM (RTX 5090)

ModelQuantVRAMTok/sec
Qwen 3.6 35B-A3B (MoE)Q6_K~28 GB75-90
Qwen 3.6 27BQ8_0~30 GB40-50
gpt-oss 20BQ8_0~22 GB70-85
Mistral Small 4 (119B-A6B)IQ3_XS~30 GB50-65 (MoE)
Llama 3.3 70BQ3_K_S~28 GB15-22 (degraded)

OpenClaw Setup on RTX 5090

ollama pull qwen3.6:35b-q6_K
ollama pull gpt-oss:20b-q8_0
openclaw config set agents.defaults.models.chat ollama/qwen3.6:35b-q6_K
openclaw config set agents.defaults.models.agent ollama/gpt-oss:20b-q8_0
openclaw config set agents.defaults.keep_alive 30m

Common Mistakes on RTX 5090

  1. Running Llama 3.3 70B at IQ2 because it fits. Quality at IQ2 is so degraded that Qwen 3.6 27B at Q8 beats it on every benchmark and runs 2-3x faster.
  2. Maxing context to 256K. KV cache at 256K eats 20+ GB. Cap at 64K-128K depending on the model.
  3. Buying the 5090 just for tokens/sec. The real value is the 32 GB VRAM ceiling. If you only run 24GB-and-under models, the 4090 is half the price and still fast.

🛒 Mac alternative

Want 32GB+ unified memory without the GPU build? Mac Studio Ultra delivers.

Amazon affiliate links — we earn a small commission at no cost to you.

See Also

Get guides like this in your inbox every Wednesday.

No spam. Unsubscribe anytime.

You'll probably need this again.

Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.

Need help with your OpenClaw setup?

We do remote setup, troubleshooting, and training worldwide.

Book a Call

Read next

Best Local LLM by GPU (2026): RTX 3090, 4090, 5090, A6000, M-series Picks
Pick the best local LLM for your exact GPU. April 2026 picks for RTX 3090, 4090, 5090, RTX 4070 Ti SUPER, RTX 4060 Ti 16GB, RTX A6000, Apple M4 Max, and Mac Studio M2 Ultra. With quantization, speed, and OpenClaw setup.
Best Local LLM for Mac Studio M2 Ultra (2026): 64/128/192 GB Unified
Best local LLM for the Mac Studio M2 Ultra. April 2026 picks for 64GB, 128GB, 192GB variants. gpt-oss 120B, Mistral Small 4 (119B-A6B), Llama 3.3 70B Q8, and quad-model OpenClaw setups.
Best Local LLM for MacBook Pro M4 Max (2026): 36/48/64/96/128 GB Picks
Best local LLM for the Apple MacBook Pro M4 Max. April 2026 picks for the 36GB, 48GB, 64GB, 96GB, 128GB variants. Qwen 3.6 27B at Q8, Llama 3.3 70B at Q5, GLM-5.1 32B + OpenClaw.