What is the best local LLM for an RTX 5090?

Qwen 3.6 35B-A3B (MoE) at Q6_K is the best pick that takes advantage of the 5090's 32 GB VRAM. It uses about 28 GB with comfortable context and runs at ~80 tokens/sec thanks to the MoE design + 1792 GB/s bandwidth. For OpenClaw, gpt-oss 20B at Q8_0 (~22 GB) is the safer production pick.

RTX 5090 vs RTX 4090 — is the upgrade worth it for LLMs?

Yes if you want to step past 24GB. The 5090 unlocks Qwen 3.6 27B at Q8 quality, Qwen 3.6 35B-A3B at Q6, or DeepSeek V3 at heavy quants. On 24GB-and-under workloads, the 5090 is ~60% faster than the 4090 (1792 vs 1008 GB/s bandwidth) but the cost premium is steep. Buy the 5090 for the VRAM ceiling, not just the speed.

Can the RTX 5090 run a 70B model?

Yes at degraded quants. Llama 3.3 70B at Q3_K_S uses about 28 GB — fits the 5090 with KV cache for short context. Quality is degraded vs Q4 but workable. For full Q4 70B you still need 48 GB+, so either two 5090s or a workstation GPU.

← Back to Blog

Hardware May 18, 2026

Best Local LLM for RTX 5090 (2026): 32GB VRAM Picks + OpenClaw Setup

The RTX 5090 jumped the consumer LLM ceiling from 24 GB to 32 GB VRAM and nearly doubled memory bandwidth (1008 → 1792 GB/s) over the RTX 4090. That's enough headroom to run Qwen 3.6 27B at Q8 (near-FP16) with 64K context, or step up to MoE models with 35B+ parameters.

Just bought an RTX 5090?

See our AI training options. We'll set up OpenClaw + Ollama to run all your AI locally on the 5090, free.

🎮 THE RTX 5090 — 32 GB FOR LOCAL LLMs

The RTX 5090's 32 GB is the new-card pick for local LLMs — more headroom than a 24 GB 4090/3090 for bigger context and models. Need 70B at long context? Step up to the 96 GB Blackwell.

5090GIGABYTE RTX 5090 32 GB ↗ 4090GIGABYTE RTX 4090 24 GB ↗ 96GBRTX PRO 6000 Blackwell 96 GB ↗

Bottom Line

Best overall pick: Qwen 3.6 35B-A3B (MoE) at Q6_K — ~80 tok/sec, 35B-class quality
Best for OpenClaw production: gpt-oss 20B at Q8_0 (cleanest tool calls)
Best premium 27B: Qwen 3.6 27B at Q8_0 (near-FP16)
Best squeeze for 70B: Llama 3.3 70B at Q3_K_S (fits, but quality compromised)

Top Picks for RTX 5090 (32 GB VRAM, 1792 GB/s bandwidth)

1. Qwen 3.6 35B-A3B (Q6_K) — best overall

Mixture-of-Experts variant of Qwen 3.6 (April 22, 2026). 35B total params, 3B active per token. At Q6_K uses about 28 GB. The 5090’s bandwidth + MoE design = blistering inference.

ollama pull qwen3.6:35b-q6_K
openclaw config set agents.defaults.models.chat ollama/qwen3.6:35b-q6_K

Expected speed: 75-90 tokens/sec.

2. gpt-oss 20B (Q8_0) — best for OpenClaw production

OpenAI’s 20B at full Q8 uses about 22 GB. Cleanest tool-call JSON of any open-weight model.

ollama pull gpt-oss:20b-q8_0
openclaw config set agents.defaults.models.chat ollama/gpt-oss:20b-q8_0
openclaw run --agent --max-hours 8 "Implement the spec end-to-end"

3. Qwen 3.6 27B (Q8_0) — premium quality

Full Q8 of the April 22 release uses about 30 GB with 32K context. Near-FP16 quality. Speed: ~45 tok/sec.

4. Mistral Small 4 (119B-A6B MoE, IQ3_XS) — premium reasoning squeeze

Mistral’s March 16, 2026 release. 119B total params, 6B active. At IQ3_XS uses about 30 GB. Quality is degraded at IQ3 but the underlying model is premium tier.

5. Qwen 3.5 122B-A10B (IQ2_XXS) — biggest squeeze

For breadth of knowledge over inference quality. ~30 GB at IQ2_XXS. Note: Qwen 3.5 has the Ollama tool-calling bug — pair with gpt-oss for agent loops.

What Fits in 32 GB VRAM (RTX 5090)

Model	Quant	VRAM	Tok/sec
Qwen 3.6 35B-A3B (MoE)	Q6_K	~28 GB	75-90
Qwen 3.6 27B	Q8_0	~30 GB	40-50
gpt-oss 20B	Q8_0	~22 GB	70-85
Mistral Small 4 (119B-A6B)	IQ3_XS	~30 GB	50-65 (MoE)
Llama 3.3 70B	Q3_K_S	~28 GB	15-22 (degraded)

OpenClaw Setup on RTX 5090

ollama pull qwen3.6:35b-q6_K
ollama pull gpt-oss:20b-q8_0
openclaw config set agents.defaults.models.chat ollama/qwen3.6:35b-q6_K
openclaw config set agents.defaults.models.agent ollama/gpt-oss:20b-q8_0
openclaw config set agents.defaults.keep_alive 30m

Common Mistakes on RTX 5090

Running Llama 3.3 70B at IQ2 because it fits. Quality at IQ2 is so degraded that Qwen 3.6 27B at Q8 beats it on every benchmark and runs 2-3x faster.
Maxing context to 256K. KV cache at 256K eats 20+ GB. Cap at 64K-128K depending on the model.
Buying the 5090 just for tokens/sec. The real value is the 32 GB VRAM ceiling. If you only run 24GB-and-under models, the 4090 is half the price and still fast.