Best Local LLM for RTX 4090 (2026): 24GB VRAM Picks + OpenClaw Setup
The RTX 4090 is the bandwidth king for 24 GB workloads. 1008 GB/s memory bandwidth runs Qwen 3.6 27B at Q4 at ~50 tokens/sec — 40% faster than an RTX 3090 on the same model. If you bought a 4090 for gaming, OpenClaw + Ollama turn it into a serious local AI rig.
RTX 4090 idle when you're not gaming?
Book a Call at calendly.com/cloudyeti/meet. We'll wire OpenClaw to run all your AI on the 4090 locally, free.
Bottom Line
- Best overall pick: Qwen 3.6 27B at Q4_K_M (~50 tok/sec, sweet spot)
- Best for OpenClaw production: gpt-oss 20B at Q5_K_M
- Best fast pick: Qwen 3.6 35B-A3B (MoE) at IQ4_XS (~70 tok/sec)
- vs RTX 3090: ~40% faster on identical workloads, same 24 GB ceiling
Top Picks for RTX 4090 (24 GB VRAM, 1008 GB/s bandwidth)
1. Qwen 3.6 27B (Q4_K_M) — best overall
Released April 22, 2026. Outperforms the 397B Qwen 3.5 MoE on agentic coding (77.2 SWE-Bench Verified). About 17 GB VRAM at Q4_K_M with 32K context.
ollama pull qwen3.6:27b openclaw config set agents.defaults.models.chat ollama/qwen3.6:27b
Expected speed on RTX 4090: 45-55 tokens/sec.
2. gpt-oss 20B (Q5_K_M) — best for OpenClaw production
OpenAI’s 20B at Q5 uses about 15 GB. Cleanest tool-call JSON of any open-weight model.
ollama pull gpt-oss:20b-q5_K_M openclaw config set agents.defaults.models.chat ollama/gpt-oss:20b-q5_K_M
3. Qwen 3.6 35B-A3B (Q5_K_M) — fastest
MoE variant — 3B active params per token. At Q5 uses about 22 GB. Inference is blistering on the 4090: 65-75 tok/sec.
4. Qwen 3.6 27B (Q5_K_M) — premium quality squeeze
Q5_K_M of the same 27B model uses ~19 GB. Slight quality bump over Q4, ~30% slower (35-45 tok/sec). Worth it if your workload is reasoning-heavy.
What Fits in 24 GB VRAM (RTX 4090)
| Model | Quant | VRAM | Tok/sec |
|---|---|---|---|
| Qwen 3.6 27B | Q4_K_M | ~17 GB | 45-55 |
| Qwen 3.6 27B | Q5_K_M | ~19 GB | 35-45 |
| Qwen 3.6 35B-A3B (MoE) | Q5_K_M | ~22 GB | 65-75 |
| gpt-oss 20B | Q5_K_M | ~15 GB | 55-70 |
| Qwen 3.5 9B | Q8_0 | ~10 GB | 90-110 |
OpenClaw Setup on RTX 4090
ollama pull qwen3.6:27b openclaw config set agents.defaults.models.chat ollama/qwen3.6:27b openclaw config set agents.defaults.context_limit 65536 openclaw config set agents.defaults.models.agent ollama/gpt-oss:20b-q5_K_M openclaw chat "Refactor the auth module"
Common Mistakes on RTX 4090
- Defaulting to Q8 because you can. Q5_K_M is near-FP16 quality. Q8 just halves your tokens/sec for imperceptible gain on 27B models.
- Running Llama 3.3 70B at IQ2. Qwen 3.6 27B at Q5 beats it on benchmarks for half the VRAM. The 70B obsession is mostly outdated for 2026.
- Setting context to 128K. KV cache eats 8-12 GB on top of the model. You’ll OOM. Cap at 64K.
- Forgetting the 4090 pulls 450W. Use a 1000W+ PSU with at least 100W headroom for sustained inference loads.
🛒 Mac alternative for the same workload
Don't want to build a GPU rig? Apple Silicon delivers equivalent local-AI capability with unified memory and zero ops overhead.
Amazon affiliate links — we earn a small commission at no cost to you.
See Also
- Best Local LLM for RTX 3090 — same VRAM, slower bandwidth, half the price used
- Best Local LLM for RTX 5090 → — 32GB step up
- Best Local LLM by GPU (hub)
- Best Local LLM by RAM (hub)
Get guides like this in your inbox every Wednesday.
No spam. Unsubscribe anytime.
You'll probably need this again.
Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.
Need help with your OpenClaw setup?
We do remote setup, troubleshooting, and training worldwide.
Book a Call