How much VRAM do I need to run Qwen 3.5 27B at Q4?

Qwen 3.5 27B at Q4_K_M quantization uses approximately 16-18 GB of VRAM for the model weights, plus 2-4 GB for KV cache depending on context length. A 24 GB card like the RTX 3090, RTX 4090, or RTX 3090 Ti handles it comfortably with 32K context. A 16 GB card can run it but requires shorter context or lighter quantization like Q3.

Why does a dense 27B model beat a 120B MoE model on coding tasks?

MoE models only activate a subset of parameters per token, typically 10-20 billion out of the full 120B. For narrow, specialized domains like agent coding and tool-use, that active slice often lacks the specific training signal the task needs. Dense 27B models activate every parameter on every token, which produces more consistent behavior on structured tasks even with fewer total parameters.

What quantization level should I use?

Q4_K_M is the sweet spot for coding agents on a 24 GB GPU. It loses roughly 1-2% on benchmark quality versus the full-precision model but cuts VRAM usage in half. Q5 or Q6 offer marginal quality gains at significant VRAM cost. Q3 sacrifices too much code-generation accuracy to be worth it for agents.

What prompt template does Qwen 3.5 27B use?

Qwen 3.5 uses the ChatML prompt format with specific system, user, and assistant tokens. Ollama handles this automatically when you use the official Qwen3.5 modelfile. For direct llama.cpp or vLLM usage, make sure your template wraps messages in the correct role tokens or tool-calling accuracy drops significantly.

Can I run Qwen 3.5 27B without a GPU?

Yes, but performance drops from 30-50 tokens per second on a 3090 to roughly 2-5 tokens per second on a modern CPU with 32 GB RAM. For interactive agent use, CPU-only inference is painful. Apple Silicon M2 Max or M3 Max runs it acceptably at 10-15 tokens per second thanks to unified memory.

Is Qwen 3.5 27B better than Claude Sonnet 4.6 for agent coding?

Claude Sonnet 4.6 is still ahead on long-horizon planning and complex multi-file refactors. Qwen 3.5 27B matches or beats it on single-task completion, boilerplate generation, and tool-call accuracy. For 80-90% of agent workloads Qwen is now good enough, and it runs at zero marginal cost once you own the hardware.

How does Qwen 3.5 27B compare to GPT-oss 120B?

On the @sudoingX benchmarks, Qwen 3.5 27B dense Q4 one-shotted 4 of 5 agent coding tasks. GPT-oss 120B MoE running on an H200 rig one-shotted 2 of 5. The 120B model has more raw capacity but its sparse activation pattern hurts on tool-use workloads where consistent behavior matters more than peak reasoning.

← Back to Blog

Models April 16, 2026

Qwen 3.5 27B on a Single RTX 3090 Beats 120B Models on $70K H200 Rigs (For Agent Coding) | OpenClaw DC

Qwen 3.5 27B dense at Q4 quantization on a single used RTX 3090 one-shots agent coding tasks that 120B MoE models on $70K H200 rigs fail. The X user @sudoingX posted a thread of side-by-side runs that cleared 3,700 combined likes. The counter-intuitive finding is not that Qwen is magic — it is that dense beats sparse for tool-use, and smart beats huge for agents.

5 OpenClaw Mistakes Costing You Money Right Now

Heartbeat fix, model routing, session resets — cut $36K/yr to $5-10K

WATCH →

Qwen 3.5 27B dense at Q4 quantization on a single used RTX 3090 one-shots agent coding tasks that 120B MoE models on $70K H200 rigs fail. The X user @sudoingX posted a thread of side-by-side runs that cleared 3,700 combined likes. The counter-intuitive finding is not that Qwen is magic — it is that dense beats sparse for tool-use, and smart beats huge for agents.

TL;DR: A $600 used RTX 3090 running Qwen 3.5 27B Q4 on Ollama outperforms a $70,000 H200 rig running GPT-oss 120B MoE on four of five standard agent coding tasks. The reason is architecture, not scale: MoE models route only 10-20B active parameters per token, and that slice is often wrong for narrow tool-use workloads. Dense models engage their full parameter count on every token and behave more consistently on structured tasks. For agent coding with OpenClaw, dense 27B at Q4 is the current Pareto-optimal setup.

The Counter-Intuitive Finding

If you have been watching the local-models space, the conventional wisdom for 2025-2026 has been “bigger and sparser wins.” MoE architectures let you ship 100B+ parameter models that only activate a fraction on each forward pass, which in theory gives you GPT-4-class reasoning on consumer-ish hardware.

In practice, @sudoingX’s tests show the theory breaks down at the exact use case most developers actually care about: agents that call tools, edit code, and complete multi-step workflows without hand-holding.

Here is what he ran, and what happened:

Task	Qwen 3.5 27B Q4 (RTX 3090)	Llama 4 70B (H100)	GPT-oss 120B MoE (H200)	Claude Sonnet 4.6 (cloud)
Refactor a 400-line Python module into 3 files	One-shot pass	One-shot pass	Failed (split incorrectly)	One-shot pass
Build a CLI tool from README spec	One-shot pass	Partial (missing flags)	Failed (imports wrong)	One-shot pass
Fix a subtle async race condition	Partial (needed 1 hint)	Partial	Failed	One-shot pass
Add type hints + tests to legacy JS	One-shot pass	One-shot pass	Partial	One-shot pass
Migrate Express route to Fastify	One-shot pass	Failed	Failed	One-shot pass
Total one-shot completions	4 of 5	2 of 5	0 of 5	5 of 5

The only model that cleanly beats Qwen 3.5 27B on this set is Claude Sonnet 4.6, which is a frontier cloud model costing $3/million input tokens and carrying all the vendor-risk problems we covered in our Anthropic-banned-integrations piece.

Why Dense Beats Sparse for Agents

MoE (mixture of experts) models work by routing each token through a small subset of their total parameters. A 120B MoE typically activates only 10-20B parameters per token. On paper this gives you the reasoning capacity of a 120B model at the inference cost of a 20B model.

The problem is that agent coding is a narrow distribution. Your model needs to consistently produce correctly formatted tool calls, valid JSON, well-structured code, and predictable refusals. The MoE router was trained to optimize loss across a broad pretraining corpus — not to produce reliable behavior on a specialized downstream distribution.

When the router picks the “wrong” expert for a tool-call token, the output drifts. You get malformed JSON, phantom function arguments, or syntactically invalid code. Dense models do not have this failure mode. Every token sees every parameter. Behavior is predictable.

For chat and open-ended Q&A, MoE’s breadth is an advantage. For agent workflows, dense’s consistency is worth more than MoE’s raw capacity.

The Exact Setup

The setup @sudoingX posted — and the one we recommend for any OpenClaw user on a 24 GB GPU — is straightforward.

Hardware:

GPU: RTX 3090 (24 GB), RTX 3090 Ti, or RTX 4090. Used 3090s run $550-700 on eBay.
RAM: 32 GB system RAM recommended (16 GB minimum)
Storage: ~25 GB for the Q4 model file
CPU: Anything from the last 5 years. Irrelevant to inference speed.

Software stack:

Ollama 0.4+ (or llama.cpp directly if you prefer)
OpenClaw configured with local provider routing
Context window: 32K (fits comfortably in VRAM)

Quantization choice: Q4_K_M. This is the sweet spot — 16-18 GB for weights, leaving room for KV cache at 32K context. Q5 edges it out on benchmarks by 1-2% but pushes you over 24 GB with any serious context. Q3 loses too much code accuracy.

Install Qwen 3.5 27B on OpenClaw (6 Steps)

Here is the full walkthrough. If you already have Ollama installed, skip to step 3.

Step 1: Install Ollama.

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# verify
ollama --version

Step 2: Confirm GPU is detected.

ollama ps
# should show GPU memory available

If Ollama reports CPU-only on a machine that has a GPU, check your CUDA drivers (NVIDIA) or ROCm (AMD) installation before proceeding. Inference without GPU on a 27B model is slow enough to be unusable.

Step 3: Pull the Qwen 3.5 27B Q4 model.

ollama pull qwen3.5:27b-instruct-q4_K_M

The download is roughly 17 GB. Grab coffee.

Step 4: Test it directly.

ollama run qwen3.5:27b-instruct-q4_K_M "Write a Python function that deduplicates a list while preserving order."

You should see 30-50 tokens per second on a 3090. If you see less than 15 tok/s, the model probably fell back to CPU. Check nvidia-smi during generation to confirm GPU utilization.

Step 5: Point OpenClaw at the local model.

Edit your OpenClaw config:

# config.yaml
providers:
  primary:
    type: ollama
    endpoint: http://localhost:11434
    model: qwen3.5:27b-instruct-q4_K_M
    context_window: 32768
    temperature: 0.2

Restart OpenClaw and your next agent session will route through the local Qwen model. See the full OpenClaw install guide if you have not set up the runtime yet.

Step 6: Tune context window for your GPU.

If you have a 24 GB card and start hitting OOM errors on long sessions, drop context_window to 16384. If you have a 48 GB card (RTX A6000 or dual 3090s), bump it to 65536 and enjoy.

For deeper tuning on Qwen-specific quirks like prompt template handling and tool-call formatting, see the OpenClaw Qwen configuration guide.

The Cost Comparison That Actually Matters

This is where the benchmark stops being a nerd curiosity and starts being a business decision.

Setup	Hardware	Ongoing Cost	1-Year Total
Used RTX 3090 + Qwen 3.5 27B	$600 one-time	~$5/mo electricity	$660
H200 rig + GPT-oss 120B	$70,000+	$400/mo power + cooling	$74,800
Claude API (Sonnet 4.6 heavy use)	$0	$200-500/mo	$2,400-6,000
ChatGPT Plus subscription	$0	$20/mo	$240 (with usage caps and vendor risk)

The $70K H200 rig is not meant to compete with a used 3090 on price. It is meant to run frontier-scale models for research or multi-tenant inference. The point of the comparison is that for agent coding specifically, buying more hardware does not buy better results. You can outperform a rig that costs 100x more by picking the right model architecture.

For full context on the rest of your OpenClaw budget — hosting, APIs, monitoring — see the complete costs guide.

Where Qwen 3.5 27B Still Loses

Being honest: Qwen is not a universal replacement. It loses to Claude Sonnet 4.6 on these workloads:

Long-horizon planning. Tasks spanning 20+ tool calls with delayed feedback. Claude’s chain-of-thought training shows up here.
Cross-file refactors over 2,000+ LOC. The 32K context limit starts to hurt on real codebases.
Ambiguous natural language specs. Claude is better at inferring intent from vague requirements.
Novel library APIs released after Qwen’s training cutoff. Any local model has a knowledge boundary.

The right mental model is: Qwen 3.5 27B is your daily driver. Claude is your escape hatch for the 10% of tasks that genuinely need frontier reasoning. OpenClaw’s provider routing makes this hybrid setup trivial to configure.

What This Means for 2026

The story of local models in 2023-2024 was “pretty good, but you will want the cloud for real work.” The story in 2026 is inverting. For agent coding specifically — the single highest-value use case for most developers — a $600 used GPU running a well-chosen 27B dense model now matches or beats outputs from rigs costing 100x more.

The implications are big. Teams that were budgeting $500-2,000 per developer per month for API access can cut that by 80-90% without a meaningful quality drop on most work. Businesses paranoid about vendor bans and price hikes (see: everyone who got hit by the Anthropic OpenClaw integration ban) now have a credible exit path. And the hobbyist with a 3-year-old gaming GPU in a closet just got a world-class coding assistant for the cost of a long weekend.

Pick your hardware. Pull the model. Point OpenClaw at it. That is the whole transition.

Try this now: If you have an RTX 3090 or better sitting in a machine at home, run the 6-step install above tonight. Point OpenClaw at the local Qwen model for your next work session. Run your three most common agent tasks and see where it lands versus your current cloud setup. Most people are surprised by how little they miss the API.

Want help sizing hardware for your team’s agent workload? We spec self-hosted OpenClaw rigs for individual developers and multi-seat team deployments.

Book a Call

Get guides like this in your inbox every Wednesday.

No spam. Unsubscribe anytime.

You'll probably need this again.

Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.

Need help with your OpenClaw setup?

We do remote setup, troubleshooting, and training worldwide.