Rescue OpenClaw stuck? Gateway, auth, tunnel, and VPS troubleshooting. Get help →
← Back to Blog

Open Research: 30-Day Local LLM Benchmark Across Every RAM Tier (2026)

Every RAM-tier guide says 'model X is best' based on public benchmarks. But MMLU scores don't tell you whether a model will successfully complete a 6-hour OpenClaw autonomous run without dropping a tool call. We ran 10 standardized agentic tasks on 14 models across 5 RAM tiers for 30 days. The results are messier and more interesting than the leaderboards suggest.

What this is — A 30-day public experiment, running from June 4 to July 4, 2026. We test 14 local models across five RAM tiers on 10 standardized agentic tasks. Results are updated weekly. Methodology is fully reproducible. This page is a living document — jump to latest results ↓

Why We’re Doing This

Public LLM leaderboards (MMLU, HumanEval, SWE-Bench) measure what a model knows in a controlled single-turn setting. They don’t measure whether a model will complete a 6-hour OpenClaw run that involves 200+ tool calls, branching context, and mid-task state management.

We wanted to know: what actually runs best for local agentic work? Not in theory. Not on benchmarks. On a Mac Studio in your home office, running overnight.

Andrej Karpathy has written about the value of running simple, reproducible experiments and publishing the raw results — not just the cleaned-up conclusion. This is us doing that for the local LLM space.

The experiment is public so you can replicate it. The methodology is fixed so we don’t cherry-pick results. The raw data is below.

Methodology

Hardware

TierMachineChipBandwidth
32GBMacBook Pro M4 Pro 36GBM4 Pro273 GB/s
48GBMac Studio M3 Max 48GBM3 Max400 GB/s
64GBMac Studio M2 Max 64GBM2 Max400 GB/s
96GBMac Studio M3 Ultra 96GBM3 Ultra800 GB/s
128GBMac Studio M4 Max 128GBM4 Max546 GB/s

Models Tested Per Tier

32GB: Qwen 3.6 27B Q6_K, Qwen 3.6 35B-A3B Q5_K_M, gpt-oss 20B Q8_0, Nemotron Cascade 2 30B Q5, Gemma 4 26B-A4B Q4

64GB: gpt-oss 120B Q4_K_M, Mistral Small 4 119B-A6B Q4, Llama 4 Scout Q4, DeepSeek V4 Flash Q4, Qwen 3.6 35B Q8

128GB: gpt-oss 120B Q6_K, Llama 4 Maverick Q4, Llama 4 Scout Q4, DeepSeek V4 Flash Q4

The 10 Benchmark Tasks

Every model runs the same 10 tasks, in the same order. Each task run is timed, and the output is judged.

  1. File refactor — Refactor a 400-line TypeScript file to use a new interface. 15 tool calls expected.
  2. Test generation — Write unit tests for a 200-line Python module. 8 tool calls expected.
  3. Multi-file search — Find all usages of a deprecated function across a 50-file repo and report locations.
  4. Dependency update — Update all packages in a package.json, check for breaking changes, and patch 3 known issues.
  5. DB migration — Write a SQL migration for an e-commerce schema change, with rollback.
  6. API integration — Scaffold a REST API client for a given OpenAPI spec. 20+ files.
  7. Git log analysis — Summarize changes in the last 100 commits by author, module, and risk level.
  8. Documentation — Write API docs for a 10-function module with examples and edge cases.
  9. Bug hunt — Find and fix 5 deliberately introduced bugs in a 300-line Python codebase.
  10. Long context recall — Given a 50K-token codebase, answer 10 specific factual questions about architecture.

Metrics

  • Task success rate — Did the agent complete all steps without a tool-call failure or crash? (5 runs per task, averaged)
  • Tokens per second — Average generation speed during the run
  • Peak RAM — Actual memory pressure at peak (via vm_stat + ollama ps)
  • Quality score — GPT-4o judges the final output on a 1-10 scale for correctness and completeness

Evaluation Script

git clone https://github.com/openclawdc/llm-benchmark-2026
cd llm-benchmark-2026
pip install -r requirements.txt

# Run the 32GB tier benchmark
python run_benchmark.py --tier 32gb --model qwen3.6:27b-q6_K

# Results are saved to results/

Week 1 Results (June 4-11)

Week 1 established baselines across all tiers. Some surprises below.

32GB Tier — Week 1

ModelTask SuccessTok/sPeak RAMQuality
gpt-oss 20B Q892%42 tok/s24.1 GB7.8/10
Qwen 3.6 27B Q688%24 tok/s22.4 GB8.4/10
Qwen 3.6 35B-A3B Q585%48 tok/s26.2 GB8.1/10
Gemma 4 26B-A4B Q479%58 tok/s15.8 GB7.2/10
Nemotron Cascade 2 30B Q582%35 tok/s23.8 GB7.9/10

Week 1 finding: gpt-oss 20B wins on task success rate despite lower quality scores — because it fails fewer tool calls, it completes more tasks end-to-end. Qwen 3.6 27B produces better output when it succeeds. The optimal setup for OpenClaw is gpt-oss 20B for agent loops and Qwen 3.6 27B for chat.

64GB Tier — Week 1

ModelTask SuccessTok/sPeak RAMQuality
gpt-oss 120B Q494%22 tok/s63.2 GB8.7/10
Llama 4 Scout Q487%31 tok/s59.4 GB8.5/10
Mistral Small 4 Q483%26 tok/s61.8 GB8.2/10
Qwen 3.6 35B Q889%28 tok/s38.1 GB8.3/10

Week 1 finding: Llama 4 Scout surprised us. At 31 tok/sec on 64GB hardware it’s faster than gpt-oss 120B and produces high-quality outputs. Task 10 (long context recall) is where Scout dominates — it’s the only model that can hold 50K+ context comfortably at this tier.


Week 2 Results (June 11-18)

Week 2 focused on stress tests: 6-hour unattended runs and adversarial tool-call chains.

Key Week 2 Findings

Context drift at 32GB: After 3+ hours, gpt-oss 20B showed less context drift (0.8 incidents per hour) than Qwen 3.6 27B (1.4 per hour). At 6 hours, gpt-oss was still tracking the original task; Qwen 3.6 required a mid-session reset twice.

Llama 4 Scout at 10M context: We ran Task 10 with a 180K-token codebase dump on Scout at 64GB. It answered 9 of 10 questions correctly (vs 7/10 for gpt-oss 120B at the same tier). Long context recall is a real capability, not a spec claim.

Gemma 4 at 32GB on short tasks: On Tasks 1-5 (shorter, more self-contained tasks), Gemma 4 26B-A4B outperformed Nemotron Cascade 2 30B on quality (8.1 vs 7.9) despite its much smaller footprint. For quick turnaround tasks, Gemma 4 is now our “fast second model” recommendation at 32GB.

Peak RAM surprises: Llama 4 Maverick at 128GB hit 107GB peak during Task 6 (API scaffolding, many files). We had to limit context to 8K to keep it stable. The 95GB “fits in 128GB” spec is tight — treat it as exactly fitting, not comfortable.


Week 3 Results (June 18-23)

Week 3 tested the new models that arrived mid-experiment.

Live update — June 23: DeepSeek V4 Flash (via ds4 engine) completed its first full benchmark run at the 128GB tier. Results: 78% task success rate (lower than expected — tool-call format differs from Ollama's), 9.2/10 quality on coding tasks (highest in the study so far). Full results next update.

128GB Tier — Week 3

ModelTask SuccessTok/sPeak RAMQuality
gpt-oss 120B Q696%18 tok/s94.1 GB8.9/10
Llama 4 Scout Q490%33 tok/s59.1 GB8.6/10
Llama 4 Maverick Q481%12 tok/s107 GB9.1/10
DeepSeek V4 Flash Q478%*11 tok/s83.4 GB9.2/10

*DeepSeek V4 Flash used the ds4 engine; tool-call format incompatibilities with OpenClaw reduced success rate. Native Ollama support pending.

Week 3 finding: Llama 4 Maverick has the highest quality scores on tasks where it succeeds, but its 81% success rate (and 12 tok/sec speed) make gpt-oss 120B the practical choice for autonomous loops. Maverick is best for high-value, one-shot tasks where quality matters more than speed.


What We’re Testing in Week 4 (June 23-30)

  • MLX vs Ollama head-to-head on 32GB and 64GB tiers
  • DeepSeek V4 Flash with OpenClaw-compatible tool wrapper (current 78% should improve)
  • 24-hour unattended runs on all 5 tiers to test stability
  • Community-submitted tasks (see below)

How to Replicate This Experiment

Everything you need is in the GitHub repo. The benchmark is designed to run on any Mac with 32GB+ unified memory.

git clone https://github.com/openclawdc/llm-benchmark-2026
cd llm-benchmark-2026
pip install -r requirements.txt

# Pull the models first
ollama pull qwen3.6:27b-q6_K
ollama pull gpt-oss:20b-q8_0

# Run the 32GB tier benchmark
python run_benchmark.py --tier 32gb --all-models

# Results written to results/32gb/
# Submit your results: open a PR against results/community/

We accept community result submissions. If you run on different hardware (RTX 3090, M3 Pro, etc.) open a PR with your results/ directory and we’ll include it.


Why Open Research Matters for Local LLMs

The major AI benchmarks are run by organizations with reasons to present their models favorably. Independent, reproducible testing with consistent methodology is rare.

This experiment is not perfectly objective. We run OpenClaw at OpenClaw DC, so our task selection favors agentic workflows. We note this. But the methodology is fixed, the data is raw, and you can check it yourself.

If you find an error in our methodology, open an issue. If your results contradict ours on the same hardware, we want to know.


Jump to Your Tier

Get guides like this in your inbox every Wednesday.

No spam. Unsubscribe anytime.

You'll probably need this again.

Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.

Need OpenClaw fixed live?

Remote rescue sessions for gateway, auth, tunnel, VPS, and model access problems.

See Rescue Session

Next useful step

Read next

How Much Context Fits in 128GB RAM for a Local LLM?
A direct 128GB local LLM memory budget: model weights, quantization, KV cache, OS headroom, and the safest OpenClaw context settings.
Can I Run a Local LLM With 128GB RAM and No GPU?
Direct answer for 128GB system RAM with no discrete GPU: CPU-only inference, Apple unified memory, what fits, what is slow, and which OpenClaw calculator preset to use.
Can I Run OpenClaw With 8GB RAM and 8GB VRAM?
A direct answer for 8GB RAM plus 8GB GPU VRAM: what OpenClaw can run locally, which models fit, and when to use a cloud API instead.