Which local LLM has the best tool calling for agentic tasks?

Based on our 30-day experiment across all RAM tiers: gpt-oss 20B (Q8) at 32GB and gpt-oss 120B (Q6) at 128GB have the highest tool-call success rates (92-96%). Qwen 3.6 27B Q6 is close at 89% and significantly faster. Llama 4 Scout at Q4 performs well at 87% success. Qwen 3.5 variants are affected by the Ollama tool-calling bug (issue #14493) and should be avoided for autonomous loops.

How do you measure local LLM performance for autonomous agents?

We measure four things per task run: (1) task completion rate — did the agent finish the 10-step task without a tool-call failure or context overflow, (2) tokens per second — average inference speed during the run, (3) peak RAM used — actual OS memory pressure at peak, (4) quality score — GPT-4o judged each completed task output on a 1-10 scale for correctness. We ran each model-task combination 5 times and report the average.

Is MLX really 2-3x faster than Ollama for local LLMs on Apple Silicon?

Yes, for supported models. In our benchmarks on 32GB M4 Pro, Qwen 3.6 35B-A3B ran at 43 tok/sec in Ollama and 128 tok/sec in MLX-LM (3x difference). Qwen 3.6 27B ran at 26 tok/sec in Ollama and 67 tok/sec in MLX (2.6x). The gap narrows above 40K context and for models without MLX-optimized weights. For OpenClaw agentic loops, Ollama is still required (MLX doesn't support the tool-call protocol yet).

Does Llama 4 Scout actually work locally with a 10 million token context?

Yes, with important caveats. Llama 4 Scout at Q4 fits in ~58GB and nominally supports 10M context, but loading a 10M context window requires enormous KV cache memory. At 128GB you can practically use 500K-1M tokens before memory pressure becomes an issue. At 64GB, keep context under 128K for reliable performance. The 10M number represents the model's architecture limit, not a practical local inference limit.

Can you run the same OpenClaw benchmarks yourself?

Yes. All 10 benchmark tasks are published in this post with exact prompts, toolsets, and evaluation criteria. The evaluation script is on GitHub. You need Ollama installed and any Mac Studio or M-series Mac with at least 32GB unified memory to run the 32GB tier tests.

← Back to Blog

Research June 4, 2026

Open Research: 30-Day Local LLM Benchmark Across Every RAM Tier (2026)

Every RAM-tier guide says 'model X is best' based on public benchmarks. But MMLU scores don't tell you whether a model will successfully complete a 6-hour OpenClaw autonomous run without dropping a tool call. We ran 10 standardized agentic tasks on 14 models across 5 RAM tiers for 30 days. The results are messier and more interesting than the leaderboards suggest.

What this is — A 30-day public experiment, running from June 4 to July 4, 2026. We test 14 local models across five RAM tiers on 10 standardized agentic tasks. Results are updated weekly. Methodology is fully reproducible. This page is a living document — jump to latest results ↓

Why We’re Doing This

Public LLM leaderboards (MMLU, HumanEval, SWE-Bench) measure what a model knows in a controlled single-turn setting. They don’t measure whether a model will complete a 6-hour OpenClaw run that involves 200+ tool calls, branching context, and mid-task state management.

We wanted to know: what actually runs best for local agentic work? Not in theory. Not on benchmarks. On a Mac Studio in your home office, running overnight.

Andrej Karpathy has written about the value of running simple, reproducible experiments and publishing the raw results — not just the cleaned-up conclusion. This is us doing that for the local LLM space.

The experiment is public so you can replicate it. The methodology is fixed so we don’t cherry-pick results. The raw data is below.

Methodology

Hardware

Tier	Machine	Chip	Bandwidth
32GB	MacBook Pro M4 Pro 36GB	M4 Pro	273 GB/s
48GB	Mac Studio M3 Max 48GB	M3 Max	400 GB/s
64GB	Mac Studio M2 Max 64GB	M2 Max	400 GB/s
96GB	Mac Studio M3 Ultra 96GB	M3 Ultra	800 GB/s
128GB	Mac Studio M4 Max 128GB	M4 Max	546 GB/s

Models Tested Per Tier

32GB: Qwen 3.6 27B Q6_K, Qwen 3.6 35B-A3B Q5_K_M, gpt-oss 20B Q8_0, Nemotron Cascade 2 30B Q5, Gemma 4 26B-A4B Q4

64GB: gpt-oss 120B Q4_K_M, Mistral Small 4 119B-A6B Q4, Llama 4 Scout Q4, DeepSeek V4 Flash Q4, Qwen 3.6 35B Q8

128GB: gpt-oss 120B Q6_K, Llama 4 Maverick Q4, Llama 4 Scout Q4, DeepSeek V4 Flash Q4

The 10 Benchmark Tasks

Every model runs the same 10 tasks, in the same order. Each task run is timed, and the output is judged.

File refactor — Refactor a 400-line TypeScript file to use a new interface. 15 tool calls expected.
Test generation — Write unit tests for a 200-line Python module. 8 tool calls expected.
Multi-file search — Find all usages of a deprecated function across a 50-file repo and report locations.
Dependency update — Update all packages in a package.json, check for breaking changes, and patch 3 known issues.
DB migration — Write a SQL migration for an e-commerce schema change, with rollback.
API integration — Scaffold a REST API client for a given OpenAPI spec. 20+ files.
Git log analysis — Summarize changes in the last 100 commits by author, module, and risk level.
Documentation — Write API docs for a 10-function module with examples and edge cases.
Bug hunt — Find and fix 5 deliberately introduced bugs in a 300-line Python codebase.
Long context recall — Given a 50K-token codebase, answer 10 specific factual questions about architecture.

Metrics

Task success rate — Did the agent complete all steps without a tool-call failure or crash? (5 runs per task, averaged)
Tokens per second — Average generation speed during the run
Peak RAM — Actual memory pressure at peak (via vm_stat + ollama ps)
Quality score — GPT-4o judges the final output on a 1-10 scale for correctness and completeness

Evaluation Script

git clone https://github.com/openclawdc/llm-benchmark-2026
cd llm-benchmark-2026
pip install -r requirements.txt

# Run the 32GB tier benchmark
python run_benchmark.py --tier 32gb --model qwen3.6:27b-q6_K

# Results are saved to results/

Week 1 Results (June 4-11)

Week 1 established baselines across all tiers. Some surprises below.

32GB Tier — Week 1

Model	Task Success	Tok/s	Peak RAM	Quality
gpt-oss 20B Q8	92%	42 tok/s	24.1 GB	7.8/10
Qwen 3.6 27B Q6	88%	24 tok/s	22.4 GB	8.4/10
Qwen 3.6 35B-A3B Q5	85%	48 tok/s	26.2 GB	8.1/10
Gemma 4 26B-A4B Q4	79%	58 tok/s	15.8 GB	7.2/10
Nemotron Cascade 2 30B Q5	82%	35 tok/s	23.8 GB	7.9/10

Week 1 finding: gpt-oss 20B wins on task success rate despite lower quality scores — because it fails fewer tool calls, it completes more tasks end-to-end. Qwen 3.6 27B produces better output when it succeeds. The optimal setup for OpenClaw is gpt-oss 20B for agent loops and Qwen 3.6 27B for chat.

64GB Tier — Week 1

Model	Task Success	Tok/s	Peak RAM	Quality
gpt-oss 120B Q4	94%	22 tok/s	63.2 GB	8.7/10
Llama 4 Scout Q4	87%	31 tok/s	59.4 GB	8.5/10
Mistral Small 4 Q4	83%	26 tok/s	61.8 GB	8.2/10
Qwen 3.6 35B Q8	89%	28 tok/s	38.1 GB	8.3/10

Week 1 finding: Llama 4 Scout surprised us. At 31 tok/sec on 64GB hardware it’s faster than gpt-oss 120B and produces high-quality outputs. Task 10 (long context recall) is where Scout dominates — it’s the only model that can hold 50K+ context comfortably at this tier.

Week 2 Results (June 11-18)

Week 2 focused on stress tests: 6-hour unattended runs and adversarial tool-call chains.

Key Week 2 Findings

Context drift at 32GB: After 3+ hours, gpt-oss 20B showed less context drift (0.8 incidents per hour) than Qwen 3.6 27B (1.4 per hour). At 6 hours, gpt-oss was still tracking the original task; Qwen 3.6 required a mid-session reset twice.

Llama 4 Scout at 10M context: We ran Task 10 with a 180K-token codebase dump on Scout at 64GB. It answered 9 of 10 questions correctly (vs 7/10 for gpt-oss 120B at the same tier). Long context recall is a real capability, not a spec claim.

Gemma 4 at 32GB on short tasks: On Tasks 1-5 (shorter, more self-contained tasks), Gemma 4 26B-A4B outperformed Nemotron Cascade 2 30B on quality (8.1 vs 7.9) despite its much smaller footprint. For quick turnaround tasks, Gemma 4 is now our “fast second model” recommendation at 32GB.

Peak RAM surprises: Llama 4 Maverick at 128GB hit 107GB peak during Task 6 (API scaffolding, many files). We had to limit context to 8K to keep it stable. The 95GB “fits in 128GB” spec is tight — treat it as exactly fitting, not comfortable.

Week 3 Results (June 18-23)

Week 3 tested the new models that arrived mid-experiment.

Live update — June 23: DeepSeek V4 Flash (via ds4 engine) completed its first full benchmark run at the 128GB tier. Results: 78% task success rate (lower than expected — tool-call format differs from Ollama's), 9.2/10 quality on coding tasks (highest in the study so far). Full results next update.

128GB Tier — Week 3

Model	Task Success	Tok/s	Peak RAM	Quality
gpt-oss 120B Q6	96%	18 tok/s	94.1 GB	8.9/10
Llama 4 Scout Q4	90%	33 tok/s	59.1 GB	8.6/10
Llama 4 Maverick Q4	81%	12 tok/s	107 GB	9.1/10
DeepSeek V4 Flash Q4	78%*	11 tok/s	83.4 GB	9.2/10

*DeepSeek V4 Flash used the ds4 engine; tool-call format incompatibilities with OpenClaw reduced success rate. Native Ollama support pending.

Week 3 finding: Llama 4 Maverick has the highest quality scores on tasks where it succeeds, but its 81% success rate (and 12 tok/sec speed) make gpt-oss 120B the practical choice for autonomous loops. Maverick is best for high-value, one-shot tasks where quality matters more than speed.

What We’re Testing in Week 4 (June 23-30)

MLX vs Ollama head-to-head on 32GB and 64GB tiers
DeepSeek V4 Flash with OpenClaw-compatible tool wrapper (current 78% should improve)
24-hour unattended runs on all 5 tiers to test stability
Community-submitted tasks (see below)

How to Replicate This Experiment

Everything you need is in the GitHub repo. The benchmark is designed to run on any Mac with 32GB+ unified memory.

git clone https://github.com/openclawdc/llm-benchmark-2026
cd llm-benchmark-2026
pip install -r requirements.txt

# Pull the models first
ollama pull qwen3.6:27b-q6_K
ollama pull gpt-oss:20b-q8_0

# Run the 32GB tier benchmark
python run_benchmark.py --tier 32gb --all-models

# Results written to results/32gb/
# Submit your results: open a PR against results/community/

We accept community result submissions. If you run on different hardware (RTX 3090, M3 Pro, etc.) open a PR with your results/ directory and we’ll include it.

Why Open Research Matters for Local LLMs

The major AI benchmarks are run by organizations with reasons to present their models favorably. Independent, reproducible testing with consistent methodology is rare.

This experiment is not perfectly objective. We run OpenClaw at OpenClaw DC, so our task selection favors agentic workflows. We note this. But the methodology is fixed, the data is raw, and you can check it yourself.

If you find an error in our methodology, open an issue. If your results contradict ours on the same hardware, we want to know.