· OPENCLAW DC ·

VOL. 02 · ISS. 177 — JUN 2026

Hardware / June 26, 2026

Why Is My Local LLM So Slow? 9 Fixes for Ollama and OpenClaw

Short answer: local LLMs are usually slow because the model is too large for your VRAM/RAM, the context window is too long, the runtime has fallen back to CPU, or OpenClaw is spending time in tool calls rather than token generation. Start by checking whether the model fits, then reduce context and quantization before buying hardware.

Filed by OpenClaw DC Editorial

Check whether your hardware is the bottleneck

Open the local model calculator or start with Can my computer run a local LLM?

Direct answer

A local LLM is slow when the model is too large for the hardware path it is actually using. The model might technically load, but if it spills from VRAM to system RAM, from RAM to swap, or from GPU to CPU, every token gets more expensive.

For OpenClaw, there is one extra trap: you are not only measuring token speed. You are measuring model inference plus tool calls, file access, browser work, context management, and retries. A model can feel acceptable in a chat window and still feel painful as an agent.

Use this order:

Confirm the model fits your RAM or VRAM.
Confirm it is using the accelerator you think it is using.
Reduce context before changing hardware.
Use a smaller or more quantized model.
Separate model slowness from tool-loop slowness.

60-second diagnosis

Run these checks before changing anything:

# See what Ollama has loaded
ollama ps

# Watch CPU, memory, and swap pressure while the model runs
top

# On macOS, check memory pressure
vm_stat

# On NVIDIA, check whether VRAM is full
nvidia-smi

If CPU is pegged, memory pressure is high, swap is active, or VRAM is full, the problem is probably hardware fit. If resources look fine but OpenClaw still feels slow, the bottleneck is likely context length, tool calls, or model behavior.

The 9 most common causes

Cause	What it feels like	First fix
Model barely fits	Loads, then crawls	Use a smaller model or lower quant
CPU fallback	High CPU, low GPU use	Pick a model that fits VRAM
Swap pressure	System freezes or beachballs	Close apps, reduce context, use smaller model
Context too long	Gets slower later in the chat	Start a new session or compact
Quant too heavy	Quality is good but tokens/sec is poor	Try Q4_K_M or Q5_K_M
Slow disk	Long model load times	Move models to SSD
Thermal throttling	Fast at first, slow after minutes	Improve cooling or lower load
Tool-call latency	OpenClaw pauses between steps	Use a tool-reliable model and narrow task
Wrong model for agent work	Retries, malformed tools, wandering	Use an OpenClaw-tested model

Fix 1: Use a model that actually fits

The fastest local LLM is often not the largest model you can barely load. It is the largest model that fits with room left for the operating system, context cache, browser, editor, Docker, and OpenClaw traces.

Good starting points:

16GB RAM: small local models only, or use cloud for agent work.
24GB memory or VRAM: useful entry point for serious local AI.
32GB RAM: good for 20B to 32B class models at practical quants.
64GB RAM: comfortable for daily OpenClaw use and larger context.
96GB to 128GB RAM: power-user tier for larger models and long local runs.

Use the matching guide:

Fix 2: Reduce context length

Long context is one of the easiest ways to make a local model feel slow. The model has to keep more state around, and the KV cache consumes memory. If you are right on the edge of your hardware limit, a large context window can push the run into swap or CPU fallback.

For OpenClaw, start smaller:

Use a fresh session for a new task.
Keep the task scope narrow.
Compact or summarize long conversations.
Avoid loading huge files unless the task needs them.
Ask the agent to inspect only the relevant directory first.

If the first few replies are fast but later replies are slow, context growth is probably involved.

Fix 3: Lower the quantization before changing machines

If you are running Q8 and the machine is struggling, try Q5 or Q4 before buying new hardware. In many local-agent workflows, Q4_K_M or Q5_K_M is a better practical choice than Q8 because it leaves enough room for context and tools.

Avoid going too low for OpenClaw. Extremely small quants can save memory but damage tool-calling reliability. A fast model that emits malformed tool calls is not fast in practice because the agent has to retry.

Fix 4: Make sure the GPU is being used

On NVIDIA systems, check nvidia-smi while the model is running. If VRAM is empty and CPU is pinned, the model is not using the GPU path you expected.

On Apple Silicon, watch memory pressure. Unified memory is shared by the model, OS, browser, and apps. A 32GB Mac can be good for local models, but it can still slow down if you keep heavy apps open and run long OpenClaw sessions.

Useful GPU guides:

Fix 5: Separate model speed from OpenClaw speed

Test the same model in a plain chat prompt and in OpenClaw. If plain chat is fast but OpenClaw is slow, the issue may be:

Tool-call planning.
File reads.
Browser automation.
Large project context.
A model that is weak at structured tool output.
Repeated failed tool calls.

For OpenClaw, tool reliability matters more than benchmark score. A smaller model with clean tool behavior can beat a larger model that wanders or retries.

Start here:

Fix 6: Close memory-heavy apps

This sounds basic, but it matters. Browsers, Electron apps, Docker Desktop, IDEs, local databases, and screen recorders can steal the memory headroom that makes a model stable.

If you are on 32GB or 64GB RAM, try one clean benchmark:

Quit browser tabs you do not need.
Stop Docker containers you are not using.
Start a new terminal session.
Load the model.
Run the same prompt again.

If performance improves, your model was competing with the rest of the workstation.

Fix 7: Use the calculator before upgrading

Buying more RAM helps when memory is the bottleneck. Buying a faster GPU helps when compute and bandwidth are the bottleneck. Buying either one blindly can waste money.

Use the calculator:

Quick FAQ

Why is my local LLM slow?

The most common reasons are that the model does not fit cleanly in VRAM or RAM, the context window is too large, the runtime is using CPU fallback, the quantization is too heavy, or the agent is spending time in tool calls rather than token generation.

How do I make Ollama faster?

Use a model that fits your hardware, prefer GPU or Apple Silicon acceleration, lower the context size, choose a smaller or more quantized model, close memory-heavy apps, and verify that Ollama is not spilling work to CPU or swap.

Why is OpenClaw slow with a local model?

OpenClaw adds tool calls, browser automation, file reads, and long conversation state on top of normal model inference. A model that feels fine in chat can feel slow as an agent if it barely fits in memory or if tool-calling reliability is weak.

You'll want to find this again.

Press Cmd+D or Ctrl+D to save.

Correspondence

Need a second pair of hands on a broken OpenClaw setup?

Gateway, auth, secure access, VPS, and model troubleshooting.

See Rescue Session →

Next useful step

Get help with the setup CloudYeti session for local AI, AWS, auth, VPS, and model routing. → Turn notes into docs Use MarkdownMe's DITA/XML tools for structured setup documentation. →

— Continue Reading —

How Much Context Fits in 128GB RAM for a Local LLM?

A direct 128GB local LLM memory budget: model weights, quantization, KV cache, OS headroom, and the safest OpenClaw context settings.

→ 02

Can I Run a Local LLM With 128GB RAM and No GPU?

Direct answer for 128GB system RAM with no discrete GPU: CPU-only inference, Apple unified memory, what fits, what is slow, and which OpenClaw calculator preset to use.

→ 03

Can I Run OpenClaw With 8GB RAM and 8GB VRAM?

A direct answer for 8GB RAM plus 8GB GPU VRAM: what OpenClaw can run locally, which models fit, and when to use a cloud API instead.

→