Why Is My Local LLM So Slow? 9 Fixes for Ollama and OpenClaw
Short answer: local LLMs are usually slow because the model is too large for your VRAM/RAM, the context window is too long, the runtime has fallen back to CPU, or OpenClaw is spending time in tool calls rather than token generation. Start by checking whether the model fits, then reduce context and quantization before buying hardware.
Check whether your hardware is the bottleneck
Open the local model calculator or start with Can my computer run a local LLM?
Direct answer
A local LLM is slow when the model is too large for the hardware path it is actually using. The model might technically load, but if it spills from VRAM to system RAM, from RAM to swap, or from GPU to CPU, every token gets more expensive.
For OpenClaw, there is one extra trap: you are not only measuring token speed. You are measuring model inference plus tool calls, file access, browser work, context management, and retries. A model can feel acceptable in a chat window and still feel painful as an agent.
Use this order:
- Confirm the model fits your RAM or VRAM.
- Confirm it is using the accelerator you think it is using.
- Reduce context before changing hardware.
- Use a smaller or more quantized model.
- Separate model slowness from tool-loop slowness.
60-second diagnosis
Run these checks before changing anything:
# See what Ollama has loaded ollama ps # Watch CPU, memory, and swap pressure while the model runs top # On macOS, check memory pressure vm_stat # On NVIDIA, check whether VRAM is full nvidia-smi
If CPU is pegged, memory pressure is high, swap is active, or VRAM is full, the problem is probably hardware fit. If resources look fine but OpenClaw still feels slow, the bottleneck is likely context length, tool calls, or model behavior.
The 9 most common causes
| Cause | What it feels like | First fix |
|---|---|---|
| Model barely fits | Loads, then crawls | Use a smaller model or lower quant |
| CPU fallback | High CPU, low GPU use | Pick a model that fits VRAM |
| Swap pressure | System freezes or beachballs | Close apps, reduce context, use smaller model |
| Context too long | Gets slower later in the chat | Start a new session or compact |
| Quant too heavy | Quality is good but tokens/sec is poor | Try Q4_K_M or Q5_K_M |
| Slow disk | Long model load times | Move models to SSD |
| Thermal throttling | Fast at first, slow after minutes | Improve cooling or lower load |
| Tool-call latency | OpenClaw pauses between steps | Use a tool-reliable model and narrow task |
| Wrong model for agent work | Retries, malformed tools, wandering | Use an OpenClaw-tested model |
Fix 1: Use a model that actually fits
The fastest local LLM is often not the largest model you can barely load. It is the largest model that fits with room left for the operating system, context cache, browser, editor, Docker, and OpenClaw traces.
Good starting points:
- 16GB RAM: small local models only, or use cloud for agent work.
- 24GB memory or VRAM: useful entry point for serious local AI.
- 32GB RAM: good for 20B to 32B class models at practical quants.
- 64GB RAM: comfortable for daily OpenClaw use and larger context.
- 96GB to 128GB RAM: power-user tier for larger models and long local runs.
Use the matching guide:
Fix 2: Reduce context length
Long context is one of the easiest ways to make a local model feel slow. The model has to keep more state around, and the KV cache consumes memory. If you are right on the edge of your hardware limit, a large context window can push the run into swap or CPU fallback.
For OpenClaw, start smaller:
- Use a fresh session for a new task.
- Keep the task scope narrow.
- Compact or summarize long conversations.
- Avoid loading huge files unless the task needs them.
- Ask the agent to inspect only the relevant directory first.
If the first few replies are fast but later replies are slow, context growth is probably involved.
Fix 3: Lower the quantization before changing machines
If you are running Q8 and the machine is struggling, try Q5 or Q4 before buying new hardware. In many local-agent workflows, Q4_K_M or Q5_K_M is a better practical choice than Q8 because it leaves enough room for context and tools.
Avoid going too low for OpenClaw. Extremely small quants can save memory but damage tool-calling reliability. A fast model that emits malformed tool calls is not fast in practice because the agent has to retry.
Fix 4: Make sure the GPU is being used
On NVIDIA systems, check nvidia-smi while the model is running. If VRAM is empty and CPU is pinned, the model is not using the GPU path you expected.
On Apple Silicon, watch memory pressure. Unified memory is shared by the model, OS, browser, and apps. A 32GB Mac can be good for local models, but it can still slow down if you keep heavy apps open and run long OpenClaw sessions.
Useful GPU guides:
Fix 5: Separate model speed from OpenClaw speed
Test the same model in a plain chat prompt and in OpenClaw. If plain chat is fast but OpenClaw is slow, the issue may be:
- Tool-call planning.
- File reads.
- Browser automation.
- Large project context.
- A model that is weak at structured tool output.
- Repeated failed tool calls.
For OpenClaw, tool reliability matters more than benchmark score. A smaller model with clean tool behavior can beat a larger model that wanders or retries.
Start here:
Fix 6: Close memory-heavy apps
This sounds basic, but it matters. Browsers, Electron apps, Docker Desktop, IDEs, local databases, and screen recorders can steal the memory headroom that makes a model stable.
If you are on 32GB or 64GB RAM, try one clean benchmark:
- Quit browser tabs you do not need.
- Stop Docker containers you are not using.
- Start a new terminal session.
- Load the model.
- Run the same prompt again.
If performance improves, your model was competing with the rest of the workstation.
Fix 7: Use the calculator before upgrading
Buying more RAM helps when memory is the bottleneck. Buying a faster GPU helps when compute and bandwidth are the bottleneck. Buying either one blindly can waste money.
Use the calculator:
- OpenClaw local model calculator
- Can my computer run a local LLM?
- 32GB vs 64GB RAM for local LLMs
- 64GB vs 128GB RAM for local LLMs
Quick FAQ
Why is my local LLM slow?
The most common reasons are that the model does not fit cleanly in VRAM or RAM, the context window is too large, the runtime is using CPU fallback, the quantization is too heavy, or the agent is spending time in tool calls rather than token generation.
How do I make Ollama faster?
Use a model that fits your hardware, prefer GPU or Apple Silicon acceleration, lower the context size, choose a smaller or more quantized model, close memory-heavy apps, and verify that Ollama is not spilling work to CPU or swap.
Why is OpenClaw slow with a local model?
OpenClaw adds tool calls, browser automation, file reads, and long conversation state on top of normal model inference. A model that feels fine in chat can feel slow as an agent if it barely fits in memory or if tool-calling reliability is weak.
Related
Need a second pair of hands on a broken OpenClaw setup?
Gateway, auth, secure access, VPS, and model troubleshooting.
See Rescue Session →