Rescue OpenClaw stuck? Gateway, auth, tunnel, and VPS troubleshooting. Get help →
← Back to Blog

How Much Context Fits in 128GB RAM for a Local LLM?

On a 128GB local LLM machine, do not budget all 128GB for model weights. Reserve memory for the OS, OpenClaw, Ollama or llama.cpp, KV cache, browser/tools, and swap safety. For stable agent work, treat 90-105GB as the practical model-weight budget and cap context before raising it.

Direct Answer

A 128GB local LLM machine is a power-user tier, but it is not a blank 128GB bucket for model weights.

For OpenClaw, use this operating budget:

Budget lineReserve
macOS, Linux, drivers, browser, editor, shell tools8-16GB
OpenClaw, Ollama or llama.cpp, file tools, local services4-8GB
KV cache and context growth8-24GB for normal work, more for long context
Safety headroom to avoid swap8-12GB
Practical model-weight budgetabout 90-105GB

If a model’s weights use 95GB, it can fit, but it does not leave much room for a giant context window. If you want stable autonomous agent runs, start conservative and raise context after the workload proves it stays out of swap.

Use the 128GB calculator preset Open the local model calculator with 128GB RAM and 128GB unified memory selected. See the 128GB model picks Compare the models and quantization levels that make sense at this tier.

The Simple Formula

Use this before changing context:

usable_memory = installed_memory - os_reserve - process_reserve - safety_headroom
model_budget = usable_memory - kv_cache_budget

For a 128GB Mac or unified-memory workstation:

  • Installed memory: 128GB
  • Conservative reserve: 24-32GB
  • Practical model + context budget: 96-104GB
  • Safe model-weight target for agent work: 90-105GB

That is why two setups can both “fit” but behave very differently:

  • 70B model at a practical quant with 32K or 64K context: comfortable.
  • 100B-120B model at a heavier quant with 64K or 128K context: possible to fit, but easy to push into swap.

Context Budget by Workload

WorkloadStarting contextWhy
Chat and short coding tasks16K-32KEnough for prompt, tool output, and small diffs. Lowest memory risk.
OpenClaw agent loops32KGood first setting because tool results and file reads grow fast.
Repository analysis32K-64KUse retrieval or chunking before jumping to 128K.
Long PDFs or full transcripts64K+Works best with smaller models or lots of headroom.
100B-class model experiments16K-32KKeep context tight because weights already consume most of the budget.

The operational rule: raise context only after confirming the model is not swapping. A slow local model is often a memory-pressure problem, not a model-quality problem.

What 128GB Is Good For

128GB is strongest when you need one of these:

  1. A high-quality 70B-class model with comfortable context.
  2. A 100B-class model at a practical quantization level.
  3. Two smaller models kept warm for routing, such as fast chat plus heavier agent model.
  4. Long-running OpenClaw jobs where swap would silently destroy throughput.
  5. A private local AI host for a team or home lab.

It is not automatically better if your actual workload is one 20B-30B model. In that case, 48GB or 64GB may be the better spend.

Safe OpenClaw Starting Config

Start with a moderate context cap, then increase once the workload is stable:

openclaw config set agents.defaults.context_limit 32768
openclaw config set agents.defaults.keep_alive 30m
openclaw models status

If memory stays below pressure during real runs, test 64K:

openclaw config set agents.defaults.context_limit 65536
openclaw run --agent "Analyze this repository and propose the smallest safe change"

If the machine swaps, reduce context before changing models. Context is the easiest knob to fix.

128GB vs 64GB

Use 128GB if you are doing serious local AI work: 70B+ models, multiple loaded models, large-document workflows, or private team hosting.

Use 64GB if you mostly run one efficient model, value lower hardware cost, and can keep context moderate.

The buying decision page is here: 64GB vs 128GB RAM for local LLMs.

Quick Troubleshooting

If a 128GB host is slow or unstable:

  • Lower context from 128K to 32K.
  • Unload secondary models.
  • Check whether the process is using CPU fallback.
  • Leave at least 8-12GB free during the run.
  • Prefer one stable model over a multi-model setup until the workload is repeatable.

For a broader diagnostic, use Why Is My Local LLM So Slow?.

See Also

Get guides like this in your inbox every Wednesday.

No spam. Unsubscribe anytime.

You'll probably need this again.

Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.

Need OpenClaw fixed live?

Remote rescue sessions for gateway, auth, tunnel, VPS, and model access problems.

See Rescue Session

Next useful step

Read next

Can I Run a Local LLM With 128GB RAM and No GPU?
Direct answer for 128GB system RAM with no discrete GPU: CPU-only inference, Apple unified memory, what fits, what is slow, and which OpenClaw calculator preset to use.
Can I Run OpenClaw With 8GB RAM and 8GB VRAM?
A direct answer for 8GB RAM plus 8GB GPU VRAM: what OpenClaw can run locally, which models fit, and when to use a cloud API instead.
RTX 5090 vs 4090 vs Used 3090 for Local LLMs (2026)
RTX 5090 vs RTX 4090 vs used RTX 3090 for local LLMs, Ollama, and OpenClaw. Clear buying rule for 32GB vs 24GB VRAM, speed, value, and used-card risk.