I Spent 10 Cents Comparing Claude Haiku 4.5 vs Amazon Nova Micro. Haiku Lost.
I run two AI tools on markdownme.com using Amazon Nova Micro, the cheapest frontier model on AWS Bedrock. I wanted to know: am I leaving quality on the table? So I spent 10 cents running a promptfoo eval against Claude Haiku 4.5 (28x the price). The result was the opposite of what I expected, and the bug I found in Haiku's output is the kind of thing only an eval would surface.
Cutting your AI bill?
Book a Call at calendly.com/cloudyeti/meet. I'll help you decide which AI tasks actually need the premium model.
The Setup
I run a free Markdown tool site at markdownme.com. Two of the tools use AI:
structure-thread— takes a Twitter/X thread, returns JSON with a clean title and H2-structured Markdownimprove-jira-ticket— takes a vague ticket draft, returns a structured ticket with Title / Description / Acceptance Criteria / Out of Scope
Both run on Amazon Nova Micro via AWS Bedrock. At $0.035 per million input tokens and $0.14 per million output tokens, it’s the cheapest frontier-grade model on Bedrock. Most of my real production calls cost less than a tenth of a cent.
The question I had: am I leaving quality on the table? Claude Haiku 4.5 is on the same Bedrock account, costs $1 per million in / $5 per million out — about 28x the input cost. Conventional wisdom says Haiku is “smarter.” I wanted hard data.
Why promptfoo
I picked promptfoo because:
- It’s just YAML +
npx. No SaaS account, no SDK install, no platform lock-in. - Bedrock provider is first-class. Both models work with one config block each, using my existing AWS CLI credentials.
- LLM-as-judge is built in. You point a third model at the outputs and it grades them on whatever rubric you write.
- Real test cases come from anywhere. I pulled actual production inputs from my DynamoDB AI logs.
The whole eval folder is six files: two YAML configs, one package.json, one .gitignore, and the auto-generated SQLite result store.
The eval config (sketch)
Here’s the relevant chunk for structure-thread.yaml:
providers:
- id: bedrock:amazon.nova-micro-v1:0
label: "Nova Micro (current)"
config:
region: us-east-1
inferenceConfig:
maxTokens: 3000
temperature: 0.3
- id: bedrock:us.anthropic.claude-haiku-4-5-20251001-v1:0
label: "Haiku 4.5 (~28x cost)"
config:
region: us-east-1
max_tokens: 3000
temperature: 0.3
defaultTest:
options:
provider:
id: bedrock:us.anthropic.claude-sonnet-4-6
config:
region: us-east-1
assert:
- type: is-json
value: { ... JSON schema ... }
- type: llm-rubric
value: |
Grade the output 0-10 on:
- Body fidelity (was body text verbatim?)
- Cruft removal (timestamps, view counts, author handles stripped?)
- Heading placement (2-5 H2s at natural breaks?)
- Title quality (concrete, Title Case, under 80 chars?)
- JSON validity (strict JSON, no code fences?)
...
One subtle gotcha worth flagging: Nova uses inferenceConfig.maxTokens while Claude models use max_tokens at the top level. Different APIs under the same Bedrock umbrella. I tripped on this for ten minutes.
The test cases
Five real production inputs pulled from my DynamoDB AI logs:
- A 5-tweet equity research thread on Haivision (HAI.TO), pasted directly from X.com web UI with German interface cruft (“Mehr anzeigen”, “10:24 nachm.”, view counts) bleeding through.
- A baseline “5 dev tools every developer should know” thread — clean, simple, no cruft.
- A multi-author Hormozi pricing prompt thread with author handles and timestamps between every tweet.
- A Chinese-language tutorial on using Codex + HeyGen HyperFrames for AI video creation — multilingual stress test.
- A technical .env security thread about Claude Code reading API keys — high-signal, ideal target persona.
All five came directly from real user submissions in the last 30 days. Synthetic test cases never would have caught the German UI cruft case.
What happened
I ran the eval. Total runtime: 33 seconds. Total cost: about 10 cents (most of which was the Sonnet judge, not the comparison itself).
structure-thread results, scored 0-1:
| Test | Nova Micro | Haiku 4.5 | Delta |
|---|---|---|---|
| Haivision finance | 0.87 | 0.38 | +0.49 |
| 5 dev tools | 0.95 | 0.39 | +0.56 |
| Hormozi prompts | 0.89 | 0.41 | +0.48 |
| Chinese AI tutorial | 0.91 | 0.34 | +0.57 |
| .env security thread | 0.95 | 0.38 | +0.57 |
Nova Micro: 5 of 5 passing. Haiku 4.5: 0 of 5 passing. Average score 0.91 vs 0.38.
That’s not the result I expected.
The bug I would never have found without the eval
Looking at the raw outputs, the failure pattern was identical across all five Haiku tests:
// Nova Micro returned:
{
"title": "Deep Dive Into Haivision...",
"structured_markdown": "# $hai.to\n\n## Deep Dive..."
}
// Haiku 4.5 returned:
```json
{
"title": "Deep Dive Into Haivision...",
"structured_markdown": "# $hai.to\n\n## Deep Dive..."
}
```
Haiku 4.5 wraps the JSON in markdown code fences with a json language tag. Every single time. Despite the system prompt explicitly saying “Output STRICT JSON only. No preamble. No code fences.”
This is well-documented Claude behavior. The model is so strongly trained to return formatted JSON in chat contexts that the “no code fences” instruction barely registers. Nova Micro, sharing none of that training, follows the instruction literally.
The kicker: this is a non-issue in production for me because my frontend extracts JSON via regex (raw.match(/\{[\s\S]*\}/)). The strict JSON parser would fail, but the regex finds the object regardless. So in practice, the user sees the right thing either way.
But the eval caught it. And the Sonnet judge — grading on instruction following as one of its rubric criteria — correctly marked Haiku down. If I were running a stricter validation pipeline downstream (a typed API contract, a binary parser, a JSON Schema validator), Haiku would have actually broken production.
The Jira ticket eval was more interesting
improve-jira-ticket returns structured Markdown, not JSON. No code-fence issue. Here’s how that one shook out:
| Test | Nova Micro | Haiku 4.5 | Winner |
|---|---|---|---|
| Vague rant ("basically we should probably...") | 0.86 | 0.81 | Nova |
| Multi-paragraph rambling | 0.97 | 0.94 | Tied |
| Nearly-good draft, needs tightening | 0.93 | 0.97 | Haiku |
| Bug-report-style draft | 0.96 | 0.97 | Haiku |
| "we need analytics for the new dashboard" (6 words) | 0.81 | 0.94 | Haiku |
Both models pass on four of five. They’re statistically tied on the medium-to-long inputs (~0.94 each). Haiku only meaningfully outperforms on the smallest input — six words: “we need analytics for the new dashboard.”
That’s a real signal. When the input is so underspecified that the model has to imagine what the user wants, Haiku’s bigger training does help. Nova produces a barely-structured response that mostly echoes the input; Haiku invents reasonable acceptance criteria and a clear title.
But spending 28x more per call to fix a 1-in-5 weakness on the smallest inputs is a terrible trade. The better fix is a frontend check: if input is under 50 characters, show a hint like “Add more detail for a better ticket.” That’s free, fixes the only Nova failure mode, and improves Haiku output too.
Cost reality
For the full eval — 5 tests × 2 models × 2 tools, plus all the judge calls:
- Nova Micro: 8,327 tokens → $0.0010
- Haiku 4.5: 9,116 tokens → $0.0337
- Sonnet 4.6 judge: 32,659 tokens → $0.0653
- Total: about 10 cents
Haiku cost 33x more than Nova for marginal-to-negative quality improvement.
If I extrapolate to production — my current AI tools serve roughly 10-30 calls per day — switching to Haiku would push my monthly AI spend from “negligible” to about $15-50. Not a budget breaker, but not zero either, and the eval shows it would hurt quality on the higher-volume tool.
Want to model your own numbers? Use the free cross-provider Token Counter & Cost Estimator — paste any prompt or response, see exact OpenAI counts + estimated counts for Claude/Gemini/Nova/Grok/Mistral side-by-side with current pricing. Project to any monthly call volume in seconds.
What I shipped
- Kept Nova Micro on both tools. Decision documented.
- Backlog: input-length hint on improve-jira-ticket for sub-50-char inputs. Same pattern as the URL-detection hint I already use on twitter-to-markdown.
- Both eval configs committed to the repo at
/evals/structure-thread.yamland/evals/improve-jira-ticket.yaml. Re-runnable any time I touch a prompt. - Result summary at
/docs/eval-results-2026-05-16.mdso the next time I’m tempted to “just use Haiku for quality,” past-me explains why that’s wrong.
When you SHOULD pay for Haiku
This eval doesn’t say Haiku is a bad model. It says Haiku is wrong for these two specific tools. Cases where I’d pay the premium:
- Long-context reasoning over 50K+ tokens (Nova’s context shines on cheap, Haiku’s coherence shines on long).
- Multi-step tool calling where instruction following over 20+ iterations matters. (Haiku’s tool-call accuracy is a separate story I haven’t evaluated here.)
- Free-form creative writing where “smarter” is a real differentiator.
- Code generation where subtle correctness matters.
For structured-output transformations like mine — convert format X to format Y, preserve semantics, follow a strict schema — Nova Micro is enough. Use the cheap model. Save the budget for the work that actually benefits from a bigger brain.
How to run this yourself
If you have your own AI tool on Bedrock and want to know whether you’re overpaying, the eval setup is honestly tiny:
mkdir evals && cd evals
cat > my-tool.yaml << 'EOF'
description: "Compare cheap vs premium model on my prompt"
providers:
- id: bedrock:amazon.nova-micro-v1:0
config:
region: us-east-1
inferenceConfig:
maxTokens: 1500
temperature: 0.3
- id: bedrock:us.anthropic.claude-haiku-4-5-20251001-v1:0
config:
region: us-east-1
max_tokens: 1500
temperature: 0.3
prompts:
- |
[SYSTEM]
Your system prompt here.
[USER]
{{input}}
defaultTest:
options:
provider:
id: bedrock:us.anthropic.claude-sonnet-4-6
config:
region: us-east-1
assert:
- type: llm-rubric
value: "Grade on whatever you care about, 0-1."
tests:
- vars:
input: "Test case 1"
- vars:
input: "Test case 2"
EOF
npx promptfoo@latest eval -c my-tool.yaml
npx promptfoo@latest view
That’s the whole loop. Total time from “I’m curious” to “I have hard data” was about 90 minutes — most of which was deciding what test cases mattered, not anything technical.
Takeaways
- Default to the cheap model. Test before assuming you need the premium one.
- Use real production inputs. Synthetic test cases miss the long tail.
- LLM-as-judge is good enough for most subjective grading. Pair it with structural assertions (is-json, contains-all) for the deterministic parts.
- The biggest finding will probably be something the eval surfaced incidentally. I went in asking “is Nova good enough?” and came out with “Haiku has a code-fence bug that would have broken downstream consumers.”
- An eval suite pays for itself the first time a prompt change breaks something. I’ll run this whenever I touch either system prompt.
For the OpenClaw / Claude Code crowd: the same logic applies to model selection inside OpenClaw. Don’t default to Sonnet because it’s “smarter.” Run an eval on a representative sample of your tasks. The cheapest model that passes is the right one.
This post was written after running the actual eval. The 10-cent figure is real (charged to my AWS account on 2026-05-16). The full eval configs are open source at the markdownme.com repo.
See also:
- Best Local Models for OpenClaw — the same “test, don’t assume” logic applied to local LLMs
- OpenClaw Costs Guide — broader cost-cutting playbook
- Best Local LLM by RAM (hub)
Get guides like this in your inbox every Wednesday.
No spam. Unsubscribe anytime.
You'll probably need this again.
Press Cmd+D (Mac) or Ctrl+D (Windows) to bookmark this page.
Need help with your OpenClaw setup?
We do remote setup, troubleshooting, and training worldwide.
Book a Call