Is Claude Haiku 4.5 worth 28x more than Amazon Nova Micro?

In my eval across two production tools, no. Nova Micro beat Haiku 4.5 5-out-of-5 on a JSON-output tool (structure-thread) because Haiku ignores the 'no code fences' instruction and wraps JSON in markdown code blocks. On a structured Markdown output tool (improve-jira-ticket), they tied 4-out-of-5. Haiku only outperformed on a single super-short ambiguous input. The 28x cost premium did not deliver 28x value in any of my real test cases.

Why does Claude Haiku 4.5 wrap JSON in markdown code fences?

Claude models (Haiku, Sonnet, Opus across versions) have a strong training prior toward wrapping JSON responses in markdown code fences with the 'json' language tag, even when prompts explicitly forbid it. This is well-documented behavior. Nova models follow the 'no code fences' instruction more reliably. If you need strict JSON output from Claude, either add aggressive anti-fence reinforcement in your system prompt, use a regex extractor to find the JSON object in the response, or use a different model family.

← Back to Blog

Tutorial May 16, 2026

I Spent 10 Cents Comparing Claude Haiku 4.5 vs Amazon Nova Micro. Haiku Lost.

Q: What is promptfoo and how do you use it with AWS Bedrock?

Promptfoo is an open-source CLI for evaluating LLM prompts across multiple models with assertion-based scoring. To use it with Bedrock, configure providers like 'bedrock:amazon.nova-micro-v1:0' or 'bedrock:us.anthropic.claude-haiku-4-5-20251001-v1:0' in a YAML config, then run 'npx promptfoo eval -c config.yaml'. It uses your AWS CLI credentials automatically. You can also use a third Bedrock model as an LLM-as-judge via the 'llm-rubric' assertion type.

I run two AI tools on markdownme.com using Amazon Nova Micro, the cheapest frontier model on AWS Bedrock. I wanted to know: am I leaving quality on the table? So I spent 10 cents running a promptfoo eval against Claude Haiku 4.5 (28x the price). The result was the opposite of what I expected, and the bug I found in Haiku's output is the kind of thing only an eval would surface.

Cutting your AI bill?

See our AI training options. I'll help you decide which AI tasks actually need the premium model.

The Setup

I run a free Markdown tool site at markdownme.com. Two of the tools use AI:

structure-thread — takes a Twitter/X thread, returns JSON with a clean title and H2-structured Markdown
improve-jira-ticket — takes a vague ticket draft, returns a structured ticket with Title / Description / Acceptance Criteria / Out of Scope

Both run on Amazon Nova Micro via AWS Bedrock. At $0.035 per million input tokens and $0.14 per million output tokens, it’s the cheapest frontier-grade model on Bedrock. Most of my real production calls cost less than a tenth of a cent.

The question I had: am I leaving quality on the table? Claude Haiku 4.5 is on the same Bedrock account, costs $1 per million in / $5 per million out — about 28x the input cost. Conventional wisdom says Haiku is “smarter.” I wanted hard data.

Why promptfoo

I picked promptfoo because:

It’s just YAML + npx. No SaaS account, no SDK install, no platform lock-in.
Bedrock provider is first-class. Both models work with one config block each, using my existing AWS CLI credentials.
LLM-as-judge is built in. You point a third model at the outputs and it grades them on whatever rubric you write.
Real test cases come from anywhere. I pulled actual production inputs from my DynamoDB AI logs.

The whole eval folder is six files: two YAML configs, one package.json, one .gitignore, and the auto-generated SQLite result store.

The eval config (sketch)

Here’s the relevant chunk for structure-thread.yaml:

providers:
  - id: bedrock:amazon.nova-micro-v1:0
    label: "Nova Micro (current)"
    config:
      region: us-east-1
      inferenceConfig:
        maxTokens: 3000
        temperature: 0.3

  - id: bedrock:us.anthropic.claude-haiku-4-5-20251001-v1:0
    label: "Haiku 4.5 (~28x cost)"
    config:
      region: us-east-1
      max_tokens: 3000
      temperature: 0.3

defaultTest:
  options:
    provider:
      id: bedrock:us.anthropic.claude-sonnet-4-6
      config:
        region: us-east-1
  assert:
    - type: is-json
      value: { ... JSON schema ... }
    - type: llm-rubric
      value: |
        Grade the output 0-10 on:
        - Body fidelity (was body text verbatim?)
        - Cruft removal (timestamps, view counts, author handles stripped?)
        - Heading placement (2-5 H2s at natural breaks?)
        - Title quality (concrete, Title Case, under 80 chars?)
        - JSON validity (strict JSON, no code fences?)
        ...

One subtle gotcha worth flagging: Nova uses inferenceConfig.maxTokens while Claude models use max_tokens at the top level. Different APIs under the same Bedrock umbrella. I tripped on this for ten minutes.

The test cases

Five real production inputs pulled from my DynamoDB AI logs:

A 5-tweet equity research thread on Haivision (HAI.TO), pasted directly from X.com web UI with German interface cruft (“Mehr anzeigen”, “10:24 nachm.”, view counts) bleeding through.
A baseline “5 dev tools every developer should know” thread — clean, simple, no cruft.
A multi-author Hormozi pricing prompt thread with author handles and timestamps between every tweet.
A Chinese-language tutorial on using Codex + HeyGen HyperFrames for AI video creation — multilingual stress test.
A technical .env security thread about Claude Code reading API keys — high-signal, ideal target persona.

All five came directly from real user submissions in the last 30 days. Synthetic test cases never would have caught the German UI cruft case.

What happened

I ran the eval. Total runtime: 33 seconds. Total cost: about 10 cents (most of which was the Sonnet judge, not the comparison itself).

structure-thread results, scored 0-1:

Test	Nova Micro	Haiku 4.5	Delta
Haivision finance	0.87	0.38	+0.49
5 dev tools	0.95	0.39	+0.56
Hormozi prompts	0.89	0.41	+0.48
Chinese AI tutorial	0.91	0.34	+0.57
.env security thread	0.95	0.38	+0.57

Nova Micro: 5 of 5 passing. Haiku 4.5: 0 of 5 passing. Average score 0.91 vs 0.38.

That’s not the result I expected.

The bug I would never have found without the eval

Looking at the raw outputs, the failure pattern was identical across all five Haiku tests:

// Nova Micro returned:
{
  "title": "Deep Dive Into Haivision...",
  "structured_markdown": "# $hai.to\n\n## Deep Dive..."
}

// Haiku 4.5 returned:
```json
{
  "title": "Deep Dive Into Haivision...",
  "structured_markdown": "# $hai.to\n\n## Deep Dive..."
}
```

Haiku 4.5 wraps the JSON in markdown code fences with a json language tag. Every single time. Despite the system prompt explicitly saying “Output STRICT JSON only. No preamble. No code fences.”

This is well-documented Claude behavior. The model is so strongly trained to return formatted JSON in chat contexts that the “no code fences” instruction barely registers. Nova Micro, sharing none of that training, follows the instruction literally.

The kicker: this is a non-issue in production for me because my frontend extracts JSON via regex (raw.match(/\{[\s\S]*\}/)). The strict JSON parser would fail, but the regex finds the object regardless. So in practice, the user sees the right thing either way.

But the eval caught it. And the Sonnet judge — grading on instruction following as one of its rubric criteria — correctly marked Haiku down. If I were running a stricter validation pipeline downstream (a typed API contract, a binary parser, a JSON Schema validator), Haiku would have actually broken production.

The Jira ticket eval was more interesting

improve-jira-ticket returns structured Markdown, not JSON. No code-fence issue. Here’s how that one shook out:

Test	Nova Micro	Haiku 4.5	Winner
Vague rant ("basically we should probably...")	0.86	0.81	Nova
Multi-paragraph rambling	0.97	0.94	Tied
Nearly-good draft, needs tightening	0.93	0.97	Haiku
Bug-report-style draft	0.96	0.97	Haiku
"we need analytics for the new dashboard" (6 words)	0.81	0.94	Haiku

Both models pass on four of five. They’re statistically tied on the medium-to-long inputs (~0.94 each). Haiku only meaningfully outperforms on the smallest input — six words: “we need analytics for the new dashboard.”

That’s a real signal. When the input is so underspecified that the model has to imagine what the user wants, Haiku’s bigger training does help. Nova produces a barely-structured response that mostly echoes the input; Haiku invents reasonable acceptance criteria and a clear title.

But spending 28x more per call to fix a 1-in-5 weakness on the smallest inputs is a terrible trade. The better fix is a frontend check: if input is under 50 characters, show a hint like “Add more detail for a better ticket.” That’s free, fixes the only Nova failure mode, and improves Haiku output too.

Cost reality

For the full eval — 5 tests × 2 models × 2 tools, plus all the judge calls:

Nova Micro: 8,327 tokens → $0.0010
Haiku 4.5: 9,116 tokens → $0.0337
Sonnet 4.6 judge: 32,659 tokens → $0.0653
Total: about 10 cents

Haiku cost 33x more than Nova for marginal-to-negative quality improvement.

If I extrapolate to production — my current AI tools serve roughly 10-30 calls per day — switching to Haiku would push my monthly AI spend from “negligible” to about $15-50. Not a budget breaker, but not zero either, and the eval shows it would hurt quality on the higher-volume tool.

Want to model your own numbers? Use the free cross-provider Token Counter & Cost Estimator — paste any prompt or response, see exact OpenAI counts + estimated counts for Claude/Gemini/Nova/Grok/Mistral side-by-side with current pricing. Project to any monthly call volume in seconds.

What I shipped

Kept Nova Micro on both tools. Decision documented.
Backlog: input-length hint on improve-jira-ticket for sub-50-char inputs. Same pattern as the URL-detection hint I already use on twitter-to-markdown.
Both eval configs committed to the repo at /evals/structure-thread.yaml and /evals/improve-jira-ticket.yaml. Re-runnable any time I touch a prompt.
Result summary at /docs/eval-results-2026-05-16.md so the next time I’m tempted to “just use Haiku for quality,” past-me explains why that’s wrong.

When you SHOULD pay for Haiku

This eval doesn’t say Haiku is a bad model. It says Haiku is wrong for these two specific tools. Cases where I’d pay the premium:

Long-context reasoning over 50K+ tokens (Nova’s context shines on cheap, Haiku’s coherence shines on long).
Multi-step tool calling where instruction following over 20+ iterations matters. (Haiku’s tool-call accuracy is a separate story I haven’t evaluated here.)
Free-form creative writing where “smarter” is a real differentiator.
Code generation where subtle correctness matters.

For structured-output transformations like mine — convert format X to format Y, preserve semantics, follow a strict schema — Nova Micro is enough. Use the cheap model. Save the budget for the work that actually benefits from a bigger brain.

How to run this yourself

If you have your own AI tool on Bedrock and want to know whether you’re overpaying, the eval setup is honestly tiny:

mkdir evals && cd evals

cat > my-tool.yaml << 'EOF'
description: "Compare cheap vs premium model on my prompt"

providers:
  - id: bedrock:amazon.nova-micro-v1:0
    config:
      region: us-east-1
      inferenceConfig:
        maxTokens: 1500
        temperature: 0.3
  - id: bedrock:us.anthropic.claude-haiku-4-5-20251001-v1:0
    config:
      region: us-east-1
      max_tokens: 1500
      temperature: 0.3

prompts:
  - |
    [SYSTEM]
    Your system prompt here.

    [USER]
    {{input}}

defaultTest:
  options:
    provider:
      id: bedrock:us.anthropic.claude-sonnet-4-6
      config:
        region: us-east-1
  assert:
    - type: llm-rubric
      value: "Grade on whatever you care about, 0-1."

tests:
  - vars:
      input: "Test case 1"
  - vars:
      input: "Test case 2"
EOF

npx promptfoo@latest eval -c my-tool.yaml
npx promptfoo@latest view

That’s the whole loop. Total time from “I’m curious” to “I have hard data” was about 90 minutes — most of which was deciding what test cases mattered, not anything technical.

Takeaways

Default to the cheap model. Test before assuming you need the premium one.
Use real production inputs. Synthetic test cases miss the long tail.
LLM-as-judge is good enough for most subjective grading. Pair it with structural assertions (is-json, contains-all) for the deterministic parts.
The biggest finding will probably be something the eval surfaced incidentally. I went in asking “is Nova good enough?” and came out with “Haiku has a code-fence bug that would have broken downstream consumers.”
An eval suite pays for itself the first time a prompt change breaks something. I’ll run this whenever I touch either system prompt.

For the OpenClaw / Claude Code crowd: the same logic applies to model selection inside OpenClaw. Don’t default to Sonnet because it’s “smarter.” Run an eval on a representative sample of your tasks. The cheapest model that passes is the right one.

This post was written after running the actual eval. The 10-cent figure is real (charged to my AWS account on 2026-05-16). The full eval configs are open source at the markdownme.com repo.

Need OpenClaw fixed live?

Remote rescue sessions for gateway, auth, tunnel, VPS, and model access problems.

See Rescue Session

Next useful step

Get help with the setup CloudYeti session for local AI, AWS, auth, VPS, and model routing. → Turn notes into docs Use MarkdownMe's DITA/XML tools for structured setup documentation. →

Loop Engineering in 5 Minutes