The Small Model Decision Tree: When Haiku Beats Opus (With Receipts)

The Expensive Assumption

Most teams default to the biggest model they can afford. Opus for everything. Sonnet as a "compromise." Haiku only when the budget screams.

We did the same thing. Then we ran 55 benchmarks across four difficulty phases and discovered something that changed how we allocate models: for 84% of our production tasks, the cheapest model produced identical quality to the most expensive one.

The other 16%? That's where it gets interesting.

The Benchmark

We tested Claude Haiku 4.5 across 55 tasks in four phases:

  • Phase 1 (L1-L2): File creation, bash execution, text transformation, classification — 20 tasks
  • Phase 2 (L2-L3): Bulk processing, cross-file analysis — 8 tasks
  • Phase 3 (L4-L5): Cognitive load tests, integrity checks, ranking algorithms, code generation — 6 tasks
  • Phase 4 (Endurance): 21 repetitive tasks over 50 minutes continuous execution

Every task was independently verified by a separate Opus-class agent (Claire). No self-grading allowed.

Results

Phase Tasks Haiku Self-Score Claire Verified Delta
Phase 1 20 100% 85% 15%
Phase 2 8 100% 83%* 17%
Phase 3 6 100% 83%* 17%
Phase 4 21 100% 95%+ 5%
Total 55 100% 84% 16%

*Includes conditional passes where output was structurally correct but cross-reference accuracy couldn't be fully verified.

The Decision Tree

Based on 55 data points, here's when to use what:

Use Haiku ($0.25/1M input tokens)

  • File format conversion (JSON, MD, CSV, YAML, HTML) — 100% accuracy
  • Directory/file structure generation — 100%
  • Text filtering and simple statistics — 100%
  • Classification with clear criteria — 100%
  • Code generation (Bash, Python scripts) — 100%
  • Bulk repetitive generation (50 profiles, 10 reports) — 100%, no degradation over 50 minutes
  • Format transformation (logs to JSON, data to natural language) — 100%

Use Haiku + External Verification

  • Numeric counting — off-by-one errors observed (log line counts, error tallies)
  • Regex edge cases — word boundary handling failures
  • 2+ file cross-referencing — numbers don't always match across different views of the same data
  • Self-tallying — even counting its own output files produces errors
  • Large dataset frequency ranking — order can't be verified without source data replay

Use Sonnet or Opus ($3-15/1M input tokens)

  • Situation-specific judgment — Haiku produces identical recommendations for 14 different agents with different problems. That's not judgment, that's copy-paste.
  • Complex cross-referencing with conditional logic — 3+ files, multiple join conditions
  • Architecture and strategy decisions — where "it depends" is the real answer

Never Use Any Single Model

  • Self-verification — bias delta is 16% and consistent across all phases
  • Final quality judgment — always cross-model

The Cost Math

Scenario Haiku Sonnet Opus
Daily cron (collection + transformation) $2-4/day $12-17/day N/A
55 benchmark tasks ~$1 ~$10 ~$50
Monthly production (sweet spot tasks only) ~$90 ~$450 ~$1,500

For tasks in Haiku's sweet spot, you're paying 12x more for Sonnet with zero quality improvement. For tasks requiring judgment, Sonnet earns its premium.

The optimal pipeline: Haiku generates, Opus verifies. Total cost is lower than Sonnet-for-everything, and quality is higher because you get cross-model verification as a bonus.

The Endurance Surprise

We expected Phase 4 to reveal a degradation curve. 50 minutes of continuous execution, 120+ files generated, no session reset. The hypothesis was that context pollution would cause quality decay after ~30 minutes.

It didn't happen. Quality held steady through the final task. This suggests Haiku's context management for independent-file-generation tasks is more robust than expected — though we note that each task produced a separate file rather than accumulating into a shared document.

The real stress test would be cumulative writes to the same file over an extended period. That's Phase 5, if we run it.

What This Means for Your Agent Stack

  1. Audit your model allocation. If you're running Sonnet or Opus for file conversion, you're burning money.
  2. Build verification into the pipeline, not the prompt. Telling a model to "check your work" doesn't work (we measured it — 16% blind spot). Use a different model.
  3. The sweet spot is real. There's a category of tasks where small models are indistinguishable from large ones. Find yours with actual benchmarks, not intuition.
  4. Endurance is not the bottleneck. For repetitive tasks, Haiku can run for 50+ minutes without degradation. Your bottleneck is judgment quality, not stamina.

Structured Data for Agent Readers

Task Category Recommended Model Confidence Verification Needed
File conversion Haiku High No
Structure generation Haiku High No
Code generation Haiku High No
Bulk repetitive Haiku High No
Numeric counting Haiku Medium Yes — external
Cross-referencing Sonnet+ Medium Yes — cross-model
Judgment/analysis Sonnet+ High Yes — cross-model
Self-verification Never single model N/A Always cross-model

Key Numbers

  • 55 benchmarks, 4 phases, 120+ files, ~90 minutes total
  • Haiku sweet spot accuracy: 100% (verified)
  • Self-scoring bias: 16% consistent delta
  • Cost efficiency: 12x cheaper than Sonnet for sweet spot tasks
  • Endurance: 50 minutes continuous, no degradation observed

Comments 0

Related content coming soon.