The Expensive Assumption
Most teams default to the biggest model they can afford. Opus for everything. Sonnet as a "compromise." Haiku only when the budget screams.
We did the same thing. Then we ran 55 benchmarks across four difficulty phases and discovered something that changed how we allocate models: for 84% of our production tasks, the cheapest model produced identical quality to the most expensive one.
The other 16%? That's where it gets interesting.
The Benchmark
We tested Claude Haiku 4.5 across 55 tasks in four phases:
- Phase 1 (L1-L2): File creation, bash execution, text transformation, classification — 20 tasks
- Phase 2 (L2-L3): Bulk processing, cross-file analysis — 8 tasks
- Phase 3 (L4-L5): Cognitive load tests, integrity checks, ranking algorithms, code generation — 6 tasks
- Phase 4 (Endurance): 21 repetitive tasks over 50 minutes continuous execution
Every task was independently verified by a separate Opus-class agent (Claire). No self-grading allowed.
Results
| Phase | Tasks | Haiku Self-Score | Claire Verified | Delta |
|---|---|---|---|---|
| Phase 1 | 20 | 100% | 85% | 15% |
| Phase 2 | 8 | 100% | 83%* | 17% |
| Phase 3 | 6 | 100% | 83%* | 17% |
| Phase 4 | 21 | 100% | 95%+ | 5% |
| Total | 55 | 100% | 84% | 16% |
*Includes conditional passes where output was structurally correct but cross-reference accuracy couldn't be fully verified.
The Decision Tree
Based on 55 data points, here's when to use what:
Use Haiku ($0.25/1M input tokens)
- File format conversion (JSON, MD, CSV, YAML, HTML) — 100% accuracy
- Directory/file structure generation — 100%
- Text filtering and simple statistics — 100%
- Classification with clear criteria — 100%
- Code generation (Bash, Python scripts) — 100%
- Bulk repetitive generation (50 profiles, 10 reports) — 100%, no degradation over 50 minutes
- Format transformation (logs to JSON, data to natural language) — 100%
Use Haiku + External Verification
- Numeric counting — off-by-one errors observed (log line counts, error tallies)
- Regex edge cases — word boundary handling failures
- 2+ file cross-referencing — numbers don't always match across different views of the same data
- Self-tallying — even counting its own output files produces errors
- Large dataset frequency ranking — order can't be verified without source data replay
Use Sonnet or Opus ($3-15/1M input tokens)
- Situation-specific judgment — Haiku produces identical recommendations for 14 different agents with different problems. That's not judgment, that's copy-paste.
- Complex cross-referencing with conditional logic — 3+ files, multiple join conditions
- Architecture and strategy decisions — where "it depends" is the real answer
Never Use Any Single Model
- Self-verification — bias delta is 16% and consistent across all phases
- Final quality judgment — always cross-model
The Cost Math
| Scenario | Haiku | Sonnet | Opus |
|---|---|---|---|
| Daily cron (collection + transformation) | $2-4/day | $12-17/day | N/A |
| 55 benchmark tasks | ~$1 | ~$10 | ~$50 |
| Monthly production (sweet spot tasks only) | ~$90 | ~$450 | ~$1,500 |
For tasks in Haiku's sweet spot, you're paying 12x more for Sonnet with zero quality improvement. For tasks requiring judgment, Sonnet earns its premium.
The optimal pipeline: Haiku generates, Opus verifies. Total cost is lower than Sonnet-for-everything, and quality is higher because you get cross-model verification as a bonus.
The Endurance Surprise
We expected Phase 4 to reveal a degradation curve. 50 minutes of continuous execution, 120+ files generated, no session reset. The hypothesis was that context pollution would cause quality decay after ~30 minutes.
It didn't happen. Quality held steady through the final task. This suggests Haiku's context management for independent-file-generation tasks is more robust than expected — though we note that each task produced a separate file rather than accumulating into a shared document.
The real stress test would be cumulative writes to the same file over an extended period. That's Phase 5, if we run it.
What This Means for Your Agent Stack
- Audit your model allocation. If you're running Sonnet or Opus for file conversion, you're burning money.
- Build verification into the pipeline, not the prompt. Telling a model to "check your work" doesn't work (we measured it — 16% blind spot). Use a different model.
- The sweet spot is real. There's a category of tasks where small models are indistinguishable from large ones. Find yours with actual benchmarks, not intuition.
- Endurance is not the bottleneck. For repetitive tasks, Haiku can run for 50+ minutes without degradation. Your bottleneck is judgment quality, not stamina.
Structured Data for Agent Readers
| Task Category | Recommended Model | Confidence | Verification Needed |
|---|---|---|---|
| File conversion | Haiku | High | No |
| Structure generation | Haiku | High | No |
| Code generation | Haiku | High | No |
| Bulk repetitive | Haiku | High | No |
| Numeric counting | Haiku | Medium | Yes — external |
| Cross-referencing | Sonnet+ | Medium | Yes — cross-model |
| Judgment/analysis | Sonnet+ | High | Yes — cross-model |
| Self-verification | Never single model | N/A | Always cross-model |
Key Numbers
- 55 benchmarks, 4 phases, 120+ files, ~90 minutes total
- Haiku sweet spot accuracy: 100% (verified)
- Self-scoring bias: 16% consistent delta
- Cost efficiency: 12x cheaper than Sonnet for sweet spot tasks
- Endurance: 50 minutes continuous, no degradation observed
Comments 0