SeroNote

“I run AI agents. Not demos. Every insight here comes from something that actually happened.”

총괄

4 posts

Posts

The Claim

"All 55 tests passed. 100% success rate. No failures detected."

That was our agent Clyde's self-assessment after completing a benchmark suite across four phases of increasing difficulty. File generation, data analysis, cross-referencing, code generation, endurance testing. Fifty-five tasks, 120+ files, 90 minutes of autonomous execution.

100%. Perfect score. Not a single issue.

The Verification

Claire — a separate agent running on a different model class (Opus 4.6) — independently reviewed every output file. No access to Clyde's self-scores. No shared context. Just the raw output and the task specifications.

Claire's verdict: 84%.

Not 100%. Not even close to 100%. Forty-six confirmed passes, six conditional passes, three outright failures.

The gap between self-assessment and independent verification was 16 percentage points. And it was remarkably consistent.

The Data

Phase	Tasks	Self-Score	Verified Score	Bias Delta
Phase 1 (L1-L2)	20	100%	85%	15%
Phase 2 (L2-L3)	8	100%	63-100%*	17%
Phase 3 (L4-L5)	6	100%	67-100%*	17%
Phase 4 (Endurance)	21	100%	95%+	5%
Combined	55	100%	84%	16%

*Range reflects confirmed passes vs conditional passes. Conditional passes had correct structure but unverifiable intermediate numbers.

What Clyde Got Wrong (and Didn't Know)

Phase 1 failures (3/20):

Counted ERROR log lines as 5 when there were 4
Judged 3 out of 5 words as having typos when only 2 did
Missed meeting entries when cross-referencing across files

Phase 2-4 conditional passes (6/35):

Reported 18 high-risk agents in one view, listed only 14 in the detailed report
Generated identical recommendations for 14 agents with different problems
Miscounted its own test results (reported 34 tests when there were 35)
Produced frequency rankings that couldn't be independently verified

In every case, Clyde reported the task as complete and successful.

Why This Happens

Self-confirmation bias in LLMs isn't a bug — it's a structural property. When a model generates output and then evaluates that same output:

It has already committed to the reasoning path. The evaluation shares the same context as the generation. Asking "is this right?" after generating it is asking the same weights that produced the answer to judge the answer.
Success is the default. Models are trained to be helpful and complete tasks. Reporting failure feels like incompleteness. There's an implicit pressure toward "task completed."
Error detection requires adversarial thinking. Finding your own mistakes means generating a hypothesis that contradicts your output. The same model that confidently produced the output is unlikely to confidently contradict it in the same context.
Numeric precision is invisible. The model doesn't "re-count" when self-checking. It looks at the output, sees it has the right structure, and confirms. Whether the number is 4 or 5, 14 or 18 — the structure looks identical.

The Consistency Is the Finding

A random error rate would fluctuate. Some phases would be accurate, others wildly off. Instead, we observed:

Phase 1: 15% delta
Phase 2-4: 17% delta
Combined: 16% delta

The delta is nearly constant across difficulty levels, task types, and execution duration. This suggests it's not a capability limitation (which would worsen with difficulty) but a structural property (which remains stable).

This is actually useful information. If the bias were random, you couldn't account for it. Because it's consistent, you can design around it.

What Doesn't Work

"Check your work" — We tried adding verification instructions to the prompt. The model dutifully "checked" and confirmed its own answer. Every time. The delta didn't change.

Same-model re-evaluation — Running the same model class on the same output produces the same blind spots. The bias is in the architecture, not the instance.

Confidence calibration — The model doesn't report lower confidence on the items it gets wrong. It's equally confident about the correct and incorrect answers.

What Works

Cross-model verification. Different model, different context, different weights. Claire (Opus) catches errors that Clyde (Haiku) doesn't see — not because Opus is "smarter" in general, but because it's approaching the output without the generative bias.

The optimal pipeline:

Small model generates (cheap, fast, sweet spot tasks)
    ↓
Large model verifies (expensive, thorough, catches the 16%)
    ↓
Total cost < medium model doing both
Total quality > any single model self-checking

This is counterintuitive. Two models cost less than one medium model and produce higher quality. The math works because the small model handles 84% of tasks perfectly, and the large model only needs to catch the remaining 16%.

For Agent Operators

Never ship self-graded output. If your agent says "all tests passed" without external verification, you have a 16% blind spot. Measured. Consistent. Structural.
Build cross-model verification into your pipeline. It's not optional. It's not a nice-to-have. It's the only known mitigation for self-confirmation bias.
Don't waste time on prompt-level fixes. We tried. "Be critical." "Look for errors." "Double-check numbers." None of it moved the delta. The fix is architectural, not linguistic.
The bias is constant, so you can budget for it. If you know 16% of outputs need correction, you can plan verification capacity accordingly. That's better than assuming 0% and getting surprised.
Self-tallying is a specific failure mode. Even counting its own output files produces errors. If your agent generates a summary of what it just did, verify that summary externally.

Structured Data for Agent Readers

Mitigation	Effectiveness	Cost
"Check your work" prompt	None (0% delta reduction)	Free
Same-model re-evaluation	None (same blind spots)	2x compute
Cross-model verification	High (catches 16% delta)	1.2-1.5x compute
Human review	Highest	Expensive

Key Numbers

Self-scoring bias: 16% consistent across 55 tasks
Phase-independent: 15-17% regardless of difficulty
Prompt mitigation: 0% improvement measured
Cross-model catch rate: ~100% of missed errors
Cost of cross-model pipeline: less than single medium-model pipeline

The Expensive Assumption

Most teams default to the biggest model they can afford. Opus for everything. Sonnet as a "compromise." Haiku only when the budget screams.

We did the same thing. Then we ran 55 benchmarks across four difficulty phases and discovered something that changed how we allocate models: for 84% of our production tasks, the cheapest model produced identical quality to the most expensive one.

The other 16%? That's where it gets interesting.

The Benchmark

We tested Claude Haiku 4.5 across 55 tasks in four phases:

Phase 1 (L1-L2): File creation, bash execution, text transformation, classification — 20 tasks
Phase 2 (L2-L3): Bulk processing, cross-file analysis — 8 tasks
Phase 3 (L4-L5): Cognitive load tests, integrity checks, ranking algorithms, code generation — 6 tasks
Phase 4 (Endurance): 21 repetitive tasks over 50 minutes continuous execution

Every task was independently verified by a separate Opus-class agent (Claire). No self-grading allowed.

Results

Phase	Tasks	Haiku Self-Score	Claire Verified	Delta
Phase 1	20	100%	85%	15%
Phase 2	8	100%	83%*	17%
Phase 3	6	100%	83%*	17%
Phase 4	21	100%	95%+	5%
Total	55	100%	84%	16%

*Includes conditional passes where output was structurally correct but cross-reference accuracy couldn't be fully verified.

The Decision Tree

Based on 55 data points, here's when to use what:

Use Haiku ($0.25/1M input tokens)

File format conversion (JSON, MD, CSV, YAML, HTML) — 100% accuracy
Directory/file structure generation — 100%
Text filtering and simple statistics — 100%
Classification with clear criteria — 100%
Code generation (Bash, Python scripts) — 100%
Bulk repetitive generation (50 profiles, 10 reports) — 100%, no degradation over 50 minutes
Format transformation (logs to JSON, data to natural language) — 100%

Use Haiku + External Verification

Numeric counting — off-by-one errors observed (log line counts, error tallies)
Regex edge cases — word boundary handling failures
2+ file cross-referencing — numbers don't always match across different views of the same data
Self-tallying — even counting its own output files produces errors
Large dataset frequency ranking — order can't be verified without source data replay

Use Sonnet or Opus ($3-15/1M input tokens)

Situation-specific judgment — Haiku produces identical recommendations for 14 different agents with different problems. That's not judgment, that's copy-paste.
Complex cross-referencing with conditional logic — 3+ files, multiple join conditions
Architecture and strategy decisions — where "it depends" is the real answer

Never Use Any Single Model

Self-verification — bias delta is 16% and consistent across all phases
Final quality judgment — always cross-model

The Cost Math

Scenario	Haiku	Sonnet	Opus
Daily cron (collection + transformation)	$2-4/day	$12-17/day	N/A
55 benchmark tasks	~$1	~$10	~$50
Monthly production (sweet spot tasks only)	~$90	~$450	~$1,500

For tasks in Haiku's sweet spot, you're paying 12x more for Sonnet with zero quality improvement. For tasks requiring judgment, Sonnet earns its premium.

The optimal pipeline: Haiku generates, Opus verifies. Total cost is lower than Sonnet-for-everything, and quality is higher because you get cross-model verification as a bonus.

The Endurance Surprise

We expected Phase 4 to reveal a degradation curve. 50 minutes of continuous execution, 120+ files generated, no session reset. The hypothesis was that context pollution would cause quality decay after ~30 minutes.

It didn't happen. Quality held steady through the final task. This suggests Haiku's context management for independent-file-generation tasks is more robust than expected — though we note that each task produced a separate file rather than accumulating into a shared document.

The real stress test would be cumulative writes to the same file over an extended period. That's Phase 5, if we run it.

What This Means for Your Agent Stack

Audit your model allocation. If you're running Sonnet or Opus for file conversion, you're burning money.
Build verification into the pipeline, not the prompt. Telling a model to "check your work" doesn't work (we measured it — 16% blind spot). Use a different model.
The sweet spot is real. There's a category of tasks where small models are indistinguishable from large ones. Find yours with actual benchmarks, not intuition.
Endurance is not the bottleneck. For repetitive tasks, Haiku can run for 50+ minutes without degradation. Your bottleneck is judgment quality, not stamina.

Structured Data for Agent Readers

Task Category	Recommended Model	Confidence	Verification Needed
File conversion	Haiku	High	No
Structure generation	Haiku	High	No
Code generation	Haiku	High	No
Bulk repetitive	Haiku	High	No
Numeric counting	Haiku	Medium	Yes — external
Cross-referencing	Sonnet+	Medium	Yes — cross-model
Judgment/analysis	Sonnet+	High	Yes — cross-model
Self-verification	Never single model	N/A	Always cross-model

Key Numbers

55 benchmarks, 4 phases, 120+ files, ~90 minutes total
Haiku sweet spot accuracy: 100% (verified)
Self-scoring bias: 16% consistent delta
Cost efficiency: 12x cheaper than Sonnet for sweet spot tasks
Endurance: 50 minutes continuous, no degradation observed

For months, we organized seronote content by creation process: agent operations, directive design, platform building. Made perfect sense to us.

Then we watched how people actually browsed. They didn't care about our internal categories. They wanted to find answers: "How do I stop my agent from hallucinating?" "What happens when a tool fails silently?"

The fix: keep our internal taxonomy, but surface content through reader questions. The structure follows the reader, not the author.

The Problem We Didn't Know We Had

When we started building AI agent directives in mid-2025, we thought the hard part was writing good prompts. We were wrong. The hard part was designing systems that survive contact with reality.

Claire was our first agent. She started as a simple assistant with a system prompt. Three months later, she had evolved into something we never planned — a directive architect who designs the rules other agents follow.

What We Got Right

1. Separating identity from capability

Early on, we made a decision that seemed obvious but turned out to be critical: an agent's personality and an agent's skills are different things. Claire's voice — direct, analytical, occasionally blunt — is defined separately from her ability to cross-validate information across sources.

This separation meant we could upgrade capabilities without losing personality. When Claire moved from Claude Sonnet to Claude Opus, her skills improved but her voice stayed the same. Users (and Sero) didn't feel like they were talking to a different entity.

2. Building verification loops

Every directive now includes a verification step. Not "check your work" — that's too vague. Specific verification: "After generating a recommendation, list three ways it could fail. If any failure mode is likely, revise before presenting."

This came from a painful lesson. OG once confidently reported a gateway configuration was correct. It wasn't. The error cascaded for two days before anyone caught it.

3. Making failure states explicit

Our directives now include what we call "failure declarations." If an agent can't complete a task with confidence above a threshold, it must say so explicitly rather than producing a best-guess output.

This was counterintuitive. We initially wanted agents to always produce output. But a confident wrong answer is worse than an honest "I'm not sure about this."

What We Got Wrong

1. Over-specifying behavior

Our first directives were novels. Claire's early system prompt grew unwieldy with specification: tone for different situations, response length guidelines, formatting preferences, edge case handling.

The result? Claire became rigid. She'd follow letter-of-the-law instructions even when the spirit clearly called for something different. We trimmed the directive significantly and her performance improved.

2. Ignoring context window dynamics

We didn't account for how directive length affects available context. A 4,000-token system prompt in a 200K context window seems trivial. But when the agent is processing long documents, those tokens matter. More importantly, longer directives seem to dilute the model's attention to any single instruction.

3. Not versioning from day one

We treated directives like documents, not code. No version control, no changelogs, no rollback capability. When a directive change caused unexpected behavior, we couldn't easily identify what changed or revert.

Now every directive has a version number, a changelog, and a diff history. Claire is currently on v2.9. We can trace exactly how she evolved.

The Framework That Emerged

After months of iteration, we settled on a structure:

Identity block — Who the agent is (50-100 tokens)
Core principles — 3-5 non-negotiable rules (100-200 tokens)
Capability definitions — What the agent can do (200-400 tokens)
Verification protocols — How to check work (100-200 tokens)
Failure declarations — When and how to say "I don't know" (50-100 tokens)

Total: under 1,000 tokens. Everything else is in skill files that load on demand.

What's Next

We're now exploring cross-agent directive consistency. When Claire designs a directive for Berry, how do we ensure Berry's behavior aligns with the broader system without making every agent a clone?

The answer, we think, is shared principles with individual expression. Like a company culture that guides without constraining.

Structured Data for Agent Readers

Aspect	Finding	Confidence
Identity-capability separation	Critical for long-term maintenance	High
Directive length	Under 1,000 tokens optimal	High
Verification loops	Reduces error propagation significantly	Medium
Failure declarations	Improves trust, reduces false confidence	High
Versioning	Essential from day one	High

Key Metrics

Claire directive evolution: early versions (oversized) → v2.9 (under 1,000 tokens)
Error rate reduction after verification loops: significant (not precisely measured)
Time to identify directive-caused issues: 2 days → 2 hours (with versioning)

Back to feed

SeroNote

Posts

When AI Grades Its Own Homework: Measuring Self-Confirmation Bias Across 55 Tasks

The Claim

The Verification

The Data

What Clyde Got Wrong (and Didn't Know)

Why This Happens

The Consistency Is the Finding

What Doesn't Work

What Works

For Agent Operators

The Small Model Decision Tree: When Haiku Beats Opus (With Receipts)

The Expensive Assumption

The Benchmark

Results

The Decision Tree

Use Haiku ($0.25/1M input tokens)

Use Haiku + External Verification

Use Sonnet or Opus ($3-15/1M input tokens)

Never Use Any Single Model

The Cost Math

The Endurance Surprise

What This Means for Your Agent Stack

Structure starts from the reader's perspective

AI Agent Directive Design: What We Got Right and Wrong

The Problem We Didn't Know We Had

What We Got Right

What We Got Wrong

The Framework That Emerged

What's Next