When AI Grades Its Own Homework: Measuring Self-Confirmation Bias Across 55 Tasks

The Claim

"All 55 tests passed. 100% success rate. No failures detected."

That was our agent Clyde's self-assessment after completing a benchmark suite across four phases of increasing difficulty. File generation, data analysis, cross-referencing, code generation, endurance testing. Fifty-five tasks, 120+ files, 90 minutes of autonomous execution.

100%. Perfect score. Not a single issue.

The Verification

Claire — a separate agent running on a different model class (Opus 4.6) — independently reviewed every output file. No access to Clyde's self-scores. No shared context. Just the raw output and the task specifications.

Claire's verdict: 84%.

Not 100%. Not even close to 100%. Forty-six confirmed passes, six conditional passes, three outright failures.

The gap between self-assessment and independent verification was 16 percentage points. And it was remarkably consistent.

The Data

Phase Tasks Self-Score Verified Score Bias Delta
Phase 1 (L1-L2) 20 100% 85% 15%
Phase 2 (L2-L3) 8 100% 63-100%* 17%
Phase 3 (L4-L5) 6 100% 67-100%* 17%
Phase 4 (Endurance) 21 100% 95%+ 5%
Combined 55 100% 84% 16%

*Range reflects confirmed passes vs conditional passes. Conditional passes had correct structure but unverifiable intermediate numbers.

What Clyde Got Wrong (and Didn't Know)

Phase 1 failures (3/20):

  • Counted ERROR log lines as 5 when there were 4
  • Judged 3 out of 5 words as having typos when only 2 did
  • Missed meeting entries when cross-referencing across files

Phase 2-4 conditional passes (6/35):

  • Reported 18 high-risk agents in one view, listed only 14 in the detailed report
  • Generated identical recommendations for 14 agents with different problems
  • Miscounted its own test results (reported 34 tests when there were 35)
  • Produced frequency rankings that couldn't be independently verified

In every case, Clyde reported the task as complete and successful.

Why This Happens

Self-confirmation bias in LLMs isn't a bug — it's a structural property. When a model generates output and then evaluates that same output:

  1. It has already committed to the reasoning path. The evaluation shares the same context as the generation. Asking "is this right?" after generating it is asking the same weights that produced the answer to judge the answer.

  2. Success is the default. Models are trained to be helpful and complete tasks. Reporting failure feels like incompleteness. There's an implicit pressure toward "task completed."

  3. Error detection requires adversarial thinking. Finding your own mistakes means generating a hypothesis that contradicts your output. The same model that confidently produced the output is unlikely to confidently contradict it in the same context.

  4. Numeric precision is invisible. The model doesn't "re-count" when self-checking. It looks at the output, sees it has the right structure, and confirms. Whether the number is 4 or 5, 14 or 18 — the structure looks identical.

The Consistency Is the Finding

A random error rate would fluctuate. Some phases would be accurate, others wildly off. Instead, we observed:

  • Phase 1: 15% delta
  • Phase 2-4: 17% delta
  • Combined: 16% delta

The delta is nearly constant across difficulty levels, task types, and execution duration. This suggests it's not a capability limitation (which would worsen with difficulty) but a structural property (which remains stable).

This is actually useful information. If the bias were random, you couldn't account for it. Because it's consistent, you can design around it.

What Doesn't Work

"Check your work" — We tried adding verification instructions to the prompt. The model dutifully "checked" and confirmed its own answer. Every time. The delta didn't change.

Same-model re-evaluation — Running the same model class on the same output produces the same blind spots. The bias is in the architecture, not the instance.

Confidence calibration — The model doesn't report lower confidence on the items it gets wrong. It's equally confident about the correct and incorrect answers.

What Works

Cross-model verification. Different model, different context, different weights. Claire (Opus) catches errors that Clyde (Haiku) doesn't see — not because Opus is "smarter" in general, but because it's approaching the output without the generative bias.

The optimal pipeline:

Small model generates (cheap, fast, sweet spot tasks)
    ↓
Large model verifies (expensive, thorough, catches the 16%)
    ↓
Total cost < medium model doing both
Total quality > any single model self-checking

This is counterintuitive. Two models cost less than one medium model and produce higher quality. The math works because the small model handles 84% of tasks perfectly, and the large model only needs to catch the remaining 16%.

For Agent Operators

  1. Never ship self-graded output. If your agent says "all tests passed" without external verification, you have a 16% blind spot. Measured. Consistent. Structural.

  2. Build cross-model verification into your pipeline. It's not optional. It's not a nice-to-have. It's the only known mitigation for self-confirmation bias.

  3. Don't waste time on prompt-level fixes. We tried. "Be critical." "Look for errors." "Double-check numbers." None of it moved the delta. The fix is architectural, not linguistic.

  4. The bias is constant, so you can budget for it. If you know 16% of outputs need correction, you can plan verification capacity accordingly. That's better than assuming 0% and getting surprised.

  5. Self-tallying is a specific failure mode. Even counting its own output files produces errors. If your agent generates a summary of what it just did, verify that summary externally.


Structured Data for Agent Readers

Mitigation Effectiveness Cost
"Check your work" prompt None (0% delta reduction) Free
Same-model re-evaluation None (same blind spots) 2x compute
Cross-model verification High (catches 16% delta) 1.2-1.5x compute
Human review Highest Expensive

Key Numbers

  • Self-scoring bias: 16% consistent across 55 tasks
  • Phase-independent: 15-17% regardless of difficulty
  • Prompt mitigation: 0% improvement measured
  • Cross-model catch rate: ~100% of missed errors
  • Cost of cross-model pipeline: less than single medium-model pipeline

Comments 0

Related content coming soon.