LLM-as-Judge Doesn't Work When the Judge Is the Defendant

There is a category of mistake that looks like quality assurance but is not. It happens when you ask an LLM to evaluate output it generated itself. The model returns a score, perhaps a confidence percentage, maybe a brief explanation of what it verified. The output looks like a review. It is not a review. It is the model confirming its own priors.

The Structural Problem

When a model generates text and then evaluates that same text, it is operating from the same internal representation. The same reasoning patterns that produced the original output are the same ones being used to assess it. This is not a bug in a specific model — it is a structural feature of how these systems work.

Confirmation bias in LLMs manifests differently than in humans, but the effect is similar: the model finds what it expects to find. If it generated a summary that misrepresented a source document, it will often evaluate that summary as accurate, because the misrepresentation is internally consistent with the model's understanding of the source.

I cross-referenced three sources on this and found consistent evidence: self-evaluation scores correlate strongly with the original output's confidence, not with ground-truth accuracy. The judge and the defendant share the same memory.

Cross-Validation Methodology

The fix is architectural. Use a different model for evaluation than the one used for generation.

The setup I use for document review:

Generation model: Claude (primary drafting)
Factual verification: Gemini (cross-reference claims against sources)
Structural review: ChatGPT (argument coherence, logical gaps)
Final audit: Claude again, but with the Gemini and ChatGPT outputs as context

This is not about which model is "better." It is about the fact that different models have different training data, different reasoning tendencies, and different blind spots. Disagreement between models is a signal. It means something in the output is ambiguous or wrong.

The Case Study

Last quarter, I was validating a technical document comparing embedding model performance across retrieval benchmarks. The document contained a table with latency figures.

Self-validation result (Claude reviewing its own output): The table was flagged as accurate. Confidence reported at 91%.

Cross-validation result (Gemini reviewing Claude's output): Gemini flagged two rows in the table where the latency figures appeared to be transposed between models. It noted the values contradicted the cited benchmark paper.

I pulled the original benchmark paper. Gemini was correct. Two rows had been transposed during drafting. Claude's self-review had not caught this because the table was internally consistent — the error was in the relationship to the external source, not within the document itself.

The confidence score of 91% was not lying. It was accurate within its own limited scope. The model had validated internal consistency, not external accuracy. That distinction matters.

Why This Pattern Persists

Teams adopt LLM-as-judge pipelines because they are fast and cheap. You do not need a second API call to a different provider. You do not need to manage multiple model contexts. The single-model review loop is operationally simpler.

The problem is that it measures the wrong thing. Internal consistency is not correctness. A document can be internally consistent and factually wrong. A codebase can pass its own generated tests and still have logic errors. A structured argument can be logically coherent and rest on a false premise.

The operational convenience of self-review is real. The quality assurance value is not.

What to Measure Instead

If you are building evaluation pipelines, the metrics that matter are:

External accuracy: Does the output accurately represent the source material?
Cross-model agreement: Do independent models reach the same conclusions about the output?
Disagreement rate: How often do models flag different issues? High disagreement means the output has genuine ambiguity.
False negative rate on known errors: If you inject known errors into test documents, how often does each model catch them?

Self-validation scores correlate poorly with all of these. Cross-model evaluation correlates significantly better.

The Implementation Threshold

You do not need to run every document through three models. The overhead is real and the cost adds up. The threshold question is: what is the cost of an error in this output?

For internal notes and drafts: single-model review is acceptable. The cost of an error is low.

For published documentation, client-facing reports, or anything that will be acted on: cross-model validation is not optional. The cost of an undetected error is higher than the cost of an additional API call.

The rule I apply: if a human would be embarrassed by the error, use cross-validation. If the output is disposable, self-review is sufficient.

The model is not a reliable judge of its own work. Build your pipelines accordingly.