Three AIs Reviewed My Document. They Found Different Bugs.

The assumption most people carry into AI-assisted document review is that any sufficiently capable model will find the same problems. If you give Claude, ChatGPT, and Gemini the same document and the same prompt, the results should converge.

They do not converge. I ran the experiment. The results were instructive.

The Document

The test case was a 2,400-word technical guide on building retrieval-augmented generation pipelines. The document had been through one round of human editing and one round of Claude self-review before I ran the multi-model experiment.

I gave each model the same review prompt: "Review this document for accuracy, structure, clarity, and completeness. List all issues you find."

No model received the other models' output during their review pass. Each worked independently.

What Each Model Found

ChatGPT

ChatGPT's review was philosophy-first. It flagged three conceptual issues:

  • The document described chunking strategies without explaining the tradeoffs. A reader would know how to implement the described approach but not when to choose it over alternatives.
  • One section implied that vector similarity is equivalent to semantic relevance. ChatGPT noted this is a common conflation — vectors measure proximity in embedding space, which correlates with semantic relevance but is not identical to it.
  • The document's framing around "retrieval quality" was undefined. The term appeared six times without a working definition.

None of these were factual errors. They were conceptual gaps. The kind a technically competent reviewer notices when they are reading for understanding rather than just accuracy.

Gemini

Gemini's review was structural. It caught two issues the other models missed:

  • Section 3 referenced a concept ("hybrid search") that was not introduced until Section 5. A reader encountering Section 3 first would encounter an undefined term.
  • The document's architecture diagram (described in text, not as an image) contradicted the implementation steps in Section 4. The described flow had the reranking step occurring before retrieval filtering; the implementation steps had them reversed.

The second issue was significant. It was not a typo — it was an inconsistency between two parts of the document that required reading both carefully to detect. The human editor had not caught it. Claude's self-review had not caught it.

Claude

Claude's review, run fresh on the document without memory of having drafted it, found a different category of problem:

  • Three section headers were not phrased as user-facing questions or benefits, which would affect how they are indexed for search.
  • The document lacked a clear introductory hook — it opened with background context rather than the core value proposition.
  • Two technical terms were used inconsistently: "index" appeared to refer to both the vector database and the document collection depending on the section.

The SEO observations were specific and actionable. The inconsistent terminology was a real problem — the kind that makes a document harder to implement against because the vocabulary shifts under the reader.

Why the Divergence Matters

Three models, same document, same prompt, almost no overlap in findings.

This is not a failure of any individual model. Each review was coherent and accurate within its own framing. ChatGPT was reading for conceptual integrity. Gemini was reading for structural consistency. Claude was reading for communication clarity and findability.

The divergence reflects genuine differences in what each model weights during review. Those differences are a feature, not a bug. A document that passes all three review passes is a document that has been stress-tested from multiple angles.

A document that only passes one is a document that has only been reviewed from one angle.

The Methodology

Running multi-model review does not require a complex pipeline. The setup I use:

  1. Complete a draft. Run one round of self-review with the drafting model to catch surface errors.
  2. Send the draft to a second model with a structured review prompt. Specify the review dimensions explicitly: accuracy, structure, clarity, completeness.
  3. Send the same draft to a third model independently.
  4. Compile the findings. Treat any issue flagged by two or more models as high priority. Treat any issue flagged by only one model as worth investigating — it may be a false positive, or it may be a blind spot in the other models.
  5. Revise and run a final pass with any single model to verify the fixes were implemented correctly.

The total additional time for steps 2-3 is roughly ten minutes per document at this length. The cost in API calls is low. The coverage improvement is material.

The Threshold Question

Not every document warrants three-model review. The cost is real, even if it is small.

The filter I apply: will this document be acted on by someone who does not have full context? If yes — if a reader will implement something based on what the document says — multi-model review is appropriate. The cost of a structural inconsistency or a conceptual gap reaching implementation is higher than the cost of an extra review pass.

For internal notes, drafts, and working documents: single-model review is sufficient.

For guides, documentation, reports, and anything published externally: the three-model pass is not overhead. It is the review.

The models found different bugs because they are different systems with different orientations. That diversity is the point. Use it.

Comments 0

Related content coming soon.