Published 2026-04-03 | Updated 2026-04-04

False Positive Hall of Fame

Every example on this page is human-written text that at least one popular AI detector flagged as likely AI-generated. These are not obscure edge cases. They are famous, canonical, or professionally published works. They exist to answer a simple question: how much should you trust a detector score?

Run Diagnostic Read reliability brief

The inductees

US Declaration of Independence (1776)

Reported score: ~97.9% AI across multiple public detectors.

Why it triggers: the Declaration is formal, rhythmically consistent, and structurally repetitive — "He has..." appears 18 times in a row. Those are exactly the features detectors use to identify AI output: low variation, predictable structure, constrained vocabulary. The text was written by committee and edited for consistency, which is the same process that makes any text look more machine-like to a statistical model.

US Constitution (1787)

Reported score: ~96.2% AI.

Why it triggers: legal prose is inherently formulaic. The Constitution uses parallel construction, repeated structural templates ("No person shall..."), and a deliberately constrained vocabulary designed for precision, not stylistic variation. It reads like AI output because AI was trained on text that reads like this.

Book of Genesis excerpts (pre-LLM era)

Reported score: high AI across multiple public detectors.

Why it triggers: "And God said... And God saw... And God called..." — the passage is built on anaphora (deliberate repetition of a phrase at the start of successive clauses). That structural regularity is indistinguishable, at the surface level, from the kind of repetitive template generation that detectors are trained to flag. The irony is considerable.

Moby-Dick excerpts (1851)

Reported score: high AI in practitioner testing across multiple tools.

Why it triggers: Melville's prose is long-sentence, rhythmically consistent, and lexically dense in a way that overlaps with the "smooth, formal, evenly weighted" profile detectors associate with generated text. Certain passages — especially the encyclopedic chapters on whale anatomy — flag heavily because they combine technical vocabulary with structural regularity.

Student essays (ongoing, worldwide)

Reported scores: widely variable; false-positive reports documented by journalists, educators, and student advocacy groups since 2023.

Why they trigger: students writing in a second language, students following strict essay templates, and students who simply write carefully and formally are all at elevated risk. The features that make writing "good" in an academic context — clear structure, consistent tone, no digressions — are the same features detectors associate with AI. Multiple documented cases involve students accused of cheating on the basis of a detector score, with no corroborating evidence, and later cleared.

Journalism and professional editing (ongoing)

Reported scores: variable, documented in newsroom and publishing discussions.

Why it triggers: professionally edited text has had its quirks smoothed out. The editing process — cutting redundancy, tightening rhythm, standardizing style — produces exactly the low-variation, polished surface that detectors interpret as synthetic. A raw first draft might score 25. The published version, after two rounds of editing, might score 55. The editing made it "more AI" by every metric a detector uses.

What this tells you

These cases are not proof that detectors are useless. They are proof that detectors have a structural blind spot: they cannot distinguish between "low variation because a machine generated it" and "low variation because the genre, the editing process, or the writer's style demands it." That distinction requires context a detector does not have.

The right response to a false positive is not to throw out detector tools. It is to stop treating detector output as proof and start treating it as what it actually is: a triage signal that needs human judgment, contextual evidence, and a review process that leaves room for error.

If you are building a review policy

Never use a detector score as the sole basis for an accusation or a consequence.
Require pattern-level evidence — not just a number — before escalating. The WROITER Diagnostic shows what it flagged and why.
Account for genre, language background, and editing history. See Limitations for which contexts carry the highest false-positive risk.
Document every false positive you encounter. Your calibration data is more valuable than any published accuracy claim.
Read Do AI Detectors Work? for the broader reliability case before you finalize your process.

Run your own check

Paste any text — yours, your students', your team's — into the diagnostic. Read the flags before you read the score. That is the difference between using a tool and being used by one.

Run Diagnostic See WROITER method