Evidence / April 2026 / 4 min read

False Positive Hall of Fame

If you've ever seen "likely AI" on writing you know is human, this page is for you. These case files show how often false positives happen, from famous historical texts to modern student essays.

Why this page exists

A detector score can feel final, especially when it says 90%+ AI. But that number is not authorship proof. This page exists to make that visible in concrete cases.

The number to remember

415 of 694 human student essays (59.8%) were flagged at or above the medium threshold in our public Ghostbuster benchmark.

If false positives are that common in student writing, high-confidence mistakes are not edge cases. They are part of the landscape.

External evidence: Turnitin's official guide says AI writing reports can misidentify text and should not be used as the sole basis for adverse action (Turnitin, updated March 6, 2026); a large cross-tool study in the International Journal for Educational Integrity found detectors were not reliable for misconduct decisions (Foltynek et al., 2024); and PEN America documented high-profile false-positive examples involving historical texts.

The case files

US Declaration of Independence (1776)

Top observed score High-90s
Evidence basis Archived detector snapshots
Signal driver Formal repetition

What happened: a foundational human text from 1776 still lands in AI-like score ranges in archived detector snapshots.

Why it matters: highly formal, repetitive prose can look "machine-like" to tools that rely on variation and predictability signals.

US Constitution (1787)

Top observed score High-90s
Evidence basis Archived detector snapshots
Signal driver Template legal prose

What happened: constitutional passages show very high AI-like scores in archived snapshots despite being canonical legal writing.

Why it matters: if your review process touches legal, compliance, or policy writing, detector scores need stronger context than the number alone.

Book of Genesis excerpts (pre-LLM era)

Top observed score 100
Evidence basis Archived detector snapshots
Signal driver Anaphora patterns

What happened: repeated biblical phrasing can score in the same high-risk ranges detectors associate with generated text.

Why it matters: repetition is not automatically evidence of AI. In many genres, repetition is style, emphasis, or oral tradition.

Moby-Dick excerpts (1851)

Top observed score Upper-80s
Evidence basis Archived detector snapshots
Signal driver Polished literary rhythm

What happened: excerpts from a core human literature work score high enough to trigger AI suspicion in archived runs.

Why it matters: strong human writing can trigger the same surface patterns as generated text, especially in formal or rhythmically consistent passages.

Student essays (ongoing, worldwide)

Strong Evidence

Top observed score 100
Counts 415 of 694
Flagged rate (@ ≥ 8) 59.8%

What happened: in the public Ghostbuster stress test, 415 of 694 human student essays scored at or above the medium threshold.

Why it matters: this is the highest-stakes environment. If policy decisions rely on detector output, student writing is where caution matters most.

Journalism and professional editing (ongoing)

Top observed score High in case reports
Evidence basis Industry case reports (2023–2026)
Signal driver Heavy editorial polish

What happened: newsroom and publishing teams have repeatedly reported detector friction on human-written, professionally edited copy.

Why it matters: editing often smooths rhythm and structure, which can increase overlap with "AI-like" surface patterns even when humans wrote every line.

Evidence note: the 415/694 figure is reproducible from the linked public benchmark. Historical score bands shown on this page come from archived detector snapshots and illustrate pattern overlap risk rather than authorship proof.

What to do with this knowledge

  1. Treat detector output as triage, not verdict.
  2. Ask for pattern-level evidence before escalation. The WROITER Diagnostic shows what it flagged and why.
  3. Match confidence to stakes. For higher-stakes decisions, use the reliability brief, How It Works, and Limitations together.

Run your own check

Use this page as a warning label, then run your own text through the diagnostic and read the flags before the score.