Reliability note

AI detector limitations and false positives

Detector outputs are probabilistic estimates, not proof. The limitations on this page matter most when the decision on the other side of the score carries real consequences for a real person.

Run Diagnostic Review method logic

Where false positives appear

A false positive means human-written text is flagged as AI-generated. It is not rare. It happens predictably in certain genres and contexts:

Academic and institutional prose — constrained vocabulary, formal structure, and conservative phrasing overlap with exactly the features detectors are trained to catch. An abstract or policy memo can score high purely because the genre demands low variation.
Second-language writing — writers working in a non-native language often choose safe, common phrasings. Those safe choices look like the same stock templates detectors flag in AI output.
Heavily edited text — multiple rounds of editing tend to smooth out quirks, flatten rhythm, and compress stylistic variation. The result can read more like generated prose than the rough first draft did.
Canonical and historical works — older texts, religious writing, and legal documents sometimes overlap with modern detector features by coincidence. The Declaration of Independence has been flagged by more than one tool.
Short samples — under 100 words, one or two triggered patterns can dominate the score. A single stock phrase in a 60-word paragraph looks more significant than the same phrase in 500 words.

If you want documented examples, the False Positive Hall of Fame collects real cases.

Where false negatives appear

A low score does not prove human authorship. It means the text did not trigger enough of the patterns the diagnostic tracks. Generated drafts that have been selectively edited, rewritten in sections, or produced by models fine-tuned to avoid common templates can pass through simple detectors with little resistance. Hybrid workflows — where a human outlines and an AI drafts, or vice versa — blur the boundary further.

This is one of the reasons WROITER frames its result as a diagnostic signal instead of a yes-or-no verdict. A clean score does not close the question. It means the surface-level evidence is not there.

Why sample size and genre matter

Genre changes what counts as suspicious. The same sentence patterns that look synthetic in a personal essay are normal in a legal brief. A few common examples:

Product copy — short, formulaic by design. High false-positive risk on phrase templates and over-signposting.
Academic abstracts — constrained vocabulary, passive voice, rigid structure. High false-positive risk on rhythm and vocabulary flags.
Legal prose — repetitive by necessity, highly formal. Rhythm and vocabulary flags are unreliable here.
Internal documentation — often written quickly, sometimes from templates. Pattern flags may be real but may also reflect house style rather than AI generation.
Personal essays and creative writing — the genre where detectors are most reliable, because the expected variation is highest and AI's tendency to flatten it is most visible.

Sample length compounds the problem. The shorter the passage, the easier it is for one local quirk to dominate the output. The diagnostic requires 50 words minimum, but 150+ gives meaningfully more stable results.

What the score tells you — and what it does not

A high score means the sample overlaps strongly with AI-typical surface patterns. It tells you where to look. It does not tell you who wrote the text, why those patterns are there, or what you should do about it. A high score should never trigger automatic punishment, public accusation, or irreversible action on its own. For the full interpretation boundary and score ranges, see How It Works.

Safe review policy

If you are building a review process around detector output, these five rules reduce the chance of a bad decision:

Triage, not judgment. Use the score to decide what deserves closer review — not to conclude the review.
Read the flags. Inspect the pattern-level evidence before escalating. If you cannot find the flagged patterns in the text, the score is not actionable.
Check context. Revision history, provenance, genre, and house style all affect what a score means. A 55 on a college essay and a 55 on a legal memo are different situations.
Document false positives. When you find a clear false positive, record it. Over time, your internal calibration will be more useful than any default threshold.
Conversation before accusation. Leave room for the writer to explain, provide drafts, or clarify context before you assign blame. The cost of a wrongful accusation is almost always higher than the cost of asking a question.

Why WROITER does not publish a single accuracy number

A single accuracy metric (e.g., "98% accurate") implies the tool works equally well across all genres, sample lengths, and writing styles. It does not. No detector does. Publishing a headline number would overstate confidence in exactly the situations where confidence is least warranted — short samples, formal prose, second-language writing. Instead, WROITER publishes the method, the flags, and this limitations page, and asks you to calibrate based on your own use case.

Evidence and cross-references

Study known failures in the False Positive Hall of Fame. Read the broader detector reliability case in Do AI Detectors Work?. For the reader-side version of pattern recognition, see How to Spot AI Writing.

Boundary

This project is anti-bypass. We publish transparent method guidance so writers can improve and reviewers can make better decisions — not so anyone can game a detector. If you are looking for evasion tactics, you are in the wrong place.

Run Diagnostic See method updates