The executive answer
Triage and prioritization
They are useful for deciding which text needs closer review.
Standalone decisions
They are not reliable enough to act as proof of authorship or misconduct.
What detectors are good at
Detection signal is strongest when the text is:
- Generated in one pass, without significant editing.
- Longer than 150 words (short samples are unstable).
- From a model the detector was trained on or exposed to.
- In higher-variation genres where pattern compression stands out.
Those are not most real-world reviews.
Where detectors fail hardest
False positives are not edge cases. They are structural side effects of how detectors score style overlap.
- Academic writing: convention-driven and structurally constrained.
- Second-language writing: often uses safer lexical patterns.
- Heavily edited text: smooths variation into detector-like flatness.
- Institutional/legal prose: repetitive and formal by design.
In WROITER's public Ghostbuster run, 415 of 694 human student essays (59.8%) scored at or above the medium threshold. Source: Ghostbuster benchmark.
False negatives are also real: edited AI drafts and hybrid workflows can pass. A low score is not proof of human authorship.
Why two detectors give different answers
Different models, features, thresholds, and update cycles. Same text, different bets. Disagreement usually means ambiguity, not certainty.
Mechanics: How AI Detectors Work.
Policy you can actually run
The bottom line
AI detectors are useful when they start the process. They are dangerous when they end it.
If stakes are low, treat output as a quick signal. If stakes are high, treat output as one input among many.
For the full boundary and failure map, read Limitations.