Research | WROITER

What makes AI writing detectable

Every AI model writes with a tell. Not one tell — dozens. This report maps vocabulary, structure, and rhetorical habits across eleven models. It’s the foundation everything else stands on.

vocabulary patterns structural analysis multi-model

How big is the gap between AI and human text

Quantitative Signals: AI vs Human Text

Everyone says AI writing “sounds different.” We measured it. Word-level frequency ratios, sentence-length distributions, lexical diversity, and rhetorical pattern rates — across almost three million words of paired human and AI text.

corpus analysis frequency data statistical

Turning observations into detection rules

Rule-Based Signals of AI-Generated Writing

Fifty named, implementable detection patterns organized into nine categories. Each with a surface description, a mechanical cause, and a programmatic detectability sketch. No neural network required.

pattern catalog detection logic rule-based

What existing detection resources miss

Existing Catalogs: Audit and Gap Analysis

Before building a pattern library, we looked at every public resource that catalogs AI writing patterns. Wikipedia, GPTZero, Pangram, academic papers, GitHub, Reddit. Here’s what they cover, where they stop, and what nobody has built yet.

gap analysis literature review competitive

Testing against an outside dataset

External Benchmark: The Good AI Library

Most AI detectors publish accuracy numbers from their own test sets. We grabbed someone else’s AI writing library and scored every sample with the production detector. Category-level results, miss analysis, and a tuning decision validated against a second corpus.

external validation cross-validation accuracy

The hardest corpus we could find

The Ghostbuster Stress Test

4,858 student essays — 694 human, 4,164 AI — scored with the production detector. The dataset that stopped us from cheating on easier benchmarks and forced us to publish the hardest number in the project: a 59.8% human false-positive rate on student writing.

stress test student essays false positives

Does the revision prompt actually work

Revision Benchmark v1

A frozen 32-text Polygraf slice that measures the prompt, not just the detector. Fresh AI drafts in, WROITER revision prompts out, revised drafts frozen and rescored. It shows both the win and the limitation: when WROITER speaks, the edits work; many clean drafts still produce no prompt at all.

revision prompts before/after frozen corpus

How research becomes a detector

Research identifies the patterns. The pattern library catalogs and explains them. The calibration log documents how each was tested against real human writing. The method ties it together. And the evidence pages show you where it fails.

If you came here to check whether the diagnostic can be trusted, that last link is the one that matters most.

Run the Diagnostic Method hub Blog False Positive Hall of Fame