Why this dataset exists here
After running WROITER against The Good AI library, the results looked strong: 89.3% capture across 168 AI-written examples. But 14 of the 18 misses sat at score 7 — one point below the warning threshold — with only a single flag: META_OUTRO, the formulaic conclusion pattern.
The temptation was obvious. Promote isolated META_OUTRO over the threshold, and most of those misses disappear. The benchmark number jumps to the mid-90s. The chart looks better. The press release writes itself.
The problem is that formulaic conclusions aren’t only an AI habit. Student essays end the same way — because that’s what students are taught to do. To find out whether the shortcut was safe, we needed a dataset that was full of student essays, both human and AI, written to the same prompts. The Ghostbuster dataset is exactly that.
polsci/ghostbuster-essay-cleaned version.
What’s in the corpus
The dataset has seven source labels: one human and six AI. Each shares the same 694 prompts, so every human essay has a direct AI counterpart written to the same assignment.
| Source | Count | Avg score | ≥ 8 rate | ≥ 30 rate |
|---|---|---|---|---|
| Human | 694 | 32.2 | 59.8% | 33.1% |
| Claude | 694 | 49.9 | 89.6% | 62.0% |
| GPT (default) | 694 | 75.0 | 88.0% | 86.2% |
| GPT (prompt 1) | 694 | 77.2 | 93.2% | 89.3% |
| GPT (prompt 2) | 694 | 86.1 | 96.7% | 93.4% |
| GPT (semantic) | 694 | 68.6 | 83.3% | 79.5% |
| GPT (writing style) | 694 | 92.0 | 99.3% | 97.4% |
Two things stand out immediately. First, the AI models score differently depending on how they were prompted — GPT’s “writing style” prompt produces text that is almost universally detected (99.3%), while its “semantic” prompt produces the cleanest output (83.3%). Second, the human column is already uncomfortable: 59.8% of real student essays score 8 or above.
That second number is the reason this page exists.
The headline numbers
At the higher threshold of 30, the picture shifts:
This is the core tension. At the lower threshold, the detector catches more AI but also flags more humans. At the higher threshold, it’s more trustworthy per flag but misses more AI. There is no threshold where both numbers are comfortable at the same time.
The seam we found
When we ran the Good AI benchmark, 14 of 18 misses sat at score 7 with one flag: META_OUTRO. A formulaic conclusion, and nothing else. The obvious fix: treat an isolated META_OUTRO as enough to cross the threshold.
Before shipping that change, we tested it on Ghostbuster. The results:
| Scenario | AI recall | Human FP rate | Change |
|---|---|---|---|
| What we shipped | 91.7% | 59.8% | — |
| Promote isolated META_OUTRO | 97.5% | 64.4% | +5.8% recall, +4.6% FP |
The numbers at the seam: 120 AI essays and 32 human essays had isolated META_OUTRO at score 7. Promoting them would catch all 120 AI essays. It would also wrongly flag all 32 human essays.
In a domain where the human false-positive rate was already 59.8%, adding another 4.6 points was not a trade-off. It was a betrayal of the people most likely to be harmed: students whose writing happens to follow the same formula the AI was trained on.
The decision we made
We rejected the shortcut.
Instead of promoting an isolated weak signal, we shipped three narrow detectors that fire on specific AI templates without catching the human writing that shares a vague family resemblance.
The three changes, each validated against both the Good AI library and the Ghostbuster corpus before shipping:
| Detector | What it catches | Ghostbuster human hits |
|---|---|---|
ESSAY_THESIS_ANNOUNCEMENT |
The rigid “Introduction: This essay will explore…” template | 11 of 694 |
SEGMENTED_EXPOSITORY_BLOCKS |
Neat stacks of paragraphs at near-identical word counts | 0 of 694 |
SECTION_LABEL_SCAFFOLDING |
Assignment-style headings: Introduction:, Body:, Conclusion: |
1 of 694 |
Together, these moved the numbers without compromising honesty:
| Metric | Before (v1.3.0) | After (v1.3.3) |
|---|---|---|
| Good AI capture | 83.9% | 89.3% |
| Ghostbuster AI recall | 83.7% | 91.7% |
| Ghostbuster human FP rate | 59.5% | 59.8% |
| Internal holdout precision | 100.0% | 100.0% |
| Internal holdout human FP | 0.0% | 0.0% |
AI recall up 8 points. Human false-positive rate moved 0.3 points. That is the difference between tuning on one benchmark and cross-validating against a second.
What fires and what doesn’t
The detector frequency table across the full Ghostbuster corpus. For each pattern, the bar shows AI hit rate; the red segment shows how often the same pattern fires on human essays.
The pattern is unmistakable. Some detectors — ESSAY_THESIS_ANNOUNCEMENT, SECTION_LABEL_SCAFFOLDING, SEGMENTED_EXPOSITORY_BLOCKS, BANNED_WORDS — are almost pure AI signals. Others — TRANSITION_OVERUSE, OVER_SIGNPOST, RULE_OF_THREE — fire on human essays nearly as often. The first group is what makes detection possible. The second group is what makes it dangerous to rely on any single flag.
The human false-positive problem
This is the hardest number in the report: 59.8% of human student essays score 8 or above. More than half of real student writing triggers the medium warning.
That number is not a bug in the detector. It is the fundamental reason this problem is hard.
The 20 highest-scoring human essays all scored 100 — indistinguishable from AI by structural measures alone. They used transitions, signposting, opener repetition, and formulaic conclusions. Not because they copied AI, but because they followed the same academic writing template that AI learned to reproduce.
This is why WROITER’s score is explicitly framed as a measure of pattern overlap, not proof of authorship. A high score on a student essay does not mean the student cheated. It means the essay follows the same structural patterns that AI follows — and in the student-essay domain, that happens to be true of most competent academic writing.
For the full discussion of what this means for real-world use, see Limitations.
The honest boundary
Ghostbuster taught us three things we cannot untaught:
1. You cannot benchmark a detector on AI writing alone. The Good AI library measures recall. Ghostbuster measures the cost of that recall. Without both numbers, any accuracy claim is incomplete.
2. The student-essay domain is where detectors are most dangerous. It is simultaneously the domain with the highest demand for detection and the domain where detection is least reliable. Anyone building a review process for student work needs to understand that tension before they start.
3. A shortcut that looks good on one benchmark can look reckless on another. The META_OUTRO promotion would have raised our headline number by 5.8 points. It would also have wrongly flagged 32 more student essays. We chose not to do it, and we would choose the same way again.
Publishing this number — 59.8% — is not comfortable. But the alternative is letting someone discover it after they’ve already built a policy on top of a score they assumed was reliable.