The Ghostbuster Stress Test — WROITER Research

Why this dataset exists here

After running WROITER against The Good AI library, the results looked strong: 89.3% capture across 168 AI-written examples. But 14 of the 18 misses sat at score 7 — one point below the warning threshold — with only a single flag: META_OUTRO, the formulaic conclusion pattern.

The temptation was obvious. Promote isolated META_OUTRO over the threshold, and most of those misses disappear. The benchmark number jumps to the mid-90s. The chart looks better. The press release writes itself.

The problem is that formulaic conclusions aren’t only an AI habit. Student essays end the same way — because that’s what students are taught to do. To find out whether the shortcut was safe, we needed a dataset that was full of student essays, both human and AI, written to the same prompts. The Ghostbuster dataset is exactly that.

What is Ghostbuster? A public academic dataset published by Verma et al. (2024), hosted on HuggingFace. It contains student-essay-length texts across 694 shared prompts. Each prompt was answered by a human student and by multiple AI models (Claude, GPT with various prompt styles), producing 4,858 texts total. We use the polsci/ghostbuster-essay-cleaned version.

What’s in the corpus

The dataset has seven source labels: one human and six AI. Each shares the same 694 prompts, so every human essay has a direct AI counterpart written to the same assignment.

Source	Count	Avg score	≥ 8 rate	≥ 30 rate
Human	694	32.2	59.8%	33.1%
Claude	694	49.9	89.6%	62.0%
GPT (default)	694	75.0	88.0%	86.2%
GPT (prompt 1)	694	77.2	93.2%	89.3%
GPT (prompt 2)	694	86.1	96.7%	93.4%
GPT (semantic)	694	68.6	83.3%	79.5%
GPT (writing style)	694	92.0	99.3%	97.4%

Two things stand out immediately. First, the AI models score differently depending on how they were prompted — GPT’s “writing style” prompt produces text that is almost universally detected (99.3%), while its “semantic” prompt produces the cleanest output (83.3%). Second, the human column is already uncomfortable: 59.8% of real student essays score 8 or above.

That second number is the reason this page exists.

The headline numbers

AI recall @ threshold 8
91.7%
3,818 of 4,164 AI essays caught.

Human false-positive rate

59.8%

415 of 694 human essays score 8 or above.

Precision @ threshold 8

90.2%

When the detector says “AI,” it’s right 90% of the time.

At the higher threshold of 30, the picture shifts:

AI recall @ threshold 30

84.6%

Fewer AI essays caught, but much cleaner signal.

Human false-positive rate

33.1%

Still a third of human essays, but half the rate.

Precision @ threshold 30

93.9%

Higher confidence when the score is high.

This is the core tension. At the lower threshold, the detector catches more AI but also flags more humans. At the higher threshold, it’s more trustworthy per flag but misses more AI. There is no threshold where both numbers are comfortable at the same time.

The seam we found

When we ran the Good AI benchmark, 14 of 18 misses sat at score 7 with one flag: META_OUTRO. A formulaic conclusion, and nothing else. The obvious fix: treat an isolated META_OUTRO as enough to cross the threshold.

Before shipping that change, we tested it on Ghostbuster. The results:

Scenario	AI recall	Human FP rate	Change
What we shipped	91.7%	59.8%	—
Promote isolated META_OUTRO	97.5%	64.4%	+5.8% recall, +4.6% FP

The numbers at the seam: 120 AI essays and 32 human essays had isolated META_OUTRO at score 7. Promoting them would catch all 120 AI essays. It would also wrongly flag all 32 human essays.

In a domain where the human false-positive rate was already 59.8%, adding another 4.6 points was not a trade-off. It was a betrayal of the people most likely to be harmed: students whose writing happens to follow the same formula the AI was trained on.

The decision we made

We rejected the shortcut.

Instead of promoting an isolated weak signal, we shipped three narrow detectors that fire on specific AI templates without catching the human writing that shares a vague family resemblance.

The three changes, each validated against both the Good AI library and the Ghostbuster corpus before shipping:

Detector	What it catches	Ghostbuster human hits
`ESSAY_THESIS_ANNOUNCEMENT`	The rigid “Introduction: This essay will explore…” template	11 of 694
`SEGMENTED_EXPOSITORY_BLOCKS`	Neat stacks of paragraphs at near-identical word counts	0 of 694
`SECTION_LABEL_SCAFFOLDING`	Assignment-style headings: `Introduction:`, `Body:`, `Conclusion:`	1 of 694

Together, these moved the numbers without compromising honesty:

Metric	Before (v1.3.0)	After (v1.3.3)
Good AI capture	83.9%	89.3%
Ghostbuster AI recall	83.7%	91.7%
Ghostbuster human FP rate	59.5%	59.8%
Internal holdout precision	100.0%	100.0%
Internal holdout human FP	0.0%	0.0%

AI recall up 8 points. Human false-positive rate moved 0.3 points. That is the difference between tuning on one benchmark and cross-validating against a second.

What fires and what doesn’t

The detector frequency table across the full Ghostbuster corpus. For each pattern, the bar shows AI hit rate; the red segment shows how often the same pattern fires on human essays.

Formulaic conclusions META_OUTRO 2,272 AI · 106 human
Lists of exactly three things RULE_OF_THREE 1,699 AI · 124 human
Thesis announcement template ESSAY_THESIS_ANNOUNCEMENT 1,615 AI · 11 human
Assignment-style headings SECTION_LABEL_SCAFFOLDING 1,607 AI · 1 human
Stock AI phrases BANNED_PHRASES 1,335 AI · 91 human
Outline-like list structure OUTLINE_LIST_FORMAT 1,081 AI · 33 human
Excessive signposting OVER_SIGNPOST 977 AI · 119 human
Transition overuse TRANSITION_OVERUSE 858 AI · 195 human
Banned vocabulary BANNED_WORDS 1,003 AI · 15 human
Segmented exposition SEGMENTED_EXPOSITORY_BLOCKS 679 AI · 0 human

The pattern is unmistakable. Some detectors — ESSAY_THESIS_ANNOUNCEMENT, SECTION_LABEL_SCAFFOLDING, SEGMENTED_EXPOSITORY_BLOCKS, BANNED_WORDS — are almost pure AI signals. Others — TRANSITION_OVERUSE, OVER_SIGNPOST, RULE_OF_THREE — fire on human essays nearly as often. The first group is what makes detection possible. The second group is what makes it dangerous to rely on any single flag.

The human false-positive problem

This is the hardest number in the report: 59.8% of human student essays score 8 or above. More than half of real student writing triggers the medium warning.

That number is not a bug in the detector. It is the fundamental reason this problem is hard.

Why does this happen? Student essays and AI essays share the same structural DNA. Clear thesis statement. Orderly body paragraphs. Transitions between sections. A conclusion that restates the thesis. Students are taught to write this way. AI models were trained on writing that was taught this way. The overlap is real, it is structural, and no amount of detector tuning can eliminate it without also eliminating the detector’s ability to catch AI.

The 20 highest-scoring human essays all scored 100 — indistinguishable from AI by structural measures alone. They used transitions, signposting, opener repetition, and formulaic conclusions. Not because they copied AI, but because they followed the same academic writing template that AI learned to reproduce.

This is why WROITER’s score is explicitly framed as a measure of pattern overlap, not proof of authorship. A high score on a student essay does not mean the student cheated. It means the essay follows the same structural patterns that AI follows — and in the student-essay domain, that happens to be true of most competent academic writing.

For the full discussion of what this means for real-world use, see Limitations.

The honest boundary

Ghostbuster taught us three things we cannot untaught:

1. You cannot benchmark a detector on AI writing alone. The Good AI library measures recall. Ghostbuster measures the cost of that recall. Without both numbers, any accuracy claim is incomplete.

2. The student-essay domain is where detectors are most dangerous. It is simultaneously the domain with the highest demand for detection and the domain where detection is least reliable. Anyone building a review process for student work needs to understand that tension before they start.

3. A shortcut that looks good on one benchmark can look reckless on another. The META_OUTRO promotion would have raised our headline number by 5.8 points. It would also have wrongly flagged 32 more student essays. We chose not to do it, and we would choose the same way again.

Publishing this number — 59.8% — is not comfortable. But the alternative is letting someone discover it after they’ve already built a policy on top of a score they assumed was reliable.

Keep reading: The Good AI Benchmark is the companion report — recall without the cost. Calibration Log shows how each detector was tuned. Limitations states what WROITER cannot do. False Positive Hall of Fame shows famous texts that fail.