What this is
The Good AI publishes a library of AI-written examples — essays on everything from the golden ratio to memoir about a favorite place. They’re meant to showcase what AI can do. We thought they’d make a good stress test for what WROITER can catch.
So we pulled all 168 examples from their public library, saved the originals so the test is reproducible, and scored every one with the same engine you’d get if you ran the diagnostic right now. No special tuning, no cherry-picking, no adjustments for the occasion.
The result is useful in both directions. It shows where WROITER catches obvious machine prose. And it shows — just as clearly — where cleaner AI writing still slips through.
The headline number
Out of 168 AI-written examples, WROITER flagged 150.
That leaves 18 examples that came in under the line. Three of them triggered nothing at all. Those are the interesting ones — we’ll get to them.
What it catches and why
The strongest signal in this library wasn’t a vocabulary tell. It wasn’t "delve" or "tapestry." It was how the essays end.
Nine out of ten examples wrapped up with a formulaic conclusion paragraph — the kind that starts with "In conclusion," "Overall," or "To summarize" and restates the thesis without adding anything. That single pattern, on its own, showed up more than any other.
Here’s how often each detection pattern fired across the full library:
Notice the pattern. This library is dominated by structural tells, not vocabulary ones. These essays don’t give themselves away by saying "delve" — they give themselves away by being shaped like AI essays. Tidy intro, three body paragraphs of similar length, a conclusion that restates the thesis. If you’ve ever read a ChatGPT essay and thought "this feels like a template" — that feeling now has numbers behind it.
Where it fails
This is the part we find most useful. The misses weren’t random — they collapsed into one very specific gap.
Of the 18 examples that came in under the threshold, 14 had only one flag: a formulaic conclusion. That single pattern scored them a 7 — one point below the warning line. The conclusion sounded AI-generated, but there was nothing else for the detector to stack on top of it.
And then there were three examples that triggered absolutely nothing:
The pattern is clear. WROITER is strong against templated, structured AI writing — the kind with headings and thesis statements and tidy conclusions. It struggles with AI writing that mimics a personal voice: memoir, first-person narration, polished expository prose where the structure is loose enough to look human.
That’s a real limitation. We’d rather publish it than pretend it doesn’t exist.
The shortcut we didn’t take
After running this benchmark, there was an obvious move. Remember those 14 examples sitting at score 7 with just a formulaic conclusion? If we raised the weight of that one pattern slightly, all of them would cross the threshold. Our capture rate would jump from 89% to somewhere in the mid-90s. The benchmark would look much better.
We didn’t do it.
Here’s why. A formulaic conclusion isn’t only an AI habit. Plenty of human writing — especially student essays, which are exactly the kind of text people run through detectors — ends with "In conclusion" and a thesis restatement. If we tuned the detector to flag that harder, we’d catch more AI on this benchmark. But we’d also start flagging more human writing as suspicious.
That’s the kind of trade-off that looks like a win on a chart and feels like a betrayal to the person whose essay just got wrongly flagged.
To make sure we weren’t being paranoid, we tested it. We took the Ghostbuster essay dataset — a public academic collection of 694 real student essays alongside 4,164 AI-generated versions of the same assignments — and ran both scenarios:
| Scenario | AI caught | Humans wrongly flagged |
|---|---|---|
| What we shipped | 91.7% | 59.8% |
| The easy fix (rejected) | 97.5% | 64.4% |
The easy fix catches 5.8% more AI writing. It also wrongly flags 4.6% more human writing. In a domain — student essays — where the human false-positive rate is already uncomfortably high.
What we shipped instead
Instead of the blunt shortcut, we shipped three narrow changes — patterns specific enough that they fire on AI templates without catching the human writing that shares a vague family resemblance.
Introduction: followed by a thesis announcement.Introduction:, Body:, and Conclusion:. Real essays have headings. But they don’t usually label them with the word "Body."Together, these three changes moved the numbers without compromising the detector’s honesty:
The internal holdout — a separate, mixed set of human and AI writing — stayed at 100% precision, 77.8% recall, and 0.0% human false-positive rate. Nothing broke.
Full category breakdown
The library spans 17 subject categories. Five hit 100% capture — every single example flagged. Two categories came in noticeably lower. Here’s the full picture:
| Category | Examples | Captured | Avg score |
|---|---|---|---|
| Physics | 4 | 100% | 90.0 |
| Engineering | 9 | 100% | 88.2 |
| Chemistry | 6 | 100% | 84.2 |
| Biology | 8 | 100% | 78.5 |
| Literature | 9 | 100% | 75.1 |
| History | 12 | 91.7% | 66.8 |
| Philosophy | 11 | 90.9% | 65.2 |
| Economics | 11 | 90.9% | 64.1 |
| Art | 10 | 90.0% | 63.7 |
| Religion | 10 | 90.0% | 60.5 |
| Finance | 9 | 88.9% | 68.4 |
| Music | 9 | 88.9% | 59.3 |
| Geography | 8 | 87.5% | 61.0 |
| Comparative | 12 | 83.3% | 57.8 |
| Math | 10 | 80.0% | 55.2 |
| Memoir | 20 | 75.0% | 48.6 |
| Technology | 10 | 60.0% | 42.1 |
The top of the table — physics, engineering, chemistry — is exactly where you’d expect a rule-based detector to dominate. These subjects produce essays with heavy scaffolding: clear sections, signposted arguments, formulaic conclusions. The AI habits stack up fast.
The bottom of the table tells the more honest story. Technology essays tend to be cleaner and less templated. Memoir is the hardest of all — first-person AI prose that mimics personal reflection without leaking structural tells. If WROITER has a blind spot, it’s here.
The honest boundary
WROITER catches nearly nine in ten AI-written examples from a third-party library it was never trained on. For heavily templated essay prose, it’s ruthless.
But clean memoir-style AI and generic expository prose can still slip through — especially when the only tell is how the essay ends. That gap is real, and closing it without hurting real human writers is the hardest open problem in this space.
That is not a reason to hide the benchmark. It is the reason to publish it.