What this is

The Good AI publishes a library of AI-written examples — essays on everything from the golden ratio to memoir about a favorite place. They’re meant to showcase what AI can do. We thought they’d make a good stress test for what WROITER can catch.

So we pulled all 168 examples from their public library, saved the originals so the test is reproducible, and scored every one with the same engine you’d get if you ran the diagnostic right now. No special tuning, no cherry-picking, no adjustments for the occasion.

The result is useful in both directions. It shows where WROITER catches obvious machine prose. And it shows — just as clearly — where cleaner AI writing still slips through.

How to read the numbers in this report. WROITER scores text on a scale where higher means more AI-like patterns detected. A score of 8 triggers a medium warning — enough patterns to be worth looking at. A score of 30 triggers a high warning — a serious pileup. When we say "captured," we mean the example scored 8 or above.

The headline number

Out of 168 AI-written examples, WROITER flagged 150.

Captured
89.3%
150 of 168 examples scored at or above the medium threshold.
Strong pileup
79.8%
134 examples triggered so many signals they hit the high warning.
Average score
67.9
Most of this library is heavily templated essay prose — exactly WROITER’s home turf.

That leaves 18 examples that came in under the line. Three of them triggered nothing at all. Those are the interesting ones — we’ll get to them.

What it catches and why

The strongest signal in this library wasn’t a vocabulary tell. It wasn’t "delve" or "tapestry." It was how the essays end.

Nine out of ten examples wrapped up with a formulaic conclusion paragraph — the kind that starts with "In conclusion," "Overall," or "To summarize" and restates the thesis without adding anything. That single pattern, on its own, showed up more than any other.

Here’s how often each detection pattern fired across the full library:

  • Formulaic conclusions META_OUTRO 152 of 168
  • Neat stacks of same-length paragraphs SEGMENTED_EXPOSITORY_BLOCKS 65 of 168
  • Stock AI phrases BANNED_PHRASES 50 of 168
  • Too many “Moreover” and “Furthermore” TRANSITION_OVERUSE 39 of 168
  • Lists of exactly three things RULE_OF_THREE 36 of 168
  • Excessive signposting OVER_SIGNPOST 28 of 168
  • Assignment-style headings SECTION_LABEL_SCAFFOLDING 20 of 168
  • Outline-like list structure OUTLINE_LIST_FORMAT 15 of 168

Notice the pattern. This library is dominated by structural tells, not vocabulary ones. These essays don’t give themselves away by saying "delve" — they give themselves away by being shaped like AI essays. Tidy intro, three body paragraphs of similar length, a conclusion that restates the thesis. If you’ve ever read a ChatGPT essay and thought "this feels like a template" — that feeling now has numbers behind it.

Where it fails

This is the part we find most useful. The misses weren’t random — they collapsed into one very specific gap.

Of the 18 examples that came in under the threshold, 14 had only one flag: a formulaic conclusion. That single pattern scored them a 7 — one point below the warning line. The conclusion sounded AI-generated, but there was nothing else for the detector to stack on top of it.

And then there were three examples that triggered absolutely nothing:

A clean explanatory essay with references. No template leakage at all. It reads like a well-researched student paper — because that’s essentially what the AI was asked to produce.
First-person memoir. Generic enough that a human reader might suspect AI, but the current rules have nothing structural to point at.
Another memoir piece. The sentimentality feels performed — but “feels performed” isn’t a rule you can write down. Not yet.

The pattern is clear. WROITER is strong against templated, structured AI writing — the kind with headings and thesis statements and tidy conclusions. It struggles with AI writing that mimics a personal voice: memoir, first-person narration, polished expository prose where the structure is loose enough to look human.

That’s a real limitation. We’d rather publish it than pretend it doesn’t exist.

The shortcut we didn’t take

After running this benchmark, there was an obvious move. Remember those 14 examples sitting at score 7 with just a formulaic conclusion? If we raised the weight of that one pattern slightly, all of them would cross the threshold. Our capture rate would jump from 89% to somewhere in the mid-90s. The benchmark would look much better.

We didn’t do it.

Here’s why. A formulaic conclusion isn’t only an AI habit. Plenty of human writing — especially student essays, which are exactly the kind of text people run through detectors — ends with "In conclusion" and a thesis restatement. If we tuned the detector to flag that harder, we’d catch more AI on this benchmark. But we’d also start flagging more human writing as suspicious.

That’s the kind of trade-off that looks like a win on a chart and feels like a betrayal to the person whose essay just got wrongly flagged.

To make sure we weren’t being paranoid, we tested it. We took the Ghostbuster essay dataset — a public academic collection of 694 real student essays alongside 4,164 AI-generated versions of the same assignments — and ran both scenarios:

Scenario AI caught Humans wrongly flagged
What we shipped 91.7% 59.8%
The easy fix (rejected) 97.5% 64.4%

The easy fix catches 5.8% more AI writing. It also wrongly flags 4.6% more human writing. In a domain — student essays — where the human false-positive rate is already uncomfortably high.

Why is the human false-positive rate so high on student essays? Because student essays and AI essays often share the same structural habits: clear thesis, orderly body paragraphs, formulaic conclusion. That’s not a bug in the detector — it’s the fundamental reason this problem is hard. The better AI gets at mimicking the writing students are taught to produce, the harder it becomes to tell them apart with structural rules alone. This is something we discuss more in Limitations.

What we shipped instead

Instead of the blunt shortcut, we shipped three narrow changes — patterns specific enough that they fire on AI templates without catching the human writing that shares a vague family resemblance.

SHIPPED
Headed-thesis detector — the rigid "Introduction: This essay will explore…" opening. Not just any introduction — the specific template where the heading literally says Introduction: followed by a thesis announcement.
SHIPPED
Segmented expository blocks — essays built from neat stacks of paragraphs that are all roughly the same length, each covering one subtopic. Humans do write expository prose, but they rarely produce five paragraphs of almost identical word count.
SHIPPED
Section label scaffolding — assignment-style headings like Introduction:, Body:, and Conclusion:. Real essays have headings. But they don’t usually label them with the word "Body."

Together, these three changes moved the numbers without compromising the detector’s honesty:

Good AI capture
83.9% → 89.3%
From the three new patterns alone.
Ghostbuster AI recall
87.2% → 91.7%
On the independent student-essay dataset.

The internal holdout — a separate, mixed set of human and AI writing — stayed at 100% precision, 77.8% recall, and 0.0% human false-positive rate. Nothing broke.

Full category breakdown

The library spans 17 subject categories. Five hit 100% capture — every single example flagged. Two categories came in noticeably lower. Here’s the full picture:

Category Examples Captured Avg score
Physics4100%90.0
Engineering9100%88.2
Chemistry6100%84.2
Biology8100%78.5
Literature9100%75.1
History1291.7%66.8
Philosophy1190.9%65.2
Economics1190.9%64.1
Art1090.0%63.7
Religion1090.0%60.5
Finance988.9%68.4
Music988.9%59.3
Geography887.5%61.0
Comparative1283.3%57.8
Math1080.0%55.2
Memoir2075.0%48.6
Technology1060.0%42.1

The top of the table — physics, engineering, chemistry — is exactly where you’d expect a rule-based detector to dominate. These subjects produce essays with heavy scaffolding: clear sections, signposted arguments, formulaic conclusions. The AI habits stack up fast.

The bottom of the table tells the more honest story. Technology essays tend to be cleaner and less templated. Memoir is the hardest of all — first-person AI prose that mimics personal reflection without leaking structural tells. If WROITER has a blind spot, it’s here.

The honest boundary

WROITER catches nearly nine in ten AI-written examples from a third-party library it was never trained on. For heavily templated essay prose, it’s ruthless.

But clean memoir-style AI and generic expository prose can still slip through — especially when the only tell is how the essay ends. That gap is real, and closing it without hurting real human writers is the hardest open problem in this space.

That is not a reason to hide the benchmark. It is the reason to publish it.

Keep reading: Rule-Based Signals explains all 50 detection patterns. Calibration Log shows the human-side testing. Limitations states what WROITER cannot do.