External Benchmark: The Good AI Library

What this is

The Good AI publishes a library of AI-written examples — essays on everything from the golden ratio to memoir about a favorite place. They’re meant to showcase what AI can do. We thought they’d make a good stress test for what WROITER can catch.

So we pulled all 168 examples from their public library, saved the originals so the test is reproducible, and scored every one with the same engine you’d get if you ran the diagnostic right now. No special tuning, no cherry-picking, no adjustments for the occasion.

The result is useful in both directions. It shows where WROITER catches obvious machine prose. And it shows — just as clearly — where cleaner AI writing still slips through.

How to read the numbers in this report. WROITER scores text on a scale where higher means more AI-like patterns detected. A score of 8 triggers a medium warning — enough patterns to be worth looking at. A score of 30 triggers a high warning — a serious pileup. When we say "captured," we mean the example scored 8 or above.

The headline number

Out of 168 AI-written examples, WROITER flagged 150.

Captured
89.3%
150 of 168 examples scored at or above the medium threshold.

Strong pileup

79.8%

134 examples triggered so many signals they hit the high warning.

Average score

67.9

Most of this library is heavily templated essay prose — exactly WROITER’s home turf.

That leaves 18 examples that came in under the line. Three of them triggered nothing at all. Those are the interesting ones — we’ll get to them.

What it catches and why

The strongest signal in this library wasn’t a vocabulary tell. It wasn’t "delve" or "tapestry." It was how the essays end.

Nine out of ten examples wrapped up with a formulaic conclusion paragraph — the kind that starts with "In conclusion," "Overall," or "To summarize" and restates the thesis without adding anything. That single pattern, on its own, showed up more than any other.

Here’s how often each detection pattern fired across the full library:

Formulaic conclusions META_OUTRO 152 of 168
Neat stacks of same-length paragraphs SEGMENTED_EXPOSITORY_BLOCKS 65 of 168
Stock AI phrases BANNED_PHRASES 50 of 168
Too many “Moreover” and “Furthermore” TRANSITION_OVERUSE 39 of 168
Lists of exactly three things RULE_OF_THREE 36 of 168
Excessive signposting OVER_SIGNPOST 28 of 168
Assignment-style headings SECTION_LABEL_SCAFFOLDING 20 of 168
Outline-like list structure OUTLINE_LIST_FORMAT 15 of 168

Notice the pattern. This library is dominated by structural tells, not vocabulary ones. These essays don’t give themselves away by saying "delve" — they give themselves away by being shaped like AI essays. Tidy intro, three body paragraphs of similar length, a conclusion that restates the thesis. If you’ve ever read a ChatGPT essay and thought "this feels like a template" — that feeling now has numbers behind it.

Where it fails

This is the part we find most useful. The misses weren’t random — they collapsed into one very specific gap.

Of the 18 examples that came in under the threshold, 14 had only one flag: a formulaic conclusion. That single pattern scored them a 7 — one point below the warning line. The conclusion sounded AI-generated, but there was nothing else for the detector to stack on top of it.

And then there were three examples that triggered absolutely nothing:

Golden Ratio

A clean explanatory essay with references. No template leakage at all. It reads like a well-researched student paper — because that’s essentially what the AI was asked to produce.

My Favorite Place

First-person memoir. Generic enough that a human reader might suspect AI, but the current rules have nothing structural to point at.

My Most Prized Possession

Another memoir piece. The sentimentality feels performed — but “feels performed” isn’t a rule you can write down. Not yet.

The pattern is clear. WROITER is strong against templated, structured AI writing — the kind with headings and thesis statements and tidy conclusions. It struggles with AI writing that mimics a personal voice: memoir, first-person narration, polished expository prose where the structure is loose enough to look human.

That’s a real limitation. We’d rather publish it than pretend it doesn’t exist.

The shortcut we didn’t take

After running this benchmark, there was an obvious move. Remember those 14 examples sitting at score 7 with just a formulaic conclusion? If we raised the weight of that one pattern slightly, all of them would cross the threshold. Our capture rate would jump from 89% to somewhere in the mid-90s. The benchmark would look much better.

We didn’t do it.

Here’s why. A formulaic conclusion isn’t only an AI habit. Plenty of human writing — especially student essays, which are exactly the kind of text people run through detectors — ends with "In conclusion" and a thesis restatement. If we tuned the detector to flag that harder, we’d catch more AI on this benchmark. But we’d also start flagging more human writing as suspicious.

That’s the kind of trade-off that looks like a win on a chart and feels like a betrayal to the person whose essay just got wrongly flagged.

To make sure we weren’t being paranoid, we tested it. We took the Ghostbuster essay dataset — a public academic collection of 694 real student essays alongside 4,164 AI-generated versions of the same assignments — and ran both scenarios:

Scenario	AI caught	Humans wrongly flagged
What we shipped	91.7%	59.8%
The easy fix (rejected)	97.5%	64.4%

The easy fix catches 5.8% more AI writing. It also wrongly flags 4.6% more human writing. In a domain — student essays — where the human false-positive rate is already uncomfortably high.

Why is the human false-positive rate so high on student essays? Because student essays and AI essays often share the same structural habits: clear thesis, orderly body paragraphs, formulaic conclusion. That’s not a bug in the detector — it’s the fundamental reason this problem is hard. The better AI gets at mimicking the writing students are taught to produce, the harder it becomes to tell them apart with structural rules alone. This is something we discuss more in Limitations.

What we shipped instead

Instead of the blunt shortcut, we shipped three narrow changes — patterns specific enough that they fire on AI templates without catching the human writing that shares a vague family resemblance.

SHIPPED

Headed-thesis detector — the rigid "Introduction: This essay will explore…" opening. Not just any introduction — the specific template where the heading literally says Introduction: followed by a thesis announcement.

SHIPPED

Segmented expository blocks — essays built from neat stacks of paragraphs that are all roughly the same length, each covering one subtopic. Humans do write expository prose, but they rarely produce five paragraphs of almost identical word count.

SHIPPED

Section label scaffolding — assignment-style headings like Introduction:, Body:, and Conclusion:. Real essays have headings. But they don’t usually label them with the word "Body."

Together, these three changes moved the numbers without compromising the detector’s honesty:

Good AI capture

83.9% → 89.3%

From the three new patterns alone.

Ghostbuster AI recall

87.2% → 91.7%

On the independent student-essay dataset.

The internal holdout — a separate, mixed set of human and AI writing — stayed at 100% precision, 77.8% recall, and 0.0% human false-positive rate. Nothing broke.

Full category breakdown

The library spans 17 subject categories. Five hit 100% capture — every single example flagged. Two categories came in noticeably lower. Here’s the full picture:

Category	Examples	Captured	Avg score
Physics	4	100%	90.0
Engineering	9	100%	88.2
Chemistry	6	100%	84.2
Biology	8	100%	78.5
Literature	9	100%	75.1
History	12	91.7%	66.8
Philosophy	11	90.9%	65.2
Economics	11	90.9%	64.1
Art	10	90.0%	63.7
Religion	10	90.0%	60.5
Finance	9	88.9%	68.4
Music	9	88.9%	59.3
Geography	8	87.5%	61.0
Comparative	12	83.3%	57.8
Math	10	80.0%	55.2
Memoir	20	75.0%	48.6
Technology	10	60.0%	42.1

The top of the table — physics, engineering, chemistry — is exactly where you’d expect a rule-based detector to dominate. These subjects produce essays with heavy scaffolding: clear sections, signposted arguments, formulaic conclusions. The AI habits stack up fast.

The bottom of the table tells the more honest story. Technology essays tend to be cleaner and less templated. Memoir is the hardest of all — first-person AI prose that mimics personal reflection without leaking structural tells. If WROITER has a blind spot, it’s here.

The honest boundary

WROITER catches nearly nine in ten AI-written examples from a third-party library it was never trained on. For heavily templated essay prose, it’s ruthless.

But clean memoir-style AI and generic expository prose can still slip through — especially when the only tell is how the essay ends. That gap is real, and closing it without hurting real human writers is the hardest open problem in this space.

That is not a reason to hide the benchmark. It is the reason to publish it.

Keep reading: Rule-Based Signals explains all 50 detection patterns. Calibration Log shows the human-side testing. Limitations states what WROITER cannot do.