Why this benchmark exists
WROITER’s homepage promise is not just that the diagnostic catches AI-like patterns. It is that the prompt points to the specific problem and helps produce a cleaner second draft. That means we need a benchmark for the prompt itself, not only a benchmark for the detector.
Revision Benchmark v1 does that in the simplest honest way we could ship quickly. We froze a fresh slice from Polygraf’s multi-model contextual dataset, scored the drafts with the production detector, revised only the drafts that actually generated a WROITER prompt, and rescored the results. Human controls stayed in the mix so any prompt-led gain had to happen without pushing real human text upward.
The frozen corpus
The first public run uses a deterministic 32-text slice: 24 AI drafts and 8 human controls. The AI drafts were grouped into the same four prompt buckets WROITER uses internally for revision testing: summary, rewrite, simplify/explain, and blog paragraph.
The Polygraf dataset’s API is gated in this local environment, so v1 freezes a public-viewer slice rather than pretending we had unrestricted live access. That limitation is part of the benchmark record, not something hidden after the fact.
| Slice | Count | Prompted | Revised | Notes |
|---|---|---|---|---|
| AI drafts | 24 | 5 | 5 | 19 stayed unchanged because WROITER produced no prompt. |
| Human controls | 8 | 0 | 0 | No human control crossed the prompt threshold in this run. |
The headline numbers
The full AI set also moved, but much less dramatically: mean score 2.2 → 0.0. That smaller number is the honest one for the whole corpus. It stays small because most drafts were already so clean that WROITER had nothing specific to say.
Prompt coverage is the real story
If we cherry-picked only the actionable subset, the benchmark would look triumphant. If we only reported the full-set average, it would look underwhelming. The truth is both at once.
This Polygraf slice is clean. Nineteen of the 24 AI drafts scored zero and produced no revision prompt at all. That tells us something important about WROITER’s current boundary: the prompt is strongest when the detector already has a concrete hook, and much weaker as a universal editor for clean assistant prose.
What the prompt fixed
The actionable drafts were modest, not catastrophic. Four were classic stock-phrase cleanups and one was assistant scaffolding. That is exactly the kind of narrow, surgical edit WROITER is supposed to guide.
What it did not fix
The weak seam is not the revised subset. The weak seam is everything that never generated a prompt.
Summary and simplify/explain drafts were especially quiet in this run. That silence is useful feedback. It tells us the next round of detector work should focus less on “can the prompt remove a named phrase?” and more on “what signals are still missing from clean contextual assistant prose?”
That is also why the benchmark is frozen. When we improve those detectors later, we will rerun the same 32 texts and publish the before-and-after change instead of finding a new set that flatters the update.
Benchmarking discipline
This page is only useful because the benchmark is constrained. The source texts are frozen. The revisions are frozen separately. The human controls are left visible. The misses stay in the published number.
That discipline is part of the broader calibration method. We do not quietly retune and then swap to an easier set. We do not report the actionability win without also reporting the 20.8% prompt coverage. And we do not collapse detector benchmarks and revision benchmarks into one mushy claim.
If you want the policy version rather than the story version, the calibration page now has a dedicated section on benchmarking discipline.
The honest boundary
Revision Benchmark v1 is a good result for the prompt and a sobering result for prompt coverage.
When WROITER produces a concrete revision prompt, the prompt is doing its job. The revised drafts in this run stayed faithful, stayed short, and cleared the current detector. But most clean Polygraf drafts still produced no prompt at all. That is not a reason to hide the benchmark. It is the reason to keep publishing it.