Revision Benchmark v1 — WROITER Research

Why this benchmark exists

WROITER’s homepage promise is not just that the diagnostic catches AI-like patterns. It is that the prompt points to the specific problem and helps produce a cleaner second draft. That means we need a benchmark for the prompt itself, not only a benchmark for the detector.

Revision Benchmark v1 does that in the simplest honest way we could ship quickly. We froze a fresh slice from Polygraf’s multi-model contextual dataset, scored the drafts with the production detector, revised only the drafts that actually generated a WROITER prompt, and rescored the results. Human controls stayed in the mix so any prompt-led gain had to happen without pushing real human text upward.

Important nuance. This benchmark does not ask whether a language model can rewrite text beautifully in the abstract. It asks a narrower and more useful product question: when WROITER produces a concrete prompt, does that prompt lead to cleaner drafts without obvious drift?

The frozen corpus

The first public run uses a deterministic 32-text slice: 24 AI drafts and 8 human controls. The AI drafts were grouped into the same four prompt buckets WROITER uses internally for revision testing: summary, rewrite, simplify/explain, and blog paragraph.

The Polygraf dataset’s API is gated in this local environment, so v1 freezes a public-viewer slice rather than pretending we had unrestricted live access. That limitation is part of the benchmark record, not something hidden after the fact.

Slice	Count	Prompted	Revised	Notes
AI drafts	24	5	5	19 stayed unchanged because WROITER produced no prompt.
Human controls	8	0	0	No human control crossed the prompt threshold in this run.

The headline numbers

Prompt coverage

20.8%

Only 5 of 24 fresh Polygraf AI drafts produced an actionable WROITER revision prompt.

Actionable mean
10.4 → 0.0
On the drafts WROITER did speak to, one revision pass cleared every current flag.

Human drift

No human control was revised, and no new human threshold crossings were introduced.

The full AI set also moved, but much less dramatically: mean score 2.2 → 0.0. That smaller number is the honest one for the whole corpus. It stays small because most drafts were already so clean that WROITER had nothing specific to say.

Prompt coverage is the real story

If we cherry-picked only the actionable subset, the benchmark would look triumphant. If we only reported the full-set average, it would look underwhelming. The truth is both at once.

This Polygraf slice is clean. Nineteen of the 24 AI drafts scored zero and produced no revision prompt at all. That tells us something important about WROITER’s current boundary: the prompt is strongest when the detector already has a concrete hook, and much weaker as a universal editor for clean assistant prose.

What this means for product claims. WROITER can now say, with evidence, that its prompt removes the specific patterns it names. It cannot yet say that every AI-assisted draft will trigger an equally rich prompt. This benchmark keeps those two claims separate.

What the prompt fixed

The actionable drafts were modest, not catastrophic. Four were classic stock-phrase cleanups and one was assistant scaffolding. That is exactly the kind of narrow, surgical edit WROITER is supposed to guide.

Stock phrasing removed BANNED_PHRASES 4 drafts
Assistant framing removed ANSWER_SCAFFOLDING 1 draft

Rewrite bucket: four one-line cleanups

The rewrite bucket carried most of the action. A phrase like “The reality is” or “not just any job, but” was enough to trigger WROITER, and the revised drafts dropped from a mean of 10.0 to 0.0 after one targeted pass.

Blog paragraph bucket: one clean save

The blog sample about gastrin in chickens fired on “plays a crucial role.” Replacing that canned phrase with a more direct claim was enough to clear the score without changing the substance of the paragraph.

What it did not fix

The weak seam is not the revised subset. The weak seam is everything that never generated a prompt.

Summary and simplify/explain drafts were especially quiet in this run. That silence is useful feedback. It tells us the next round of detector work should focus less on “can the prompt remove a named phrase?” and more on “what signals are still missing from clean contextual assistant prose?”

That is also why the benchmark is frozen. When we improve those detectors later, we will rerun the same 32 texts and publish the before-and-after change instead of finding a new set that flatters the update.

Benchmarking discipline

This page is only useful because the benchmark is constrained. The source texts are frozen. The revisions are frozen separately. The human controls are left visible. The misses stay in the published number.

That discipline is part of the broader calibration method. We do not quietly retune and then swap to an easier set. We do not report the actionability win without also reporting the 20.8% prompt coverage. And we do not collapse detector benchmarks and revision benchmarks into one mushy claim.

If you want the policy version rather than the story version, the calibration page now has a dedicated section on benchmarking discipline.

The honest boundary

Revision Benchmark v1 is a good result for the prompt and a sobering result for prompt coverage.

When WROITER produces a concrete revision prompt, the prompt is doing its job. The revised drafts in this run stayed faithful, stayed short, and cleared the current detector. But most clean Polygraf drafts still produced no prompt at all. That is not a reason to hide the benchmark. It is the reason to keep publishing it.

Where to go next. If you want the mechanics, read How It Works. If you want the failure modes, read Limitations. If you want to see the current prompt on your own text, run the diagnostic.