Calibration log

How WROITER calibrates the diagnostic

This page exists so nobody has to guess what the method was tuned on. It names the books, essays, journalism, Reddit, and model datasets used in calibration, explains the sampling rules, and publishes the metric history that changed the engine.

Run Diagnostic Back to method hub

Short answer: which books did we use?

The human literary and essay slice of the calibration corpus was drawn from public-domain long-form writing. The books and essay collections used were:

Pride and Prejudice by Jane Austen.
Jane Eyre by Charlotte Bronte.
Moby-Dick; or, The Whale by Herman Melville.
The Adventures of Sherlock Holmes by Arthur Conan Doyle.
Essays — First Series by Ralph Waldo Emerson.
Essays — Second Series by Ralph Waldo Emerson.
On the Duty of Civil Disobedience by Henry David Thoreau.

We did not score whole books as single examples. We extracted bounded passage windows so human and AI samples were compared at roughly similar lengths.

The calibration corpus at a glance

Slice	Count	Notes
Human literature	12	Three passages each from Austen, Bronte, Melville, and Doyle.
Human essays	10	Four Emerson passages from each essay collection, plus two Thoreau passages.
Human journalism	10	One passage each from ten Wikinews articles.
Human Reddit	8	Eight English long-form posts from Reddit TIFU.
AI: GPT-4	12	ShareGPT-style outputs from `shibing624/sharegpt_gpt4`.
AI: Claude Sonnet 3.5	12	ShareGPT-style outputs from `ChaoticNeutrals/Gryphe-Claude_Sonnet-3.5-SlimOrca-140k-ShareGPT`.
AI: Gemini 2.0 Flash Experimental	12	Instruction-output rows from `PJMixers-Dev/allenai_WildChat-1M-gemini-2.0-flash-exp-ShareGPT`.
Total	76	40 human / 36 AI

The current split is deterministic hashed holdout stratified by human bucket and AI model family: 56 train texts and 20 holdout texts.

Human source list

The point of the human corpus was not to reward one genre. It was to include polished literary prose, reflective essays, straight reporting, and messy internet writing.

Books and essay collections

Jane Austen: Pride and Prejudice (3 passages).
Charlotte Bronte: Jane Eyre (3 passages).
Herman Melville: Moby-Dick; or, The Whale (3 passages).
Arthur Conan Doyle: The Adventures of Sherlock Holmes (3 passages).
Ralph Waldo Emerson: Essays — First Series (4 passages).
Ralph Waldo Emerson: Essays — Second Series (4 passages).
Henry David Thoreau: On the Duty of Civil Disobedience (2 passages).

Journalism slice: the ten Wikinews titles

Musique Libre Femmes plays for Women's History Month at Tompkins Square Library in New York City.
South Korean military halts small firearms drills after stray bullet strikes young girl on playground.
"One Battle After Another" wins 6 Oscars including Best Picture.
Caloundra win round 3 game in Australian soccer's Football Queensland Premier League 2 competition.
Boats carrying Kyoto students capsize off Okinawa, fatalities reported.
Australian Minister for Energy lowers fuel standards for 60 days amid rising oil prices, panic buying.
US president Donald Trump demands the unconditional surrender of Iran.
NSW government proposes rezoning to create residences midway between Sydney and Parramatta.
PM of Australia sends E-7A Wedgetail and air-to-air missiles to UAE.
Former Major League Baseball pitcher Julio Teherán retires.

Reddit slice: the eight TIFU post titles

getting rotten robin egg juices on my hand.
getting a friend fired from work. at least it was due to her own actions, but still.
being in a crowded bus.
peeing in a gatorade bottle.
getting caught with gf.
showing my two friends and coworkers my thigh tattoo in the bathroom at work.
complaining about an autistic boy at christmas eve.
trying to run back up on escalators.

AI source list

The AI side was intentionally mixed across model families and prompt styles. We did not want a detector that only catches one model's favorite phrases.

GPT-4: 12 samples from shibing624/sharegpt_gpt4.
Claude Sonnet 3.5: 12 samples from ChaoticNeutrals/Gryphe-Claude_Sonnet-3.5-SlimOrca-140k-ShareGPT.
Gemini 2.0 Flash Experimental: 12 samples from PJMixers-Dev/allenai_WildChat-1M-gemini-2.0-flash-exp-ShareGPT.

Allowed prompt styles in the calibration harness were rewrite, summary, analysis, explanation, advice, and creative. The harness capped each model family at four samples per detected prompt style so one style could not dominate the corpus.

Selection and normalization rules

The corpus is small enough to inspect but strict enough to be reproducible. The rules were encoded in the calibration harness, not hand-waved after the fact.

Project Gutenberg headers and trailers were stripped before passage extraction.
Human long-form passages were built from paragraph windows, usually landing around 140–320 words.
AI responses were trimmed into 120–420 word windows so they could be compared against human passages at similar scale.
Only English long-form samples were kept. The harness filtered by word count, ASCII letter ratio, and stopword ratio.
Code-heavy outputs were rejected. Any response containing fenced code blocks was skipped.
The split is deterministic. A hashed assignment keeps the same train and holdout sets unless the corpus itself changes.
The holdout is stratified by human bucket and AI model family, so literature, essays, journalism, Reddit, GPT-4, Claude, and Gemini are all represented.

Benchmarking discipline

Benchmarks are only useful if they make it harder to lie to ourselves. WROITER treats them as a guardrail, not a marketing prop.

We freeze source corpora before scoring them, so later detector changes rerun against the same texts instead of a moving target.
We freeze revision outputs separately from source texts, so prompt changes and detector changes do not quietly overwrite each other.
We separate tuning from evaluation whenever possible. If train, holdout, and external benchmarks disagree, we publish that disagreement.
We keep human controls in the loop. A gain that only comes from flagging more humans does not count as a win.
We report misses, weak seams, and awkward numbers alongside the stronger ones. Ghostbuster's 59.8% essay false-positive rate stayed public for that reason.
We version every published claim by detector version, benchmark corpus, and run ID. WROITER does not get timeless accuracy slogans.

You can see those rules play out in the external benchmark pages, the Ghostbuster stress test, and the revision benchmark, which distinguishes between drafts that generated actionable prompts and drafts that stayed untouched because the engine found nothing to say.

What the research base looked like before corpus tuning

The rules were not invented from vibes. The detector work started from a 50-pattern research memo assembled from detector documentation, stylometry papers, Wikipedia's "Signs of AI writing" page, editor complaints, and practitioner notes. Then those candidate rules were cut against real text. If a pattern sounded plausible in theory but created obvious human false positives, it got softened or demoted.

The three fragilities known up front were:

BANNED_WORDS: common English words such as "landscape" and "robust" cannot carry much weight alone.
OPENER_REPETITION: rhetorical anaphora can look machine-like if the rule is too blunt.
UNIFORM_RHYTHM: some literary prose is deliberately even or stylized.

What changed during calibration

The detector began as a 22-signal rule engine. The current public engine is 28 signals. Five passes matter most.

First pass: v1.2.1 on 2026-04-05

Softened BANNED_WORDS so common English items could not dominate by themselves.
Tightened OPENER_REPETITION so rhetorical anaphora and normal prose reuse were less likely to trigger.
Made UNIFORM_RHYTHM dialogue-aware and stricter.
Tightened PASSIVE_OVERUSE.
Added co-occurrence-aware scoring so one isolated weak flag no longer carried the same weight as multiple agreeing signals.
Moved the practical operating threshold from the legacy default of 45 down to 8.

Second pass: v1.3.0 on 2026-04-06

Added deterministic train/holdout evaluation instead of tuning and scoring on the same set only.
Added three new detectors aimed at clean assistant prose: ANSWER_SCAFFOLDING, OUTLINE_LIST_FORMAT, and LABELED_LIST_FORMAT.
Left the public medium threshold at 8 because the holdout result at 8 matched the train-tuned threshold of 2, and 8 is the more conservative production cutoff.

Third pass: v1.3.1 on 2026-04-07

Added one narrow detector for the rigid school-essay template Introduction: ... This essay will ...: ESSAY_THESIS_ANNOUNCEMENT.
Accepted the change only after external benchmark work showed it improved the essay stress test without moving the internal holdout at all.
Kept the public medium threshold at 8. This pass changed pattern coverage, not threshold policy.

Fourth pass: v1.3.2 on 2026-04-07

Added SEGMENTED_EXPOSITORY_BLOCKS, a cautious structural detector for essays assembled from three or more medium-length exposition paragraphs.
Accepted the change only after miss-mining showed it rescued repeated Good AI and Ghostbuster misses while staying silent on the internal holdout and on Ghostbuster human essays.
Kept the public medium threshold at 8. This pass improved external coverage without changing threshold policy or the core mixed-corpus guardrail.

Fifth pass: v1.3.3 on 2026-04-07

Added SECTION_LABEL_SCAFFOLDING, a cautious detector for assignment-style inline headings such as Introduction:, Conclusion:, and Paragraph 1:.
Accepted the change only after it stayed silent on the internal holdout, improved the Ghostbuster essay stress test again, and produced only one non-threshold-crossing human hit at score 4 in Ghostbuster.
Kept the public medium threshold at 8. This pass improved essay-structure coverage without changing threshold policy.

Metric history: then and now

These rows are intentionally labeled by evaluation regime. Whole-corpus snapshots are useful for internal iteration. Holdout snapshots are stronger evidence. They are not the same thing.

Date	Version / regime	Threshold	Precision	Recall	F1	Human FP rate	What changed
2026-04-05	Legacy threshold on full 76-text corpus	45	100.0%	2.8%	0.054	0.0%	The old cutoff missed almost every AI sample.
2026-04-05	v1.2.1 first calibrated pass on full 76-text corpus	8	91.7%	30.6%	0.458	2.5%	Threshold tuning plus softer fragile rules and co-occurrence scoring.
2026-04-06	v1.3.0 current engine on full 76-text corpus	8	95.2%	55.6%	0.702	2.5%	New clean-assistant detectors raised coverage without adding new human false positives.
2026-04-06	v1.3.0 holdout validation	8	100.0%	77.8%	0.875	0.0%	20-text holdout, unseen during train threshold selection.
2026-04-07	v1.3.1 current engine on full 76-text corpus	8	95.2%	55.6%	0.702	2.5%	Narrow headed-thesis detector added; core calibration metrics stayed unchanged.
2026-04-07	v1.3.1 Ghostbuster essay stress test	8	89.7%	87.2%	0.885	59.8%	External hostile essay benchmark used to validate the new headed-thesis rule.
2026-04-07	v1.3.2 current engine on full 76-text corpus	8	95.2%	55.6%	0.702	2.5%	Segmented expository detector added; core calibration metrics stayed unchanged.
2026-04-07	v1.3.2 Ghostbuster essay stress test	8	90.1%	91.1%	0.906	59.8%	Structural essay-block detector improved external recall without worsening the internal holdout.
2026-04-07	v1.3.3 current engine on full 76-text corpus	8	95.2%	55.6%	0.702	2.5%	Section-label detector added; core calibration metrics stayed unchanged.
2026-04-07	v1.3.3 Ghostbuster essay stress test	8	90.2%	91.7%	0.909	59.8%	Assignment-style section-label detector improved essay recall again without adding a new threshold-crossing human false positive.

The train-tuned threshold on the latest pass was 2, but the holdout result at 2 was identical to the holdout result at 8. That is why the public threshold stayed at 8: no recall was lost on holdout, and the higher cutoff is the more cautious production choice.

External stress tests after the core calibration pass

The 76-text calibration corpus is the production guardrail. It is not the only benchmark we use. After the main calibration pass we started freezing external challenge sets so changes can be judged against sources we did not hand-pick.

The Good AI examples library: 168 AI-only examples across 17 categories. At threshold 8, WROITER still captures 89.3% of them under v1.3.3, while the average benchmark score has risen to 67.9 because assignment-style section labels are now explained explicitly on 20 examples.
Ghostbuster essay stress test: 694 human essays and 4,164 AI essays from polsci/ghostbuster-essay-cleaned. This is not a production-threshold target. It is a deliberately hostile essay-domain benchmark. At threshold 8, WROITER now scores 91.7% recall with a 59.8% human false-positive rate in that domain.

That second benchmark is why we still did not promote isolated META_OUTRO just to catch more Good AI essays. In the current Ghostbuster stress test, that tweak would raise recall to 96.9% but also raise the human false-positive rate from 59.8% to 64.4%. That is a worse trade-off, not a better one.

The post-benchmark changes we did accept were all narrower than that. First came the headed-thesis intro detector for the rigid Introduction: This essay will ... template. Then came SEGMENTED_EXPOSITORY_BLOCKS, which catches essays written as tidy stacks of medium-length explanatory paragraphs. The most recent addition, SECTION_LABEL_SCAFFOLDING, catches assignment-style headings such as Introduction: and Conclusion:. Together they lifted the external benchmarks while leaving the internal holdout unchanged.

Which detectors false-positive most on human text?

Right now the answer is simple: PASSIVE_OVERUSE. On the 40 human texts in the calibration corpus it triggered one false positive, which translates to a 2.5% human false-positive rate for that detector. That one example came from Emerson.

The six template-oriented detectors added across v1.3.0 through v1.3.3 produced zero human false positives on the current 40-text calibration slice. That does not prove they are universally safe. It does mean they earned their place on the corpus we actually tested.

What still gets through

Two holdout misses remain in the current pass, and both are important because they show the boundary of rule-based detection:

A clean Claude summary rewrite that reads like polished news copy and triggers no current rule.
A Gemini creative story that reads like competent genre fiction without obvious assistant scaffolding.

That is exactly why WROITER publishes method notes and limitation notes next to the tool. A low score is not proof of human authorship. It means the sample did not trip enough of the rules the engine currently knows how to explain.

Why publish this log at all?

Because a detector without a calibration record is asking for trust it has not earned. We would rather show the books, show the datasets, show the thresholds, show the misses, and let people judge the method with open eyes. If you want the broader reliability argument around detector claims, read Do AI Detectors Work?. If you want examples of where detectors embarrass themselves on human writing, read the False Positive Hall of Fame.

We also publish external benchmarks when we can freeze a third-party source set. The first case study runs all 168 examples from The Good AI library through the live detector and logs where WROITER catches them, where it misses, and why we declined the most obvious overfit. Read it here: We Ran 168 Good AI Examples Through WROITER.

Run Diagnostic See method changelog