Engine audit · v1.3.4

We audited our own detector. Here is everything we found.

WROITER’s detector engine was largely built with Codex. The author did not fully trust it. So we ran a structural audit, one detector at a time, with every finding cited against a spec line and a code line. This page publishes the result.

Short answer: three real bugs in 28 detectors

The audit covered all 28 detectors in the v1.3.4 engine, plus the scoring algorithm, the output format, the input validation, and the calibration harness. It produced 32 spec verification findings, 16 fresh-review concerns, and 9 coverage gaps. Fifty-seven items in total.

Of those 57 items, three are real code bugs. All three are small regex changes. Together they fit in a single follow-on patch that ships the engine at v1.3.5. No architecture changes. No rewrites. No detector removed or added.

The dominant pattern across the rest of the audit was the opposite of what the author expected. The engine had quietly grown through v1.1, v1.2, and v1.3 detector additions that were never backfilled into the specification text. The documentation was promising less than the engine delivered. The trust gap was in the spec, not in the code.

The audit at a glance

Pass Question Output Findings
Pass 1 Does the implementation match the specification? 32 findings · 14 exact / 17 partial / 1 undetermined 1 code bug, 11 spec updates, 14 closed, 6 to discuss
Pass 2 What concerns surface on a fresh code review? 16 concerns · 2 high / 6 medium / 8 low 2 more code bugs, plus architectural notes
Pass 3 Which patterns should arguably be caught but are not? 9 coverage gaps · 3 high / 5 medium / 1 low Input for the next round of detector design

Each pass produced its own structured artifact. They are published in the repository at /audit/v1.3.4/ as JSON, with a human-readable decisions log next to them.

The three bugs

Each bug is named by the original audit finding ID (e.g., v015 for Pass 1 finding 15, fr001 for Pass 2 fresh-review concern 1). The fixes shipped together in v1.3.5.

1. Section-label detector matched mid-sentence colons

Finding: v015 (Pass 1) / fr002 (Pass 2 verification).

The SECTION_LABEL_SCAFFOLDING detector is supposed to catch assignment-style headings like Introduction: or Conclusion: at the start of a paragraph. The regex used a word boundary instead of a line boundary. So a sentence like “the cardiology team reached a similar conclusion: patients with jugular compression markers were undercounted” would fire the detector mid-sentence, flagging real prose as scaffolding.

Fix: change the regex from \b(?:Introduction|…)\s*: to (?:^|\n)\s*(?:Introduction|…)\s*:. Verified empirically: 7 true positives still match, 4 mid-sentence false positives are now correctly rejected. 11 of 11 cases correct.

2. List detector silently ignored Unicode bullets

Finding: fr001 (Pass 2).

The OUTLINE_LIST_FORMAT and LABELED_LIST_FORMAT detectors match bullet markers. The regex was [*-]. That covers ASCII asterisk and hyphen. It does not cover the Unicode bullet character • that almost every paste source actually uses — Word documents, Google Docs, Slack, Discord, the ChatGPT and Claude web UIs when you copy a list out of them.

A 25-bullet markdown sample with Unicode bullets reported zero bullet markers. The detector was silently dropping the most common real-world bullet character.

Fix: expand the character class to [*\-•●◦∙·–—]. Six more glyphs: bullet, black circle, white bullet, bullet operator, middle dot, en-dash, em-dash. Same fix applied to both detectors.

3. Copula detector did not allow line breaks

Finding: fr013 (Pass 2).

COPULA_AVOIDANCE catches phrases like “serves as” and “stands as” — the AI habit of replacing a simple is with a fancier two-word substitute. The regex used a literal space between the verb and as. Line-wrapped text where the line break landed between the two words missed entirely.

Fix: replace the literal space with \s+. Trivial. Caught by the regression fixture in v1.3.5.

That is the complete bug list. There were no others.

The dominant pattern: the spec was the laggard

17 of the 32 Pass 1 findings were partial matches between specification and implementation. Eleven of those 17 partials pointed in the same direction: the implementation catches more patterns than the specification documents.

Detector Spec lists Engine catches
HEDGING 8 hedge patterns 11+ pattern groups (adds “it appears that”, “in some ways”, “it’s worth considering”)
META_INTRO 4 example pattern classes Strict superset within each class (6 article types, 7 verbs, broader contractions)
META_OUTRO 5 example phrases Strict superset (adds “closing”, “summarize”, “conclude”, “main takeaway”, “key point is”)
WEASEL_ATTRIBUTION 6 attribution patterns 9+ pattern groups (adds “according to”, “researchers have found”, etc.)
ASSISTANT_PERSONA 8 chatbot phrases Strict superset across each phrase class (warmer variants, more verbs)
BANNED_PHRASES 27 stock phrase patterns 30 patterns — same examples plus three undeclared additions

The story is mechanical. The detector engine grew through v1.1.0, v1.2.0, and v1.3.0 detector additions. Each addition extended the surface coverage. The specification text was written at v1.0 and never refreshed. So the spec promised one set of patterns; the engine had been catching a strictly larger set for months.

This is the opposite of how AI detector vendors typically fail. The pattern in the broader industry is to publish accuracy claims that the engine does not actually meet — over-claiming. WROITER’s version was under-claiming, in the spec doc, while the engine quietly did more. Both are dishonest in opposite directions. Only the second one is fixable by writing more carefully. The Pattern Profile Specification rewrite will absorb all 11 spec-update items and rewrite the affected sections to enumerate what the engine actually catches.

The five named concerns — resolved

The audit specification named five specific concerns up front. Every one has a definitive answer.

  • PIVOT_CRUTCH variants. The detector catches “it’s not just X but Y” and a few close cousins. The inverted form “it’s Y, not X” is structurally identical and is not yet caught. Logged as a coverage gap for the next detector pass.
  • OUTLINE_LIST_FORMAT bullet counting. Root cause: Unicode bullets. Reproduced empirically. Fixed in v1.3.5.
  • SECTION_LABEL_SCAFFOLDING anchoring. Confirmed bug. Fix verified empirically (11 of 11 cases correct). Fixed in v1.3.5.
  • BANNED_PHRASES fuzzy matching. The detector matches exact literal phrases like “it is important to note”. Variants like “it is important to remember” or “it is important to be measured here” are not caught. Logged as a coverage gap for the next detector pass.
  • Sentence and paragraph splitting consistency. Sentence splitting is centralized in one helper. Paragraph splitting is inlined at multiple call sites. The revision-prompt engine will need a shared utility for both. Queued.

What the audit did not find

This part matters as much as the bug list.

  • No detector was wrong on its severity. Every detector’s severity label matches the spec.
  • No detector was missing from the engine. All 28 declared detectors are present and reachable.
  • The scoring algorithm matches the spec exactly. Severity weights (22 / 12 / 6), per-flag contribution capping at 4, isolation adjustments at 0.35 / 0.55 / 0.7, co-occurrence bonus weights of 8 / 6 / 3 for direct leaks, family diversity, and flag count — all match the spec verbatim.
  • The calibration harness is structurally sound. All four properties the audit demanded — deterministic hashing, stratified split, fixed train/holdout ratio, layered quality filters — are implemented and correct.
  • Input handling is sound. Type validation and minimum word count both work. Boundary cases (empty string, whitespace, very large input, malformed types) all fail loudly with explicit errors.
  • No rewrite was recommended anywhere. The audit’s bar for “rewrite this module” was “the existing module is unfixable in place.” Nothing cleared that bar.

What is still queued

Several findings became their own follow-ons rather than landing inside this audit. They are tracked separately so the audit’s record stays clean.

  • Pattern Profile Specification rewrite. Absorbs all 11 spec-update items from Pass 1 plus the related spec-direction resolutions from the discussion items. This is part of the broader v2.0 specification rename. Separate writing project.
  • PASSIVE_OVERUSE participle list expansion. The current regex enumerates 26 irregular past participles and misses common ones like broken, eaten, fallen, frozen, stolen, drunk. Separate small code fix, queued.
  • Sentence-splitter abbreviation handling. The current splitter treats “Dr. Smith left.” as two sentences and drops the “Dr.” fragment. Affects sentence count and length statistics. Separate code fix, queued.
  • Per-detector test coverage expansion. Today’s test suite covers about seven detectors directly. The audit recommends expanding to roughly 56 fixture tests (positive and negative per detector). Separate project.
  • v1.4 detector improvement project. The nine coverage gaps from Pass 3 become the input for the next round of detector design. The highest-priority items are a new em-dash overuse detector, the PIVOT_CRUTCH inverted form, and the BANNED_PHRASES “It is important to [any]” family expansion.

Read the raw findings yourself

The whole audit lives in the public repository. The artifacts are JSON; the decisions log is markdown; the summary is markdown. Nothing was abridged for this page.

  • pass-1-verification-log.json — the 32 detector and system-level verification findings, each citing a spec line and a code line.
  • pass-2-fresh-review.json — the 16 fresh-review concerns, each with empirical reproduction where possible.
  • pass-3-coverage-gaps.json — the nine coverage gaps and the priority order for the next detector design pass.
  • decisions.md — the recommended resolution for every finding, plus the blanket-confirmation record.
  • SUMMARY.md — the top-level recap and the by-the-numbers tables.

Why publish this at all?

Because every AI detector vendor on the market does this kind of audit somewhere internally and chooses not to publish it. The score-vendor narrative depends on the audit staying private. Once the receipts are out, the “99% accurate” lines do not survive contact with the actual findings.

WROITER would rather show the work. The Codex-built engine was not perfect. Three small bugs. One large documentation gap. A handful of coverage extensions queued for the next pass. That is what an honest audit of a real diagnostic looks like — not a marketing artifact, not a clean ten-out-of-ten, just the actual state of the code with the receipts attached.

If you want the broader case for why detector claims should be treated skeptically by default, read Do AI Detectors Work?. If you want to see what happens when detectors get applied to real human writing, read the False Positive Hall of Fame. If you want the calibration discipline that produced the corpus this audit was tested against, read the Calibration Log.