Quantitative Signals of AI-Generated Text

What this is

The AI Writing Fingerprints report names the patterns. This report puts numbers on them.

We used a public research dataset called HC3 — the Human ChatGPT Comparison Corpus — where real humans and ChatGPT answered the exact same questions across four domains: open Q&A, finance, medicine, and encyclopedia-style topics. That pairing is what makes the comparison fair: same prompts, different authors.

The dataset contains 7,210 human answers (about a million words) and 10,243 AI answers (about 1.8 million words). We ran every measurement on both halves and compared. Where GPTZero has published their own ratios from a much larger corpus, we include those too — so you can see whether the numbers agree.

They usually do.

The words AI overuses — and by how much

You’ve probably seen lists of "AI words" — delve, tapestry, vibrant, landscape. But how overused are they, really? GPTZero measured it across millions of documents. The top result is striking:

Phrase	AI-to-Human Ratio	Source
play a significant role in shaping	182×	GPTZero Top 10
notable works include	≥120×	GPTZero Top 10
today’s fast-paced world	107×	GPTZero Top 10
aims to explore	≥50×	GPTZero Top 10
showcasing	20×	GPTZero Top 10
remarked	18×	GPTZero Top 10
aligns	16×	GPTZero Top 10
surpassing	12×	GPTZero Top 10
tragically	11×	GPTZero Top 10
impacting	≥11×	GPTZero Top 10

AI uses the phrase "play a significant role in shaping" 182 times more often than human writers do. That’s not a subtle difference. That’s a fingerprint.

We verified these patterns independently on the HC3 corpus. The ratios are lower — HC3 is a Q&A dataset with shorter answers, less room for elaborate phrasing — but the direction is the same:

Word	Human /1M	AI /1M	Ratio	Signal strength
vibrant	0.00	13.56	27.5×	strong
delve	0.00	7.91	16.7×	strong
showcasing	0.00	4.52	9.6×	strong
multifaceted	0.00	3.96	8.4×	strong
testament	4.03	28.26	6.3×	medium
foster	1.01	10.17	5.8×	medium
additionally	35.26	193.85	5.4×	medium
enduring	2.01	13.00	5.3×	medium
groundbreaking	0.00	2.26	5.1×	medium
transformative	0.00	2.26	5.1×	medium
essential	5.04	17.52	3.4×	medium
navigate	12.09	40.13	3.2×	medium
crucial	4.03	10.74	2.7×	weak
intricate	0.00	1.13	2.6×	weak
landscape	8.05	20.91	2.5×	weak
leverage	9.06	18.65	2.0×	weak
comprehensive	26.18	53.12	2.0×	weak
realm	6.04	13.00	2.0×	weak

What about "moreover" and "furthermore"?

Some famous AI words barely register here. Pivotal (0.65×), moreover (0.73×), nuanced (0.79×) — they actually appear less in AI text in this corpus. That’s because HC3 is early ChatGPT answering short questions. These words became AI tells later, in longer-form writing and newer models. A word can be a strong AI signal in a blog post and a weak one in Q&A. Context matters.

Sentence rhythm — the metronomic beat

This is probably the most reliable non-vocabulary signal. When you read AI text and feel like something is off even though you can’t point to a specific word — this is often why. AI sentences are longer, and they’re more uniform. Human writing has a rhythm that speeds up and slows down. AI writing hums at a steady frequency.

What we measured	Human	AI	What it means
Average sentence length (words)	22.6	31.1	AI sentences are 37% longer
Variation in sentence length (median SD)	7.2	10.0	AI varies more in raw numbers — but read on
% consecutive sentences within 5 words of each other	28.5%	24.0%	Surprisingly close. AI sentences can vary — they just vary around a higher average
Average sentences per document	7.1	6.0	AI uses fewer, longer sentences to say the same thing

But here’s the twist that matters. This signal doesn’t work the same way across all subjects:

Domain	Human avg length	AI avg length	Human SD	AI SD	What happens
Open Q&A	12.0	24.4	5.5	11.8	strong AI doubles human length
Finance	20.1	38.6	9.0	17.8	strong AI nearly doubles
Medicine	23.1	14.6	7.9	7.0	reversed AI writes shorter than humans
Encyclopedia	23.5	24.0	9.8	6.7	weak Almost identical length

This is why single-rule detectors fail

In open Q&A, AI writes sentences twice as long as humans. In medical text, AI actually writes shorter. A sentence-length detector tuned on Q&A would completely misfire on medical writing. And in encyclopedia-style text, length is nearly useless — only the lower variation in AI (SD 6.7 vs 9.8) gives you anything. This is why WROITER doesn’t use a single global threshold. The signal changes shape depending on what kind of writing you’re looking at.

Paragraph shape — the wall of neat blocks

If you’ve ever looked at a page of AI writing and thought "this looks too organized" — it’s not your imagination. AI organizes text into paragraphs more aggressively than humans, and those paragraphs are eerily uniform in size.

What we measured	Human	AI	What it means
% documents with 2+ paragraphs	9.0%	19.0%	AI structures more, even when humans don’t bother
Average paragraphs per document	1.31	1.69	AI adds ~30% more paragraph breaks
Average paragraph length (sentences, multi-para docs)	2.89	2.24	AI paragraphs are shorter and more regular
Variation in paragraph length within a document	1.06	0.54	AI paragraphs are twice as uniform in size
% paragraphs opening with a transition word	0.5%	1.7%	AI uses 3.4× more paragraph-opening transitions
% paragraphs with a clear topic sentence first	23.7%	9.9%	Humans use topic sentences 2.4× more often

The uniformity signal

That paragraph-length variation number — 0.54 for AI vs 1.06 for human — is one of the cleanest structural signals in the entire corpus. Human writers produce paragraphs of all sizes: some long, some just one sentence, some sprawling. AI produces blocks of similar size, one after another. That’s the "wall of neat paragraphs" that experienced editors notice before they can articulate why.

And there’s an interesting surprise in the last row. Humans use topic sentences more than twice as often as AI. That goes against the intuition that AI writing is "more structured." AI looks structured because the paragraphs are uniform, but it actually does less of the work of orienting the reader at the start of each paragraph. It substitutes neatness for clarity.

Vocabulary range — how quickly AI starts repeating itself

AI doesn’t just overuse specific words. It uses fewer different words overall. Three standard measures all point the same direction — and the gap gets worse the longer the text runs.

Metric	Human	AI	Gap
500-word chunks
Unique words as % of total (TTR)	0.501	0.402	−20%
Words used exactly once (Hapax rate)	0.345	0.241	−30%
Moving-window diversity (MATTR)	0.703	0.631	−10%
1,000-word chunks
TTR	0.428	0.333	−22%
Hapax rate	0.283	0.188	−34%
MATTR	0.703	0.631	−10%
2,000-word chunks
TTR	0.360	0.275	−24%
Hapax rate	0.226	0.145	−36%
MATTR	0.703	0.631	−10%

What the Hapax rate really tells you

Hapax legomena are words that appear exactly once in a chunk of text. In 500 words of human writing, about 35% of the words are one-offs — slang, specific jargon, idiosyncratic word choices. In AI, it’s only 24%. AI picks "safe" words and recycles them. And the gap widens with length: at 2,000 words, AI’s hapax rate is 36% lower than human. The longer AI writes, the more it repeats itself.

Notice that the MATTR metric (a length-independent measure of vocabulary diversity) holds perfectly steady at 0.703 vs 0.631 across all chunk sizes. That stability is the point — it confirms the 10% diversity gap is real and consistent, not an artifact of how we sliced the data.

Rhetorical habits — the hedging and the endings

Beyond words and structure: how often does AI use specific rhetorical moves compared to humans? Two patterns stood out sharply. One was expected. The other was interesting for what it didn’t show.

Pattern	Human	AI	Signal
Hedging phrases per 1,000 words	13.8	22.0	+59% AI hedges significantly more
Hedging — encyclopedia split only	10.6	15.3	+45% Effect persists even in formal writing
% texts ending with a summary conclusion	0.9%	8.7%	9.7× A strong AI tell
% texts with both intro + conclusion markers	0.2%	0.4%	2× Small absolute numbers
% texts with "both sides" framing	0.0%	0.0%	Not detectable at this scale
% texts with meta-introduction	0.0%	0.0%	Absent in Q&A format

The hedging premium is a direct consequence of how AI models are trained. During reinforcement learning, models are rewarded for epistemic caution — "may," "might," "could," "appears to," "suggests that" — and penalized for overconfident claims. The result: AI hedges 59% more than humans, even in formal encyclopedia text where both sides hedge less overall.

The summary-conclusion signal is the one that matters most for detection. Only 0.9% of human answers in this corpus end with an "In conclusion" or "Overall" summary paragraph. AI does it 9.7 times more often. If you’ve read the Good AI benchmark, you already know this pattern: formulaic conclusions were the single most common detection signal across the entire library.

The zeros are informative too

Meta-introductions ("In this article, we will explore...") and "both sides" framing ("On one hand... on the other...") — two patterns that feel obviously AI — showed 0.0% in both human and AI text here. They don’t appear in Q&A format. They appear in blog posts and essays. A detection system calibrated only on Q&A data would miss them entirely. Genre-specific baselines aren’t optional; they’re the whole game.

Predictability — and why we chose not to use it

There’s one more signal we should mention, even though WROITER doesn’t use it: perplexity.

Perplexity measures how "surprising" text is to a language model. If you feed a sentence to GPT-2 and it finds each word highly predictable, the perplexity is low. AI-generated text has consistently lower perplexity than human text — because it was generated by a process that maximizes token probability. It’s predictable because a prediction machine wrote it.

The HC3 research shows this clearly. Human text has a broad perplexity distribution, with significant mass at high values — humans write things that genuinely surprise language models. AI text clusters at the low end, concentrated in a narrow band of high predictability.

GPTZero was one of the first detectors to use perplexity as a primary signal, combined with "burstiness" (how much perplexity varies from sentence to sentence). It works. So why doesn’t WROITER use it?

Why WROITER skips perplexity

Three reasons. First, it requires running a reference language model on every piece of text — that’s infrastructure cost we’d rather not pass on. Second, the reference model itself changes over time, making results inconsistent. Third, paraphrased or lightly edited text can shift perplexity without changing the underlying patterns that make it sound like AI. WROITER’s rule-based approach detects patterns that survive paraphrasing, at zero infrastructure cost. Different philosophy, different trade-offs.

Honest gaps

Every study has limitations. Ours are worth being explicit about.

The corpus is from early ChatGPT (2023). HC3 captures a specific moment in AI writing — before GPT-4o, before Claude 3.5, before Gemini 2. The patterns measured here are real, but newer models may exhibit different ratios. We’re working with the best publicly available paired corpus, not the most current one.

The format is Q&A, not essays. These are short answers to specific questions, not blog posts, marketing copy, or academic papers. Some patterns that are strong AI signals in long-form writing — like meta-introductions and elaborate "both sides" framing — simply don’t appear in this format. The numbers here are a floor, not a ceiling.

GPTZero’s full data is gated. We used their public Top 10 ratios. Their full Top 50 list and monthly vocabulary updates appear to be account-gated or dynamically rendered. Adding those would strengthen the word-level analysis significantly.

Multi-corpus validation is still to do. Adding numeric tables from Ma et al., StyloAI, and CHEAT analyses would provide robustness beyond a single dataset. We haven’t done that yet.

Keep reading

This report provides the numbers. Rule-Based Signals turns them into 50 named, implementable detection patterns. AI Writing Fingerprints maps which patterns belong to which model. And the Good AI Benchmark shows what happens when you point the detector at someone else’s library.