AI Writing Fingerprints — WROITER Research

Why fingerprints exist at all

When you read something and think "this sounds like AI," you're reacting to a fingerprint. Not one thing — a cluster. A word that no human would reach for. A paragraph that's exactly the same length as the one before it. A sentence that hedges in a way that feels rehearsed rather than honest.

These patterns aren't random. They trace back to a specific step in how language models are built: RLHF — reinforcement learning from human feedback. After the base model is trained on text, human workers rate its responses. The model learns to produce whatever those workers reward. If the workers prefer formal vocabulary, the model learns formal vocabulary. If they reward thoroughness, the model becomes exhaustive. If they happen to like the word "delve," the model starts delving into everything.

Different companies use different workers, different rating criteria, different training philosophies. That's why GPT doesn't sound like Claude, Claude doesn't sound like Gemini, and none of them sound quite like a person. Each model picked up its own habits from its own training — and those habits are remarkably consistent, remarkably persistent, and remarkably hard to prompt away.

This report documents what those habits are, model by model.

The words that give it away

The universal AI vocabulary

Some words are overused by every major model. Research across 15 million PubMed abstracts found over a hundred words that spiked sharply after 2022 — the year AI-assisted writing went mainstream. These are the words all models reach for more than humans do:

Shared across all models — elevated vs. human baseline

delve underscore intricate meticulous commendable realm pivotal boast primarily enhancing nuanced tapestry landscape in today's world broader context foster

The interesting thing: the acceleration started in 2020, before ChatGPT launched. The RLHF workers who rated training data were already preferring these words. The public release just scaled it.

The "delve" origin story

"Delve" became the poster child of AI-generated writing, and there's a real hypothesis behind why. OpenAI used Kenyan and Nigerian workers (through a company called Sama) to rate model responses during training. In Nigerian formal English, "delve" is a common, natural word — used far more than in American or British English. If those workers rated responses containing "delve" as higher quality, the model learned to produce it more.

Academic research (ACL 2025) tested this and found:

RLHF definitely contributes — instruction-tuned Llama models overuse these words; base Llama models don't
Human participants actually prefer responses that contain these elevated words, confirming a feedback loop
The Nigerian English link is plausible but not proven — the broader explanation (RLHF workers reward formal register, period) has stronger support

Why different models have different words

The same mechanism — RLHF workers rewarding what sounds "good" — produces different vocabulary depending on who the workers are. GPT and Claude share some tells but diverge on others because they used different annotator pools and different training processes. Same machine, different teachers.

What belongs to whom

Beyond the universal vocabulary, each model has its own signature words. These are the ones that help identify not just that AI wrote something, but which AI wrote it.

OpenAI GPT-4 / GPT-4o

Signature vocabulary

certainly such as overall explore it's worth noting that furthermore additionally in today's world absolutely

GPT-4o additions

em-dash — boasts underscore delve curly quotes ""

GPT loves to bold key phrases and organize into lists. In a classifier study with 97.1% accuracy across five models, "certainly," "such as," and "overall" were the most predictive GPT word choices.

Anthropic Claude (all versions)

Signature vocabulary

according to based on here it's worth noting fascinating delve into landscape (metaphor)

Structural phrase tells

I should mention that said it's important to note broadly speaking in other words

Claude frames information with "according to" and "based on" — a habit from its Constitutional AI training, where it learned to cite its own reasoning. It uses minimal formatting by default: fewer bullets, less bold than GPT. It also doesn't use curly quotes, which immediately distinguishes it from ChatGPT and DeepSeek.

Google Gemini 1.5 / 2.0 / 2.5

Signature vocabulary

I hope this helps I'd be happy to great question certainly that's a great point

Structural habits

ADVP sentence openers bullet-heavy organization question restatement

Gemini has the most varied syntactic structure of any model tested, but its vocabulary diversity is actually lower. It starts sentences with adverbial phrases far more than any competitor. It tries to match your tone — but can tip into stiff formality even when you're being casual.

DeepSeek DeepSeek V3 / R1

Signature tells

curly quotes "" structured headers numbered subsections emoji in headers (chat)

Style character

dense, technical longer avg. sentences meticulous data annotation

DeepSeek uses curly quotes (like ChatGPT), immediately distinguishing it from Claude and Gemini. Its sentences average 24.9 words — the longest of any model tested. Classifiers frequently confuse it with GPT because of shared formatting traits.

Meta Llama 3 (instruct)

The key finding

base model: minimal AI tells instruct model: RLHF amplifies tells

Known tells

Sure! Of course Certainly! Here's what I found:

This is the smoking gun for RLHF: base Llama doesn't overuse any of the focal words. The instruct-tuned version does. Same architecture, same training data — the only difference is the human feedback step. That's the clearest evidence that these vocabulary patterns come from training, not from the model itself.

Mistral AI Mistral Small / Large

The least studied model for writing fingerprints, which is itself a kind of data point. Its outputs tend to be competent, direct, and personality-neutral — somewhere between GPT-3.5 and GPT-4 in formality. One documented quirk: it ignores negative instructions more than other models. Tell it "don't use the phrase 'in today's'" and it will use it anyway. No distinctive vocabulary tell has been firmly established by any study.

Structure as fingerprint

Here's something most people don't realize: even if you strip out every word and replace it with a placeholder, classifiers can still identify which model generated the text with 73%+ accuracy — using only the markdown structure. Bullets vs. paragraphs. Header depth. How many items in each list. Whether there's a summary at the top.

Formatting is a fingerprint independent of vocabulary. And each model has defaults that are surprisingly hard to override.

GPT-4 / 4o — structural defaults

Heavy use of bullet points and numbered lists
Bold key phrases liberally
Opens with a thesis paragraph that summarizes before diving in
Transition phrases at every paragraph start ("Furthermore," "Additionally,")
Formal header structure even when casual would fit
Tends toward comprehensive coverage over concision

Claude — structural defaults

Minimal formatting — fewer bullets, less bold
Prose-first: more paragraph writing, less listing
Variable sentence length (more natural rhythm)
More likely to use em-dash for asides
Doesn't restate the question before answering
Skips level-2 headings; goes straight to level-3

Gemini — structural defaults

Adverbial sentence openers ("Ever feel like...", "Think about...")
Bullet-heavy for information-dense responses
Often restates or reframes the question
Most varied syntactic structure of any model tested
Tends to formally organize even creative content

DeepSeek — structural defaults

Numbered subsections with headers everywhere
Section-by-section breakdowns even for simple tasks
Longest average sentence length (24.9 words/sentence)
Emojis in chat-mode headers (🔍 📌 etc.)
Chain-of-thought reasoning visible in R1 variant

Sentence length — it's not the average, it's the variance

Across seven models tested in academic writing:

DeepSeek-coder V2: 24.9 words/sentence — the longest
Llama 3.1 8B: 20.6 words/sentence — more readable by standard metrics
GPT-4 and Claude: 18-22 words, but Claude varies length more naturally within a single response
GPT-5: Shorter and more concise than predecessors — and distinctively overuses semicolons as a result

Why this matters for detection

Sentence length uniformity is more telling than length itself. Human writers swing wildly — 8 words, then 32, then 14, then 27. AI models, especially GPT-4o, hover in a narrow band: 21, 22, 20, 21. Both average 21 words per sentence. Only one of them reads like a machine wrote it.

How each model dodges, hedges, and deflects

Every model handles uncertainty and controversy differently — and those differences are fingerprints just as distinctive as vocabulary. This isn't about what they say. It's about what they do when they're uncomfortable.

Who hedges most?

Model	Hedging tendency	Go-to hedge phrases	On controversial topics
GPT-4 / 4o	High — adds disclaimers unprompted	`it's worth noting`, `however, it depends on`, `it's important to consider`	Both-sides framing; rarely refuses, often hedges
Claude	Medium — hedges on uncertainty, not on opinion	`that said`, `I'm not certain`, `broadly speaking`	More likely to take positions; also more likely to refuse entirely on some topics
Gemini	High — errs on caution	`I hope this helps`, `I'd be happy to assist`, `that's a great question`	References ethical frameworks without committing to a position
DeepSeek	Low-medium on general topics; high on Chinese political content	`I should point out`, `it's important to note`	Direct on technical/historical; refuses or deflects on China-sensitive content
Llama 3 (instruct)	Medium — mirrors what RLHF trainers rewarded	`of course`, `certainly`, `sure!`	Variable by fine-tune; base model is more direct
Mistral	Low — notably fewer disclaimers than GPT or Claude	Fewer hedge phrases documented	More direct; less content filtering by default

The "Let me explain" problem

Models narrate their own responses at very different rates. GPT-4o is the worst offender — it opens with an acknowledgment, contextualizes the task out loud, works through the answer, then closes with an offer to elaborate. Claude tends to skip the preamble and just answer. Gemini opens with enthusiasm before the actual content. DeepSeek dives straight into structure.

Meta-commentary phrases by model (most common first)

GPT-4o: "Great question," "I'd be happy to help," "Let me walk you through," "Here's what you need to know," "To summarize"
Gemini: "I hope this helps," "That's a great question," "Absolutely!" "Here's a breakdown," "I'd be happy to"
Claude: "That said," "In other words" — rarely opens with affirmations, tends to start directly answering
DeepSeek: Minimal meta-commentary; dives directly into structured response
Llama 3 instruct: "Sure!", "Of course", "Certainly!" — opener-heavy, then gets down to business

How they handle controversy

This is where training philosophy becomes most visible:

GPT-4: Both-sides by default. Presents multiple perspectives, rarely refuses, rarely commits. Will take a position when pushed, but never on its own.
Claude: More willing to form and express opinions, especially on philosophical and ethical questions. Also more willing to refuse creative and roleplay content that GPT handles without complaint.
Gemini: References ethical frameworks abstractly — utilitarianism, deontology — without committing to any of them. Strong on historical examples. High abstraction on moral questions.
DeepSeek: Excellent on general historical and ethical reasoning. But explicit refusal or deflection on anything touching Chinese political sensitivities — Taiwan, Tiananmen, Xinjiang. This asymmetry is unique to DeepSeek and instantly identifying when triggered.

How tells change version to version

Models don't stay the same. Each new version fixes some tells and introduces new ones. Tracking this evolution matters because a detector tuned to last year's tells will miss this year's models.

The GPT lineage

GPT-3.5 (Nov 2022)

Formal, clear, competent. No em-dash overuse. Less "delve." Heavy use of "I am writing to" and "I hope this email finds you well" type openers. Clearly robotic to trained eyes — but it fooled the early detectors.

GPT-4 (March 2023)

The "delve" era begins. Academic-sounding vocabulary spikes: "underscore," "pivotal," "meticulous." Bold text and bullet points become dominant. "Certainly" as opener proliferates. More structured, more verbose. Errors get harder to spot — wrapped in confident-sounding prose.

GPT-4o (May 2024)

The em-dash era. Roughly 10x more em-dashes than GPT-3.5. "Delve" persists, "boasts" drops, "underscore" increases. Curly quotes become consistent. Hypothesis: training data included more high-quality print books, which favor em-dashes. Still the most formatting-heavy model.

GPT-4.1 (Apr 2025)

Em-dash overuse increases further. Semicolons start appearing. Slightly more concise overall — the beginning of a trend.

GPT-5 (Aug 2025)

Noticeably more concise. Semicolons proliferate — hypothesis: the brevity bias uses semicolons as connective tissue instead of full sentences. Older tells (delve, em-dash) reduce. Harder to catch by word-lists. Still classifiable at 80%+ by embedding models, because the deeper structural fingerprint survives.

The Claude lineage

Claude 2 (2023)

Verbose. Extremely cautious. Heavy hedging, long responses to simple questions. The "Constitutional AI" register: ethical caveats woven into almost everything that touched a sensitive topic. Literary quality already higher than GPT-3.5, but the caution was overwhelming.

Claude 3 Opus / Sonnet (March 2024)

Major quality jump. More natural prose rhythm. Less unsolicited hedging. Still prone to "fascinating," "it's worth noting," "landscape" as metaphor. Formatting becomes more restrained vs GPT. Users describe Opus as writing like "a skilled human author." Literary tone most pronounced here.

Claude 3.5 Sonnet (June 2024)

The most natural-sounding Claude at that point. "Fascinating" frequency drops. Hedges only when genuinely uncertain. Better at matching requested tone. Rated highest among AI models for marketing copy and editorial writing. Still occasionally explains things nobody asked to have explained.

Claude 4 / Sonnet 4.6 (2025-26)

Current generation. Fewer universal AI tells. Harder to detect by word-level analysis alone. Retains mild Claude-isms: "based on," "according to," "that said." More direct, fewer unsolicited caveats. Stylistically closest to human academic writing of any Claude version.

Are newer models harder to detect?

By word-lists, yes. The same classifier trained on older outputs loses accuracy on newer ones. But "harder to catch by word-list" is not the same as "harder to catch." Embedding-based classifiers still hit 97.1% accuracy on the latest models — even when text is paraphrased, translated, or summarized. The fingerprint isn't just in the words. It's in how models organize thought. And that's much harder to engineer away.

Character portraits

If you've read this far, you've seen the data. Here's how it all adds up into personality — the shorthand version of what each model sounds like when you step back from individual tells and listen to the voice.

GPT-4 / GPT-4o — The Competent Bureaucrat

GPT writes like a research assistant who wants to demonstrate thoroughness. It front-loads with a summary, organizes everything with headers and bullets, uses transition phrases between every section, and closes with an offer to help further. It's reliable and comprehensive — but it's never surprising.

Its vocabulary skews formal even for casual tasks. When asked to be casual, it becomes "friendly professional" rather than actually casual. The em-dash in GPT-4o is persistent enough that even explicit prompts to avoid it fail regularly — it's baked deep.

Detectability: High Hedging: High List-heavy: High Prose naturalness: Low-Med

Claude 3.5 / Claude 4 — The Thoughtful Generalist

Claude writes like a well-read person who cares about being understood. It varies sentence length, avoids most boilerplate openers, and defaults to prose over lists. Its "AI-ness" is subtler — slightly elevated vocabulary, occasional over-explanation of things nobody asked about, and a tendency to contextualize decisions within the response itself.

It's significantly better at following style instructions than GPT. Tell it to write like Hemingway and it won't start with "Certainly! Here's a Hemingway-style piece." It just writes it.

Detectability: Medium Hedging: Low-Med List-heavy: Low Prose naturalness: High

Gemini 1.5 Pro / 2.0 — The Eager Collaborator

Gemini is the most conversational of the major models. It asks questions back, frames things warmly, uses "I hope this helps" like a reflex. This is both its strength and its most obvious tell. Its syntactic variety is actually the highest tested, but its vocabulary diversity is lower. It can tip into stiff formality even when casual register is requested.

The adverbial-phrase opening is Gemini's signature move: "Ever feel like...", "Just imagine..." It creates an enthusiastic-coach voice that's instantly recognizable once you know to listen for it.

Detectability: Medium-High Hedging: High List-heavy: Medium Prose naturalness: Medium

Llama 3 (instruct) — The Eager Intern

Llama opens with enthusiasm markers — "Sure!", "Of course!", "Certainly!" — more consistently than any other family. This is almost certainly an RLHF artifact: annotators rated responses higher when they opened with agreement and willingness. Base Llama doesn't do this at all. The contrast is the cleanest evidence that human feedback, not architecture, shapes these patterns.

Quality varies enormously by fine-tune. The base model has the fewest AI tells of any widely available model — which makes controlled Llama fine-tunes a potential blind spot for word-level detectors.

Detectability: High (instruct) Detectability: Low (base) Opener affirmations: Very High

Mistral — The Pragmatist

Mistral is the least studied model for writing fingerprints, which is itself a data point. Its outputs are competent, direct, and free of the enthusiastic-affirmation pattern. It ignores negative instructions more than other models — tell it not to use certain phrases and it uses them anyway. Its register sits between GPT-3.5 and GPT-4 in formality. No signature vocabulary tell has been established.

Detectability: Low-Med Hedging: Low Instruction following: Mixed

DeepSeek V3 / R1 — The Technical Structuralist

DeepSeek organizes everything. Numbered headers, subsections, section-by-section breakdowns even when prose would be more appropriate. Its chain-of-thought reasoning (R1) is visible, which creates a distinctive "I'm working through this" quality. Curly quotes consistently. Longest average sentence length of any model in academic writing tests.

The political-topic deflection is the single clearest model-specific tell in this entire report: DeepSeek will discuss Nazi war crimes in detail but deflects on Tiananmen, Uyghur persecution, and Taiwan independence. When triggered, it's an instant identification.

Detectability: Medium Structural formatting: High Sentence length: High

Master comparison table

Everything above, compressed into one table. Use it as a reference — the sections above tell the story, this is the cheat sheet.

Model	Primary vocab tell	Structural default	Rhetorical default	Punctuation tell	Controversy approach	Detectability trend
GPT-3.5	"I hope this email finds you well," "I am writing to"	Paragraphs, moderate lists	Formal register, direct	No em-dash; straight quotes	Balanced; rarely refuses	High (older tells obvious)
GPT-4	delve, certainly, such as, overall, pivotal	Heavy bullets + bold headers	Comprehensive; unsolicited disclaimers	Curly quotes; no em-dash yet	Both-sides framing	High
GPT-4o	underscore, tapestry, boasts, explore	Heavy bullets + bold + em-dashes	Enthusiastic meta-commentary	Em-dash dominant; curly quotes	Both-sides + hedges	Medium-High
GPT-5	Semicolons as connective tissue; fewer older tells	More concise; still structured	Fewer disclaimers; more direct	Semicolon overuse emerging	More direct than GPT-4	Medium (evolving)
Claude 3 Opus	fascinating, realm, delve into, it's worth noting	Prose-first, minimal lists	Literary register; considered tone	Em-dash (moderate); straight quotes	Forms opinions; more refusals on edge content	Medium
Claude 3.5/4	according to, based on, that said, landscape (metaphor)	Minimal formatting; paragraph-dominant	Direct; hedges only when uncertain	Em-dash (light); no curly quotes	Opinionated; more selective refusals	Medium-Low
Gemini 1.5	I hope this helps, great question, I'd be happy to	ADVP openers; bullet-heavy; question restatement	Enthusiastic-collaborative; warm hedging	No curly quotes; no em-dash	Abstract + ethical frameworks	High (enthusiasm pattern obvious)
Gemini 2.0+	Reducing but "I hope this helps" persists	Better prose; still conversational	More tone-matching; less formulaic	Consistent with 1.5	Improving nuance	Medium
Llama 3 instruct	Sure!, Of course, Certainly!, Here's what I found	Moderate lists; opener-heavy	Affirmation openers; then direct	Variable by fine-tune	Variable by fine-tune	High (opener pattern)
Mistral	No firm tell established	Moderate; ignores negative instructions	Direct; fewer disclaimers	Variable	Direct; less filtering	Low-Medium
DeepSeek V3/R1	Numbered section headers; meticulous subsections	Heavy structured formatting; CoT visible	Technical precision; direct on general content	Curly quotes consistently	Strong on history; deflects on China-sensitive topics	Medium (structure obvious; political tell unique)

What makes a tell "strong" vs. "weak"

Strong tell: Shows up consistently, rarely appears in human writing, and hard to prompt away. Examples: em-dash in GPT-4o, opener affirmations in Llama 3 instruct, DeepSeek's political deflection.
Medium tell: Elevated vs. human baseline, but can be prompted away with explicit instruction. Examples: delve, certainly, bullet overuse.
Weak tell: Overrepresented in AI generally but not specific to any one model. Examples: realm, pivotal, meticulous — all models use these.

What this means for detection

The strongest detection signals are structural — formatting patterns that survive even when you swap out every word. Next come rhetorical defaults — how models handle disagreement, uncertainty, and meta-commentary. Vocabulary tells come last. A detector that only looks at word-level tells will be defeated by the next model update. The deeper signature — how a model organizes thought — is much harder to train away. That's what WROITER is built on.

What we don't know yet

This report is a snapshot, not a permanent truth. Here's where the gaps are:

Coverage is uneven. Most academic studies focus on GPT and Claude. Gemini, Mistral, and DeepSeek are significantly less studied for writing fingerprints specifically.
Tells shift with every release. Anything documented before mid-2024 may already be reduced in current models that were trained to avoid it.
Fine-tuning changes everything. Enterprise fine-tunes of Llama, Mistral, and others can produce signatures that look nothing like the base model.
The "delve" origin story is plausible, not proven. Direct confirmation from OpenAI's annotator data hasn't been published.
Identification ≠ detection. Telling which model wrote something requires model-specific classifier training on current outputs. Telling that AI wrote it is a different — and in some ways easier — problem.
The arms race is real. GPT-5, Claude 4, and Gemini 2.5 are specifically trained to reduce known tells. WROITER has to track model releases actively, not just document what was true six months ago.