Motivation
As large language models (LLMs) like ChatGPT grew increasingly popular, most attention focused on their ability to do tasks, like summarizing text or doing math homework problems. But many noticed that the writing style of ChatGPT, Gemini, Claude, and similar models was distinct. Words like “delve” and “underscore” became known as the hallmarks of LLMs. This data comes from a project to analyze the writing style of LLMs in more detail by studying linguistic and rhetorical features of their writing, such as their use of present participles, contractions, or the passive voice. Using a corpus of thousands of human-written texts, various LLMs were prompted to generate text in similar styles, allowing direct comparisons.
The corpus contains text written by two different groups of models: OpenAI’s GPT-4o, used to power ChatGPT starting in 2024, and Meta’s Llama 3, which is released publicly and widely used to power many different AI products. The models are in several sizes, referring to the size of neural network used to generate their text output. GPT-4o has two sizes, GPT-4o Mini and the regular size; each Llama version comes in 8B and 70B (8 billion and 70 billion parameter) sizes.
Some of the included LLMs are “base” or “pre-trained” models. These have been trained on massive corpora of text to predict the most likely next word in a sentence, allowing them to be used to generate realistic-sounding text. Prompted with a chunk of human text, they will generate a plausible continuation of it. The Llama models marked “base” are base models.
The rest of the included LLMs have been “instruction-tuned”, meaning that after the basic pre-training process, they have been tuned to answer questions, follow instructions, and complete tasks. The details of this process are proprietary, so AI companies do not explain the exact training steps and tasks used, but typically this process involves giving the LLMs tasks (such as summarizing a text or solving a math problem), rating its output, and using the ratings to tweak the model to better perform on the tasks. Instruction-tuned LLMs are the ones you may have interacted with through ChatGPT, Claude, Gemini, and similar tools. The GPT-4o models in this data are all instruction-tuned, as are the Llama models marked “instruct”.
Data
The Human-AI Parallel English Corpus (HAP-E) is based on 12,000 original texts, scraped from various online sources. There are 2,000 texts from each of six different text-types:
- Academic writing from a corpus of open-access academic papers published by Elsevier
- News from a corpus of online news articles published by US-based news organizations
- Fiction from novels and short stories in the public domain and available on Project Gutenberg (which hosts work that is out-of-copyright, so typically from before the 1930s)
- Spoken word from a corpus of podcast transcriptions
- Blog posts from a corpus of posts from blogger.com
- TV/movie scripts drawn from two corpora of scripts
Two chunks were extracted from each text. The first chunk of approximately 500 words was used to prompt the LLMs, by asking them to write the next 500 words in the same style. The second chunk of 500 words was compared to the version written by the LLMs. There were six LLMs used:
GPT-4o Mini |
2024-07-18 |
Yes |
GPT-4o |
2024-08-06 |
Yes |
Llama 3 |
8B |
No |
Llama 3 |
8B Instruct |
Yes |
Llama 3 |
70B |
No |
Llama 3 |
70B Instruct |
Yes |
GPT-4o Mini and Llama 3 8B are smaller models with fewer parameters; GPT-4o and Llama 3 70B are larger and more capable.
Occasionally the LLMs refused to complete the writing task, for instance if the text was violent or sexually explicit, or generated very short output. Short outputs were filtered out and only texts where all LLMs provided output were kept, leaving \(n = 8290\) texts.
Each chunk—the human first and second chunk and the 500-word chunks generated by the LLMs—was then analyzed with the pseudobibeR package, which calculates the rate of 66 different rhetorical features in the texts. These are the variables provided in the data, listed below.
Each row of data represents one 500-word chunk of text. As there were 8,290 texts, each with two human chunks and six LLM chunks, there are \(8290 \times 8 =
66320\) rows.
Variable descriptions
Tense and aspect markers
f_01_past_tense |
Verbs in the past tense. |
f_02_perfect_aspect |
Verbs in the perfect aspect, indicated by “have” as an auxiliary verb (e.g. I [have] written this sentence.)” |
f_03_present_tense |
Verbs in the present tense. |
Place and time adverbials
f_04_place_adverbials |
Place adverbials (e.g., above, beside, outdoors) |
f_05_time_adverbials |
Time adverbials (e.g., early, instantly, soon) |
Pronouns and pro-verbs
f_06_first_person_pronouns |
First-person pronouns |
f_07_second_person_pronouns |
Second-person pronouns |
f_08_third_person_pronouns |
Third-person personal pronouns (excluding it) |
f_09_pronoun_it |
Pronoun it, its, or itself |
f_10_demonstrative_pronoun |
Pronouns being used to replace a noun (e.g. [That] is an example sentence.) |
f_11_indefinite_pronouns |
Indefinite pronouns (e.g., anybody, nothing, someone) |
f_12_proverb_do |
Pro-verb do |
Questions
f_13_wh_question |
Direct wh- questions (e.g., When are you leaving?) |
Passives
f_17_agentless_passives |
Agentless passives (e.g., The task [was done].) |
f_18_by_passives |
by- passives (e.g., The task [was done by Steve].) |
Subordination features
f_21_that_verb_comp |
that verb complements (e.g., I said [that he went].) |
f_22_that_adj_comp |
that adjective complements (e.g., I’m glad [that you like it].) |
f_23_wh_clause |
wh- clauses (e.g., I believed [what he told me].) |
f_24_infinitives |
Infinitives |
f_25_present_participle |
Present participial adverbial clauses (e.g., [Stuffing his mouth with cookies], Joe ran out the door.) |
f_26_past_participle |
Past participial adverbial clauses (e.g., [Built in a single week], the house would stand for fifty years.) |
f_27_past_participle_whiz |
Past participial postnominal (reduced relative) clauses (e.g., the solution [produced by this process]) |
f_28_present_participle_whiz |
Present participial postnominal (reduced relative) clauses (e.g., the event [causing this decline]) |
f_29_that_subj |
that relative clauses on subject position (e.g., the dog [that bit me]) |
f_30_that_obj |
that relative clauses on object position (e.g., the dog [that I saw]) |
f_31_wh_subj |
wh- relatives on subject position (e.g., the man [who likes popcorn]) |
f_32_wh_obj |
wh- relatives on object position (e.g., the man [who Sally likes]) |
f_33_pied_piping |
Pied-piping relative clauses (e.g., the manner [in which he was told]) |
f_34_sentence_relatives |
Sentence relatives (e.g., Bob likes fried mangoes, [which is the most disgusting thing I’ve ever heard of].) |
f_35_because |
Causative adverbial subordinator (because) |
f_36_though |
Concessive adverbial subordinators (although, though) |
f_37_if |
Conditional adverbial subordinators (if, unless) |
f_38_other_adv_sub |
Other adverbial subordinators (e.g., since, while, whereas) |
Prepositional phrases, adjectives, and adverbs
f_39_prepositions |
Total prepositional phrases |
f_40_adj_attr |
Attributive adjectives (e.g., the [big] horse) |
f_41_adj_pred |
Predicative adjectives (e.g., The horse is [big].) |
f_42_adverbs |
Total adverbs |
Lexical specificity
f_44_mean_word_length |
Average word length (across tokens, excluding punctuation) |
Lexical classes
f_45_conjuncts |
Conjuncts (e.g., consequently, furthermore, however) |
f_46_downtoners |
Downtoners (e.g., barely, nearly, slightly) |
f_47_hedges |
Hedges (e.g., at about, something like, almost) |
f_48_amplifiers |
Amplifiers (e.g., absolutely, extremely, perfectly) |
f_49_emphatics |
Emphatics (e.g., a lot, for sure, really) |
f_50_discourse_particles |
Discourse particles (e.g., sentence-initial well, now, anyway) |
f_51_demonstratives |
Demonstratives (that, this, these, or those used as determiners, e.g. [That] is the feature) |
Modals
f_52_modal_possibility |
Possibility modals (can, may, might, could) |
f_53_modal_necessity |
Necessity modals (ought, should, must) |
f_54_modal_predictive |
Predictive modals (will, would, shall) |
Specialized verb classes
f_55_verb_public |
Public verbs (e.g., assert, declare, mention) |
f_56_verb_private |
Private verbs (e.g., assume, believe, doubt, know) |
f_57_verb_suasive |
Suasive verbs (e.g., command, insist, propose) |
f_58_verb_seem |
seem and appear |
Co-ordination
f_64_phrasal_coordination |
Phrasal co-ordination (N and N; Adj and Adj; V and V; Adv and Adv) |
f_65_clausal_coordination |
Independent clause co-ordination (clause-initial and) |
Negation
f_66_neg_synthetic |
Synthetic negation (e.g., No answer is good enough for Jones.) |
f_67_neg_analytic |
Analytic negation (e.g., That isn’t good enough.) |
Questions
EDA
- Humans tend to adapt their writing and speech to the genre: blog posts are written differently than formal academic writing, which is very different from ordinary unscripted speech. Select the human chunk 2 text. Use the first part of the
doc_id
column to get the text-type of each text. Construct tables and visualizations to identify the features that differ the most by text-type. What features are most associated with formal writing? What features appear the most in informal speech?
- Now extend the comparison to examine how the instruction-tuned models (GPT-4o and the Llama Instruct variants) vary the same features by text-type.
- Build a table to compare the mean use of each feature by model.
- Build a visualization to compare the distribution of feature uses by model, compared to human use. How can you show the distributions in a compact way to allow the many features to be compared?
- Construct plots or tables that illustrate how the instruction-tuned Llama models compare to the base Llama models. What features seem to be changed the most by instruction tuning?
Classification
- Build a classifier to identify the source of text using the provided features. Split your data into training and test sets so you can evaluate its accuracy. What is the overall accuracy? How well can it distinguish between ChatGPT and Llama Instruct, and between the different sizes of each LLM?
- Subset the data to form pairwise corpora: human chunk 2 versus GPT-4o, human chunk 2 versus Llama 3 70B Instruct, and so on. For each pair, construct a classifier. Which LLMs are easiest to distinguish from human writing? Which are hardest?
- Explore which text-types have the lowest accuracy and which have the highest. In which do LLMs write most like humans, and in which do they write least like humans?
Multivariate analysis
- Use a dimension-reduction technique like principal components analysis (PCA) to reduce the features down to a few dimensions. How much variance is explained by the top two dimensions? Try to interpret the dimensions: which features are associated with each dimension?
- Plot the texts in the space formed by the top two dimensions. Construct plots that allow you to compare the distributions of the models and the six text-types. Which text-types are most different? Which models are most different?
References
The full text of the corpus is available at: Brown et al (2024). Human-AI Parallel Corpus. Hugging Face, doi:10.57967/hf/3770
This data was obtained from: Brown et al (2024). Human-AI Parallel Corpus, Biber Tagged. Hugging Face, doi:10.57967/hf/3792
Discussion of the creation of the corpus and its analysis: Reinhart et al (2025). Do LLMs write like humans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122 (8) 122 (8) e2422455122. doi:10.1073/pnas.2422455122. Preprint: https://arxiv.org/abs/2410.16107