Thursday, May 14, 2026

The Impact of AI-Generated Text on the Internet

Jonas Dolezal, Sawood Alam, Mark Graham, and Maty Bohacek, The Impact of AI-Generated Text on the Internet.

Abstract: The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments (sometimes subsumed under the “Dead Internet Theory”). What has hindered answering these questions is that it has not been understood just how much of the internet is actually AI-generated or AI-edited. To this end, we construct a representative sample of websites published on the internet between 2022 and 2025 using the Internet Archive, and apply a state-of-the-art AI text detector on them. We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT’s launch in late 2022. We also find statistically significant evidence for some of the identified hypotheses; for example, that increases in AI-generated text on the internet correlate negatively with semantic diversity and positively with the prevalence of positive sentiment. We do not, however, find statistically significant evidence supporting the hypothesis that an increased rate of AI-generated text on the internet decreases factual accuracy or stylistic diversity. Notably, this diverges from public perception, which we measure in a user study, where the majority of US adults turned out to believe in all four of the above-mentioned hypotheses. Individuals who do not use AI or use it infrequently tend to believe in these negative impacts more than those who use it frequently; similarly, individuals who hold negative views of AI tend to believe in these hypotheses more than those with favorable views of the technology.

From the introduction:

Ever since ChatGPT first made large language models (LLMs) available to the wider public in 2022, which was followed by mass adoption, there have been concerns about the impact of AI-generated text (as well as AI-generated content in other modalities) on the internet and online discourse (Ferrara, 2026; Muzumdar et al., 2025). Specifically, many known limitations and failure modes of LLMs, including factual hallucinations (Huang et al., 2025), sycophancy (Malmqvist, 2025), verbosity (Saito et al., 2023), and more, have raised concerns that unchecked proliferation of such content could reduce the overall quality of internet content (Shumailov et al., 2024; Xing et al., 2025). These hypotheses are sometimes subsumed under the “Dead Internet Theory,” which they loosely expand, but which, on its own, predates the widespread use of LLMs (Muzumdar et al., 2025). These hypotheses have been difficult to verify, primarily because there is limited understand- ing of how much internet content is actually AI-generated (Santy et al., 2025; Spennemann, 2025). In this paper, we attempt to address these questions. We concern ourselves only with LLM- generated text,leaving other modalities for future work, and use LLM-generated and AI-generated interchangeably.

The authors have produced a less technical version of their research online HERE, where they have a shorter abstract of their findings:

The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments. We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT's launch in late 2022. We also find evidence suggesting that increases in AI-generated text on the internet bring about a decrease in semantic diversity and an increase in positive sentiment. We do not, however, find statistically significant evidence supporting the hypothesis that an increased rate of AI-generated text on the internet decreases factual accuracy or stylistic diversity. Notably, our findings diverge from public perception of AI's impact on the internet.

Here's a statement of their methodology:

Answering this question is harder than it might seem. Constructing a statistically representative sample of the internet is difficult, as there is no central index, popular domains are vastly over-represented in most crawls, and archival coverage has shifted considerably over time. To work around this, we draw on the Internet Archive's Wayback Machine and apply a multi-dimensional stratified sampling approach, approximating a uniform random draw from publicly accessible web pages published between 2022 and 2025 (see Section 3.1 in our paper).

On top of this sample, we need a reliable way to tell AI-generated and AI-assisted text apart from human-written text. AI-generated text detection is itself an open problem, so rather than committing to a single detector, we experiment with four prominent methods selected based on their performance on the RAID benchmark: Binoculars, Desklib, DivEye, and Pangram v3. We then run our own robustness checks across text length, HTML versus plain text, model family, model version, and language, and choose the detector that comes out the strongest overall — Pangram v3 (see Appendix A in our paper).

AI-Generated Text on the Internet from Mid-2022 to Mid-2025. The proportion of websites classified as fully AI-generated (red) and AI-generated or AI-assisted (purple) based on Pangram v3 detection applied to representative samples obtained from the Internet Archive. The dashed line marks ChatGPT's public launch in November 2022.

H/t Tyler Cowen.

No comments:

Post a Comment