The Signpost

File:23 0054245 Convair Negative Image - Convair quality control illustration (53918621431).jpg
Convair
PD
100
400
Recent research

Explaining the disappointing history of Flagged Revisions; and what's the impact of ChatGPT on Wikipedia so far?

Contribute   —  
Share this
By Tilman Bayer and Clayoquot


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


Flagged Revisions: Explaining a disappointing history

Reviewed by Clayoquot

A 2024 paper[1] explores the history of Flagged Revisions in several Wikipedia language versions. Flagged Revisions is a vandalism-mitigation feature that was first deployed in the German Wikipedia in 2008. There were calls for the feature to be used broadly across WMF wikis but community and WMF support both dwindled over the years.

The English Wikipedia uses the term Pending Changes for its variant of the feature. After extensive discussions between 2009 and 2017, the English Wikipedia community settled on a very small role for Pending Changes – it is used in just ~0.06% of articles. In the German Wikipedia, whose community requested the initial development of the system, it is used in nearly all articles.

In the English Wikipedia's implementation of Flagged Revisions, certain edits are not shown to readers until they are accepted by an editor with the pending changes reviewer permission, or an administrator.
TKTK
"Illustration of the challenges between different stakeholders. Most of the platform challenges arise between the Wikimedia Foundation and the different language editions. Most community challenges arise inside the different language editions, among the user base. Technological challenges arise both between the latter and former stakeholders." (Figure 2 from the paper)

The authors start with a premise that Flagged Revisions is fundamentally a good idea, citing their own prior research finding that "the system substantially reduced the amount of vandalism visible on the site with little evidence of negative trade-offs" (see our earlier review: "FlaggedRevs study finds that concerns about limiting Wikipedia's 'anyone can edit' principle 'may be overstated'"). They then ask,

"What led to the decline in FlaggedRevs' popularity over the years, despite its effectiveness?"

The paper attributes the loss of popularity to community challenges ("conflicts with existing social norms and values", "unclear moderation instructions"); platform and policy challenges ("lack of technical support from the governing body", "bureaucratic hurdles", "lack of cross-community empirical evidence on the effectiveness of the system"); and technical challenges.

As part of their methodology, the authors analyzed dozens of on-wiki discussions in the English, German, Indonesian, and Hungarian Wikipedias. They also conducted interviews with seven individuals, six of whom were WMF employees. It is unclear how much weight on-wiki discussions were given in the findings.

A major drawback of Pending Changes, according to the English Wikipedia's past RfCs, is that it significantly increases work for experienced editors and leads to backlogs. The paper discusses this issue under Technical challenges in a way that suggests it is solvable; it does not suggest how to solve it. Another part of the paper asserts that "the English Wikipedia community" hired at least one contractor to do technical work, implying that the authors mistakenly believe the community can hire people.

Where the paper breaks important ground is in exploring the dynamics between the Wikipedia communities and the WMF. It goes into depth about the sometimes-slow pace of technology development and the limitations of the Community Wishlist Survey process.

Applications are open for the 2025 Wikimedia Research Fund

By Tilman Bayer

The Wikimedia Foundation announced the call for proposals for its 2025 research fund, open until April 16, 2025.

Changes from previous years include:

The list of proposals funded in the 2022–23 round might give an impression of the kind of research work previously produced with support from the fund (while the 2023–24 round is still in progress), and might also shed some light on possible reasons for these changes – e.g. it appears that several projects struggled to complete work within 12 months:

Proposal Title Applicant(s) Budget (USD) Status as of March 22, 2025
(per project page)
Reliable sources and public policy issues: understanding multisector organisations as sources on Wikipedia and Wikidata Amanda Lawrence 45,000 in progress
Codifying Digital Behavior Around the World: A Socio-Legal Study of the Wikimedia Universal Code of Conduct Florian Grisel and Giovanni De Gregorio 49,402.77 completed
Dashboards to understand the organization of social memory about Chileans in Wikipedia. Politicians, scientists, artists, and sportspersons since the 19th century Pablo Beytía and Carlos Cruz Infante 36,000.00 completed
Understanding how Editors Use Machine Translation in Wikipedia: A Case Study in African Languages Eleftheria Briakou, Tajuddeen Gwadabe and Marine Carpuat 50,000 in progress
(duration extended to June 2025)
A Whole New World – Integration of New Editors into the Serbian Wikipedia Community Nevena Rudinac and Nebojša Ratković 25,000–30,000 completed
Disinformation, Wikimedia and Alternative Content Moderation Models: Possibilities and Challenges (2022–23) Ramiro Álvarez Ugarte 47,147 in progress
Measuring the Gender Gap: Attribute-based Class Completeness Estimation Gianluca Demartini 50,000 in progress
(one of the planned studies has been published already)
Reducing the gender gap in AfD discussions: a semi-supervised learning approach Giovanni Luca Ciampaglia and Khandaker Tasnim Huq 50,000 in progress (duration extended to July 2025)
Implications of ChatGPT for knowledge integrity on Wikipedia Heather Ford, Michael Davis and Marian-Andrei Rizoiu 32,449 in progress
Network perspectives on collective memory processes across the Arabic and English Wikipedias H. Laurie Jones and Brian Keegan 40,000 completed


(See also our 2022 coverage: "Wikimedia Research Fund invites proposals for grants up to $50k, announces results of previous year's round")

So again, what has the impact of ChatGPT really been?

By Tilman Bayer

More than two years after the release of ChatGPT precipitated what English Wikipedia calls the AI boom, its possible effects on Wikipedia continue to preoccupy researchers. Recently, ChatGPT surpassed Wikipedia in Similarweb's "Most Visited Websites In The World" list. While the information value of such rankings might be limited and the death of Wikipedia from AI clearly still isn't imminent, generative AI seems here to stay.

Previous efforts

Earlier attempts to investigate ChatGPT's impact on Wikipedia include a rather simplistic analysis by the Wikimedia Foundation, which concluded in February 2024 that there had been No major drop in readers during the meteoric rise in ChatGPT use, based on a juxtaposition of monthly pageview numbers for 2022 and 2023.

A May 2024 preprint by authors from King's College London (still not published in peer-reviewed form) reported more nuanced findings, see our review: "Actually, Wikipedia was not killed by ChatGPT – but it might be growing a little less because of it".

And a June 2024 abstract-only conference paper presented stronger conclusions, see our coverage: "'Impact of Generative AI': A 'significant decrease in Wikipedia page views' after the release of ChatGPT. However, these likewise don't seem to have been published as a full paper yet.

More recently, several new quantitative research publications have examined such issues further from various angles (in addition to some qualitative research papers that we will cover in future issues):

"Wikipedia Contributions in the Wake of ChatGPT"

This paper,[2] to be presented at the upcoming WWW conference, focuses on ChatGPT's impact on the English Wikipedia. From the abstract:

How has Wikipedia activity changed for articles with content similar to ChatGPT following its introduction? [...] Our analysis reveals that newly created, popular articles whose content overlaps with ChatGPT 3.5 saw a greater decline in editing and viewership after the November 2022 launch of ChatGPT than dissimilar articles did. These findings indicate heterogeneous substitution effects, where users selectively engage less with existing platforms when AI provides comparable content.

The aforementioned King's College preprint had used what this reviewer called a fairly crude statistical method. The authors of the present paper directly criticize it as unsuitable for the question at hand:

Several factors about Wikipedia necessitate our differences-in-differences (DiD) strategy, in contrast to the interrupted time series analysis that is often used in similar work [...including the King's College preprint on Wikipedia]. In addition to having a broader scope of topics, Wikipedia allows for more diverse user incentives than analogous platforms: viewers exhibit both shallow and deep information needs, while contributors are driven by both intrinsic and extrinsic motivations. These factors may dampen the effects of ChatGPT on some users and articles. In fact, [the King's College researchers] analyze Wikipedia engagement in the aggregate, and do not identify significant drops in activity following the launch of ChatGPT. We hypothesize that their analysis do not fully capture the heterogeneity of Wikipedia, compared to similar platforms with more homogeneous contents and users.

To account for this (hypothesized) uneven impact of ChatGPT on the articles in their sample, the authors split it into those that are "similar" and "dissimilar" to ChatGPT's output. Concretely, they first prompted GPT 3.5 Turbo to [...] write an encyclopedic article for [each Wikipedia article's topic], similar to those found on Wikipedia [...]. (GPT 3.5 Turbo, by now a rather dated LLM, corresponds to the free version of ChatGPT available to most users in 2023.) Embeddings of this output and the original Wikipedia article were then derived using a different model by OpenAI. The "similar" vs. "dissimilar" split of the sample was based on the cosine similarity between these two embeddings. The authors interpret this metric, somewhat speculatively, as a proxy for substitutability of the two options from the user’s point of view, and for GPT 3.5’s mastery of the topic. They thus assume that for the "dissimilar" half, there is less possibility that ChatGPT will replace Wikipedia as the main provider of information for these articles.

The sample consisted of all articles that had been among the English Wikipedia's 1000 most viewed for any month during the analyzed timespan (July 2015 to November 2023).

"Our Diff-in-Diff Methodology. Measuring Wikipedia activity before and after the introduction of ChatGPT for Wikipedia articles that rank similarly and dissimilarly to the same information generated by ChatGPT." (Figure 1 from the paper)

The rest of the paper proceeds to compare the monthly time series of views and edits from before and after the release of ChatGPT (i.e. the months until November 2022 vs. the months from December 2022 on).[supp 1]

To do this, the authors use the aforementioned standard diff-in-diff regression, while also controlling for article length and the trend in overall growth of all articles. As a kind of robustness check, this regression is calculated for varying values of article "recency" T, by including only observations for articles which are at most T months old.

Plot showing evidence for a decrease in pageviews for LLM-similar Wikipedia articles, in form of "the estimated diff-in-diff coefficients [with 95% confidence intervals] for each recency parameter T from 1 to 24 months (x-axis) [...]. Each point represents an individual regression: points to the left focus on newer articles and their views and edits in the first few months, while points to the right place greater weight on activities of older articles close to two years after creation"
The analogous plot of diff-in-diff coefficients for edit numbers, indicating a "much less pronounced" effect

Overall, the researchers interpret their results as implying

that Wikipedia articles where ChatGPT provides a similar output experience a larger drop in views after the launch of ChatGPT. This effect is much less pronounced for edit behavior.

Somewhat intriguingly, this finding might suggest a disparate impact on different parts of the Wikimedia movement: To simplify a bit, pageviews correspond fairly directly, via Wikipedia's well-known donation banners, to the Wikimedia Foundation's most important source of revenue by far. Edit numbers, on the other hand, are a proxy for the amount of work volunteers put into maintaining and improving Wikipedia. So, very speculatively, one might interpret the paper's results as indicating that ChatGPT endangers the Foundation's financial health more than the editing community's sustainability.

All that said though, it must be kept in mind that generative AI has vastly improved since the timespan analyzed in the paper. Last month, a prominent OpenAI employee even proclaimed that the company's newly released ChatGPT Deep Research tool might be the beginning of the end for Wikipedia (in a since deleted tweet that was followed by some more nuanced statements).

Of note, one of the paper's six authors is Daron Acemoglu, one of the winners of last year's Nobel Prize in Economics, and one of the most cited economists. (However, his work – including on the impact of AI on the labor market – has not always escaped criticism from other economists. Still, one scholar expressed his excitement that the present paper marks Probably the first time Daron Acemoglu published in WWW!)

The rise of "additionally" and "crucial": "Wikipedia in the Era of LLMs: Evolution and Risks"

"Word frequency evolution for [the] word “additionally from 2020 to 2025", in several content areas on English Wikipedia (figure 10, from the paper's "LLM Direct Impact" appendix section
Word frequency of "crucial" in the first section of English Wikipedia articles in several content areas on English Wikipedia, from 2020 to 2025

Published earlier this month, this preprint[3] presents what the authors call

"[...] a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. [...] Our findings and simulation results reveal that [English] Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories."

As the researchers (from Huazhong University of Science and Technology and International School for Advanced Studies) note, the question of how much content on Wikipedia may be LLM-generated has been explored before:

The detection of AI-generated content has been a hot research topic in recent years [...], including its application to Wikipedia articles (Brooks et al., 2024). But MGT [Machine-Generated Content] detectors have notable limitations [...], and as a result, researchers are also exploring other methods for estimating the LLM impact, such as word frequency analysis [...].

See also our earlier review of that Brooks et al. paper (presented at the "NLP for Wikipedia Workshop" at EMNLP 2024) for more about its various limitations: "As many as 5%" of new English Wikipedia articles "contain significant AI-generated content", says paper."

For example, the authors observe an increasing frequency of the words “crucial” and “additionally”, which are favored by ChatGPT [according to previous research] in the content of Wikipedia article. To be sure, the mere presence of such words is of course affected by even more false positive and false negatives that the LLM detectors used in that previous research (such as GPTZero). However, the authors partially compensate for this by tracking the frequency increase over several years and several content areas.

Separate from relying on those words found to be "favored by ChatGPT" in previous research, they also use "LLM simulations" to estimate word frequency changes that would indicat LLM usage in Wikipedia:

We use GPT-4o-mini to revise the January 1, 2022, versions of Featured Articles to construct word frequency data reflecting the impact of large language models (LLMs). This choice is based on the assumption that Featured Articles are less likely to be affected by LLMs, given their rigorous review processes and ongoing manual maintenance.

An amusing sidenote here is that the researchers ran into a technical problem with this process because AI companies' content safety standards are apparently stricter than those of Wikipedia: some responses are filtered due to the prompt triggering Azure OpenAI’s content moderation policy, likely because certain Wikipedia pages contain violent content.

Overall, these methods do lend some support to the above quoted impact of approximately 1%-2% in certain categories result (although it is not quite clear to this reviewer how representative e.g. the conspicuous results for "crucial" and "additionally" are among the much larger sets of words identified as "favored by ChatGPT" in the previous research cited by the paper). But the authors also caution that

  • While [the content of] some Wikipedia articles have been influenced by LLMs, the overall impact has so far been quite limited.

The paper offers also a cursory observation about pageviews, but (unlike the WWW paper) does not make a serious attempt at establishing causality:

There has been a slight decline in page views for certain scientific categories on Wikipedia, but the connection to LLMs remains uncertain.

Large parts of the article are concerned with the potential indirect impact on AI research itself:

Our findings that LLMs are impacting Wikipedia and the impact could extend indirectly to some NLP tasks through their dependence on Wikipedia content. For instance, the target language for machine translation may gradually shift towards the language style of LLMs, albeit in small steps.

This is not too dissimilar from the widely publicized (but sometimes overhyped) concerns about a posssible "model collapse" for LLMs, but the impact remains speculative.

Interestingly, this paper is one of the few research publications that apart from Wikipedia also uses content from Wikinews, albeit only in an auxiliary fashion (for the purpose of generating questions to test LLMs in specific scenarios).


"Death by AI: Will large language models diminish Wikipedia?"

This "brief communication" in JASIST[4] examines a very similar question (likewise only for English Wikipedia), arriving at some similar overall conclusions as the WWW paper reviewed above:

However, compared to the WWW paper reviewed above, the statistical methods underlying this assertion about pageviews are rather cavalier - they basically rely on eyeballing charts:

Human readership peaked around 2016, at 106 billion page views, and has since dropped to around 90 billion views per year. Meanwhile, the number of automated (non-human) page views has doubled since 2017, from 14 billion to over 28 billion. Human page views are thus likely to continue their decline in favor of AI bot accesses. With that, Wikipedia's visibility will also diminish.

This kind of reasoning obviously ignores any other possible causes for the (slight) decline in human pageviews over the past decade (e.g. improved detection of automated pageviews, or, hypothetically, a decrease in the global number of English speakers with Internet access). The diff-in-diff and interrupted time series methods used by the aforementioned papers are designed to avoid such fallacies.

Also, while the Wikimedia Foundation has indeed reported a rise in scraping activity last year largely fueled by scraping bots collecting training data for AI-powered workflows and products (albeit perhaps more in form of API requests than pageviews), it seems a bit adventurous to attribute all non-human pageviews to "AI bots", considering that there are many other reasons for scraping websites.[supp 2]

To be fair, the rest of the paper does offer a more thorough analysis. The authors construct a feedback model to postulate causal relationships between several different variables, and then check those hypotheses empirically. More concretely, they start with a "basic" flywheel-type model assuming that

The dynamics on Wikipedia start with contributors creating and editing articles, which attract readers to consume the content. As readership grows, readers are more likely to spot a need for edits (e.g., content correction), thereby becoming contributors themselves.

Introducing AI is hypothesized to disrupt this idyllic symbiosis between human editors and readers:

As AI answer bots automate readership and AI writer bots automate contributions, the original dynamics are expanded accordingly. While the reinforcement relationships between contributorship and readership remain, AI answer and writer bots exert negative impacts on human activity due to their crowding-out effects.

The activity of these different parties is then operationalized as pageview and edit numbers drawn from stats.wikimedia.org, relying on the existing classifications of users as bots (for edits) and of pageviews as human or non-human. As mentioned above, it seems quite a stretch to assume that the latter all come from "AI answer bots". Similarly, many a Wikipedian might rise an eyebrow at seeing edit bots (which have existed on Wikipedia since 2001[supp 3]) as "AI writer bots".[supp 4]

And it becomes especially contrived when the authors try to shoehorn existing research literature into justifying their model's assumption that bot editing activity negatively affects human editing activity:

Writer bots are necessary to help human editors maintain an ever-growing body of knowledge and update routine content (Petroni et al., 2022), yet writer bots also negatively affect Wikipedia in that they become competitors or opponents, by creating false knowledge or by deleting legitimate user contributions. Thomas (2023) argues that especially LLMs with their ability to make “creative” (i.e., false but plausible) contributions pose a danger to Wikipedia, requiring human editors to correct such contributions. Elsewhere deletionist writer bots (Livingstone, 2016)[[supp 3]] became a source of frustration particularly for novice Wikipedians who considered their contributions invalidated and thus turned away from further contributorship.

For example, the abstract of the cited Thomas (2023) paper[5], published less than six months after the release of ChatGPT, explicitly positions it as an evaluation of the the potential benefits and challenges of using large language models (LLMs) like ChatGPT to edit Wikipedia (emphasis added)- i.e. not a suitable reference for the factual claim that writer bots also negatively affect Wikipedia in that they become competitors or opponents. And the authors reference for deletionist writer bots (Livingstone, 2016)[supp 3] basically describes the exact opposite: A (human) bot operator's frustrations with his bots contributions getting deleted by deletionist human editors. The paper contains various further examples of such hallucinated citations.

The model is extended by some other variables, e.g. one modelling the recognition of Wikipedia as a public good, for which

we expect Wikipedia recognition to be a driver of contributorship [...]. To assess Wikipedia recognition, we use Google Trends as a proxy, as it monitors the popularity of Wikipedia as a search term.

Based on the last 15–20 years of data, the researcher find statistical support for most of their postulated relationships between these variables, in the form of impressively large adjusted and impressively small p-values.[supp 5] Still, the authors also caution that Due to the limited data, the proposed feed-back model has yet to be fully tested empirically.

They summarize their overall results as follows:

Starting from the premise of a producer–consumer relationship where readers access knowledge provided by other contributors, and then also create knowledge in return, we postulate a positive feedback cycle where readers attract contributors and vice versa. Using this logic and recognizing the advances in AI-enabled automation, we note that this positive feedback relationship has attracted AI bots both as “readers” (answer bots) and as “contributors” (writer bots), thereby weakening traditional human engagement in both readership and contributorship, thus shifting the originally virtuous cycle towards a vicious cycle that would diminish Wikipedia.

Briefly

References

  1. ^ Tran, Chau; Take, Kejsi; Champion, Kaylea; Hill, Benjamin Mako; Greenstadt, Rachel (2024-11-08). "Challenges in Restructuring Community-based Moderation". Proc. ACM Hum.-Comput. Interact. 8 (CSCW2): 415–1–415:24. doi:10.1145/3686954. Retrieved 2025-03-16. {closed access}} / Open access preprint: Tran, Chau; Take, Kejsi; Champion, Kaylea; Hill, Benjamin Mako; Greenstadt, Rachel (2024-02-27), Challenges in Restructuring Community-based Moderation, arXiv:2402.17880
  2. ^ Lyu, Liang; Siderius, James; Li, Hannah; Acemoglu, Daron; Huttenlocher, Daniel; Ozdaglar, Asuman (2025-03-02), Wikipedia Contributions in the Wake of ChatGPT, arXiv:2503.00757
  3. ^ Huang, Siming; Xu, Yuliang; Geng, Mingmeng; Wan, Yao; Chen, Dongping (2025-03-04), Wikipedia in the Era of LLMs: Evolution and Risks, arXiv:2503.02879 Data and code
  4. ^ Wagner, Christian; Jiang, Ling (2025-01-03). "Death by AI: Will large language models diminish Wikipedia?". Journal of the Association for Information Science and Technology. n/a (n/a). doi:10.1002/asi.24975. ISSN 2330-1643.
  5. ^ Thomas, Paul A. (2023-05-09). "Wikipedia and large language models: perfect pairing or perfect storm?". Library Hi Tech News. 40 (10): 6–8. doi:10.1108/LHTN-03-2023-0056. hdl:1808/34102. ISSN 0741-9058. Closed access icon
Supplementary references and notes:
  1. ^ Unfortunately the authors do not specify whether they excluded bot edits and non-human pageviews.
  2. ^ What's more, the cited numbers don't even match the stated source, i.e. the Wikimedia Foundation's stats.wikimedia.org site (which e.g. gives 17.3 billion non-human views for 2017, not 14 billion). Unfortunately the paper comes without replication data or code.
  3. ^ a b c Livingstone, Randall M. (2016-01-09). "Population automation: An interview with Wikipedia bot pioneer Ram-Man". First Monday. 21 (1). doi:10.5210/fm.v21i1.6027. ISSN 1396-0466.
  4. ^ The authors' problematic assumption that bot accounts on Wikipedia are powered by AI is especially evident in the "Discussion" section, which claims that a very recent growth in edits [...] may already be a result of generative AI support, pointing to a table that lists half-yearly edit numbers and ends with a much higher number in bot edits during the first half of 2024. However, that spike has largely subsided in more recent data.
  5. ^ However, statisticians with high blood pressure are advised to avoid looking at Table 1, which reports several of these p-values as negative (<0.000), a mathematical impossibility. While these might just be typos arising from cutting off trailing digits, they don't quite raise confidence in JASIST's peer review processes.


Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0