A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A 2024 paper[1] explores the history of Flagged Revisions in several Wikipedia language versions. Flagged Revisions is a vandalism-mitigation feature that was first deployed in the German Wikipedia in 2008. There were calls for the feature to be used broadly across WMF wikis but community and WMF support both dwindled over the years.
The English Wikipedia uses the term Pending Changes for its variant of the feature. After extensive discussions between 2009 and 2017, the English Wikipedia community settled on a very small role for Pending Changes – it is used in just ~0.06% of articles. In the German Wikipedia, whose community requested the initial development of the system, it is used in nearly all articles.
The authors start with a premise that Flagged Revisions is fundamentally a good idea, citing their own prior research finding that "the system substantially reduced the amount of vandalism visible on the site with little evidence of negative trade-offs" (see our earlier review: "FlaggedRevs study finds that concerns about limiting Wikipedia's 'anyone can edit' principle 'may be overstated'"). They then ask,
"What led to the decline in FlaggedRevs' popularity over the years, despite its effectiveness?"
The paper attributes the loss of popularity to community challenges ("conflicts with existing social norms and values", "unclear moderation instructions"); platform and policy challenges ("lack of technical support from the governing body", "bureaucratic hurdles", "lack of cross-community empirical evidence on the effectiveness of the system"); and technical challenges.
As part of their methodology, the authors analyzed dozens of on-wiki discussions in the English, German, Indonesian, and Hungarian Wikipedias. They also conducted interviews with seven individuals, six of whom were WMF employees. It is unclear how much weight on-wiki discussions were given in the findings.
A major drawback of Pending Changes, according to the English Wikipedia's past RfCs, is that it significantly increases work for experienced editors and leads to backlogs. The paper discusses this issue under Technical challenges in a way that suggests it is solvable; it does not suggest how to solve it. Another part of the paper asserts that "the English Wikipedia community" hired at least one contractor to do technical work, implying that the authors mistakenly believe the community can hire people.
Where the paper breaks important ground is in exploring the dynamics between the Wikipedia communities and the WMF. It goes into depth about the sometimes-slow pace of technology development and the limitations of the Community Wishlist Survey process.
The Wikimedia Foundation announced the call for proposals for its 2025 research fund, open until April 16, 2025.
Changes from previous years include:
For the first time, we are going to accept multi-year extended research proposals (currently for two years with a possibility of applying for renewal for a third year).– instead of the previous limitations to 12 months.
We have reduced the proposal review stages from two to one for this year.
The list of proposals funded in the 2022–23 round might give an impression of the kind of research work previously produced with support from the fund (while the 2023–24 round is still in progress), and might also shed some light on possible reasons for these changes – e.g. it appears that several projects struggled to complete work within 12 months:
(See also our 2022 coverage: "Wikimedia Research Fund invites proposals for grants up to $50k, announces results of previous year's round")
More than two years after the release of ChatGPT precipitated what English Wikipedia calls the AI boom, its possible effects on Wikipedia continue to preoccupy researchers. Recently, ChatGPT surpassed Wikipedia in Similarweb's "Most Visited Websites In The World" list. While the information value of such rankings might be limited and the death of Wikipedia from AI clearly still isn't imminent, generative AI seems here to stay.
Earlier attempts to investigate ChatGPT's impact on Wikipedia include a rather simplistic analysis by the Wikimedia Foundation, which concluded in February 2024 that there had been No major drop in readers during the meteoric rise in ChatGPT use
, based on a juxtaposition of monthly pageview numbers for 2022 and 2023.
A May 2024 preprint by authors from King's College London (still not published in peer-reviewed form) reported more nuanced findings, see our review: "Actually, Wikipedia was not killed by ChatGPT – but it might be growing a little less because of it".
And a June 2024 abstract-only conference paper presented stronger conclusions, see our coverage: "'Impact of Generative AI': A 'significant decrease in Wikipedia page views' after the release of ChatGPT. However, these likewise don't seem to have been published as a full paper yet.
More recently, several new quantitative research publications have examined such issues further from various angles (in addition to some qualitative research papers that we will cover in future issues):
This paper,[2] to be presented at the upcoming WWW conference, focuses on ChatGPT's impact on the English Wikipedia. From the abstract:
How has Wikipedia activity changed for articles with content similar to ChatGPT following its introduction? [...] Our analysis reveals that newly created, popular articles whose content overlaps with ChatGPT 3.5 saw a greater decline in editing and viewership after the November 2022 launch of ChatGPT than dissimilar articles did. These findings indicate heterogeneous substitution effects, where users selectively engage less with existing platforms when AI provides comparable content.
The aforementioned King's College preprint had used what this reviewer called a fairly crude statistical method
. The authors of the present paper directly criticize it as unsuitable for the question at hand:
Several factors about Wikipedia necessitate our differences-in-differences (DiD) strategy, in contrast to the interrupted time series analysis that is often used in similar work [...including the King's College preprint on Wikipedia]. In addition to having a broader scope of topics, Wikipedia allows for more diverse user incentives than analogous platforms: viewers exhibit both shallow and deep information needs, while contributors are driven by both intrinsic and extrinsic motivations. These factors may dampen the effects of ChatGPT on some users and articles. In fact, [the King's College researchers] analyze Wikipedia engagement in the aggregate, and do not identify significant drops in activity following the launch of ChatGPT. We hypothesize that their analysis do not fully capture the heterogeneity of Wikipedia, compared to similar platforms with more homogeneous contents and users.
To account for this (hypothesized) uneven impact of ChatGPT on the articles in their sample, the authors split it into those that are "similar" and "dissimilar" to ChatGPT's output. Concretely, they first prompted GPT 3.5 Turbo to [...] write an encyclopedic article for [each Wikipedia article's topic], similar to those found on Wikipedia [...]
. (GPT 3.5 Turbo, by now a rather dated LLM, corresponds to the free version of ChatGPT available to most users in 2023.) Embeddings of this output and the original Wikipedia article were then derived using a different model by OpenAI. The "similar" vs. "dissimilar" split of the sample was based on the cosine similarity between these two embeddings. The authors interpret this metric, somewhat speculatively, as a proxy for substitutability of the two options from the user’s point of view, and for GPT 3.5’s mastery of the topic
. They thus assume that for the "dissimilar" half, there is less possibility that ChatGPT will replace Wikipedia as the main provider of information for these articles
.
The sample consisted of all articles that had been among the English Wikipedia's 1000 most viewed for any month during the analyzed timespan (July 2015 to November 2023).
The rest of the paper proceeds to compare the monthly time series of views and edits from before and after the release of ChatGPT (i.e. the months until November 2022 vs. the months from December 2022 on).[supp 1]
To do this, the authors use the aforementioned standard diff-in-diff regression, while also controlling for article length and the trend in overall growth of all articles
. As a kind of robustness check, this regression is calculated for varying values of article "recency" T, by including only observations for articles which are at most T months old.
Overall, the researchers interpret their results as implying
that Wikipedia articles where ChatGPT provides a similar output experience a larger drop in views after the launch of ChatGPT. This effect is much less pronounced for edit behavior.
Somewhat intriguingly, this finding might suggest a disparate impact on different parts of the Wikimedia movement: To simplify a bit, pageviews correspond fairly directly, via Wikipedia's well-known donation banners, to the Wikimedia Foundation's most important source of revenue by far. Edit numbers, on the other hand, are a proxy for the amount of work volunteers put into maintaining and improving Wikipedia. So, very speculatively, one might interpret the paper's results as indicating that ChatGPT endangers the Foundation's financial health more than the editing community's sustainability.
All that said though, it must be kept in mind that generative AI has vastly improved since the timespan analyzed in the paper. Last month, a prominent OpenAI employee even proclaimed that the company's newly released ChatGPT Deep Research tool might be the beginning of the end for Wikipedia
(in a since deleted tweet that was followed by some more nuanced statements).
Of note, one of the paper's six authors is Daron Acemoglu, one of the winners of last year's Nobel Prize in Economics, and one of the most cited economists. (However, his work – including on the impact of AI on the labor market – has not always escaped criticism from other economists. Still, one scholar expressed his excitement that the present paper marks Probably the first time Daron Acemoglu published in WWW!
)
Published earlier this month, this preprint[3] presents what the authors call
"[...] a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. [...] Our findings and simulation results reveal that [English] Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories."
As the researchers (from Huazhong University of Science and Technology and International School for Advanced Studies) note, the question of how much content on Wikipedia may be LLM-generated has been explored before:
The detection of AI-generated content has been a hot research topic in recent years [...], including its application to Wikipedia articles (Brooks et al., 2024). But MGT [Machine-Generated Content] detectors have notable limitations [...], and as a result, researchers are also exploring other methods for estimating the LLM impact, such as word frequency analysis [...].
See also our earlier review of that Brooks et al. paper (presented at the "NLP for Wikipedia Workshop" at EMNLP 2024) for more about its various limitations: "As many as 5%" of new English Wikipedia articles "contain significant AI-generated content", says paper."
For example, the authors observe an increasing frequency of the words “crucial” and “additionally”, which are favored by ChatGPT [according to previous research]
in the content of Wikipedia article. To be sure, the mere presence of such words is of course affected by even more false positive and false negatives that the LLM detectors used in that previous research (such as GPTZero). However, the authors partially compensate for this by tracking the frequency increase over several years and several content areas.
Separate from relying on those words found to be "favored by ChatGPT" in previous research, they also use "LLM simulations" to estimate word frequency changes that would indicat LLM usage in Wikipedia:
We use GPT-4o-mini to revise the January 1, 2022, versions of Featured Articles to construct word frequency data reflecting the impact of large language models (LLMs). This choice is based on the assumption that Featured Articles are less likely to be affected by LLMs, given their rigorous review processes and ongoing manual maintenance.
An amusing sidenote here is that the researchers ran into a technical problem with this process because AI companies' content safety standards are apparently stricter than those of Wikipedia: some responses are filtered due to the prompt triggering Azure OpenAI’s content moderation policy, likely because certain Wikipedia pages contain violent content.
Overall, these methods do lend some support to the above quoted impact of approximately 1%-2% in certain categories
result (although it is not quite clear to this reviewer how representative e.g. the conspicuous results for "crucial" and "additionally" are among the much larger sets of words identified as "favored by ChatGPT" in the previous research cited by the paper). But the authors also caution that
- While [the content of] some Wikipedia articles have been influenced by LLMs, the overall impact has so far been quite limited.
The paper offers also a cursory observation about pageviews, but (unlike the WWW paper) does not make a serious attempt at establishing causality:
There has been a slight decline in page views for certain scientific categories on Wikipedia, but the connection to LLMs remains uncertain.
Large parts of the article are concerned with the potential indirect impact on AI research itself:
Our findings that LLMs are impacting Wikipedia and the impact could extend indirectly to some NLP tasks through their dependence on Wikipedia content. For instance, the target language for machine translation may gradually shift towards the language style of LLMs, albeit in small steps.
This is not too dissimilar from the widely publicized (but sometimes overhyped) concerns about a posssible "model collapse" for LLMs, but the impact remains speculative.
Interestingly, this paper is one of the few research publications that apart from Wikipedia also uses content from Wikinews, albeit only in an auxiliary fashion (for the purpose of generating questions to test LLMs in specific scenarios).
This "brief communication" in JASIST[4] examines a very similar question (likewise only for English Wikipedia), arriving at some similar overall conclusions as the WWW paper reviewed above:
suggests that Wikipedians are not yet materially affected by AI-related changes in the platform.
Reviewing the Wikipedia readership from recent years, disintermediation [of Wikipedia] by answer bots appears already prevalent and only to increase over time.
However, compared to the WWW paper reviewed above, the statistical methods underlying this assertion about pageviews are rather cavalier - they basically rely on eyeballing charts:
Human readership peaked around 2016, at 106 billion page views, and has since dropped to around 90 billion views per year. Meanwhile, the number of automated (non-human) page views has doubled since 2017, from 14 billion to over 28 billion. Human page views are thus likely to continue their decline in favor of AI bot accesses. With that, Wikipedia's visibility will also diminish.
This kind of reasoning obviously ignores any other possible causes for the (slight) decline in human pageviews over the past decade (e.g. improved detection of automated pageviews, or, hypothetically, a decrease in the global number of English speakers with Internet access). The diff-in-diff and interrupted time series methods used by the aforementioned papers are designed to avoid such fallacies.
Also, while the Wikimedia Foundation has indeed reported a rise in scraping activity last year largely fueled by scraping bots collecting training data for AI-powered workflows and products
(albeit perhaps more in form of API requests than pageviews), it seems a bit adventurous to attribute all non-human pageviews to "AI bots", considering that there are many other reasons for scraping websites.[supp 2]
To be fair, the rest of the paper does offer a more thorough analysis. The authors construct a feedback model to postulate causal relationships between several different variables, and then check those hypotheses empirically. More concretely, they start with a "basic" flywheel-type model assuming that
The dynamics on Wikipedia start with contributors creating and editing articles, which attract readers to consume the content. As readership grows, readers are more likely to spot a need for edits (e.g., content correction), thereby becoming contributors themselves.
Introducing AI is hypothesized to disrupt this idyllic symbiosis between human editors and readers:
As AI answer bots automate readership and AI writer bots automate contributions, the original dynamics are expanded accordingly. While the reinforcement relationships between contributorship and readership remain, AI answer and writer bots exert negative impacts on human activity due to their crowding-out effects.
The activity of these different parties is then operationalized as pageview and edit numbers drawn from stats.wikimedia.org, relying on the existing classifications of users as bots (for edits) and of pageviews as human or non-human. As mentioned above, it seems quite a stretch to assume that the latter all come from "AI answer bots". Similarly, many a Wikipedian might rise an eyebrow at seeing edit bots (which have existed on Wikipedia since 2001[supp 3]) as "AI writer bots".[supp 4]
And it becomes especially contrived when the authors try to shoehorn existing research literature into justifying their model's assumption that bot editing activity negatively affects human editing activity:
Writer bots are necessary to help human editors maintain an ever-growing body of knowledge and update routine content (Petroni et al., 2022), yet writer bots also negatively affect Wikipedia in that they become competitors or opponents, by creating false knowledge or by deleting legitimate user contributions. Thomas (2023) argues that especially LLMs with their ability to make “creative” (i.e., false but plausible) contributions pose a danger to Wikipedia, requiring human editors to correct such contributions. Elsewhere deletionist writer bots (Livingstone, 2016)[[supp 3]] became a source of frustration particularly for novice Wikipedians who considered their contributions invalidated and thus turned away from further contributorship.
For example, the abstract of the cited Thomas (2023) paper[5], published less than six months after the release of ChatGPT, explicitly positions it as an evaluation of the the potential benefits and challenges of using large language models (LLMs) like ChatGPT to edit Wikipedia
(emphasis added)- i.e. not a suitable reference for the factual claim that writer bots also negatively affect Wikipedia in that they become competitors or opponents
. And the authors reference for deletionist writer bots
(Livingstone, 2016)[supp 3] basically describes the exact opposite: A (human) bot operator's frustrations with his bots contributions getting deleted by deletionist human editors. The paper contains various further examples of such hallucinated citations.
The model is extended by some other variables, e.g. one modelling the recognition of Wikipedia as a public good
, for which
we expect Wikipedia recognition to be a driver of contributorship [...]. To assess Wikipedia recognition, we use Google Trends as a proxy, as it monitors the popularity of Wikipedia as a search term.
Based on the last 15–20 years of data, the researcher find statistical support for most of their postulated relationships between these variables, in the form of impressively large adjusted and impressively small p-values.[supp 5] Still, the authors also caution that Due to the limited data, the proposed feed-back model has yet to be fully tested empirically.
They summarize their overall results as follows:
Starting from the premise of a producer–consumer relationship where readers access knowledge provided by other contributors, and then also create knowledge in return, we postulate a positive feedback cycle where readers attract contributors and vice versa. Using this logic and recognizing the advances in AI-enabled automation, we note that this positive feedback relationship has attracted AI bots both as “readers” (answer bots) and as “contributors” (writer bots), thereby weakening traditional human engagement in both readership and contributorship, thus shifting the originally virtuous cycle towards a vicious cycle that would diminish Wikipedia.
14 billion). Unfortunately the paper comes without replication data or code.
very recent growth in edits [...] may already be a result of generative AI support, pointing to a table that lists half-yearly edit numbers and ends with a much higher number in bot edits during the first half of 2024. However, that spike has largely subsided in more recent data.
<0.000), a mathematical impossibility. While these might just be typos arising from cutting off trailing digits, they don't quite raise confidence in JASIST's peer review processes.
Discuss this story