Explaining the disappointing history of Flagged Revisions; and what's the impact of ChatGPT on Wikipedia so far?

Recent research

Explaining the disappointing history of Flagged Revisions; and what's the impact of ChatGPT on Wikipedia so far?

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Flagged Revisions: Explaining a disappointing history

Reviewed by Clayoquot

A 2024 paper^[1] explores the history of Flagged Revisions in several Wikipedia language versions. Flagged Revisions is a vandalism-mitigation feature that was first deployed in the German Wikipedia in 2008. There were calls for the feature to be used broadly across WMF wikis but community and WMF support both dwindled over the years.

The English Wikipedia uses the term Pending Changes for its variant of the feature. After extensive discussions between 2009 and 2017, the English Wikipedia community settled on a very small role for Pending Changes – it is used in just ~0.06% of articles. In the German Wikipedia, whose community requested the initial development of the system, it is used in nearly all articles.

"Illustration of the challenges between different stakeholders. Most of the platform challenges arise between the Wikimedia Foundation and the different language editions. Most community challenges arise inside the different language editions, among the user base. Technological challenges arise both between the latter and former stakeholders." (Figure 2 from the paper)

The authors start with a premise that Flagged Revisions is fundamentally a good idea, citing their own prior research finding that "the system substantially reduced the amount of vandalism visible on the site with little evidence of negative trade-offs" (see our earlier review: "FlaggedRevs study finds that concerns about limiting Wikipedia's 'anyone can edit' principle 'may be overstated'"). They then ask,

"What led to the decline in FlaggedRevs' popularity over the years, despite its effectiveness?"

The paper attributes the loss of popularity to community challenges ("conflicts with existing social norms and values", "unclear moderation instructions"); platform and policy challenges ("lack of technical support from the governing body", "bureaucratic hurdles", "lack of cross-community empirical evidence on the effectiveness of the system"); and technical challenges.

As part of their methodology, the authors analyzed dozens of on-wiki discussions in the English, German, Indonesian, and Hungarian Wikipedias. They also conducted interviews with seven individuals, six of whom were WMF employees. It is unclear how much weight on-wiki discussions were given in the findings.

A major drawback of Pending Changes, according to the English Wikipedia's past RfCs, is that it significantly increases work for experienced editors and leads to backlogs. The paper discusses this issue under Technical challenges in a way that suggests it is solvable; it does not suggest how to solve it. Another part of the paper asserts that "the English Wikipedia community" hired at least one contractor to do technical work, implying that the authors mistakenly believe the community can hire people.

Where the paper breaks important ground is in exploring the dynamics between the Wikipedia communities and the WMF. It goes into depth about the sometimes-slow pace of technology development and the limitations of the Community Wishlist Survey process.

Applications are open for the 2025 Wikimedia Research Fund

By Tilman Bayer

The Wikimedia Foundation announced the call for proposals for its 2025 research fund, open until April 16, 2025.

Changes from previous years include:

For the first time, we are going to accept multi-year extended research proposals (currently for two years with a possibility of applying for renewal for a third year). – instead of the previous limitations to 12 months.
More focus on established researchers
We have reduced the proposal review stages from two to one for this year.

The list of proposals funded in the 2022–23 round might give an impression of the kind of research work previously produced with support from the fund (while the 2023–24 round is still in progress), and might also shed some light on possible reasons for these changes – e.g. it appears that several projects struggled to complete work within 12 months:

Proposal Title	Applicant(s)	Budget (USD)	Status as of March 22, 2025 (per project page)
Reliable sources and public policy issues: understanding multisector organisations as sources on Wikipedia and Wikidata	Amanda Lawrence	45,000	in progress
Codifying Digital Behavior Around the World: A Socio-Legal Study of the Wikimedia Universal Code of Conduct	Florian Grisel and Giovanni De Gregorio	49,402.77	completed
Dashboards to understand the organization of social memory about Chileans in Wikipedia. Politicians, scientists, artists, and sportspersons since the 19th century	Pablo Beytía and Carlos Cruz Infante	36,000.00	completed
Understanding how Editors Use Machine Translation in Wikipedia: A Case Study in African Languages	Eleftheria Briakou, Tajuddeen Gwadabe and Marine Carpuat	50,000	in progress (duration extended to June 2025)
A Whole New World – Integration of New Editors into the Serbian Wikipedia Community	Nevena Rudinac and Nebojša Ratković	25,000–30,000	completed
Disinformation, Wikimedia and Alternative Content Moderation Models: Possibilities and Challenges (2022–23)	Ramiro Álvarez Ugarte	47,147	in progress
Measuring the Gender Gap: Attribute-based Class Completeness Estimation	Gianluca Demartini	50,000	in progress (one of the planned studies has been published already)
Reducing the gender gap in AfD discussions: a semi-supervised learning approach	Giovanni Luca Ciampaglia and Khandaker Tasnim Huq	50,000	in progress (duration extended to July 2025)
Implications of ChatGPT for knowledge integrity on Wikipedia	Heather Ford, Michael Davis and Marian-Andrei Rizoiu	32,449	in progress
Network perspectives on collective memory processes across the Arabic and English Wikipedias	H. Laurie Jones and Brian Keegan	40,000	completed

So again, what has the impact of ChatGPT really been?

By Tilman Bayer

More than two years after the release of ChatGPT precipitated what English Wikipedia calls the AI boom, its possible effects on Wikipedia continue to preoccupy researchers. Recently, ChatGPT surpassed Wikipedia in Similarweb's "Most Visited Websites In The World" list. While the information value of such rankings might be limited and the death of Wikipedia from AI clearly still isn't imminent, generative AI seems here to stay.

Previous efforts

Earlier attempts to investigate ChatGPT's impact on Wikipedia include a rather simplistic analysis by the Wikimedia Foundation, which concluded in February 2024 that there had been No major drop in readers during the meteoric rise in ChatGPT use, based on a juxtaposition of monthly pageview numbers for 2022 and 2023.

A May 2024 preprint by authors from King's College London (still not published in peer-reviewed form) reported more nuanced findings, see our review: "Actually, Wikipedia was not killed by ChatGPT – but it might be growing a little less because of it".

And a June 2024 abstract-only conference paper presented stronger conclusions, see our coverage: "'Impact of Generative AI': A 'significant decrease in Wikipedia page views' after the release of ChatGPT". However, these likewise don't seem to have been published as a full paper yet.

More recently, several new quantitative research publications have examined such issues further from various angles (in addition to some qualitative research papers that we will cover in future issues):

"Wikipedia Contributions in the Wake of ChatGPT"

This paper,^[2] to be presented at the upcoming WWW conference, focuses on ChatGPT's impact on the English Wikipedia. From the abstract:

How has Wikipedia activity changed for articles with content similar to ChatGPT following its introduction? [...] Our analysis reveals that newly created, popular articles whose content overlaps with ChatGPT 3.5 saw a greater decline in editing and viewership after the November 2022 launch of ChatGPT than dissimilar articles did. These findings indicate heterogeneous substitution effects, where users selectively engage less with existing platforms when AI provides comparable content.

The aforementioned King's College preprint had used what this reviewer called a fairly crude statistical method. The authors of the present paper directly criticize it as unsuitable for the question at hand:

Several factors about Wikipedia necessitate our differences-in-differences (DiD) strategy, in contrast to the interrupted time series analysis that is often used in similar work [...including the King's College preprint on Wikipedia]. In addition to having a broader scope of topics, Wikipedia allows for more diverse user incentives than analogous platforms: viewers exhibit both shallow and deep information needs, while contributors are driven by both intrinsic and extrinsic motivations. These factors may dampen the effects of ChatGPT on some users and articles. In fact, [the King's College researchers] analyze Wikipedia engagement in the aggregate, and do not identify significant drops in activity following the launch of ChatGPT. We hypothesize that their analysis do not fully capture the heterogeneity of Wikipedia, compared to similar platforms with more homogeneous contents and users.

To account for this (hypothesized) uneven impact of ChatGPT on the articles in their sample, the authors split it into those that are "similar" and "dissimilar" to ChatGPT's output. Concretely, they first prompted GPT 3.5 Turbo to [...] write an encyclopedic article for [each Wikipedia article's topic], similar to those found on Wikipedia [...]. (GPT 3.5 Turbo, by now a rather dated LLM, corresponds to the free version of ChatGPT available to most users in 2023.) Embeddings of this output and the original Wikipedia article were then derived using a different model by OpenAI. The "similar" vs. "dissimilar" split of the sample was based on the cosine similarity between these two embeddings. The authors interpret this metric, somewhat speculatively, as a proxy for substitutability of the two options from the user’s point of view, and for GPT 3.5’s mastery of the topic. They thus assume that for the "dissimilar" half, there is less possibility that ChatGPT will replace Wikipedia as the main provider of information for these articles.

The sample consisted of all articles that had been among the English Wikipedia's 1000 most viewed for any month during the analyzed timespan (July 2015 to November 2023).

The rest of the paper proceeds to compare the monthly time series of views and edits from before and after the release of ChatGPT (i.e. the months until November 2022 vs. the months from December 2022 on).^{[supp 1]}

To do this, the authors use the aforementioned standard diff-in-diff regression, while also controlling for article length and the trend in overall growth of all articles. As a kind of robustness check, this regression is calculated for varying values of article "recency" T, by including only observations for articles which are at most T months old.

Overall, the researchers interpret their results as implying

that Wikipedia articles where ChatGPT provides a similar output experience a larger drop in views after the launch of ChatGPT. This effect is much less pronounced for edit behavior.

Somewhat intriguingly, this finding might suggest a disparate impact on different parts of the Wikimedia movement: To simplify a bit, pageviews correspond fairly directly, via Wikipedia's well-known donation banners, to the Wikimedia Foundation's most important source of revenue by far. Edit numbers, on the other hand, are a proxy for the amount of work volunteers put into maintaining and improving Wikipedia. So, very speculatively, one might interpret the paper's results as indicating that ChatGPT endangers the Foundation's financial health more than the editing community's sustainability.

All that said though, it must be kept in mind that generative AI has vastly improved since the timespan analyzed in the paper. Last month, a prominent OpenAI employee even proclaimed that the company's newly released ChatGPT Deep Research tool might be the beginning of the end for Wikipedia (in a since deleted tweet that was followed by some more nuanced statements).

Of note, one of the paper's six authors is Daron Acemoglu, one of the winners of last year's Nobel Prize in Economics, and one of the most cited economists. (However, his work – including on the impact of AI on the labor market – has not always escaped criticism from other economists. Still, one scholar expressed his excitement that the present paper marks Probably the first time Daron Acemoglu published in WWW!)

The rise of "additionally" and "crucial": "Wikipedia in the Era of LLMs: Evolution and Risks"

Published earlier this month, this preprint^[3] presents what the authors call

"[...] a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. [...] Our findings and simulation results reveal that [English] Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories."

As the researchers (from Huazhong University of Science and Technology and International School for Advanced Studies) note, the question of how much content on Wikipedia may be LLM-generated has been explored before:

The detection of AI-generated content has been a hot research topic in recent years [...], including its application to Wikipedia articles (Brooks et al., 2024). But MGT [Machine-Generated Content] detectors have notable limitations [...], and as a result, researchers are also exploring other methods for estimating the LLM impact, such as word frequency analysis [...].

See also our earlier review of that Brooks et al. paper (presented at the "NLP for Wikipedia Workshop" at EMNLP 2024) for more about its various limitations: "'As many as 5%' of new English Wikipedia articles 'contain significant AI-generated content', says paper."

For example, the authors observe an increasing frequency of the words “crucial” and “additionally”, which are favored by ChatGPT [according to previous research] in the content of Wikipedia article. To be sure, the mere presence of such words is of course affected by even more false positive and false negatives that the LLM detectors used in that previous research (such as GPTZero). However, the authors partially compensate for this by tracking the frequency increase over several years and several content areas.

Separate from relying on those words found to be "favored by ChatGPT" in previous research, they also use "LLM simulations" to estimate word frequency changes that would indicat LLM usage in Wikipedia:

We use GPT-4o-mini to revise the January 1, 2022, versions of Featured Articles to construct word frequency data reflecting the impact of large language models (LLMs). This choice is based on the assumption that Featured Articles are less likely to be affected by LLMs, given their rigorous review processes and ongoing manual maintenance.

An amusing sidenote here is that the researchers ran into a technical problem with this process because AI companies' content safety standards are apparently stricter than those of Wikipedia: some responses are filtered due to the prompt triggering Azure OpenAI’s content moderation policy, likely because certain Wikipedia pages contain violent content.

Overall, these methods do lend some support to the above quoted impact of approximately 1%-2% in certain categories result (although it is not quite clear to this reviewer how representative e.g. the conspicuous results for "crucial" and "additionally" are among the much larger sets of words identified as "favored by ChatGPT" in the previous research cited by the paper). But the authors also caution that

While [the content of] some Wikipedia articles have been influenced by LLMs, the overall impact has so far been quite limited.

The paper offers also a cursory observation about pageviews, but (unlike the WWW paper) does not make a serious attempt at establishing causality:

There has been a slight decline in page views for certain scientific categories on Wikipedia, but the connection to LLMs remains uncertain.

Large parts of the article are concerned with the potential indirect impact on AI research itself:

Our findings that LLMs are impacting Wikipedia and the impact could extend indirectly to some NLP tasks through their dependence on Wikipedia content. For instance, the target language for machine translation may gradually shift towards the language style of LLMs, albeit in small steps.

This is not too dissimilar from the widely publicized (but sometimes overhyped) concerns about a posssible "model collapse" for LLMs, but the impact remains speculative.

Interestingly, this paper is one of the few research publications that apart from Wikipedia also uses content from Wikinews, albeit only in an auxiliary fashion (for the purpose of generating questions to test LLMs in specific scenarios).

"Death by AI: Will large language models diminish Wikipedia?"

This "brief communication" in JASIST^[4] examines a very similar question (likewise only for English Wikipedia), arriving at some similar overall conclusions as the WWW paper reviewed above:

Regarding editor numbers, the authors argue that their model suggests that Wikipedians are not yet materially affected by AI-related changes in the platform.
Concerning pageviews, they claim that Reviewing the Wikipedia readership from recent years, disintermediation [of Wikipedia] by answer bots appears already prevalent and only to increase over time.

However, compared to the WWW paper reviewed above, the statistical methods underlying this assertion about pageviews are rather cavalier - they basically rely on eyeballing charts:

Human readership peaked around 2016, at 106 billion page views, and has since dropped to around 90 billion views per year. Meanwhile, the number of automated (non-human) page views has doubled since 2017, from 14 billion to over 28 billion. Human page views are thus likely to continue their decline in favor of AI bot accesses. With that, Wikipedia's visibility will also diminish.

This kind of reasoning obviously ignores any other possible causes for the (slight) decline in human pageviews over the past decade (e.g. improved detection of automated pageviews, or, hypothetically, a decrease in the global number of English speakers with Internet access). The diff-in-diff and interrupted time series methods used by the aforementioned papers are designed to avoid such fallacies.

Also, while the Wikimedia Foundation has indeed reported a rise in scraping activity last year largely fueled by scraping bots collecting training data for AI-powered workflows and products (albeit perhaps more in form of API requests than pageviews), it seems a bit adventurous to attribute all non-human pageviews to "AI bots", considering that there are many other reasons for scraping websites.^{[supp 2]}

To be fair, the rest of the paper does offer a more thorough analysis. The authors construct a feedback model to postulate causal relationships between several different variables, and then check those hypotheses empirically. More concretely, they start with a "basic" flywheel-type model assuming that

The dynamics on Wikipedia start with contributors creating and editing articles, which attract readers to consume the content. As readership grows, readers are more likely to spot a need for edits (e.g., content correction), thereby becoming contributors themselves.

Introducing AI is hypothesized to disrupt this idyllic symbiosis between human editors and readers:

As AI answer bots automate readership and AI writer bots automate contributions, the original dynamics are expanded accordingly. While the reinforcement relationships between contributorship and readership remain, AI answer and writer bots exert negative impacts on human activity due to their crowding-out effects.

The activity of these different parties is then operationalized as pageview and edit numbers drawn from stats.wikimedia.org, relying on the existing classifications of users as bots (for edits) and of pageviews as human or non-human. As mentioned above, it seems quite a stretch to assume that the latter all come from "AI answer bots". Similarly, many a Wikipedian might rise an eyebrow at seeing edit bots (which have existed on Wikipedia since 2001^{[supp 3]}) as "AI writer bots".^{[supp 4]}

And it becomes especially contrived when the authors try to shoehorn existing research literature into justifying their model's assumption that bot editing activity negatively affects human editing activity:

Writer bots are necessary to help human editors maintain an ever-growing body of knowledge and update routine content (Petroni et al., 2022), yet writer bots also negatively affect Wikipedia in that they become competitors or opponents, by creating false knowledge or by deleting legitimate user contributions. Thomas (2023) argues that especially LLMs with their ability to make “creative” (i.e., false but plausible) contributions pose a danger to Wikipedia, requiring human editors to correct such contributions. Elsewhere deletionist writer bots (Livingstone, 2016)[^{[supp 3]}] became a source of frustration particularly for novice Wikipedians who considered their contributions invalidated and thus turned away from further contributorship.

For example, the abstract of the cited Thomas (2023) paper^[5], published less than six months after the release of ChatGPT, explicitly positions it as an evaluation of the the potential benefits and challenges of using large language models (LLMs) like ChatGPT to edit Wikipedia (emphasis added)- i.e. not a suitable reference for the factual claim that writer bots also negatively affect Wikipedia in that they become competitors or opponents. And the authors reference for deletionist writer bots (Livingstone, 2016)^{[supp 3]} basically describes the exact opposite: A (human) bot operator's frustrations with his bots contributions getting deleted by deletionist human editors. The paper contains various further examples of such hallucinated citations.

The model is extended by some other variables, e.g. one modelling the recognition of Wikipedia as a public good, for which

we expect Wikipedia recognition to be a driver of contributorship [...]. To assess Wikipedia recognition, we use Google Trends as a proxy, as it monitors the popularity of Wikipedia as a search term.

Based on the last 15–20 years of data, the researcher find statistical support for most of their postulated relationships between these variables, in the form of impressively large adjusted $R^{2}$ and impressively small p-values.^{[supp 5]} Still, the authors also caution that Due to the limited data, the proposed feed-back model has yet to be fully tested empirically.

They summarize their overall results as follows:

Starting from the premise of a producer–consumer relationship where readers access knowledge provided by other contributors, and then also create knowledge in return, we postulate a positive feedback cycle where readers attract contributors and vice versa. Using this logic and recognizing the advances in AI-enabled automation, we note that this positive feedback relationship has attracted AI bots both as “readers” (answer bots) and as “contributors” (writer bots), thereby weakening traditional human engagement in both readership and contributorship, thus shifting the originally virtuous cycle towards a vicious cycle that would diminish Wikipedia.

Briefly

See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

References

^ Tran, Chau; Take, Kejsi; Champion, Kaylea; Hill, Benjamin Mako; Greenstadt, Rachel (2024-11-08). "Challenges in Restructuring Community-based Moderation". Proc. ACM Hum.-Comput. Interact. 8 (CSCW2): 415:1–415:24. doi:10.1145/3686954. Retrieved 2025-03-16. / Open access preprint: Tran, Chau; Take, Kejsi; Champion, Kaylea; Hill, Benjamin Mako; Greenstadt, Rachel (2024-02-27), Challenges in Restructuring Community-based Moderation, arXiv:2402.17880
^ Lyu, Liang; Siderius, James; Li, Hannah; Acemoglu, Daron; Huttenlocher, Daniel; Ozdaglar, Asuman (2025-03-02), Wikipedia Contributions in the Wake of ChatGPT, arXiv:2503.00757
^ Huang, Siming; Xu, Yuliang; Geng, Mingmeng; Wan, Yao; Chen, Dongping (2025-03-04), Wikipedia in the Era of LLMs: Evolution and Risks, arXiv:2503.02879 Data and code
^ Wagner, Christian; Jiang, Ling (2025-01-03). "Death by AI: Will large language models diminish Wikipedia?". Journal of the Association for Information Science and Technology. 76 (5). doi:10.1002/asi.24975. ISSN 2330-1643.
^ Thomas, Paul A. (2023-05-09). "Wikipedia and large language models: perfect pairing or perfect storm?". Library Hi Tech News. 40 (10): 6–8. doi:10.1108/LHTN-03-2023-0056. hdl:1808/34102. ISSN 0741-9058.

Supplementary references and notes:

^ Unfortunately the authors do not specify whether they excluded bot edits and non-human pageviews.
^ What's more, the cited numbers don't even match the stated source, i.e. the Wikimedia Foundation's stats.wikimedia.org site (which e.g. gives 17.3 billion non-human views for 2017, not 14 billion). Unfortunately the paper comes without replication data or code.
^ ^a ^b ^c Livingstone, Randall M. (2016-01-09). "Population automation: An interview with Wikipedia bot pioneer Ram-Man". First Monday. 21 (1). doi:10.5210/fm.v21i1.6027. ISSN 1396-0466.
^ The authors' problematic assumption that bot accounts on Wikipedia are powered by AI is especially evident in the "Discussion" section, which claims that a very recent growth in edits [...] may already be a result of generative AI support, pointing to a table that lists half-yearly edit numbers and ends with a much higher number in bot edits during the first half of 2024. However, that spike has largely subsided in more recent data.
^ However, statisticians with high blood pressure are advised to avoid looking at Table 1, which reports several of these p-values as negative (<0.000), a mathematical impossibility. While these might just be typos arising from cutting off trailing digits, they don't quite raise confidence in JASIST's peer review processes.

← Previous "Recent research"

In this issue

22 March 2025 (all comments)

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

The Wikimedia Research Fund section could use a sentence about what the fund is, what it does, etc. I hadn't heard of it before this article. –Novem Linguae (talk) 22:02, 23 March 2025 (UTC)[reply]
Thanks for the feedback. I think the list of previously funded proposals may already convey quite a bit of information in that regard, as do the changes mentioned. But point taken. Regards, HaeB (talk) 02:27, 24 March 2025 (UTC)[reply]
Taking a look at the words to watch page, the usage of "crucial" is hinted in the puffery section and "additionally" might be something to add to the editorial section. – The Grid (talk) 13:33, 24 March 2025 (UTC)[reply]
Most of these articles about ChatGPT's impact seem to assume the current chatbot model will just keep going, if not grow even more. I'd like to remind everyone OpenAI is still losing billions of dollars a year and, if investor confidence wanes even a little bit, may not be able to provide free access to ChatGPT for much longer. It takes insane amounts of power to run an AI data center and the product just isn't valuable enough to justify the expense. This blog post discusses AI's profitability problems at length. HansVonStuttgart (talk) 08:41, 26 March 2025 (UTC)[reply]
That's an extremely misguided comment. Whatever one thinks about OpenAI's finances (and I would recommend some healthy skepticism about Ed Zitron's rants; in any case, re investor confidence, the company raised another $40 billion right after you posted this comment here, so its demise is presumably still some time away):

The cost of operating LLMs like ChatGPT "has fallen dramatically in recent years" [1]. And one now "run a GPT-4 class model on [a] laptop" [2] - i.e. a model that is significantly better than the free version of ChatGPT was during the timespan covered by the studies reviewed here.

Regards, HaeB (talk) 04:54, 29 April 2025 (UTC)[reply]

"Wikipedia Contributions in the Wake of ChatGPT"

I'm skeptical of this paper's conclusions. I think this WWW article's evidence is actually partly inconsistent with the narrative of ChatGPT as a competitive threat to Wikipedia. Figure 2 shows a dramatic increase in Wikipedia views and edits in "dissimilar articles" much larger than the decrease observed in "similar articles". These similarity categories are based on an attempt to classify if a Wikipedia article is similar to content the chatbot would produce. But it is difficult to pin down what explains the difference between these sets. They set it up so that the dissimilar articles are the "control group", as if they would be unaffected by ChatGPT. But that's not the story they tell in Figure 2. The headline could easily be ChatGPT increased Wikipedia usage and contribution if the researchers had started from a different narrative frame. I'm still skeptical that current chatbots pose a competitive threat to Wikipedia. Wikipedia gets you facts faster than Chatbots, and has a stronger brand among its users for verifiable and factual information. Groceryheist (talk) 15:07, 26 March 2025 (UTC)[reply]
Update: The more I look at this, the less convincing it is. The assumptions of their DiD just don't seem plausible if you look at the data in Figure 2. There's a big increase in views for dissimilar articles that starts before the change. The decline in views for similar articles begins prior to the change. This invalidates the "parallel trends" assumption of the DiD estimator. They only estimated a large decrease in views for "similar" articles in the DiD because there was an increase in views to "dissimilar" articles in the same period. Groceryheist (talk) 15:24, 26 March 2025 (UTC)[reply]

@Groceryheist: Yes, this confused me greatly too. For what it's worth though:

1. Parallel trends assumption: We should note that the authors directly claim that Figure 2 [...], which shows similar pre-ChatGPT evolution of these two groups, bolsters our confidence in [the "parallel trends"] assumption. Now, I don't think that one should blindly defer to a Nobel prize winner. (As I mentioned in the review, this wouldn't be the first Acemoglu-coauthored or -promoted paper on AI impacts that attracts criticism for statistical flaws - or, in recent news, worse.) But at the very least they didn't ignore this assumption.

There's a big increase in views for dissimilar articles that starts before the change. The decline in views for similar articles begins prior to the change. - note that Figure 2 actually shows residual views and edits, more specifically mean residuals for each month of activity 𝑡 aggregated over similar and dissimilar articles respectively from the "Comparative time series" regression on the page preceding Figure 2. And that regression already includes a linear trend[] for [...] activity time. I too am more familiar with visually checking the parallel trends assumption by looking at the actual outcome variable instead of such averaged residuals. But perhaps in this residuals view one is supposed to check the parallel trends assumption by e.g. visually inspecting whether the error bars cover the mean mean residual (dashed lines) pre-launch, which they do actually do fairly well in Figure 2? Just guessing though. In any case, the authors might question what you mean by "big" exactly in big increase in views for dissimilar articles - shouldn't that be put in relation to the size of the standard errors? Other observations derived from ocular inspection like The decline in views for similar articles begins prior to the change should probably take the error bars into account, too.

2. Eyeballed trends from Figure 2 vs. DiD regression coefficients in Figure 3: When the authors write in section 3.1 about Figure 2 that (my bolding)

For both views and edits, we see that similar articles exhibited little changes in activity from the pre-GPT to the post-GPT period (accounting for controls). Dissimilar articles, on the other hand, show an increase in edits after ChatGPT launched [...]

perhaps they simply refer to the positions of the dashed lines (which show the [mean] mean residuals for similar (blue) and dissimilar (red) articles over the pre-GPT and post-GPT periods respectively)? But it's in weird contrast to what they say right afterwards in section 3.2 based on the DiD regression(s):

The diff-in-diff coefficients for Figure 3a (views) are negative and statistically significant for all article ages except 𝑇 = 1, which implies that Wikipedia articles where ChatGPT provides a similar output experience a larger drop in views after the launch of ChatGPT. This effect is much less pronounced for edit behavior.

My hunch is that they may have gotten a bit carried away with these ocular observations in section 3.1 (perhaps forgetting themselves for a moment that Figure 2 shows residuals only?) and should have stuck with reporting the DiD outcomes, instead of yielding to the temptation of (to use your expression) "telling a story" about Figure 2 already. It's also worth being aware that Figure 2 only shows the situation for a single T (T=6).

But I'm genuinely still confused about all this too. At the very least, it is safe to say that the paper goes through only a small part of the steps recommended in this new "practitioner's guide" to Difference-in-Differences designs (see p.53 f., "Conclusions"; also, section 5.1. there about 2xT designs looks like it could be relevant here, but I haven't gotten around to reading it yet).

In any case, thanks a lot for raising this. As mentioned, it had confused me as well, but I didn't discuss this problem in the review because a) I wasn't sure about it and b) this review and the rest of this issue already felt grumpy enough ;-) Overall, writing this roundup has been a bit of a disconcerting experience - the quantitative impact of the AI boom on Wikipedia is one of the most important research questions about Wikipedia in recent memory, and sure, it is not easy to tackle, but the quality of what has been published so far is not great. (I still think the "Wake of" paper might be the one with the most solid approach among those mentioned in this issue.)

Regards, HaeB (talk) 08:25, 18 May 2025 (UTC) (Tilman)[reply]

But perhaps in this residuals view one is supposed to check the parallel trends assumption by e.g. visually inspecting whether the error bars cover the mean mean residual (dashed lines) pre-launch, which they do actually do fairly well in Figure 2?

Hey! Thanks for responding in such thoughtful depth. I think Figure 2 is exactly what you'd want to look at for evaluating parallel trends for their DiD models. The DiD model is exactly the "comparative time series model" plus the terms needed to statistically test for post-intervention differences in treatment effects. This part of the study seems well-executed, as is true of the study overall, my concerns are entirely about "interpretation". What is a bit surprising or unclear is why they felt the need to adjust for length and creation month at all. The linear trend for time in these two models means that both lines (similar and dissimilar) in Figure 2 should be flat if there were no differences between similar and dissimilar articles. So we'd expect flat lines pre-cutoff and slopes post-cutoff; however, I see slopes pre-cutoff for both edits and views. Edit: The lines pre-cutoff could not be flat, but trending in the same direction if there were higher-order temporal trends (i.e., acceleration in editing or views).

That said, I don't think it appropriate to consider the standard errors in Figure 2 when considering the parallel trends assumption since this isn't a theoretically justified or interpretable test of the assumption. One reason is that the parallel trends assumption is about the "counter-factual" of what would have happened in the absence of treatment. This fundamentally isn't possible to test. Looking at pre-cutoff patterns and extrapolating is one approach, and people also tend to squabble over the substance of the phenomena to decide how much weight to give a DiD estimate. My understanding of DiD is that it is particularly prone to mislead when pre-cutoff trends are opposite, which to me appears so in this data.

shouldn't that be put in relation to the size of the standard errors? Other observations derived from ocular inspection like "The decline in views for similar articles begins prior to the change" should probably take the error bars into account, too.

Even if we do try to use SEs to tell if the trends are statistically significant or random fluctuations, it's hard to dismiss the pre-cutoff trends. For edits which looks like an increase of almost 1 SE for dissimilar and a decrease of about 0.5 SE for similar, and something happens in the three points before the cutoff that could be a bit opposite trend or a random fluctuation.

Overall, I wouldn't claim that the assumption is surely "invalid", but it is far from bullet proof.

I do agree that statements interpreting the magnitudes of these changes make more sense in the context of the of the variance in the outcomes. By that standard, it might be difficult to claim any of the changes as "big". But comparing the far-left and far-right of Figure 2 the change for dissimilar articles looks to about nearly 2 SE. For edits that change might be about 1 SE. Groceryheist (talk) 17:05, 18 May 2025 (UTC)[reply]

@Groceryheist: Thanks. This is helpful, but I'm still confused about various aspects here. (And yes, I'm familiar with the basics of DiD and aware that the parallel trends assumption is counter-factual, as you explained - but it does seem that various methods have been proposed in the literature that are at least partly more rigorous than mere eyeballing.)

Re I think Figure 2 is exactly what you'd want to look at for evaluating parallel trends for their DiD models - do you happen to have a reference at hand where one can read up more on this method? (It may be a while before I get to take a look at that chapter in the aforementioned new guide by Baker et al. to see if it is indeed relevant here.)

The only other point I'd add right now regarding DiD validity is that the authors rather audaciously assume the effect of ChatGPT to have been constant during the post-launch period (per the 1_{t ≥ Dec 2022} term in the DiD regression) - whereas we can be fairly certain that its use was still growing rapidly at least in the first several months of that period (one reason why above I felt inclined to inspect the pre-launch half of Figure 2 first).

PS: I added the corpus delicti above for easier reference.

PPS: Another thing that is really unfortunate about this paper is that the authors did not release any replication data or code (not to talk about preregistration).

PPPS re my general frustrations (above) about the current lack of high-reliability quantitative findings in this area, I noticed Mako stating recently that due to generative AI, for the first time in history, Wikipedia is beginning to see decreased viewership. So maybe he has formed some judgment about these papers too and might want to weigh in here? (Although in that article it is framed as a reference to the trend of AI content being listed first in web search results, ahead of Wikipedia. But that could be a mixup, considering that websites featured in those AI Overviews might also benefit from the extra backlinks, so we can't conclude from their mere presence that they reduce our traffic. And at least according to one SEO firm that tries to monitor their impact systematically, Wikipedia is probably not among their victims so far.)

Regards, HaeB (talk) 07:38, 26 May 2025 (UTC)[reply]

Hey @HaeB thanks for continuing the conversation. Nothing comes to mind as far as a citation for the DiD pretrends visualization. It's just a sort of standard part of the DiD study design. Including the overall trend in the visualized model makes it easier to see if there's a lack of parallelism pre-treatment. If the overall trends were opposed then we'd see a bigger difference. I actually kinda prefer eyeballing to statistical tests of pre-trend parallelism, since I think those tests lend an undue air of rigor. They can lead you to reject an analysis based on insubstantial differences. E.g., a KS test of normality will reject the null for pretty much any sufficiently large dataset that isn't generated by a gaussian process. People might conclude not to use a z-test in such a case, even though the test is fairly robust to deviations from normality.

Instead of a statistical test of pre-treatment parallelism, what I'd really like to see added to this would be a sensitivity analysis to quantify how violations of the assumptions would affect the results.

Good observation that the treatment effect isn't constant over time, it seems reasonable to assume it would tend to increase. That said, I think this one would make their analysis and interpretation conservative since those increases wouldn't be fully captured in their estimate of the treatment effect; part of the increasing effect would be captured by the post-treatment slope parameters.

One other DiD validity issue that @Jdfoote brought up in another channel was that, under their theory that more similar articles are more exposed to the treatment, there should be a dose-dependent treatment effect. I.e., we should observe a greater effect for the most similar articles. The way they split articles into similar and dissimilar groups seems a bit arbitrary from this perspective. Treating the treatment as continuous instead of discrete would be an interesting step.

I guess @Mako was just referring to the overall decline in page views as seen in e.g., [3]. As you describe, there could be multiple mechanisms for this decline.

Thanks for pointing out that you're unsatisfied with the research into this question so far. I agree the one we're discussing is maybe the most useful we've seen. Thinking about what else we might to do shed light on this. Groceryheist (talk) 19:46, 26 May 2025 (UTC)[reply]

and has a stronger brand among its users for verifiable and factual information - maybe so (I think WMF made such a claim based on survey data from its 2023 ChatGPT plugin experiment). But "among its users" does a lot of work here. Contrast these two studies we previously reviewed here:

"Identical text perceived as *less* credible when presented as a Wikipedia article than as simulated ChatGPT or Alexa output"
"In blind test, readers prefer ChatGPT output over Wikipedia articles in terms of clarity, and see both as equally credible" (based on the now ancient ChatGPT 3.5, no less).

Regards, HaeB (talk) 06:22, 18 May 2025 (UTC)[reply]

Blind tests inherently factor out any effect of Wikipedia's brand on credibility. Groceryheist (talk) 17:07, 18 May 2025 (UTC)[reply]

But yeah. These lab-based studies do suggest that some part of the potential audience doesn't ascribe all that much prestige to Wikipedia, at least compared to the chatbots. Groceryheist (talk) 19:07, 18 May 2025 (UTC)[reply]

Keep up with The Signpost on Twitter, Facebook or Mastodon.

Home

About