The Signpost


Recent research

"As many as 5%" of new English Wikipedia articles "contain significant AI-generated content"

Contribute   —  
Share this
By Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"As many as 5%" of new English Wikipedia articles "contain significant AI-generated content"

[edit]
Figure 1 from the paper: "Using two tools, GPTZero and Binoculars, we detect that as many as 5% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content. The classification thresholds of both tools were calibrated to maintain a FPR of no more than 1% on a pre GPT-3.5 Wikipedia baseline, as indicated by the red line.

A new paper titled "The Rise of AI-Generated Content in Wikipedia"[1] estimates

"that 4.36% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content"

In more detail, the authors used two existing AI detectors, which

"reveal a marked increase in AI-generated content in recent[ly created] pages compared to those from before the release of GPT-3.5 [in March 2022]. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint on controversial topics."

These are among the first research results providing a quantitative answer to an important question that Wikipedia's editing community and the Wikimedia Foundation been weighing since at least the release of ChatGPT almost two years ago. (Cf. previous Signpost coverage: Community rejects proposal to create policy about large language models, "AI is not playing games anymore. Is Wikipedia ready?", and in this issue: TKTK update link after publication "Keeping AI at bay – with a little help from volunteers", summarizing media coverage of WikiProject AI Cleanup). The "Implications of ChatGPT for knowledge integrity on Wikipedia" were also the topic of a research project conducted in 2023-2024 by UT Sydney researchers (funded by a $32k Wikimedia Foundation grant) which just published preliminary results where "Concerns about AI-generated content bypassing human curation" are highlighted as one of the challenges voiced by Wikipedians.

The new study's numbers should be valuable as concrete evidence that the generative AI has indeed started to affect Wikipedia in this manner (but might potentially also be reassuring for those who had been fearing Wikipedia would be overrun entirely by ChatGPT-generated articles).

That said, there are several serious concerns about how to interpret the study's data, and unfortunately the authors (a postdoc, a graduate student and an undergraduate student from Princeton University) address them only partially.

Figure 2 from the paper: "The activity of this user, who was flagged for instigating an ‘Edit War,’ reveals that within a single day, they created three articles (red border), all identified as AI-generated. Notably, at 13:00 (green border), the user edited the outcome of ‘War in Dibra’ from ‘Mixed Results’ to ‘Victory’ and removed key text, just an hour before creating a new page titled ‘Uprising in Dibra.’ That page (see Figure 3) has since been deleted by moderators."

First, the researchers made no attempt to quantify how many of the articles from their headline result ("4.36% of 2,909 English Wikipedia articles created in August 2024 contain significant AI-generated content") had also been detected (and flagged or deleted) by Wikipedians. They did inspect a smaller subset, namely "the 45 English articles flagged as AI-generated by both GPTZero and Binoculars" (corresponding to 1.5% of those 2,909), finding that

"Most of the 45 pages are flagged by moderators and bots with some warning, e.g., 'This article does not cite any sources. Please help improve this article by adding citations to reliable sources' or even 'This article may incorporate text from a large language model."

Even for this smaller sample though, we are not told what percentage of AI-generated articles survived.

Has the AI-Wikipedia ouroboros begun to devour itself? Or is it still being starved on a much more meager diet than the paper's results might make one believe?

In other words, the paper is a rather unsatisfactory read for those interested in the important question of whether generative AI threatens to overwhelm or at least degrade Wikipedia's quality control mechanisms - or whether these handle LLM-generated articles just fine alongside the existing never-ending stream of human-generated vandalism, hoaxes, or articles with missing or misleading references (see also our last issue, about an LLM-based system that generates gene articles with fewer such "hallucinated" references than human Wikipedia editors). Overall, while the paper's title boldly claims to show "The Rise of AI-Generated Content in Wikipedia", it leaves it entirely unclear whether the text that Wikipedia readers actually read has become substantially more likely to be AI-generated. (Or, for that matter, the text that AI systems themselves read, considering that Wikipedia is an important training source for LLMs - i.e. whether the paper is evidence for concerns that "The ouroboros has begun".)

Secondly and more importantly, the reliability of AI content detection software - such as the two tools that the study's numerical results are based on - has been repeatedly questioned. To their credit, the authors are aware of these problems and try to address them. For example by combining the results of two different detectors, and by using a comparison dataset of articles created before the release of GPT-3.5 in March 2022 (which can be reasonably assumed to be virtually free of LLM-generated text). However, their method still leaves several questions unanswered that may well threaten the validity of the study's results overall.

In more detail, the authors "use two prominent detection tools which were suitably scalable for our study". The first tool is

GPTZero [.....] a commercial AI detector that reports the probabilities that an input text is entirely written by AI, entirely written by humans, or written by a combination of AI and humans. In our experiments we use the probability that an input text is entirely written by AI. The black-box nature of the tool limits any insight into its methodology."

The second tool is more transparent:

An open-source method, Binoculars [... uses two separate LLMs [..] to score a text s for AI-likelihood by normalizing perplexity by a quantity termed cross-perplexity [...] The input text is classified as AI-generated if the score is lower than a determined threshold, calibrated according to a desired false positive rate (FPR). [...] For our experiments, we use Falcon-7b and Falcon-7b-instruct [as the two LLMs, following the recommendation of the authors of the Binoculars paper.] Compared to competing open-source detectors, Binoculars reports superior performance across various domains including Wikipedia"

The "superior performance" of the Binoculars tool (online demo) for the Wikipedia "domain" sounds very reassuring, with both precision and recall at or near a perfect 100% according to figure 3 in the "Binoculars" paper[supp 1]. But it refers to the performance on a 2023 dataset called "M4",[supp 2] where the AI "articles" to be detected were generated in a rather simplistic manner. ("We prompted LLMs to generate Wikipedia articles given titles, with the requirement that the output articles contain at least 250 words", see also the results for e.g. ChatGPT). It seems unwise to assume that this is representative of all the ways in which actual editors try to use AI to generate new articles in August 2024. Indeed the authors explicitly acknowledge this in a different part of the "Rise" paper, pointing out they did not attempt to "simulat[e] the various ways Wikipedia authors might use LLMs to assist in writing—taking into account different models, prompts, and the extent of human integration, among other factors." As a small illustration of potential issues with this, the few concrete examples of articles detected as AI-generated that are included in the paper (figure 2, see above) all start with an infobox - something which ChatGPT can certainly generate if explicitly prompted to do so, but which seems to be absent from most or all of the examples in the M4 dataset. (The authors removed some formatting from the HTML of the Wikipedia articles before passing them to the detector, but apparently not templates - such as infoboxes - in general.)

What's more, as is evident from Figure 1 (above), as used in the paper, both tools disagreed frequently, with GPTZero being much more detection-happy than Binoculars in English, French, and German, but much less in Italian. The authors acknowledge that "the tools we use are primarily for detecting AI-generated content in English. While GPTZero supports Spanish and French, it is not designed for other languages."

As mentioned in the paper's abstract (see above), the authors try to control for false positives by calibrating both detectors to a 1% false positive rate on the control dataset (of presumably AI-free Wikipedia articles from March 2022). A technical issue that appears to have been overlooked here is that this 2022 dataset was generated from the source wikitext (using the well-known "mwparserfromhell" Python package), whereas the authors obtained their August 2024 articles by scraping the HTML pages on the Wikipedia website (and, as mentioned above, applying some of their own cleanup steps). LLM-based text classification tools can sometimes be quite sensitive to minor formatting aspects.

More importantly, it seems rather adventurous to assume that the articles from that March 2022 dataset are comparable in all relevant properties to the newly created articles from August 2024 (i.e. that the 1% false positive calibration on the former will mean a 1% false positive rate on the latter). The authors are aware of this problem, but only make a very perfunctory attempt to address it:

"One concern is that pre-March 2022 pages may be more polished due to years of editing. However, we observe that a higher number of edits weakly correlates with a higher AI-detection score for pre-March 2022 articles (Appendix D), suggesting that the FPRs for those articles may even be inflated. While the base assumption cannot be watertight, we observe a relatively consistent distribution of page categories between the two data pools, and we rely on the consistency of our chosen tools’ reported FPRs."

For many people, the fact that additional edits by human Wikipedia editors make the AI detection score go up in both detectors might increase skepticism about their overall validity. But the authors take it as an argument to strengthen their paper's overall "Rise of AI-Generated Content" claim, by alluding to the possibility that its estimate might be too low. At various other points of the paper the authors likewise express awareness that their measurement method is subject to substantial errors and uncertainties, but claim that these can only go in their favor (i.e. could only mean that the actual rate of AI-generated articles is higher than their estimate). And there are other issues that likewise make one wonder a bit about the stringency of the peer review process that the paper has undergone, for example its claim that "The Wikipedia data we collect is under a Creative Commons CC0 License."

The study has only been been published as an arXiv preprint at the time of writing. But according to a remark in the accompanying code, it has been accepted at the "NLP for Wikipedia Workshop" at next month's EMNLP conference.

The authors have commendably published code used for the paper (although not under an open source license). Unfortunately though for readers who might want to replicate part of the paper's quantitative or qualitative analysis (or check whether some of the AI-generated articles it detected have slipped through Wikipedia's New pages patrol), none of the data underlying the paper's results is being published (even though it is based entirely on public information):

"Detecting AI may have unexpected negative consequences for people implicated as having generated that text. We have therefore been encouraged to omit any identifying information in the specific pages we discuss; however, we will provide more specific data to researchers upon request provided that it not be disseminated further."

But these concerns did not stop the authors from discussing the aforementioned concrete examples in a way that makes it very easy to identify involved users (as one reader of the paper already did, pointing out the specific longstanding sockpuppeting case that the editor featured in figure 2 was involved in).

Briefly

[edit]

Other recent publications

[edit]

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Wikipedia’s socio-technical vision is over-determined by consensus" and "Wikipedia should strengthen its democratic commitment by engaging with dissensus"

[edit]

From the abstract:[2]

"Wikipedia is composed from consensus. [...] While it is often positioned as a self-evident good, its usage on Wikipedia is not without concern. In this paper I mobilize Chantal Mouffe’s (2000) feminist critical political theory and Johanna Drucker’s (2014) methods of interface analysis to raise important questions about the relationship between consensus and peer production. [...] I identify the multitude of ways that Wikipedians perform consensus: not only through understanding and decision-making, but also through acts of composing, showing, processing, closing, and calculating. However, because Wikipedia’s socio-technical vision is over-determined by consensus, its political design is ill-equipped to address the political conditions of pluralist societies. As a result, I identify the reasons why Wikipedia should strengthen its democratic commitment by engaging with dissensus. By conducting this research, I demonstrate how consensus has transitioned from a democratic ideal into an interface and why it should be re-imagined within peer production projects."

2019 integration of Google Translate made Wikipedia editors more productive

[edit]

From the abstract:[3]

"This study examines the impact of integrating Google Translate into Wikipedia's Content Translation system in January 2019. Employing a natural experiment design and difference-in-differences strategy, we analyze how this translation technology shock influenced the dynamics of content production and accessibility on Wikipedia across over a hundred languages. We find that this technology integration lead to a 149% increase in content production through translation, driven by existing editors become more productive as well as an expansion of the editor base. Moreover, we observe that machine translation enhances the propagation of biographical and geographical information, helping to close these knowledge gaps in the multilingual context."

See also mw:Wikimedia_Research/Showcase#July_2024


"Historical Narratives in Different Language Versions of Wikipedia"

[edit]

From the abstract:[4]

"The article compares selected entries on Wikipedia concerning significant historical events in three language versions: Belarusian, Lithuanian, and Polish. [...] I apply the method of ideological critique to investigate whether national values influence the objectivity of Wikipedia articles written in local languages. A comparison of multilingual Wikipedia entries reveals the prevalence of “local” points of view on controversial historical events."


"Community Vital Signs: Measuring Wikipedia Communities’ Sustainable Growth and Renewal"

[edit]

From the abstract:[5]

"After 2007, researchers started to observe that the number of active editors for the largest Wikipedias declined after rapid initial growth. Years after those announcements, researchers and community activists still need to understand how to measure community health. In this paper, we study patterns of growth, decline and stagnation, and we propose the creation of 6 sets of language-independent indicators that we call “Vital Signs” [formerly available at https://vitalsigns.wmcloud.org/ ]. Three focus on the general population of active editors creating content: retention, stability, and balance; the other three are related to specific community functions: specialists, administrators, and global community participation. [...] We present our analysis for eight Wikipedia language editions, and we show that communities are renewing their productive force even with stagnating absolute numbers; we observe a general lack of renewal in positions related to special functions or administratorship."

See also:


References

[edit]
  1. ^ Brooks, Creston; Eggert, Samuel; Peskoff, Denis (2024-10-10), The Rise of AI-Generated Content in Wikipedia, arXiv, doi:10.48550/arXiv.2410.08044 (accepted at the "NLP for Wikipedia Workshop" at EMNLP 2024) / code
  2. ^ Jankowski, S. (February 2022). "Making Consensus Sensible: The Transition of a Democratic Ideal into Wikipedia's Interface". Journal of Peer Production. 15. Peer reviews
  3. ^ Zhu, Kai; Walker, Dylan (2024-01-28), The Promise and Pitfalls of AI Technology in Bridging Digital Language Divide: Insights from Machine Translation on Wikipedia, Rochester, NY, doi:10.2139/ssrn.4708614{{citation}}: CS1 maint: location missing publisher (link)
  4. ^ Kubś, Jakub (2021). "Historical Narratives in Different Language Versions of Wikipedia". Academic Journal of Modern Philology (12): 83–94. doi:10.34616/ajmp.2021.12. ISSN 2299-7164.
  5. ^ Miquel-Ribé, Marc; Consonni, Cristian; Laniado, David (January 2022). "Community Vital Signs: Measuring Wikipedia Communities' Sustainable Growth and Renewal". Sustainability. 14 (8): 4705. doi:10.3390/su14084705. ISSN 2071-1050.{{cite journal}}: CS1 maint: unflagged free DOI (link) / Code, data
Supplementary references and notes:
  1. ^ Hans, Abhimanyu; Schwarzschild, Avi; Cherepanova, Valeriia; Kazemi, Hamid; Saha, Aniruddha; Goldblum, Micah; Geiping, Jonas; Goldstein, Tom (2024-10-13), Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text, doi:10.48550/arXiv.2401.12070
  2. ^ Wang, Yuxia; Mansurov, Jonibek; Ivanov, Petar; Su, Jinyan; Shelmanov, Artem; Tsvigun, Akim; Whitehouse, Chenxi; Afzal, Osama Mohammed; Mahmoud, Tarek (2024-03-09), M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection, doi:10.48550/arXiv.2305.14902


S
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.




       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0