The Signpost

File:Boerenerf met gevogelte.jpeg
Moses ter Borch
PD
50
10
420
Opinion

Crawlers, hogs and gorillas

Contribute   —  
Share this
By Smallbones
This article gives the opinions of the author which do not necessarily reflect those of The Signpost or its staff, any other Wikipedia editors, or of the Wikimedia Foundation. See related articles in this issue at Op-ed, News from Diff, and News and notes.

Hogs at the trough

Have you noticed that Wikipedia pages have been loading more slowly over the last year? The Wikimedia Foundation has. See the Op-ed in this issue. Wikipedia is one of the largest sources of training data for most, perhaps all, of the large language models behind AI products, and it appears that the slowdown has been caused by the AI firms' bots scraping Wikimedia content. The "amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs", according to the WMF.

Imagine a world in which every single person on the planet is given free access to the sum of all human knowledge. That's what we're doing.
— Jimmy Wales, 2004

We need a place on the web where ordinary people, every single one of us, are given priority over mere corporations. That place is Wikipedia.

While limiting immediate access by for-profit corporations to Wikimedia data may seem harsh, it is needed to ensure that human readers and editors have access. It’s already being done, "Site Reliability Engineers have had to enforce on a case-by-case basis rate limiting or banning of crawlers repeatedly to protect our infrastructure."

This is consistent with the WMF's Terms of Use which prohibit

  • Engaging in automated uses of the Project Websites that are abusive or disruptive of the services, violate acceptable usage policies where available, or have not been approved by the Wikimedia community;
  • Disrupting the services by placing an undue burden on an API, Project Website or the networks or servers connected with a particular Project Website...


— From Terms of Use - under Refraining from Certain Activities

Given the drastic change in corporate demand for our compendium of knowledge, it might be worthwhile for the WMF to modify the Terms of Use to further stress the priority of people over corporations by requiring registration of corporate bots, and cutting off the bots and imposing adequate fines when they impose costs on Wikipedia operations. While they are at it, the WMF should impose fines when AI firms ignore the attribution requirement when using material scraped from WMF servers that’s licensed CC-BY-SA.

There is an alternative for the corporations to obtain quick access to Wikipedia’s data while paying their fair share of the cost, WMF's own for-profit corporation Wikimedia Enterprise. The total cost to the corporations may even be lower by dealing with a single source of Wikimedia’s data that is designed to deal with multiple high demand users, rather than each of them having their own bots jostling and elbowing everybody else out of the way for immediate access.

The WMF and the editing community have both done our parts by creating a compendium of knowledge the likes of which the world has never seen, by properly storing the data and giving it away free in an orderly manner. Now the corporations are all fighting for immediate access so that they can repackage the knowledge and then sell a dumbed-down version. The least they can do is pay the costs that they are imposing on others and not act like hogs at the trough.

The gorillas in the room

There is another problem that was discussed on Diff and and is reprinted in this month's Signpost in the News from Diff. But this issue needs to be handled more gently than the case of the AI bots.

The WMF wants to improve the consistency of the understanding and enforcement of our policy on Neutral point of view (NPOV) across all language versions. Traditionally each language version has had great latitude in defining and enforcing its own rules in order to avoid the cultural imperialism that might result from applying concepts of bias from English speaking countries to encyclopedias written, for example, in Croatian, Russian, Hebrew, Arabic or Chinese.

The very serious problems caused by inconsistent understandings of NPOV include a decade-long effort to stop local admins from imposing a strong nationalist point of view on the Croatian Wikipedia. See these previous Signpost articles.

Eastern European-Russian conflicts also have a long history (2007–2024) of a failure to arrive at an NPOV on a range of articles covered in multiple Arbcom cases with dozens of editor sanctions. Perhaps the most recent cause of the WMF's concern over consistently applying NPOV standards across Wiki projects is the Gaza war and the campaign in the press by the Anti-defamation league, the Heritage Foundation and others questioning Wikipedia’s credibility. Or perhaps the cause is the split between the Republican and Democratic parties in the US in their recognition of simple facts.

Whatever the cause, we should not let our community governance model become completely distorted by the failure to resolve a half-dozen or dozen serious cases. Nevertheless, an attempt to adapt the current model to try to resolve some of these serious cases within a shorter time should be welcomed.

One adaptation might be to try to include academics, such as was done in the Croatian case in order to determine what sources can be considered reliable in each case.

It won't be easy, and it won't be cheap getting experts to come in to help decide our most intractable cases, but it may be worthwhile.


Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

This op-ed (shoehorned in after the Signpost's publication deadline) shows why we need better fact-checking of opinion pieces. E.g. regarding

There is an alternative for the corporations to obtain quick access to Wikipedia’s data while paying their fair share of the cost, WMF's own for-profit corporation Wikimedia Enterprise.

You would think so, but it is actually not true for the AI scraping requests that the WMF called out as particularly problematic in its Diff post (which conspicuously failed to mention Enterprise). See my notes here. Regards, HaeB (talk) 19:27, 9 April 2025 (UTC)[reply]

And see my response there. Smallbones(smalltalk) 21:49, 9 April 2025 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0