How crawlers impact the operations of the Wikimedia projects

Op-ed

How crawlers impact the operations of the Wikimedia projects

By Birgit Mueller, Chris Danis, and Giuseppe Lavagetto

This article was originally published at the Wikimedia Foundation's Diff blog on April 1, 2025. It is licensed CC-BY-SA 4.0. See related articles in this issue In focus, News from Diff, and News and notes. Birgit Mueller, Chris Danis, and Giuseppe Lavagetto are with the Wikimedia Foundation.

Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we'll discuss the reasons for this trend and its impact.

The Wikimedia projects are the largest collection of open knowledge in the world. Our sites are an invaluable destination for humans searching for information, and for all kinds of businesses that access our content automatically as a core input to their products. Most notably, the content has been a critical component of search engine results, which in turn has brought users back to our sites. But with the rise of AI, the dynamic is changing: We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.

A view behind the scenes: The Jimmy Carter case

When Jimmy Carter died in December 2024, his page on English Wikipedia saw more than 2.8 million views over the course of a day. This was relatively high, but manageable. At the same time, quite a few users played a 1.5 hour long video of Carter's 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic, doubling its normal rate. As a consequence, for about one hour a small number of Wikimedia's connections to the Internet filled up entirely, causing slow page load times for some users. The sudden traffic surge alerted our Site Reliability team, who were swiftly able to address this by changing the paths our internet connections go through to reduce the congestion. But still, this should not have caused any issues, as the Foundation is well equipped to handle high traffic spikes during exceptional events. So what happened?

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.

The graph below shows that the base bandwidth demand for multimedia content has been growing steadily since early 2024 – and there's no sign of this slowing down. This increase in baseline usage means that we have less room to accommodate exceptional events when a traffic surge might occur: a significant amount of our time and resources go into responding to non-human traffic.

Multimedia bandwidth demand for the Wikimedia Projects. Credit Chris Danis.

65% of our most expensive traffic comes from bots

The Wikimedia Foundation serves content to its users through a global network of datacenters. This enables us to provide a faster, more seamless experience for readers around the world. When an article is requested multiple times, we memorize – or cache – its content in the datacenter closest to the user. If an article hasn't been requested in a while, its content needs to be served from the core data center. The request then "travels" all the way from the user's location to the core datacenter, looks up the requested page and serves it back to the user, while also caching it in the regional datacenter for any subsequent user.

While human readers tend to focus on specific – often similar – topics, crawler bots tend to "bulk read" larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

While undergoing a migration of our systems, we noticed that only a fraction of the expensive traffic hitting our core datacenters was behaving how web browsers would usually do, interpreting JavaScript code. When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total. This high usage is also causing constant disruption for our Site Reliability team, who has to block overwhelming traffic from such crawlers before it causes issues for our readers.

Wikimedia is not alone with this challenge. As noted in our 2025 global trends report, technology companies are racing to scrape websites for human-created and verified information. Content publishers, open source projects, and websites of all kinds report similar issues. Moreover, crawlers tend to access any URL. Within the Wikimedia infrastructure, we are observing scraping not only of the Wikimedia projects, but also of key systems in our developer infrastructure, such as our code review platform or our bug tracker. All of that consumes time and resources that we need to support the Wikimedia projects, contributors, and readers.

Our content is free, our infrastructure is not: Establishing responsible use of infrastructure

Delivering trustworthy content also means supporting a "knowledge as a service" model, where we acknowledge that the whole internet draws on Wikimedia content. But this has to happen in ways that are sustainable for us: How can we continue to enable our community, while also putting boundaries around automatic content consumption? How might we funnel developers and reusers into preferred, supported channels of access? What guidance do we need to incentivize responsible content reuse?

We have started to work towards addressing these questions systemically, and have set a major focus on establishing sustainable ways for developers and reusers to access knowledge content in the Foundation's upcoming fiscal year. You can read more in our draft annual plan: WE5: Responsible Use of Infrastructure. Our content is free, our infrastructure is not: We need to act now to re-establish a healthy balance, so we can dedicate our engineering resources to supporting and prioritizing the Wikimedia projects, our contributors and human access to knowledge.

← Previous "Op-ed"

In this issue

9 April 2025 (all comments)

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Regardless of what anyone thinks of AI, the burden scraper bots create for website operators is something that should not be ignored. --Firestar464 (talk) 18:42, 9 April 2025 (UTC)[reply]

Misbehaving scraper bots are definitely a problem, and I have a lot of respect for the hard work that the post's authors and the rest of Wikimedia Foundation's SRE team do to keep the sites up. But reading between the lines, this Diff post points to some of the Wikimedia Foundation's own failings which may well also be a major cause of these current problems, alongside irresponsible scraping behavior. In more detail (recapping various points from a discussion about this post last week in the Wikipedia Weekly Facebook group, where various WMF staff already weighed in):

1. Is there a legitimate alternative?

The Foundation had already highlighted this excessive AI-related scraping traffic several months ago (see the draft annual plan section linked at the end of the post - like probably many Wikimedians, I had read it there before and didn't have second thoughts about it). However, in this post we now learn that it is especially driven by demand for the 144 million images, videos, and other files on Wikimedia Commons. What's interesting about this: For Wikipedia text (i.e. the content that has long been used to train non-multimodal LLMs like the original ChatGPT, Claude, Llama etc.), the "good citizen" advice has always been to download it in form of the WMF's dumps, which puts much less strain on the infrastructure than sending millions of separate requests for individual web pages or hammering the APIs. But for those Commons media files which are apparently so in demand now, there has been no publicly available dump for over a decade. In other words, the "good citizen" method is not available for those who want download a large dataset of current Commons media files.

This might also explain another aspect of this WMF blog post that is rather puzzling: It does not at all mention Wikimedia Enterprise (the paid API access offered by the Wikimedia Foundation's for-profit subsidiary Wikimedia LLC). Normally, this would seem to be the perfect opportunity to advertise Enterprise to AI companies who inconsiderately overload our infrastructure without giving back, and tell them to switch the paid service instead. (Indeed, that's exactly what WMF/WM LLC representatives have done on previous occasions when publicly discussing the use and overuse of Wikipedia by AI companies, see e.g. our earlier Signpost coverage here.) - Not, however, if Wikimedia Enterprise has so far failed to address this apparently huge demand for Commons media files and neglected to build a paid API product for it. (Indeed I don't see such an offering on enterprise.wikimedia.com.)

2. If no approved alternative exist, and the Wikimedia Foundation works to disable existing methods of mass-downloading content, then it has effectively abandoned the right to fork.

The "Right To Fork" has been an important aspect of wikis since before Wikipedia was founded. It is part of the Wikimedia Foundation Guiding Principles:

we support the right of third parties to make and maintain licensing-compliant copies and forks of Wikimedia content and Wikimedia-developed code, regardless of motivation or purpose. While we are generally not able to individually assist such efforts, we enable them by making available copies of Wikimedia content in bulk, and avoiding critical dependencies on proprietary code or services for maintaining a largely functionally equivalent fork.

Note the regardless part - there is no exception like "unless it's for commercial gain" or "unless it's for AI training purposes" or "unless it might hurt the financial sustainability of the WMF or might reduce active editor numbers on wikipedia.org".

This "right to fork" is not exercised frequently, but it is very important as a governance safeguard. It enables anyone to launch a complete copy of (say) Wikipedia and/or Commons on a different website, if they think WMF has turned evil or is being taken over by the US government via executive order. For example, the community's ability to fork Wikipedia is very likely a major reason why Wikipedia does not contain ads today, see Signpost coverage: "Concerns about ads, US bias and Larry Sanger caused the 2002 Spanish fork". The WMF itself also mounted an aggressive legal defense of the right to fork Wikitravel and bring its entire content from a commercial host to Wikivoyage, see Signpost coverage: "Wikimedia Foundation declares 'victory" in Wikivoyage lawsuit".

And WMF's failure to mak[e] available copies of Wikimedia content in bulk in case of Commons images has long been called out, see e.g. c:Commons:Requests_for_comment/Technical_needs_survey/Media_dumps and phab:T298394. (In fact, there didn't even exist an internal backup of until a 2016 Community wishlist request was addressed years later.)

Of course (as was also pointed out by WMF staff in the aforementioned Facebook discussion last week), addressing this longstanding missing dumps issue requires some work (and addressing some organizational dysfunction). But the proposed annual plan focus area already requests a substantial amount of resources for working on adversarial solutions (better tracking of downloaders for enforcement purposes etc.). This may be understandable from the SRE team's narrow perspective. However, overall, such an all stick no carrot approach is inconsistent with our mission, and by the way also fails to address this question asked in the draft annual plan section itself: How might we funnel users into preferred, supported channels?

3. Developments after the Diff post

As mentioned, in last week's Facebook discussion some WMF staff already responded to some of these concerns. (By the way, I'm not sure that the WMF would publish this Diff post in the exact form again today, and I'd like to note that although I'm a member of the Signpost team, I was not involved in the decision to republish it as part of this Signpost issue.) For example, WMF's Jonathan Tweed stated:

Hi everyone, I am a Product Manager working on WE5 at the Wikimedia Foundation, primarily on the attribution work that’s described in WE5.1. [...] We are looking into providing responsible ways for people to obtain images from Commons. Whether this is through creating new dumps, which is a non-trivial amount of work, or rate limited access is still under discussion, but at no point is our intention to block all downloading of images.
I look forward to diving deeper on these questions with the technical community over the next few months, including at the Hackathon in May.

Also, yesterday Giuseppe (one of the Diff post's authors) announced updates to the WMF's "Robot policy. Among other changes, the policy now explicitly recommends considering dumps (as WMF has done elsewhere before):

Check if you could use our dumps or other forms of offline collection of our data instead of making live requests. If that’s a viable option for your use case, it will reduce the strain on our very limited resources and make your life easier.

Regarding media files (from upload.wikimedia.org), it asks to "Always keep a total concurrency of at most 2, and limit your total download speed to 25 Mbps (as measured over 10 second intervals)." It might be interesting to calculate what this means for exercising the right to fork in practical (duration) terms.

Regards, HaeB (talk) 19:23, 9 April 2025 (UTC)[reply]

PS (to add another vignette illustrating the organizational dysfunction described under 2. above, where important work falls through the cracks between different WMF teams' turfs):

It turns out that someone from the WMF Research department foresaw this need back in 2023 already and was

working on releasing more datasets that can help AI practitioners to work on models that are relevant to Wikimedia's needs. Two of those datasets that are particularly exciting deal with image data, however, which is a major challenge in our public data sharing infrastructure.

In April 2024, it was decided to

not prioritize this task [...] Currently, hosting large dataset has been a challenge in the foundation, and this task has highlighted the needs of being able to do so. Given this functionality is the prerequisite of completing this task, and the resource/effort caused by this overhead, [a Principal Software Engineer from the WMF's Data Platform Engineering team] will be helping us with this regard to with the goal of helping researchers to access dumps, as well as potentially helping other parts of the organization/Enterprise. ETA 12-18 month.

But half a year later, this task was made dependent on the WMF first producing an AI strategy for 2026-28, a separate task that is currently marked as due on Feb 5 2025 but still open (and it wouldn't be surprising to see it take another year or so to complete, the same task had previously already been due on Sep 29 2023 and then on Apr 30 2024).

Regards, HaeB (talk) 21:00, 9 April 2025 (UTC)[reply]

@HaeB: Thanks for providing this info. There is obviously a lot of it, most being from sources I'm not familiar with. It's the usual problem of Wikipedia being so large and conversations being so spread out. I think I do a good job keeping up with the press views about en.Wiki and much of the usual places for discussion on en.Wiki, but phabricator and facebook are not in my usual rounds. If I had that info I probably wouldn't have submitted this for republication. Your comments do bring out that a major part of the problem is that Wikipedia Enterprise and the WMF haven't offered an alternative to the current inefficient method being used. If that is the ultimate takeaway from this, so be it. I'll have some simpler questions that you may have info on that will get me (and others like me) a bit more up to speed, in maybe 30 minutes. Smallbones_(smalltalk) 21:44, 9 April 2025 (UTC)[reply]

@HaeB: Sorry it took so long to get back. 1st basic question. The Commons data is so large, having been built up over 20 or so years, that percent turnover can't be very big at all. So are the same bots returning day after day? Why? Will they someday get nearly "filled up" and stop? Or is this rush going to just keep on going? A second question is about non-photo data and the low quality of some of the photos. If the bots are just looking for photos for AI are they scraping everything? Or can they choose what they want to scrape? Finally, if a major part of the problem is within the WMF, do you have any suggestions on how to fix that problem? Smallbones_(smalltalk) 01:56, 10 April 2025 (UTC)[reply]

WE5.1 and WE5.2 there sound a lot like "the latest generation of clueless managers think API keys and replacing the Action API would work or be a good idea". 🙄 They never seem to listen to people who've been around longer that API keys can't work sensibly for web-based or open source applications, nor that getting rid of an API that handles 10,000 requests per second (literally) and powers tons of existing tools isn't a very feasible plan. Anomie ⚔ 23:09, 9 April 2025 (UTC)[reply]

While it isn't appropriate to block these bots for the reasons already outlined, I don't see any issue with a policy of rate-limiting bots which are putting a high demand on the infrastructure (with potential negative impacts on human readers and contributors). Then the bots get what they want, but it just takes somewhat longer. I also think it would be reasonable to limit the frequency of "whole of site" crawls in favour of getting the stream of updates after an initial whole-of-site crawl. And, as has been pointed out, there is a paid Enterprise service that is specifically designed for bots that want immediate access to recent changes etc. Kerry (talk) 01:20, 10 April 2025 (UTC)[reply]

Hi @Smallbones: I am Birgit, one of the authors of the blog post and responsible for WE5. I really appreciate you submitting this for republication and giving us the opportunity to discuss it here. I hope you still see some value in that too!

I wanted to give some context on why this is an industry wide problem, not something specific to Wikimedia Commons. Images are indeed a valuable resource for crawlers, but so is human-generated text. Scraping happens across the web and these users are likely to have an interest in using tools they can use for any site, rather than custom, Wikimedia-specific solutions.

Within our infrastructure, we observe scraping across wiki projects, and even on sites like Phabricator, Gitlab, Gerrit, or tools on Cloud Services. It’s the sum of bot traffic across all projects that makes up 65% of our most expensive traffic (as described in the blog post). We’re also observing scraping for content that is indeed provided through Enterprise’s services, or otherwise accessible through dumps.

I think HaeB is right that one way to address our developer and researcher communities’ needs specifically for Commons content could be to offer dumps, but that is not necessarily the case for companies across a growing industry.

This is a key reason why we need to explore different approaches and will require mechanisms to both encourage and enforce sustainable access. @Anomie: – just wanted to also clarify that this is the intent in 5.1 and 5.2, not retiring the very important Action API :-)

As Kerry mentions, we’re concerned about the bots which are putting a high demand on the infrastructure and causing traffic-related incidents that we have to deal with. We acknowledge that some users may require higher limits and will allow exceptions or refer these users to Enterprise as appropriate.

The intent of the Diff post was to make the problem clear, not provide all the answers. We’re hoping for input and support from the technical community as we learn and share more about this work over the coming months (for example at the Hackathon in early May). -BMueller (WMF) (talk) 19:36, 10 April 2025 (UTC)[reply]

Thank you for that clarification. There have been too many people in the past who hadn't seemed to think beyond "it's not REST, so it's bad". 😀 Anomie ⚔ 22:01, 10 April 2025 (UTC)[reply]

Regarding the problem of cache misses for the articles that only the scrapers frequently seek, it is my understanding that ENWP Main Space amounts to only a few Gigabytes. So, can't the various continental and regional servers just cache them all? Jim.henderson (talk) 06:07, 11 April 2025 (UTC)[reply]

@Jim.henderson: Thanks for the question! Unfortunately this would not solve the problem. Cache is not infinite, and we can’t predict which articles (or really: pages across name spaces, since crawlers tend to visit any url) from which wiki any given bot will scrape. Any content we may preemptively cache that is not used, also deprives cache storage from content that may be more frequently requested by readers, which in return would degrade their user experience. BMueller (WMF) (talk) 14:29, 11 April 2025 (UTC)[reply]

@BMueller (WMF): is it possible to quantify how much it would cost for the cache to be made large enough to contain all rendered articles? — The Anome (talk) 13:37, 30 April 2025 (UTC)[reply]

quite a few users played a 1.5 hour long video of Carter's 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic When someone watches Netflix the videos are chunked, and if you pause the video 5 seconds in you only get the first chunk. Seeking still works. Look at Dynamic Adaptive Streaming over HTTP and HTTP Live Streaming. If the WMF was using this tech then all those users would only get the chunks they actually watch. Downside is that it would probably require re-rendering the biggest videos. Polygnotus (talk) 11:25, 12 April 2025 (UTC)[reply]

Coming up to speed on Wikimedia can be daunting, and for programmers trying to grab whatever they can from 100s of places on the Internet, they may follow the easy path so long as it reasonably works OK. Dumps, EventStream, Enterprise, etc.. are more advanced topics. Why be nice? There are various ways with edge equipment to detect and shape/limit users pulling too much data. It's probably already done to combat DDOS. -- GreenC 02:35, 16 April 2025 (UTC)[reply]

One concern I have about recent developments in web scraping is the danger of well-intentioned scrapers getting caught up in measures meant to mitigate against malicious ones. The Internet Archive's Wayback Machine, for example, makes heavy use of scrapers, and lately I've been noticing that attempting to use their "Save Page Now" service on Wikipedia pages hasn't been working as well as it used to. It would be a shame if this were a casualty of the fight against other scrapers. Cooljeanius (talk) (contribs) 02:58, 26 April 2025 (UTC)[reply]

The bulk of the data in commons is the media files, not the pages. Technically speaking, there is very little point in making compressed dumps of Commons content: almost all the files there are already pre-compressed (images, audio, video), or are small files like SVGs. tar archives would thus be ideal for the purpose of creating dumps, and would tend to be immutable over time with the exception of deletions or uploads of new revisions of existing files or pages, for which particular dump files could be regenerated if needed. Serving up huge numbers of multiple-gigabyte files is much more efficient than serving hundreds of times as many multi-megabyte files. And because they are just flat files, they can easily be stored in cooperating third-party file mirrors all over the world without any fancy technology setup.
So technically, this is not a difficult problem to solve. Practically and logistically, though, it's a massive one: it probably requires one or two more people to be hired to orchestrate it, lots more storage servers to be added to hold the data, and network engineering and ops to keep it all going. But that is one of the things the enormous amount of money the WMF has in its coffers should be there for. — The Anome (talk) 13:30, 30 April 2025 (UTC)[reply]

Want the latest Signpost delivered to your talk page each month?

Home

About