Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we'll discuss the reasons for this trend and its impact.
The Wikimedia projects are the largest collection of open knowledge in the world. Our sites are an invaluable destination for humans searching for information, and for all kinds of businesses that access our content automatically as a core input to their products. Most notably, the content has been a critical component of search engine results, which in turn has brought users back to our sites. But with the rise of AI, the dynamic is changing: We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.
When Jimmy Carter died in December 2024, his page on English Wikipedia saw more than 2.8 million views over the course of a day. This was relatively high, but manageable. At the same time, quite a few users played a 1.5 hour long video of Carter's 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic, doubling its normal rate. As a consequence, for about one hour a small number of Wikimedia's connections to the Internet filled up entirely, causing slow page load times for some users. The sudden traffic surge alerted our Site Reliability team, who were swiftly able to address this by changing the paths our internet connections go through to reduce the congestion. But still, this should not have caused any issues, as the Foundation is well equipped to handle high traffic spikes during exceptional events. So what happened?
Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
The graph below shows that the base bandwidth demand for multimedia content has been growing steadily since early 2024 – and there's no sign of this slowing down. This increase in baseline usage means that we have less room to accommodate exceptional events when a traffic surge might occur: a significant amount of our time and resources go into responding to non-human traffic.
The Wikimedia Foundation serves content to its users through a global network of datacenters. This enables us to provide a faster, more seamless experience for readers around the world. When an article is requested multiple times, we memorize – or cache – its content in the datacenter closest to the user. If an article hasn't been requested in a while, its content needs to be served from the core data center. The request then "travels" all the way from the user's location to the core datacenter, looks up the requested page and serves it back to the user, while also caching it in the regional datacenter for any subsequent user.
While human readers tend to focus on specific – often similar – topics, crawler bots tend to "bulk read" larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.
While undergoing a migration of our systems, we noticed that only a fraction of the expensive traffic hitting our core datacenters was behaving how web browsers would usually do, interpreting JavaScript code. When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total. This high usage is also causing constant disruption for our Site Reliability team, who has to block overwhelming traffic from such crawlers before it causes issues for our readers.
Wikimedia is not alone with this challenge. As noted in our 2025 global trends report, technology companies are racing to scrape websites for human-created and verified information. Content publishers, open source projects, and websites of all kinds report similar issues. Moreover, crawlers tend to access any URL. Within the Wikimedia infrastructure, we are observing scraping not only of the Wikimedia projects, but also of key systems in our developer infrastructure, such as our code review platform or our bug tracker. All of that consumes time and resources that we need to support the Wikimedia projects, contributors, and readers.
Delivering trustworthy content also means supporting a "knowledge as a service" model, where we acknowledge that the whole internet draws on Wikimedia content. But this has to happen in ways that are sustainable for us: How can we continue to enable our community, while also putting boundaries around automatic content consumption? How might we funnel developers and reusers into preferred, supported channels of access? What guidance do we need to incentivize responsible content reuse?
We have started to work towards addressing these questions systemically, and have set a major focus on establishing sustainable ways for developers and reusers to access knowledge content in the Foundation's upcoming fiscal year. You can read more in our draft annual plan: WE5: Responsible Use of Infrastructure. Our content is free, our infrastructure is not: We need to act now to re-establish a healthy balance, so we can dedicate our engineering resources to supporting and prioritizing the Wikimedia projects, our contributors and human access to knowledge.
Discuss this story
Misbehaving scraper bots are definitely a problem, and I have a lot of respect for the hard work that the post's authors and the rest of Wikimedia Foundation's SRE team do to keep the sites up. But reading between the lines, this Diff post points to some of the Wikimedia Foundation's own failings which may well also be a major cause of these current problems, alongside irresponsible scraping behavior. In more detail (recapping various points from a discussion about this post last week in the Wikipedia Weekly Facebook group, where various WMF staff already weighed in):
The Foundation had already highlighted this excessive AI-related scraping traffic several months ago (see the draft annual plan section linked at the end of the post - like probably many Wikimedians, I had read it there before and didn't have second thoughts about it). However, in this post we now learn that it is especially driven by demand . What's interesting about this: For Wikipedia text (i.e. the content that has long been used to train non-multimodal LLMs like the original ChatGPT, Claude, Llama etc.), the "good citizen" advice has always been to download it in form of the WMF's dumps, which puts much less strain on the infrastructure than sending millions of separate requests for individual web pages or hammering the APIs. But for those Commons media files which are apparently so in demand now, there has been no publicly available dump for over a decade. In other words, the "good citizen" method is not available for those who want download a large dataset of current Commons media files.
This might also explain another aspect of this WMF blog post that is rather puzzling: It does not at all mention Wikimedia Enterprise (the paid API access offered by the Wikimedia Foundation's for-profit subsidiary Wikimedia LLC). Normally, this would seem to be the perfect opportunity to advertise Enterprise to AI companies who inconsiderately overload our infrastructure without giving back, and tell them to switch the paid service instead. (Indeed, that's exactly what WMF/WM LLC representatives have done on previous occasions when publicly discussing the use and overuse of Wikipedia by AI companies, see e.g. our earlier Signpost coverage here.) - Not, however, if Wikimedia Enterprise has so far failed to address this apparently huge demand for Commons media files and neglected to build a paid API product for it. (Indeed I don't see such an offering on enterprise.wikimedia.com.)
The "Right To Fork" has been an important aspect of wikis since before Wikipedia was founded. It is part of the Wikimedia Foundation Guiding Principles:
Note the regardless part - there is no exception like "unless it's for commercial gain" or "unless it's for AI training purposes" or "unless it might hurt the financial sustainability of the WMF or might reduce active editor numbers on wikipedia.org".
This "right to fork" is not exercised frequently, but it is very important as a governance safeguard. It enables anyone to launch a complete copy of (say) Wikipedia and/or Commons on a different website, if they think WMF has turned evil or is being taken over by the US government via executive order. For example, the community's ability to fork Wikipedia is very likely a major reason why Wikipedia does not contain ads today, see Signpost coverage: "Concerns about ads, US bias and Larry Sanger caused the 2002 Spanish fork". The WMF itself also mounted an aggressive legal defense of the right to fork Wikitravel and bring its entire content from a commercial host to Wikivoyage, see Signpost coverage: "Wikimedia Foundation declares 'victory" in Wikivoyage lawsuit".
And WMF's failure to c:Commons:Requests_for_comment/Technical_needs_survey/Media_dumps and phab:T298394. (In fact, there didn't even exist an internal backup of until a 2016 Community wishlist request was addressed years later.)
in case of Commons images has long been called out, see e.g.Of course (as was also pointed out by WMF staff in the aforementioned Facebook discussion last week), addressing this longstanding missing dumps issue requires some work (and addressing some organizational dysfunction). But the proposed annual plan focus area already requests a substantial amount of resources for working on adversarial solutions (better tracking of downloaders for enforcement purposes etc.). This may be understandable from the SRE team's narrow perspective. However, overall, such an all stick no carrot approach is inconsistent with our mission, and by the way also fails to address this question asked in the draft annual plan section itself:
As mentioned, in last week's Facebook discussion some WMF staff already responded to some of these concerns. (By the way, I'm not sure that the WMF would publish this Diff post in the exact form again today, and I'd like to note that although I'm a member of the Signpost team, I was not involved in the decision to republish it as part of this Signpost issue.) For example, WMF's Jonathan Tweed stated:
Also, yesterday Giuseppe (one of the Diff post's authors) announced updates to the WMF's "Robot policy. Among other changes, the policy now explicitly recommends considering dumps (as WMF has done elsewhere before):
Regarding media files (from upload.wikimedia.org), it asks to
It might be interesting to calculate what this means for exercising the right to fork in practical (duration) terms.Regards, HaeB (talk) 19:23, 9 April 2025 (UTC)[reply]
WE5.1 and WE5.2 there sound a lot like "the latest generation of clueless managers think API keys and replacing the Action API would work or be a good idea". 🙄 They never seem to listen to people who've been around longer that API keys can't work sensibly for web-based or open source applications, nor that getting rid of an API that handles 10,000 requests per second (literally) and powers tons of existing tools isn't a very feasible plan. Anomie⚔ 23:09, 9 April 2025 (UTC)[reply]
Hi @Smallbones: I am Birgit, one of the authors of the blog post and responsible for WE5. I really appreciate you submitting this for republication and giving us the opportunity to discuss it here. I hope you still see some value in that too!
I wanted to give some context on why this is an industry wide problem, not something specific to Wikimedia Commons. Images are indeed a valuable resource for crawlers, but so is human-generated text. Scraping happens across the web and these users are likely to have an interest in using tools they can use for any site, rather than custom, Wikimedia-specific solutions.
Within our infrastructure, we observe scraping across wiki projects, and even on sites like Phabricator, Gitlab, Gerrit, or tools on Cloud Services. It’s the sum of bot traffic across all projects that makes up 65% of our most expensive traffic (as described in the blog post). We’re also observing scraping for content that is indeed provided through Enterprise’s services, or otherwise accessible through dumps.
I think HaeB is right that one way to address our developer and researcher communities’ needs specifically for Commons content could be to offer dumps, but that is not necessarily the case for companies across a growing industry.
This is a key reason why we need to explore different approaches and will require mechanisms to both encourage and enforce sustainable access. @Anomie: – just wanted to also clarify that this is the intent in 5.1 and 5.2, not retiring the very important Action API :-)
As Kerry mentions, we’re concerned about the bots which are putting a high demand on the infrastructure and causing traffic-related incidents that we have to deal with. We acknowledge that some users may require higher limits and will allow exceptions or refer these users to Enterprise as appropriate.
The intent of the Diff post was to make the problem clear, not provide all the answers. We’re hoping for input and support from the technical community as we learn and share more about this work over the coming months (for example at the Hackathon in early May). -BMueller (WMF) (talk) 19:36, 10 April 2025 (UTC)[reply]