It is broadly known that Wikipedia is the sixth most popular website on the Internet, but the English Wikipedia now has over 4 million articles and 29 million total pages. Much less attention has been given to traffic patterns and trends in content viewed. The Wikimedia Foundation makes available aggregate raw article view data for all of its projects.
This article attempts to convey some of the fascinating phenomena that underlie extremely popular articles, and perhaps more importantly to editors, discusses how this information can be used to improve the project moving forward. While some dismiss view spikes as the manifestation of shallow pop culture interests (e.g., Justin Bieber is the 6th most popular article over the past 3 years, see Tab. 2), these are valuable opportunities to study reader behavior and to shape the public perception of our projects.
We have begun producing two weekly charts on the most popular articles on Wikipedia, the WP:5000 list, and the moderated WP:5000/Top25Report.
WP:5000, an automated list of the 5,000 most popular pages on Wikipedia, is now being compiled weekly. It also identifies how many featured articles, good articles, and lists are included. For the current list covering January 27 to February 2, we find 239 featured articles and 468 good articles in the top 5000 pages. However, this report is based on raw data and includes non-article pages and popularly requested redlinks, like "Com/fluendo/plugin/KateDec.class" at No. 15 on the current list (a script used to stream media content; see Cortado (software)), as well as 18k Gold Watch at position 166, a recurring entry likely fueled by spambots. More information on how the WP:5000 results are computed is found below.
The WP:5000/Top25Report is a manually moderated weekly Top 25 list started in January 2013 of the most popular articles on English Wikipedia. Similar in format to best-selling book or music charts, it is a bit more user friendly in that it excludes non-article pages, likely DOS attack entries, and the Main page. It also tracks how long an article has remained in the Top 25. Throughout January 2013, certain American football-related pages have been popular (a yearly trend seen during the playoff season of that sport), as well as popular recently released movies such as Django Unchained and notable recent deaths such as Aaron Swartz.
Articles which are "extremely popular" on Wikipedia fall into the category of either (1) occasional or isolated popularity, or (2) consistent popularity.
The prime sources of occasional or isolated popularity include:
Rank | Article | Date (UTC) | Views/hr | Views/sec | Notes |
---|---|---|---|---|---|
1 | Whitney Houston | 12 Feb 2012 | 1532302 | 425.6 | Death of subject |
2 | Amy Winehouse | 23 Jul 2011 | 1359091 | 377.5 | Death of subject |
3 | Steve Jobs | 6 Oct 2011 | 1063665 | 295.5 | Death of subject |
4 | Madonna (entertainer) | 6 Feb 2012 | 993062 | 275.9 | Super Bowl halftime |
5 | Osama bin Laden | 2 May 2011 | 862169 | 239.5 | Death of subject |
6 | The Who | 7 Feb 2010 | 567905 | 157.8 | Super Bowl halftime |
7 | Ryan Dunn | 20 Jun 2011 | 522301 | 145.1 | Death of subject |
8 | Jodie Foster | 14 Jan 2013 | 451270 | 125.4 | Golden Globes speech |
Rank | Article |
---|---|
1 | Wiki |
2 | |
3 | United States |
4 | YouTube |
5 | |
6 | Justin Bieber |
7 | Glee (TV series) |
8 | Sex |
9 | Wikipedia |
10 | Lady Gaga |
11 | Eminem |
12 | How I Met Your Mother |
13 | United Kingdom |
14 | The Big Bang Theory |
15 | India |
16 | World War II |
Meanwhile, reasons for long-term popularity are somewhat more intuitive. Tab. 2 shows the most popular articles over the last ~3 years. In addition to the broad underlying cultural and academic interests of Wikipedia's audience, we encourage the reader to consider:
The impetus behind storing these statistics was to better understand damage response on Wikipedia (the dissertation topic of author User:West.andrew.g). By storing statistics for every article at the finest granularity possible (hourly), it becomes possible to accurately estimate the number of readers who saw any particular article version. While practical writings have often focused on the time to revert of damaging edits, we argue that the quantity of persons who view it is the more relevant metric. Vandalism that survives for days on an obscure article is effectively harmless if no one visits that article.
Fig. 1 plots the CDF of both the lifespan and view count of about 500,000 recent damaging edits. As the graph shows, at median just 1 person will be exposed to a damaging edit. Such an impressive figure is a testament to the automated (e.g. ClueBot NG) and semi-automated (e.g., Huggle and STiki) mechanisms that have recently been brought to bear on the task. While these tools produce probabilistic measures of damage, only STiki will soon integrate an article's popularity into its prioritization schema.
Fig. 1 also shows that ~10% of damaging edits are viewed by 100+ persons. Deeper analysis shows that many of the associated survival times are quite short, and these are often the result of damage to extremely popular articles. With the human latency already quite minimal (and a certain amount of latency being inherent), new solutions are needed. Consider that spammers could opportunistically target very popular pages to exploit these brief windows of opportunity. [3] Dynamically and autonomously moving articles in and out of "page protection" or "pending changes" based on their traffic patterns is another possible use-case for this data. As Fig. 2 demonstrates, the power-law distribution of views over articles would suggest relatively few articles need to be protected to have significant impact.
Spam and vandalism are surface-level issues. Recent analysis of deleted revisions on English Wikipedia showed copyright violations, being much harder to detect in casual patrolling work, to have significant lifespans and end-user exposures. [4] This finding has motivated research into autonomous means of copyright violation discovery (see WP:Turnitin).
Article popularity can also be a measure for deciding which articles to improve, a concept already familiar to WikiProjects who keep tabs on the popularity of articles within their project (e.g., Wikipedia:WikiProject Songs has a watchlist for the 1,500 most popular song articles). At the aggregate level, the distribution of page views follows a "power law distribution". Fig. 2 represents one months' views on Wikipedia graphed against a Zipf distribution (a distribution where the most frequent item will occur approximately twice as often as the next item, three times as often as the third item, and so forth.)
The top 25 most viewed pages represent 4% of all total views, and the top 5000 represent 19% of all views. Though the distribution has an extremely long tail, the top 5000 data provides an opportunity to locate popular but poorly written articles that need attention, as opposed to randomly selecting one of the 4.15 million remaining articles on the project. That is not to say that articles deep in the long tail are less important, but for editors interested in prioritizing article improvement based on popularity and effect on public perception, the WP:5000 data is an important tool.
These statistics also provide an opportunity to study what is popular in contemporary culture. Before the growth of the Internet, the primary quantitative measures of contemporary popularity included bestselling book and music charts, box office sales, and television and radio ratings. The digital age now gathers vast quantities of data on consumption not previously available, but some observations from the past still hold true. The fact that Justin Bieber was the sixth most popular article from 2010–12, far ahead of more critically appreciated talent, is consistent with what James D. Hart (author of The Oxford Companion to American Literature) observed in 1950 in writing about the most popular books of the mid-19th century:
“ | If a student of taste wants to know the thoughts and feelings of the majority who lived during Franklin Pierce's administration [1853–57], he will find more positive value in Maria Cummins' The Lamplighter or T.S. Arthur's Ten Nights in a Bar-Room than he will in Thoreau's Walden [the former being far more popular] – all books published in 1854.... Usually the book that is popular pleases the reader because it is shaped by the same forces that mold his non-reading hours, so that its dispositions and convictions, its language and subject, re-create the sense of the present, to die away as soon as that present becomes the past.[5] | ” |
Thus, in the same way, page view statistics permit us to consider that Justin Bieber and One Direction—as maligned as they may be critically—are more popular and likely influential on culture, than say, Kendrick Lamar, chosen by Pitchfork Media as releasing the best album of 2012.[6]
All the statistics in this article were produced by aggregating raw data made available by the WMF. This data contains hourly hit data on a per article basis for all WMF language/project combinations. Since Jan. 1, 2010 User:West.andrew.g has been parsing these files nightly and storing the English Wikipedia (article namespace) portions to a database hosted at the University of Pennsylvania. This is a non-trivial undertaking, consuming 1TB+ yearly. In addition to being the basis for several academic results [3][4] (and motivated by earlier third-party work[7]), he has more recently begun publishing the aforementioned weekly reports of the top 5000 articles, made available monthly reports for 2012, and released the source code behind these computations.
Others have used the same data for alternative purposes: User:Henrik has developed a tool for looking up the traffic history of specific articles. The Wikitrends site concentrates on dramatic popularity increases/decreases. WMF analyst Erik Zachte produces WikiStats, which provides a higher-level perspective on all WMF projects in numerous statistical dimensions. Mr. Zachte also has a fascinating portfolio of his WMF statistical work. These Wikipedia/WMF-specific resources complement other Internet-scale observations regarding search and popularity; most famously the Google Zeitgeist.
There are some caveats in interpreting this data. First, this is a raw presentation of traffic and popularity. It is known that English Wikipedia traffic has generally been increasing over time (per [1]). This fact, and the growing Internet connectivity that likely underlies it, lends some bias to more recent events. Second, it should be mentioned that logs may have under reported page view data in early 2010.
Discuss this story
This is a fantastic article; thank you for sharing. Jujutacular (talk) 02:37, 6 February 2013 (UTC)[reply]
Not only is this a great article, it supplies important information -- not just for Wikipedia, but for the marketing world in general. Since most people don't know about the Signpost, I highly recommend posting about this article on marketing and social media sites - tweet it up. -- kosboot (talk) 14:26, 6 February 2013 (UTC)[reply]
Really great article about the power of WP. The effect that WP has had on our world is huge, but unfortunately largely unmeasurable on the individuals' side of things. I thought that the readers here, may likewise enjoy a piece of research that I recently read (that cites WP as an example), that I feel is very fascinating in how it describes the power behind phenomena like WP. It's called "The Theory of Crowd Capital" and you can download it here if you're interested: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2193115 Enjoy! — Preceding unsigned comment added by 24.85.85.220 (talk) 01:24, 7 February 2013 (UTC)[reply]
Nice work. I'm reminded of this page: Wikipedia:Short popular vital articles.
Azerbaijani places