The Signpost

News and notes

Wikipedia's traffic statistics understated by nearly one-third

Contribute  —  
Share this
By The ed17 and Tony1

A Wikipedia researcher has discovered that the encyclopedia's widely used article traffic statistics are missing out on approximately one-third of total views.

Computer scientist Andrew West has found that mobile readers are not counted by stats.grok.se (an unofficial website linked from the "history" tab on every Wikipedia page) or any other service/report that tabulates and visualizes the Wikimedia Foundation's official raw data. Thanks to a historical artifact, desktop and mobile counts have been segregated since the figures were first released in 2007. "The world has changed a lot since the original code was written," the WMF's director of analytics Toby Negrin told the Signpost. "We are working hard to catch up."

Impact

Of 9.5 billion total views to English Wikipedia in August 2014, about 3 billion—31.6%—are not reported in the raw per-article statistics. Other projects are assumed to have similar omissions based on their own mobile viewership ratios.

West told the Signpost he ran into the problem when collating view statistics for the English Wikipedia's Medicine WikiProject. The figures are being used in an upcoming academic paper comparing Wikipedia to WebMD, the World Health Organization, the National Institutes of Health, and other high-traffic medical websites. West caught the error early enough to add a disclaimer, but he's "curious and fearful as to how many other WikiProjects and researchers might have fallen into the same trap."

Unfortunately, that number is not zero. For a new example, Variety's new "Digital Audience Ratings" use Wikipedia's traffic statistics as a key cog. Jason Klein of ListenFirst, the company writing the posts, said in an interview with Lost Remote that "We have been monitoring Wikipedia page views daily for tv shows (as well as films and consumer brands) for over two years, and have found fascinating trends ..." (Editor's note: for additional information, please see this week's "In the media").

Similarly affected are the English Wikipedia's top 25 viewed articles (ten of which are used in the Signpost's weekly "Traffic report"). All of these initiatives are missing out on what West calls the mobile "bump" that popular culture and breaking-news events kindle.

The largest ramification may be reserved for users in the global south, where a higher percentage of individuals use mobile phones to surf the web. High-priced traditional computers can be out of reach for large segments of the population, who have turned instead to smartphones; this was a chief inspiration for the Wikimedia Foundation's Wikipedia Zero project. Pgallert laid out the scope of the computer issue on the Wikimedia blog last year:


Future

Negrin told us that they are aware of the problem and are currently working to replace the current apparatus with a "modern, scalable system," which will come out in a preliminary form next quarter. The team is also working on a redefining what a "page view" is, taking modern concepts like mobile apps and web, API requests, and automated bots into account. Negrin added that "fortunately, we'll be in a position very soon to provide more accurate data to the Foundation and the Community."

The work involved in this is not negligible. As research analyst Oliver Keyes wrote to us, "The overall page view trends are of increasing importance to how we understand how people consume our site. At the moment we ... have a lot of ideas and a lot of the nuts and bolts worked out and tested, but it's fairly inchoate and needs to be organised better before we do anything with it. Once we have done that, we'll move on to implementing it and running it in parallel to the existing infrastructure to detect irregularities."

In the meantime, the unofficial status of grok.se (it is still listed as a "beta service") and the varying reliability of the WMF's data dumps leave researchers like West in the lurch. For example, grok.se periodically misses full days of stats (such as 28 August), which invariably leads to frustration with the website's coder, Henrik—but the issue lies with the WMF-released data. In the example, the traffic statistics for five hours (UTC 16:00–21:00) are missing.

It appears that statisticians, researchers, and curious Wikimedia contributors will have to wait only a little longer for a more stable and reliable solution.

Editor's note: emails to Henrik, the owner of stats.grok.se, and Domas Mituzas, the former WMF database administrator who originally coded the raw data output, were not returned by publishing time.
Update: a new Pageview API was released by the Wikimedia Foundation in December 2015, and stats.grok.se has been replaced by a WMF Labs tool since January–February 2016.

In brief

Rachel diCerbo, new manager of the WMF's Community Engagement (Product) team.
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Possible typo in image caption: "new manage" --> "new manager" ? --Hispalois (talk) 05:45, 18 September 2014 (UTC)[reply]

@Hispalois: I've fixed it. Feel free to make this kind of correction yourself in future - as Wikipedia:Wikipedia Signpost/About says, "post-publication edits such as grammatical and spelling corrections to articles are welcome". -- John of Reading (talk) 06:39, 18 September 2014 (UTC)[reply]
  • We need to solve this pageview issue. When we at WPMED do outreach to media they want to know what impact we are having in the developing world. Currently we do not really know.
  • With respect to templates, the inconsistency across Wikipedia is a huge issue we are facing with the translation project. A basic set of templates is needed. This is something I would love to see the WMF work on.Doc James (talk · contribs · email) (if I write on your page reply on mine) 06:06, 18 September 2014 (UTC)[reply]
  • Article sounds like it blames Henrik. WMF has had years to build a real system to replace stats.grok.se. --LA2 (talk) 11:23, 18 September 2014 (UTC)[reply]
    Er. Replacing stats.grok.se would not be the solution here; the files SGS is relying on lack the records </pedantry> Ironholds (talk) 12:24, 18 September 2014 (UTC)[reply]
    Correct. I've done some minimal re-wording to fix that in the lede. Hope that this is okay per Signpost editing norms. West.andrew.g (talk) 13:20, 18 September 2014 (UTC)[reply]
  • When I read the title, I thought that page views had fallen by 1/3rd. "Off" implies something different than the context suggests, in this case. Maury Markowitz (talk) 10:48, 18 September 2014 (UTC)[reply]
  • I regularly present to health organizations about how I share the information in their fields of expertise on Wikipedia. All of them are surprised to hear about the number of pageviews that health articles on Wikipedia get, and become more interested in Wikipedia because of the audience it has. Having accurate information about the number of pageviews that Wikipedia articles get contributes to a persuasive argument that experts should contribute to Wikipedia if they want to reach the audience seeking information in their field. I appreciate all efforts to better describe Wikipedia's audience because I think good descriptions of Wikipedia's audience are necessary to attract more expert contributors to our community. Blue Rasberry (talk) 13:36, 18 September 2014 (UTC)[reply]
  • Great work, guys. Accurately counting your users, patrons, customers, etc. is pretty basic stuff. My non-tech guy impression of this, coupled with the Media Viewer debacle, is that the WMF is not up to the technological challenges facing Wikipedia. I have no idea if this is a fair assessment or not, but I'd love to see you follow up on this by getting the Foundation's perspective on these matters. Gamaliel (talk) 15:06, 18 September 2014 (UTC)[reply]
  • Alas no scoop here. Andrew West didn't 'discover' that mobile traffic per article is not counted. I told him on several occasions. At least as early as Aug 20 2014, and project WikiMedicine at least a year earlier. Many WMF reports (on page views per title) contain a very clear notice in the introduction (in red, as shown here) that "Please be aware that pageviews per article are not yet captured for Wikipedia's mobile site. Average underreporting will be 15-20%, but may be much higher for languages mostly spoken in the Global South, where a larger share of web access happens via mobile phones.. This category of reports contains the notice since about 18 months [1]. The following report, dated July 2013, on health articles in wp:en, with the same unmissable notice, was prepared specifically for requests from project WikiMedicine and was sent to some key members (still) of the project and 'WikiMedicine discussion' list. As for the defect itself, the Analytics and Engineering Teams discussed how to fix this on several occasions and finally decided not to repair legacy software (which would have been either very time consuming or even undoable given our infrastructure, as one aborted attempt some 2 years ago revealed). It was not in any way moved under the carpet. Personally I'm taking pride in being very transparent to the community about what we can deliver and what not (yet). And I'm sure the same holds for most colleagues. Erik Zachte (talk) 16:01, 18 September 2014 (UTC)[reply]
    • Could you please find someone willing to take pride in producing accurate statistics? 72.130.129.212 (talk) 18:19, 18 September 2014 (UTC)[reply]
      • Would you have the guts to sign your comment? Erik Zachte (talk) 18:34, 18 September 2014 (UTC)[reply]
      • This is a big deficiency. But it's incredibly unfair to portray Erik or anyone else working on this as not taking pride in accurate statistics. It's a very hard problem with a lot of moving parts. Transparency about my motivations and expertise - I'm one of the people actually working on the problem. I'd love to see that transparency reflected in everyone else's commentary, too (Gamaliel wins points for starting from the premise of 'this is just my opinion as an observer', even if I disagree with that opinion to some degree). Ironholds (talk) 19:42, 18 September 2014 (UTC)[reply]
      • I appreciate Erik's work and we've had productive a productive back-and-forth on this. I think the talk page at [2] makes my first-person story clear. On Aug. 9, I told a user that "yes! we are counting mobile views" (paraphrased) based on my flawed assumption that since mobile requests are hitting the same content as desktop views, that presumably they would be counted in the totals. When shortly thereafter I learned that this was not the case, I again posted to that talk page. I never claimed to 'discover' anything, rather it "came to my attention", and tried to place no blame: "This very well could be an artifact of an earlier system that was not prepared to handle mobile views. I am in no position to comment ..." I posted this where my stakeholders and consumers of the data could find it. I accuse no-one of non-transparency. It is not a triumph of my own to make this revelation -- quite the opposite -- it means I have to tell those I collaborate with that I was distributing incomplete numbers and the flawed assumption that could cause them to be mis-interpreted. West.andrew.g (talk) 21:48, 18 September 2014 (UTC)[reply]
  • Question: Does this underreporting affect our estimates of total Wikipedia usage? --j⚛e deckertalk 19:45, 18 September 2014 (UTC)[reply]
    It shouldn't. So, we actually have two streams - one is per-page impression data (basically, URL aggregation with some amount of filtering and parameter-stripping) and global PV data. They're processed using different filters (for obvious reasons. Pageview != impression). The global pageviews count includes mobile data. This shouldn't be having an impact on the overall tracking. Erik, obviously correct me if I'm getting this wrong Ironholds (talk) 19:48, 18 September 2014 (UTC)[reply]
    • We were both answering at same time, here is my version of a similar answer:
    • No, or hardly, webstatscollector writes two files per hour, one tiny file called projectcounts with total views per wiki, mobile and non-mobile as separate counts, one huge file called pagecounts with views per page title. That is not to say that numbers in projectcounts are perfect, for one we still need to filter bot requests. Erik Zachte (talk) 19:58, 18 September 2014 (UTC)[reply]
  • Ah, yes. The global south is mentioned. Killiondude (talk) 20:35, 18 September 2014 (UTC)[reply]
  • @Erik Zachte: Do you know if there any project in place to start building a stats.grok.se replacement (perhaps in Wikimedia Labs)? There is a clear desire all round to see page view data and it's not suitable to have to rely on an external website to do this. We link to this website on every page yet it carries the foreboding warning: "This is very much a beta service and may disappear or change at any time". It's apparent that a lot of problems would be caused for Wikipedia should it go down. SFB 13:40, 20 September 2014 (UTC)[reply]
@Sillyfolkboy, no clear-cut plans. An earlier initiative to build an API on top of the current data stream, was abended. The present team took that in as confirmation that perfect is the enemy of good, and also we better take things one step at a time. Current focus is on stabilizing, extending, and tuning the raw data feeds. Colleagues at Analytics Engineering are working as we speak on porting the current webstatscollector tool to the new hadoop environment. Webstatscollector is at the basis of all external reporting on traffic, not only at stats.grok.se, and at much of WMF reporting. Once migration has been done, hopefully both mobile stats and binary file requests can be added. I'm saying hopefully, as capacity is always an issue. The infrastructure has way more capacity (and flexibility) than the old, but as always it takes just a few overoptimistic choices to again fill any scaled up system up till strangulation point. So one step at a time. Implementing hadoop and all of the software stack that comes with it is a serious undertaking for a small team, making it robust and tuning it takes time, and the team has many tasks on their plate. But we are looking right now into the finer details of how to extend the data feed without losing downward compatibility. Once that data feed is more reliable, complete and unambiguous (e.g. bots counted separately on the wiki level, and omitted on the page level), more flexible querying would be a new challenge. Most likely that would be a post-processing step and a separate data warehouse outside hadoop (which as I understand is is more batch oriented). When, how, and by whom are currently not our focus, one step at a time. Erik Zachte (WMF) (talk) 20:32, 21 September 2014 (UTC) Erik Zachte (WMF) (talk) 14:53, 21 September 2014 (UTC)[reply]
@Erik Zachte (WMF): Thanks for the reply. Sounds like there is a lot of ongoing change at the moment, so that's understandable that (re-)development based on the current interface is low priority. Good luck with the upcoming work. I hope the above article and comments confirm that potential usage of these stats isn't restricted to esoteric nerd research – they are integral to understanding our users! SFB 00:23, 22 September 2014 (UTC)[reply]

I am not Pranav Curumsey

In the section on the Indian chapter, you quote my email to the Wikimediaindia-l and you inadvertently (?) call me Pranav Curumsey. I am Pradeep Mohandas and I forwarded an email of Pranav Curumsey's resignation from the Wikimedia India Chapter members list to the public Wikimediaindia-l, which you quote in your story calling me Pranav. Please read this carefully before reporting. Thanks. Thiruvathira (talk) 15:24, 23 September 2014 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0