Searching for Wikipedia

News from the WMF

Searching for Wikipedia

By Dan Andreescu, Kinneret Gordon, Isaac Johnson and Nicholas Perry

This article was originally published in the Wikimedia Techblog on June 7, 2021 CC BY-SA 4.0

How people use Search to access Wikipedia is a common question by researchers. Until now, however, there has been little data available about this relationship. To help address these questions, the Wikimedia Foundation is releasing a new, faceted dataset on search engine traffic to Wikipedia so you can ask questions like "What is the most common search engine in my country?" or "Which search engine is most-used by Android users?"

It's no secret that search engines ferry a great deal of traffic to Wikipedia. With every major change in how a search engine presents its results,^[a] questions arise about how the change might affect Wikipedia traffic. Historically, there has been scant data about how search engine traffic varied by platform and region.

We are taking a small step towards shedding greater light on the relationship between Search and Wikipedia by releasing a new, daily dataset of Wikipedia pageviews referred directly from search engines split by Wikipedia language, search engine, operating system, and web browser.

A day in the life of search

What might you find combing through the data? Well, first, you'll discover there's a lot of data! In any given month, about eight billion pageviews to Wikipedia come directly from clicks on search engines. On any given day, this dataset showcases pageviews that come from about 220 different countries, 100 different languages of Wikipedia,^[b] 50 browser families, 14 operating systems, and 20 search engines.^[c]

The vast majority of those clicks—over 90%—come from Google Search (table; see Figure 1). The next closest competitor is Yahoo Search at 2% of views followed by Microsoft Bing, DuckDuckGo, and Yandex Search. While Google's search traffic is globally quite dominant, many of the smaller search engines see their share of search coming primarily from a single country—e.g., 70% of Yahoo!'s search comes from Japan; 90% of Yandex's search comes from Russia; almost 100% of Naver's search comes from South Korea (nested table).

The increasing dominance of mobile devices can be seen in this dataset as well but with slightly more variation between countries than between search engines. Android and iOS typically trade between the top two spots with Windows generally in a strong third place (heatmap). Browsers have similar dynamics but replace Android with Chrome Mobile, iOS with Safari, and add a few more desktop versions into the mix (heatmap).

Visualizing the data

The multi-faceted nature of this new dataset also presented some new display challenges. Most datasets we release consist of a target metric—e.g., pageviews—and are composed of a single facet—e.g., language edition—or sometimes hierarchical facets—e.g., you can split by project family like Wikipedia or individual languages of Wikipedia. This dataset has five, non-hierarchical facets, all with many categories, as highlighted in the previous section.

Maybe you're interested in which search engine is dominant in a particular market? Or how Android users compare to iOS users? Or the distribution of language editions in a given country? Or, or, or…? This makes our standard public dashboards — Wikistats, Dashiki, Discovery — a poor fit for someone who might want to slice or aggregate the data as they primarily support a single dominant facet.

Luckily, Wikimedia has some experience with an open-source dashboarding platform called Turnilo that is a perfect fit. Turnilo allows for us to create quick, dynamic filters and aggregations, supports a variety of displays—e.g., tables, line graphs, or heatmaps—and makes it easy to share specific views of the data via URLs. We currently use Turnilo to showcase a number of private datasets, so we had some experience working with it but had never provided a publicly-viewable version. In just a few hours, we built a public Turnilo instance on our Cloud VPS infrastructure (code). We worked with the Turnilo team to improve support for flat files (as opposed to their more popular, but more complex Druid back-end). And now we have a strong use-case for expanding our public dataset dashboarding options (Phab)!

Go check it out at: https://wiki-search-referrals.wmcloud.org/ And if all the options are a bit overwhelming, here's a good place to start: search referrals from the previous month split by country and search engine (link).

Footnotes

^ See, for example Google Panda, Google Penguin, Google Pigeon – Signpost editors
^ Astute Wikipedians might notice that there are 300 language editions, not 100. The discrepancy arises from masking that we do for any pageview counts below 500 for privacy reasons — i.e. many other language editions (and countries and OSes and browsers) receive search traffic, but they would be represented as “other” in this dataset if they did not meet that threshold. See https://phabricator.wikimedia.org/T270140 for more details.
^ You can see more information on the search engines we track in this dataset here (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/referrer_daily#Search_Engines). If you notice any major search engines missing, let us know!

← Previous "News from the WMF"

Next "News from the WMF" →

In this issue

27 June 2021 (all comments)

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

What, no comments? This is quite interesting, at least for me (I am researching global aspects of Wikipedia's popularity or lack of thereof). Question: When I asked for Korean data, I got Google 55%, Naver 31%. Dau, 9%, Bing 1%, other 2%. Am I understanding this correctly - that out of all search engines referrals from Korea, Google accounts for 55%, while Naver for 31%? Since Naver accounts for 70-90% of the Korean search engine market, this would suggest that Naver is prioritizing Wikipedia much, much less than Google does. --_{Piotr Konieczny aka Prokonsul Piotrus| reply here} 12:00, 16 July 2021 (UTC)[reply]

Get the latest headlines on your user page – just add {{Signpost-subscription}}.

Home

About