The Signpost

Recent research

Wikipedia's flood biases

Contribute  —  
Share this
By Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"Uneven Coverage of Natural Disasters in Wikipedia: The Case of Floods"

A 2017 flood in China (64.29% of whose floods are covered on English Wikipedia, according to the study)

A paper[1] with this title, presented earlier this year at the "International Conference on Information Systems for Crisis Response and Management" (ISCRAM 2020), adds to the growing literature on Wikipedia's content biases, finding that while the English Wikipedia "offers good coverage of disasters, particularly those having a large number of fatalities [...] the coverage of floods in Wikipedia is skewed towards rich, English-speaking countries, in particular the US and Canada."

Any bias analysis of this kind is faced with the problem of identifying an unbiased "ground truth" that Wikipedia's coverage can be compared to. The researchers approach this diligently, resorting to "three of the most comprehensive databases documenting floods that are commonly used by the hydrology science for reference": Floodlist, which is funded by the EU's Copernicus Programme, the "Emergency Events Database" (EM-DAT), and the University of Colorado's Dartmouth Flood Observatory (DFO). Focusing on a timespan extending from 2016 to 2019, and following an elaborate process involving e.g. defining search criteria for each source and deduplicating the results, they arrived at a consolidated dataset consisting of 1102 flood events, of which only 249 were present in all three databases. The authors asked experts to identify possible reasons for these discrepancies (or biases) between the sources, e.g. the fact that Floodlist includes landslides resulting from heavy rain events that do not meet the definitions of the other two sources. They concluded that these explanations justified relying on events that were covered in at least two of the three sources, resulting in a dataset consisting of 458 floods.

The comparison dataset representing Wikipedia's coverage was constructed using keyword searches to find individual sentences mentioning flood events (rather than entire articles, which one might identify more easily using e.g. Category:Floods).

The analysis of the data focuses on the "hit rate" per country, defined as the percentage of floods from the ground truth dataset that have at least one corresponding item in the Wikipedia dataset. The United States was both the country with the highest number of floods in the ground truth dataset (36, followed by Indonesia with 25 and the Philippines with 17), and the country with by far the highest hit rate (86.11%) among the countries with the highest number of floods. Aggregated by continent, North America likewise had the highest Wikipedia coverage (49.06%), and South America the lowest (10.53%). Interestingly, Europe did not fare very well, with a hit rate of 21.18%, slightly below that of Africa (21.88%) and way behind Asia (which had 37.63% of its floods covered on English Wikipedia).

To identify possible causes of the differing hit rates by country, the authors "analyzed several socio-economic variables to see whether they correlate with floods coverage. These variables are GDP per capita, GNI per capita, country, continent, date, fatalities, number of English speakers and vulnerability index." This analysis consists of presenting various table and graphs with the hit rate plotted over four to six buckets of the independent variable (e.g. Low income / lower middle income / upper middle income / high income), eschewing more sophisticated statistical methods. They find some evidence for a bias toward higher income countries, although the trend is not entirely consistent (e.g. in a different classification into six instead of four income levels, the second-lowest level "Lower middle income" had a higher hit rate than the three above it). They also find evidence of that countries with a higher ratio of English speakers have better coverage, although "The language can be only a partial explanation because for floods in Australia the hit-rate is half and lower than other non-English-speaking countries" (similarly, the UK only ranked 16th in Wikipedia coverage among the top 20 countries with at least five floods in the ground truth data).

Still, the paper's overall conclusion is that "Wikipedia’s coverage is biased towards some countries, particularly those that are more industrialized and have large English-speaking populations, and against some countries, particularly low-income countries which also happen to be among the most vulnerable".

Unfortunately the researchers fail to acknowledge their own glaring bias in this research, namely the decision to exclusively focus on the English Wikipedia in a paper that is repeatedly hand-wringing about language disparities. To be sure, this bias has long been identified as an issue affecting a large part of Wikipedia research, and there are practical reasons for confining such an analysis to a language that researchers are fluent in. But since the authors clearly seem to frame such biases as a bad thing (at one point referring to them as "flaws" of Wikipedia), it is worth asking whether and why they think that the authors of reference works like Wikipedia should not focus their labor on those natural disasters that are more likely to affect their readers. While the study's confinement to only one of Wikipedia's hundreds of languages is mentioned in the "Limitations and future work" section, it is again framed just as an open question about Wikipedia's shortcomings ("understand how an editor’s language affects the coverage bias"), rather than as an acknowledgment of the paper's own.


Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Automated Adversarial Manipulation of Wikipedia Articles" using Markov chains

From the abstract and paper:[2]

"The WikipediaBot is a self-contained mechanism with modules for generating credentials for Wikipedia editors, bypassing login protections, and a production of contextually-relevant adversarial edits for target Wikipedia articles that evade conventional detection. The contextually-relevant adversarial edits are generated using an adversarial Markov chain that incorporates a linguistic manipulation attack known as MIM or malware-induced misperceptions. [...]

To show how the WikipediaBot could be used to harm discourse, we analyzed a scenario where a hypothetical adversary aims to reduce mentions of Uyghurs on the Uyghurs Wikipedia page [e.g. by changing "the ongoing repression of the Uyghurs" with "the ongoing repression of the Manchus", and other edits suggested by the MIM engine]. ... we contacted the Wikipedia security team with the details and the inner workings of WikipediaBot prior to writing this publication as part of the responsible disclosure requirement. The exposure of the WikipediaBot system architecture allows for consideration of other types of detection, prevention, and defenses then the one proposed in this paper [which was "to add a more robust CAPTCHA system to prevent edits to individual pages"]. We only tested the WikipediaBot on a local, isolated testbed, and never used it to make any adversarial manipulation on the live Wikipedia platform."

"From web to SMS: A text summarization of Wikipedia pages with character limitation"

From the abstract:[3]

"Due to the limitation of the number of characters, a Wikipedia page cannot always be sent through SMS. This work raises the issue of text summarization with character limitation. To solve this issue, two extractive approaches have been combined: LSA and TextRank algorithms. [...] The evaluation showed the relevance of the approach for pages of at most 2000 characters. The system has been tested using the SMS simulator of RapidSMS without a GSM gateway to simulate the deployment in a real environment."

(Compare also previous efforts to make Wikipedia accessible via text messaging)

"RuBQ: A Russian Dataset for Question Answering over Wikidata"

From the abstract:[4]

"The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification."

"Topological Data Analysis on Simple English Wikipedia Articles"

From the abstract and paper :[5]

"We present three statistical approaches for comparing geometric data using two-parameter persistent homology [a tool from topological data analysis ], and we demonstrate the applicability of these approaches on high-dimensional point-cloud data obtained from Simple English Wikipedia articles. [...] The data in this project was produced by applying a Word2Vec algorithm to the text of articles in Simple English Wikipedia, [converting] each of 120,526 articles into a 200-dimension vector, such that articles with similar content produce vectors that are close together. The data also gives a popularity score for each article, indicating how frequently the article is accessed in Simple English Wikipedia. Abstractly, our data is a point cloud of 120,526 points in , with a real-valued function on each point ..."

Dataset provides "interesting negative information" for Wikidata

From the abstract:[6]

"Rooted in a long tradition in knowledge representation, all popular KBs [knowledge bases] only store positive information, but abstain from taking any stance towards statements not contained in them. In this paper, we make the case for explicitly stating interesting statements which are not true. [..] We introduce two approaches towards automatically compiling negative statements. [...] Experimental results show that both approaches hold promising and complementary potential. Along with this paper, we publish the first datasets on interesting negative information, containing over 1.4M statements for 130K popular Wikidata entities."

See also Video and slides, OpenReview page, dataset

Amazon Alexa researchers measure "social bias" on Wikidata

From the abstract:[7]

"We present the first study on social bias in knowledge graph embeddings, and propose a new metric suitable for measuring such bias. We conduct experiments on Wikidata and Freebase, and show that, as with word embeddings, harmful social biases related to professions are encoded in the embeddings with respect to gender, religion, ethnicity and nationality. For example, graph embeddings encode the information that men are more likely to be bankers, and women more likely to be homekeepers."

The paper also contains lists of the top male and female professions in Wikidata (relative to female and male, respectively), evaluated by two different metrics. For the first metric (TransE embeddings), the male list is lead by baritone, military commander, banker, racing driver and engineer. The top five entries on the corresponding female professions list are nun, feminist, soprano, suffragette, and mezzo-soprano.

"Wikipedia and Westminster: Quality and Dynamics of Wikipedia Pages about UK Politicians"

From the abstract:[8]

"First, we analyze spatio-temporal patterns of readers' and editors' engagement with MPs' Wikipedia pages, finding huge peaks of attention during election times, related to signs of engagement on other social media (e.g. Twitter). Second, we quantify editors' polarisation and find that most editors specialize in a specific party and choose specific news outlets as references. Finally we observe that the average citation quality is pretty high, with statements on 'Early life and career' missing citations most often (18%)."


  1. ^ Valerio Lorini; Javier Rando; Diego Saez-Trumper; Carlos Castillo (2020). "Uneven Coverage of Natural Disasters in Wikipedia: The Case of Floods" (PDF). In Amanda Hughes; Fiona McNeill; Christopher W. Zobel (eds.). ISCRAM 2020 Conference Proceedings – 17th International Conference on Information Systems for Crisis Response and Management. Blacksburg, VA (USA): Virginia Tech. pp. 688–703. code on GitHub
  2. ^ Sharevski, Filipo; Jachim, Peter (2020-06-24). "WikipediaBot: Automated Adversarial Manipulation of Wikipedia Articles". arXiv:2006.13990 [cs].
  3. ^ Fendji, Jean Louis; Aminatou, Balkissou (2020-06-11). "From web to SMS: A text summarization of Wikipedia pages with character limitation". EAI Endorsed Transactions on Creative Technologies. 7. doi:10.4108/eai.11-6-2020.165277.
  4. ^ Korablinov, Vladislav; Braslavski, Pavel (2020-05-21). "RuBQ: A Russian Dataset for Question Answering over Wikidata". arXiv:2005.10659 [cs]. Dataset on GitHub
  5. ^ Wright, Matthew; Zheng, Xiaojun (2020-06-30). "Topological Data Analysis on Simple English Wikipedia Articles". arXiv:2007.00063 [math].
  6. ^ Hiba Arnaout, Simon Razniewski, Gerhard Weikum. "Enriching Knowledge Bases with Interesting Negative Statements". Automated Knowledge Base Construction (AKBC) 2020
  7. ^ Fisher, Joseph; Palfrey, Dave; Christodoulopoulos, Christos; Mittal, Arpit (2020-05-07). "Measuring Social Bias in Knowledge Graph Embeddings". arXiv:1912.02761 [cs]. (also published on Amazon.Science)
  8. ^ Agarwal, Pushkal; Redi, Miriam; Sastry, Nishanth; Wood, Edward; Blick, Andrew (2020-06-23). "Wikipedia and Westminster: Quality and Dynamics of Wikipedia Pages about UK Politicians". arXiv:2006.13400 [cs].

In this issue
+ Add a comment

Discuss this story

"social bias" on Wikidata

Why do you even bother with mentioning of this kind of extremely low-quality research? And undue promotion of sloppy researchers with weird results and no meaningful interpretation thereof. Staszek Lem (talk) 23:50, 27 September 2020 (UTC)[reply]

(For reference: Staszek Lem is talking about the paper by four researchers of Amazon Alexa, from Amazon's Cambridge Research lab.)
Out of curiosity, what is your opinion ("extremely low-quality", "sloppy") based on? Regards, HaeB (talk) 05:52, 4 October 2020 (UTC)[reply]
If decent researchers receive a weird result ("male list is lead by baritone"), they do something about this. As for the observation "women more likely to be homekeepers", I guess I have to add this example into my article "British scientists". Staszek Lem (talk) 16:32, 4 October 2020 (UTC)[reply]
Have you actually read the paper?
I'm not sure how this particular result is "weird" in the sense of likely to be wrong. Keep in mind that they are not saying that baritone was the most frequent profession among male Wikidata subjects; rather (as I indicated in the excerpt) that it was the profession ranked highest in a metric indicating that Wikidata subjects with this profession are more likely to be male than female. (The full description of this list in the paper is "Top 20 male professions in Wikidata relative to female using TransE embeddings".)
A more worthwhile question might be what Wikidata should and shouldn't do about these "harmful social biases related to professions" (to quote the paper's abstract). In the case of suffragettes (#4 on the top 20 female professions list), it's hard to imagine many benefits of adding male profession members until this disparity on Wikidata is resolved.
Regards, HaeB (talk) 06:12, 7 October 2020 (UTC)[reply]

Unless you are qualified to act as a peer reviewer and willing to place your reputation behind the article(s), it's a very bad idea to promote preprints that have not yet been peer reviewed. It's bad science and bad journalism. ElKevbo (talk) 00:28, 28 September 2020 (UTC)[reply]

This is really not correct. Preprints aren't as reliable as peer reviewed publications (but those aren't guaranteed either!), but they're distributed for people to read and review, and give feedback. Indeed, if there's a flaw in the work that might be obvious to editors here, it's good that they see it and contact the authors so it can be corrected before final publication; WilyD 09:19, 28 September 2020 (UTC)[reply]
This Wikipedia, not a peer review service. If they were decent researchers, they would have asked for feedback themselves. Signpost undeservedly puts them into the limelight just because they do some vivisection of our projects, and not for the merits. Staszek Lem (talk) 16:28, 28 September 2020 (UTC)[reply]
This is incorrect. Publishing preprints is an implicit invitation for feedback from interested parties. If we're reporting and discussing them, then we're interested parties. Not Peer reviewers (probably, perhaps someone is plausible referee, but I doubt I'd be asked to review a paper in any journal they're likely to submit to). Signpost is doing it's readers a service by letting us know what research is done on Wikipedia. Perhaps there's some room to improve how it's presented here to make it clearer it's a preprint, but the idea that the Signpost shouldn't be letting its readers what's going on with research on Wikipedia is totally indefensible. WilyD 09:24, 29 September 2020 (UTC)[reply]
The articles aren't being presented in the context of "these are unreviewed articles and readers should offer comments and suggestions to the authors at <mechanism to provide feedback and commentary>." They're being presented as authoritative research articles. Using this venue as a means to solicit feedback is a very interesting idea and one worth exploring but that is not what is currently being done. ElKevbo (talk) 16:55, 28 September 2020 (UTC)[reply]
No, nor should they be, nor did I suggest that. They're being presented as preprints describing research that's being done on Wikipedia. Which is of interest to Wikipedians. If you (or anyone) thinks they could use some feedback you (they) should offer it. That's one of the purposes of preprints. It's also of interest to people here to know what's being researched, by whom, and what they're finding. These are also purposes of preprints, and of plausible interest to people who read the Signpost. WilyD 09:27, 29 September 2020 (UTC)[reply]
It's irresponsible at best and dishonest at worst to present reviewed, published research alongside unreviewed, unpublished research as if they're equivalent. ElKevbo (talk) 14:22, 29 September 2020 (UTC)[reply]

I don't think that it was stupid to write a Signpost article about this stuff (and certainly haven't read the paper myself), since it's nice to know what kinds of things are getting published about Wikipedia (for good or for ill). It does seem like kind of a facile observation that an English-language resource has more information on things that happened in English-speaking countries if they haven't controlled with foreign language resources − and if they didn't, I might be forced to echo the above comment about ursine defecation. That paper by the Alexa researchers seems like pure academic clickbait, though (and what a group of people to be making statements about "social responsibility", LOL!) jp×g 13:13, 15 October 2020 (UTC)[reply]


"we contacted the Wikipedia security team with the details and the inner workings of WikipediaBot prior to writing this publication as part of the responsible disclosure requirement" - What (within the parameters of what can be safely revealed in public) was the security team's response? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 17:29, 20 October 2020 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0