The Signpost

Recent research

Vandalizing Wikipedia as rational behavior

Contribute  —  
Share this
By Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Vandalizing Wikipedia as rational behavior

A paper[1] presented last year at the International Conference on Social Media and Society studies possible rational motivations for Wikipedia vandalism:

"Competing theories in criminology seek to explain the motivations for and causes of crime, ascribing criminal behavior to such factors as lack of impulse control, lack of morals, or to societal failure. Alternatively, rational choice theory proposes that behaviors are the product of rational choices. In order to apply rational choice theory to vandalism, this project seeks to understand vandal decision-making in terms of preferences and constraint"

The author observes that "vandalism-related research has tended to focus on the detection and removal of vandalism, with relatively little attention paid to understanding vandals themselves" (which can be readily confirmed by searching the archives of this newsletter for "vandalism"; one exception being a 2018 study that asked students their guesses about why their classmates vandalize Wikipedia: "Only 4% of students vandalize Wikipedia – motivated by boredom, amusement or ideology (according to their peers)"). She notes that

"Although the harm is clear, the benefit to the vandal is less clear. In many cases, the thing being damaged may itself be something the vandal uses or enjoys. Vandalism holds communicative value: perhaps to the vandal themselves, to some audience at whom the vandalism is aimed, and to the general public."

The theoretical framework used to study such rational motivations is "rational choice theory (RCT) as applied in value expectancy theory (VET)". It conceptualizes the expected utility of a choice (such as that engaging in an act of vandalizing) as the sum over possible outcomes over the product of "the probability of some outcome O [...] and the utility valuation U of that same outcome".

Based on a sample of 141 vandalism edits (from the English Wikipedia), the author proposes an ontology of Wikipedia vandalism, extending classifications used in previous vandalism detection studies (e.g. blanking, misinformation, "image attack", "link spam") with a few new ones: "Attack graffiti" (i.e. "attack an individual or group") and "Community-related Graffiti" (expressing "opposition to community, norms, or policies").

The quantitative part of this mixed methods paper "examine[s] vandalism from four groups: users of a privacy tool Tor Browser, those contributing without an account, those contributing with an account for the first time, and those contributing with an account but having some prior edit history". Tor Browser edits are generally blocked automatically on Wikipedia and those in the dataset consists of edits that slipped through this mechanism, raising the question whether some or many of these edits might have involved the editor having to try several times to get around that block, setting them apart from less dedicated vandals in the other groups.

The observation that contributing under an account requires more effort (i.e. creating that account, and logging into it) than contributing as IP editor motivates the author's first hypothesis: "(H1) users who have created accounts will vandalize less frequently". She finds it confirmed by the examined edit data.

Secondly, the author hypothesizes that "the least identifiable individuals are more likely to produce vandalism that has high-risk repercussions" (H2) because value expectancy theory "suggests that identifiability acts as a constraint on deviant behavior." The author finds this hypothesis partially supported. Among other findings, "Tor-based users are substantially more likely than other groups to engage in large-scale vandalism and least likely to engage in the lowest risk type of vandalism, that which communicates friendly and sociable intent."

In motivating her third hypothesis, the author observes that "the groups under study differ by how they are treated by community policies. Newcomers are targeted for social interventions to welcome, train, and retain them. Wikipedia invites IP-based editors to create accounts as well as welcoming them. However, Tor-based editors generally experience rejection." The resulting hypothesis is "(H3) Members of excluded groups are more likely to strike against the community targeting them," operationalized as a higher rate of vandalism in the "community-related" category (e.g. directly attacking Wikipedia norms or policies).

The paper contains various other interesting observations that might make it worth reading for Wikipedia editors spending time dealing with vandalism and related community policies. To pick just one example, the author highlights that vandalism can also have positive effects, referring to a 2014 paper.[2] That earlier study involved conducting interviews with editors and a quantitative analysis of a dataset that included edit numbers by editor experience level, page watcher numbers, pageview numbers and other data from the English Wikipedia, finding that "novice contributors’ participation has a direct negative effect on the quality of goods produced [i.e. newbie edit decreased article quality on average], but a positive indirect effect because it acts as a cue for expert contributors to improve the quality of those goods that consumers [i.e. Wikipedia readers] are most interested in." It found "that the positive direct effect of article consumption [i.e. pageviews] on expert editing patterns is fully mediated by novice contributions. Results [...] support the theory that experts are unaware of demand [i.e. experienced editors do not usually check traffic levels of the articles they edit] but they are stimulated to respond to article consumption if consumers signal demand for that particular good through their contributions as novice producers."


Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Language biases in Wikipedia's "information landscapes"

From the abstract and conclusions:[3]

"We test the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Controlling the size factor, we investigate this hypothesis for a number of 25 subject areas. [...] at least in the context of the subject areas examined here [Wikipedia's] different language versions differ so much in their treatment of the same subject area that it is necessary to know which area in which language someone is consulting if one wants to know how much the part of the IL [information landscape] he or she is traversing is biased."

"Universal structure" of collective reactions to invididual actions found in Twitter, Wikipedia and scientific citations

From the abstract:[4]

"In a social system individual actions have the potential to trigger spontaneous collective reactions. [...] We measure the relationship between activity and response with the distribution of efficiency [...]. Generalizing previous results, we show that the efficiency distribution presents a universal structure in three systems of different nature: Twitter, Wikipedia and the scientific citations network."

"Novel Version of PageRank, CheiRank and 2DRank for Wikipedia in Multilingual Network Using Social Impact"

From the abstract:[5]

"... we propose a new model for the PageRank, CheiRank and 2DRank algorithm based on the use of clickstream and pageviews data in the google matrix construction. We used data from Wikipedia and analysed links between over 20 million articles from 11 language editions. We extracted over 1.4 billion source-destination pairs of articles from SQL dumps and more than 700 million pairs from XML dumps. [...] Based on real data, we discussed the difference between standard PageRank, Cheirank, 2DRank and measures obtained based on our approach in separate languages and multilingual network of Wikipedia."

(see also earlier coverage of related research that applied such ranking metrics to graphs of Wikipedia articles)

"Modeling Popularity and Reliability of Sources in Multilingual Wikipedia"

From the accompanying blog post:[6]

"In this paper authors analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. In the research authors presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. [....] For example, among the most popular scientific journals in references of English Wikipedia are: Nature, Astronomy and Astrophysics, Science, The Astrophysical Journal, Lloyd’s List, Monthly Notices of the Royal Astronomical Society, The Astronomical Journal and others."


  1. ^ Champion, Kaylea (2020-07-22). "Characterizing Online Vandalism: A Rational Choice Perspective". International Conference on Social Media and Society. SMSociety'20. New York, NY, USA: Association for Computing Machinery. pp. 47–57. doi:10.1145/3400806.3400813. ISBN 9781450376884. (blog post)
  2. ^ Gorbatai, Andreea D. (2014). "The Paradox of Novice Contributions to Collective Production: Evidence from Wikipedia". SSRN 1949327.
  3. ^ Mehler, Alexander; Hemati, Wahed; Welke, Pascal; Konca, Maxim; Uslu, Tolga (2020). "Multiple Texts as a Limiting Factor in Online Learning: Quantifying (Dis-)similarities of Knowledge Networks". Frontiers in Education. 5. doi:10.3389/feduc.2020.562670. ISSN 2504-284X.
  4. ^ Martin-Gutierrez, Samuel; Losada, Juan C.; Benito, Rosa M. (2020-07-22). "Impact of individual actions on the collective response of social systems". Scientific Reports. 10 (1): 12126. Bibcode:2020NatSR..1012126M. doi:10.1038/s41598-020-69005-y. ISSN 2045-2322. PMC 7376036. PMID 32699262. S2CID 220682026.
  5. ^ Coquidé, Célestin; Lewoniewski, Włodzimierz (2020). Abramowicz, Witold; Klein, Gary (eds.). "Novel Version of PageRank, CheiRank and 2DRank for Wikipedia in Multilingual Network Using Social Impact". Business Information Systems. Lecture Notes in Business Information Processing. 389. Cham: Springer International Publishing: 319–334. arXiv:2003.04258. doi:10.1007/978-3-030-53337-3_24. ISBN 978-3-030-53337-3. S2CID 212649841. Closed access icon
  6. ^ Lewoniewski, Włodzimierz; Węcel, Krzysztof; Abramowicz, Witold (May 2020). "Modeling Popularity and Reliability of Sources in Multilingual Wikipedia". Information. 11 (5): 263. doi:10.3390/info11050263. See also blog post

In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0