The Signpost

Recent research

Most controversial Wikipedia topics, automatic detection of sockpuppets

Contribute  —  
Share this
By Giovanni Luca Ciampaglia, Taha Yasseri, Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"The most controversial topics in Wikipedia: a multilingual and geographical analysis"

Map of Conflict in Spanish Wikipedia. Each dot represents a geolocated article. Size and colour of dots are corresponding to the controversy measure according to Sumi et al. (2001)[1]. The map is taken from Yasseri, et al. (2013) [2].

A comparative work by T. Yasseri., A. Spoerri, M. Graham and J. Kertész on controversial topics in different language versions of Wikipedia has recently been posted on the Social Science Research Network (SSRN) online scholarly archive [1]. The paper, which will appear as a chapter of an upcoming book titled "Global Wikipedia: International and cross-cultural issues in online collaboration", to be published by Scarecrow Press in 2014, and edited by Fichman P., and Hara N., looks at the 100 most controversial topics in 10 language versions of Wikipedia (results including 3 additional languages are reported in the blog of one of the authors), and tries to make sense of the similarities and differences in these lists. Several visualization methods are proposed, based on a Flash-based tool developed by the authors, called CrystalView. Controversiality is measured using a scalar metric which takes into account the total volume of pairwise mutual reverts among all contributors to a page. This metric was proposed by Sumi et al. (2011)[2], in a paper reviewed two years ago in this newsletter ("Edit wars and conflict metrics"). Topics related to politics, geographical locations, and religion are reported to be the most controversial across the board, and each language seems to feature specific, local controversies, which the authors further track down by grouping together languages with similar spheres of influences. Furthermore, the presence of latitude/longitude information (geocoordinates) in several of the Wikipedia articles in the sample analyzed in the study let the authors map the top controversial topics to a global world map, showing how each language features both local and global issues as the most heated topics of debate.

In summary, the study shows how valuable information about cross-cultural differences can be extracted from traces of Internet activity, though one obvious question is how the demographics of Wikipedia editors affect the representativeness of the results, an issue which the authors seem to be aware of, and which is probably going to play a role of increasing importance, as the field of cultural studies looks more and more at data generated by peer production communities.

The research has been intensely featured in the media, e.g., Huffington Post, Live Science,, Zeit Online.

Non-virtual sockpuppets created by participants of RecentChangesCamp, as a humorous take on the sockpuppet phenomenon in online communities

Sockpuppet evidence from automated writing style analysis

"A Case Study of Sockpuppet Detection in Wikipedia"[3], presented at a "Workshop on Language in Social Media" this month, describes an automated method to analyze the writing style of users for the purpose of detecting or confirming sockpuppets. The abuse of multiple accounts (also known as "multi-aliasing" or sybil attacks in other contexts) is described as "a prevalent problem in Wikipedia, there were close to 2,700 unique suspected cases reported in 2012."

The authors' approach is based on existing authorship attribution research (cf. stylometry, writeprint). In a very brief overview of such research, the authors note that data from real-life cases is usually hard to come by, so that most papers are testing attribution methods on text that was collected for different purposes, and comes from authors that were not deliberately trying to evade detection. Whereas on Wikipedia "there is a real need to identify if the comments submitted by what appear to be different users belong to a sockpuppeteer".

Using the open-source machine learning tool Weka, the authors developed an algorithm that analyzes users' talk page comments by "239 features that capture stylistic, grammatical, and formatting preferences of the authors" - e.g. sentence lengths, or the frequency of happy emoticons (i.e. ":)" and ":-)"). Apart from features whose use is established in the literature, they add some of their own, e.g. counting errors in the usage of "a" and "an".

The paper examines 77 real-life sockpuppet cases from the English Wikipedia - 41 where the suspected use of sockpuppets was confirmed by "the administrator’s verdict" (presumably most of them based on Checkuser evidence), and 36 where it was rejected. For each case, the algorithm was first trained on talk page comments by the suspected sockpuppeteer (main account), and then tested on comments by the suspected sockpuppet (alternate account). On the average in each case less than 100 talk page messages were used to train or test the algorithm.

The system achieved an accuracy of 68.83% in the tested cases (for comparison, simply always confirming the suspected sockpuppet abuse would have achieved 53.24% accuracy on the same test cases). After adding features based on the user's edit frequency by time of day and day of the week, it achieved 84.04% confidence when tested on a smaller subset of the cases.

The authors remark in the introduction that "relying on IP addresses is not robust, as simple counter measures can fool the check users". In this reviewer's opinion, this probably underestimates the effort needed (for example, DSL or cable users simply resetting their modem to obtain a different dynamic IP most likely will not "fool" Checkusers). Still, a later part of the paper treats rejections of sockpuppet cases as definite proof that the accounts were not sockpuppets. Thus, they are possibly ignoring cases where a sockpuppeteer managed to avoid generating Checkuser evidence - in other words, some of results counted as false positives in this methodology might actually have been correct.

Looking forward, the authors write: "We are aiming to test our system on all the cases filed in the history of the English Wikipedia. Later on, it would be ideal to have a system like this running in the background and pro-actively scanning all active editors in Wikipedia, instead of running in a user triggered mode." If all the resulting similarity scores would be public, it would be doubtful that this would remain uncontroversial - many editors (especially on the German Wikipedia) are uncomfortable with the publication of aggregated analysis data about their editing behavior, even if it is based purely on information that is already public; compare the current RfC on Meta about X!'s Edit Counter.

The authors state that "to the best of our knowledge, we are the first to tackle [the problem of real sockpuppet cases in Wikipedia]" with this kind of stylometric analysis. This may only be accurate in an academic context. For example, in a high-profile sockpuppet investigation on the English Wikipedia in 2008, User:Alanyst applied the tf-idf similarity measure to the aggregated edit summaries of all users who had made between 500 and 3500 edits in 2007. (This measure compares the relative word frequencies in two texts.) The analysis confirmed the sockpuppet suspicion against two accounts A and B: Account B came out closest to A, and account A 188th closest to B (among the 11,377 tested accounts). For an overview of this and other methods developed by Wikipedians to evaluate the validity of sockpuppet suspicions, see the slides of this reviewer's talks at Wikimania 2008 and the Chaos Congress 2009.

Adjusting automatic quality flaw predictors by topic areas

Building on their earlier work on the feasibility of automatically assigning maintenance templates to articles (review: "Predicting quality flaws in Wikipedia articles"), three German researchers investigate[4] how an article's topic might inform the detecting of text that needs to be tagged for quality problems. In this paper, they focus on maintenance tags for neutrality (e.g. {{Advert}}, {{Weasel}}) or style (e.g. {{Tone}}), cataloguing them into "94 template clusters representing 60 style flaws and 34 neutrality flaws". As an example of a maintenance tag that is restricted to certain topic areas, they cite "the template in-universe... which should only be applied to articles about fiction." Differing standards between different WikiProjects are named as another possible reason for "topic bias" in maintenance tags.
To make their classification algorithm more aware of article's topics when assigning maintenance templates, the researcher modify their previous approach by populating their "positive" and "negative" training sets by revision pairs from the same articles: The version where a (human) Wikipedian had inserted a maintenance tag first, and the later revisions of the same article when the tag is removed (assuming that the corresponding flaw has indeed been eliminated at that time). To evaluate the success of this approach, the authors introduce the notion of a "category frequency vector" assigned to a set of articles (counting, for each category on Wikipedia, how many articles from this category are contained in the set). The cosine of the vectors of two article sets measures how similar their topics are. They find that "topics of articles in the positive training sets are highly similar to the topics of the corresponding reliable negative articles while they show little similarity to the articles in the random set. This implies that the systematic bias introduced by the topical restriction has largely been eradicated by our approach." Sadly, this evaluation method does not seem to have yielded direct information about which quality flaws are prevalent in which topic areas.
Apart from their own software, the researchers used the WikiHadoop software to analyze the entire revision history of the English Wikipedia, and the machine learning tool Weka to classify article text.



  1. ^ Sumi, R., Yasseri, T., Rung, A., Kornai, A., and Kertész, J. (2011) Edit wars in Wikipedia. IEEE Third International Conference on Social Computing (SocialCom), 9-11 October 2011, Boston, MA. pp. 724-727. Online preprint: ArXiv [stat.ML].
  2. ^ Yasseri, T., Spoerri, A., Graham, M. and Kertész, J. (2014) The most controversial topics in Wikipedia: A multilingual and geographical analysis. In: P.Fichman and N.Hara (eds) Global Wikipedia: International and cross-cultural issues in online collaboration. Scarecrow Press
  3. ^ Thamar Solorio and Ragib Hasan and Mainul Mizan: A Case Study of Sockpuppet Detection in Wikipedia. Proceedings of the Workshop on Language in Social Media (LASM 2013), pages 59–68, Atlanta, Georgia, June 13 2013. Association for Computational Linguistics PDF
  4. ^ "The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia"
  5. ^ Luis Fernando Heredia: Collective Organization for Public Good Provision: The Case of Wikipedia. University of Victoria, 2012
  6. ^ Ramine Tinati, Leslie Carr, Susan Halford, Catherine Pope: The HTP Model: Understanding the Development of Social Machines. WWW ’13,May 13–17, 2013, Rio De Janeiro, Brazil. PDF
  7. ^ Imene Bensalem, Salim Chikhi, Paolo Rosso: Building Arabic Corpora from Wikisource
  8. ^ Krisztian Balog, Naimdjon Takhirov, Heri Ramampiaro, Kjetil Nørvåg: "Multi-step Classification Approaches to Cumulative Citation Recommendation". OAIR’13, May 22-24, 2013, Lisbon, Portugal. PDF
  9. ^ Benjamin Perez, Cristoforo Feo, Andrew G. West, and Insup Lee: WikiCat -- A graph-based algorithm for categorizing Wikipedia articles. PDF
  10. ^ Phillip Singer, Thomas Niebler, Markus Strohmaier, Andreas Hotho: Computing Semantic Relatedness from Human Navigational Paths on Wikipedia. PDF

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

This goes to show the power of reminders: nobody told me the Research Newsletter is going to be in this week Signpost (I keep thinking it's in the first edition for a new month), so I didn't contribute this time. Sigh :( --Piotr Konieczny aka Prokonsul Piotrus| reply here 10:37, 28 June 2013 (UTC)[reply]

Folks, as an author of the sockpuppet detection article mentioned in this report, I want to thank you for discussing the article. Thanks also for the critique and the nice suggestions/pointers. If you have any further suggestions, please ping me on my talk page or email me. --Ragib (talk) 08:12, 7 July 2013 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0