The Signpost

Recent research

Supporting interlanguage collaboration; detecting reverts; Wikipedia's discourse, semantic and leadership networks, and Google's Knowledge Graph

Contribute  —  
Share this
By Jodi.a.schneider, Piotr Konieczny, Tilman Bayer and Angelika Adam
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, edited jointly with the Wikimedia Research Committee and republished as the Wikimedia Research Newsletter.

Discourse on Wikipedia sometimes irrational and manipulative, but still emancipating, democratic and productive

An article[1] in sociology journal The Information Society looks at interactions between Wikipedia editors and the project's governance, visible in the articles on stem cells and transhumanism, and in the analysis of Wikipedia's discussion of userboxes, all through the prism of Jürgen Habermas' universal pragmatics and Mikhail Bakhtin's dialogism theories.

The authors focus on the qualitative analysis of language used by editors, to argue that Wikipedia has elements of a democracy, and is an example of a Web 2.0–empowering discourse tool. They stress that some forms of discourse found online (including on Wikipedia) may be highly irrational, something that some previous arguments that Web 2.0 is a democratic space have often ignored, but they argue that this is in fact not as much of a hindrance as previously expected. Cimini and Burr remark that discourse can develop between Wikipedians of widely differing points of view, and that some editors will engage in "repeated, strategic, and often highly manipulative attempts" to assert personal authority. Such discussions may be very lively, involving "personal, emotional, or humour-based arguments", yet the authors argue that such comments may not be a hindrance; instead, "on many occasions, there is thus a clearer exposition of views that is achieved, in spite of, or perhaps because of, these personal [and] sometimes vulgar methods of argumentation."

In the end, the authors are positive about the success of Wikipedia's deliberation in reaching consensus, although they say that it can be "fleeting and transitory" on occasion. Unfortunately, the paper does not touch on Wikipedia policies such as Wikipedia:Civility and Wikipedia:No personal attacks, which would certainly have added to their analysis.

Despite the paper's claim to have received approval for research through a university research ethics committee, the paper does critically discuss the postings of specifically named editors ("[Editor A's] claim to authority and ad hominem attacks were met with derision by [Editor B]" (names replaced by the Signpost); this may raise eyebrows. Not all editors are 100% anonymous, which raises the question of whether the researchers did enough to protect the identity and reputation of the editors it cites. At the very least, why weren't the editors' usernames changed in the quotes? Their direct identification adds nothing to the article, and may expose the users to attack. (Similar questions have been discussed in the past by members of the Wikimedia Foundation Research Committee.)

Different language Wikipedias: automatic detection of inconsistencies

In a paper presented at the 4th International Conference on Intercultural Collaboration (ICIC),[2] Kulkarni et al. offer a simple approach to support the work of Wikipedia editors who maintain articles concerning the same topic in multiple language versions. The long-term goal is to implement a bot that supports these specialized users by highlighting missing attributes and content inconsistencies.

The analysis was focused on a pairwise comparison of infoboxes in different languages. First, the attribute-value pairs were extracted from the infoboxes and translated into English via Google translate. The identification of matching attribute names was achieved through direct text comparison with a set of synonyms obtained from WordNet (this step was included to handle mismatches caused by translation errors and variations). In a second step (the matching of attribute-values) the authors again used direct text comparative methods, and checked whether the values could be identified as homophones, to exclude mismatches caused by spelling mistakes in the text.

The evaluation data-set of these analyses and the whole pipeline included articles from English, German, Chinese and Hindi Wikipedias concerning two restricted domains: Indian cities and US-based companies. The evaluation revealed "a significant increase in recall after the concepts of homophones and synonyms were applied in addition to the direct text comparison." But the overall result was very weak, mainly due to translation errors. The authors noticed syntactic and semantic differences between the infoboxes, such as paraphrasing or different fact representations. "Also, abbreviations, unit conversion and geographic location matching [was not handled by their system]." The researchers plan to improve the system by addressing all of these issues in turn.

Finding deeper meanings from the words used in Wikipedia articles

An undergraduate computer science honors thesis at Trinity University (Texas) constructs a semantic graph from 451 articles, linked to from the World War II article.[3] Ryan Tanner's goal is to produce a visualization "which allows one to quickly find and examine connections between the people, places and things described in Wikipedia". The process is as follows:

  1. Import SQL dump from the Wikimedia Foundation into a local database
  2. Strip wiki markup from the articles using Bliki
  3. Parse articles with the Stanford NLP, using dependency grammars to extract facts and simplify sentences
  4. Parse the output from the Stanford library using Scala
    1. Read a Stanford XML file into a collection of models.
    2. Produce abstractions for named entities and locations.
    3. Input models into the algorithm developed for this thesis (see Chapter 7)
  5. Store results in a database.
  6. Traverse the resulting graph and produce user-presentable output.

Originally the goal was to visualize the whole of Wikipedia; however, due to problems with the dump, only 250,000 articles out of about 1.5 million were imported. An even smaller subset was ultimately usable, since the Stanford NLP library crashed on many of the remaining articles due to markup issues and the need for manual cleanup. To ensure a dense graph, tests were focused on the network of the World War II article. Some brief examples of the resulting graph are given in Chapter 10, which notes false positives as one problem requiring further investigation. The author makes suggestions for future research, such as using the Simple English Wikipedia or more complex relations.

How leaders emerge in the Wikipedia community

A paper titled "Leading the Collective: Social Capital and the Development of Leaders in Core-Periphery Organizations"[4] looks at how leaders emerge in Wikipedia and similar crowd-based organizations. While often seen as egalitarian and with little hierarchy, such projects always have a group of leaders who have emerged from the community (the "crowd"), involved in planning, mediation, and policy development. The authors treat Wikipedia and similar organization as a core–periphery network model developed by Steve Borgatti—a system with a deeply interconnected center and a poorly connected periphery. In Wikipedia, the leaders ("core") comprise the most active contributors, and the authors assume they produce the most social capital. Using social network analysis, the paper looks at the interpersonal ties between the editors, focusing on the ties between leaders and periphery. The hypothesis is that specific types of ties will have a greater influence on advancement to leadership.

The authors collected data from RfA pages, and the ties were measured through user-talk-page interactions. Leaders were defined as admins, and periphery editors as non-administrators; this operationalization may raise some doubts about the validity, since some very active and prominent members of the community are not admins, something the authors do not address. The authors find that the most important ties are the early ones to the periphery, and later, ties to the leaders. Overall strong ties are not as important as weak ties, although Simmelian ties (between pairs of leader groups) are among the most important.

Collier and Kraut conclude that leaders in projects such as Wikipedia do not suddenly appear; instead, they evolve over time through their immersion in the project's social network. Early in their experience, those leaders gain a deeper understanding of the community, developing a network of contacts through their weak ties to the periphery; later, their most important ties are to the leaders, particularly in the form of strong connection to a leader group.

Identifying software needs from Wikipedia translation discussions

A paper[5] presented at an international conference on intercultural collaboration aims "to identify the type of community interaction needed for successfully creating or amending an article via Wikipedia translation activities", and proposes new software tools to facilitate these interactions. To this end, the researchers from Kyoto University analyzed 1694 talk-page comments from three Wikipedias, belonging to articles in categories marking (partial or complete) translations (e.g. fr:Catégorie:Projet:Traduction/Articles_liés): 228 articles from the Finnish, 93 from the French, and 94 from the Japanese Wikipedia. They attempted to categorize (code) each comment according to which "activity" it referred to (either editing the article or translating it), about which "context" it was referring to (using the categories "content", "layout", "sources", "naming", "significance" and "wording"), and which action was intended (requesting or providing help, requesting an edit, announcing an edit that the user had made, criticizing the article without a direct request for action, coordinating actions between users, or referring to an established Wikipedia policy).

Regarding comments focused on the activity of editing, the "results were consistent with previous research, with a high frequency of discussion contributions about content and layout". The authors found that "the Japanese Wikipedia was the only one with more discussion contributions about layout than content when the discussion was about editing activities (40.18%)" and speculate that this is because "in the older, or larger, Wikipedias, practices and policies are likely to be better established than in the younger, or smaller, Wikipedias leading to a lower frequency of discussions about layout." (However, they later point out that the Finnish Wikipedia, rather than the Japanese, is the smallest and youngest among the three examined ones, noting that it shows a much higher frequency of discussion about policy—15.0%, versus 6.0% on the French and 3.3% on the Japanese Wikipedia.) In this class of comments, "discussions about citing sources were relatively common in the Finnish and French Wikipedias (18.8% and 12.4%, respectively). In the Japanese Wikipedia, sources were less common with 7.1% of all discussion contributions regarding editing activities."

Most discussions about translation activities were about naming—that is, "resolving the proper form for the title of the article, section or sub-section, names or proper nouns, and transliteration in the corresponding article", contrasting the researchers' initial hypothesis that such discussion would "have a high frequency of contributions regarding translation of specific words and expressions" (their "naming" category "does not include phrasing or resolving proper translation of individual words or expressions"). As one reason, they identify "the diversity in naming practices of events between different language sources, such as mass media. Especially in the Finnish Wikipedia, discussion about sources was common (16.15%). These two topics are loosely related, as direct translations of the names of well-known events are often not acceptable in the target language Wikipedia."

Having identified naming issues and the search for suitable sources in the target language as "key problems" emerging in the translation discussions, the authors conclude that "the current approaches for supporting Wikipedia translation are not necessarily solving the main problems in Wikipedia translation" and proceed to suggest two "directions for designing supporting tools for Wikipedia translation, especially through open source development of MediaWiki extensions":

The paper makes references to previous work on Wikipedia translation (including the authors' own), but does not mention the EU-supported CoSyne project, which aims to integrate tools with MediaWiki that "automate the dynamic multilingual synchronization process of Wikis" and would seem to have a lot of overlap with the kind of tools discussed in the paper.

New algorithm provides better revert detection

A paper[6] by three researchers affiliated with the EU-supported RENDER project (to be presented at next month's "Hypertext 2012" conference) promises "accurate revert detection in Wikipedia". The article starts by describing the detection of reverts as "a foundational step for many (more elaborated) research ideas, [whose] purposeful handling leads to a superior understanding of wiki-like systems of collaboration in general", giving an overview over such research. (Revert detection has also been used in tools for the use of the editing community, such as this one that identify articles on the German Wikipedia that are currently controversial.)

Overviewing the "state-of-the-art in revert detection", the authors criticize the prevalent "identity revert detection method" (SIRD) which relies on finding identical revisions using MD5 hashes, arguing that it does not fully match the definition of a revert in the (English) Wikipedia's policies at Wikipedia:Reverting: The SIRD method "does not require the reverting edit to actually undo the actions of an edit identified as reverted ... [Furthermore, it] is not possible to indicate if the reverting edit fully, partly or not at all undid the actions of the reverted edit ... It also does not require the intention of the reverting edit to revert any other edit." (Still, mainly due to requests by researchers, MD5 hashes have been integrated directly into the revision table stored by MediaWiki recently, necessitating considerable technical efforts when updating the existing databases for Wikimedia projects.)

The paper then presents the authors' new method for revert detection, which still aims to detect full reverts and to avoid false positives, while coming closer to the Wikipedia community's definition. It is implemented as an algorithm based on splitting the revisions' wikitext into word tokens (and made available online as a Python script). Also, MD5 hashes are still used on a paragraph level to be able to detect unchanged paragraphs easily and speed up computation. The algorithm was then evaluated by a panel of Wikipedians recruited on the English Wikipedia in comparison with the existing SIRD method.

As summarized by the authors, this user study found the new method to be "more accurate in identifying full reverts as understood by Wikipedia editors. More importantly, our method detects significantly fewer false positives than the SIRD method [27% in the sample, which however was somewhat small]". As a drawback, the authors note "the increased computational cost. As [the new algorithm] is quadratic over the number of words in the DIFFs [the changed text between subsequent revisions], in its current implementation it might not be the tool of choice if larger amounts of articles are to be analyzed; especially in the case of complete history dumps of the large Wikipedias, e.g., English, German or Spanish."


The history of art mapped using Wikipedia (visualization of wikilinks between "art-historical actors" spanning at most 75 years, from Goldfarb et al.)


  1. ^ Cimini, N., & Burr, J. (2012). An Aesthetic for Deliberating Online: Thinking Through “Universal Pragmatics” and “Dialogism” with Reference to Wikipedia. The Information Society, 28(3), 151–160. Routledge. doi:10.1080/01972243.2012.669448 Closed access icon
  2. ^ Gurunath Kulkarni, R., Trivedi, G., Suresh, T., Wen, M., Zheng, Z., & Rose, C. (2012). Supporting collaboration in Wikipedia between language communities. Proceedings of the 4th international conference on Intercultural Collaboration – ICIC ’12 (p. 47). New York, New York, USA: ACM Press. doi:10.1145/2160881.2160890 Closed access icon
  3. ^ Tanner, R. (2012). Creating a Semantic Graph from Wikipedia. Computer Science Honors Theses. Paper 29. Open access icon
  4. ^ Collier, B., & Kraut, R. (2012). Leading the Collective: Social Capital and the Development of Leaders in Core–Periphery Organizations. Physics and Society. Open access icon
  5. ^ Gurunath Kulkarni, R., Trivedi, G., Suresh, T., Wen, M., Zheng, Z., & Rose, C. (2012). Supporting collaboration in Wikipedia between language communities. Proceedings of the 4th international conference on Intercultural Collaboration – ICIC ’12 (p. 47). New York, New York, USA: ACM Press. doi:10.1145/2160881.2160890 Closed access icon
  6. ^ Fabian Flöck, Denny Vrandecic and Elena Simperl. Reverts Revisited – Accurate Revert Detection in Wikipedia. HT’12, June 25–28, 2012, Milwaukee, Wisconsin, USA. Open access icon
  7. ^ Doron Goldfarb, Max Arends, Josef Froschauer, Dieter Merkl. Art History on Wikipedia, a Macroscopic Observation (PDF) WebSci 2012, June 22–24, 2012, Evanston, Illinois, USA. Open access icon
  8. ^ Ford, Heather: Update on the Wikipedia sources project., May 17, 2012 Open access icon
  9. ^ Vrandečić, D. (2012). Distribution of title lengths in Wikipedias., 10 May 2012 Open access icon
  10. ^ Talukdar, P. P., & Cohen, W. W. (2012). Crowdsourced Comprehension: Predicting Prerequisite Structure in Wikipedia. 7th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL 2012. PDF Open access icon
  11. ^ Miller, N. (2012). Characterizing Conflict in Wikipedia. Honors Projects. Paper 25. Open access icon
  12. ^ Jurgens, D., & Lu, T.-ching. (2012). Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia. ICWSM '12 PDF Open access icon
  13. ^ Valentin I. Spitkovsky, Angel X. Chang. "A Cross-Lingual Dictionary for English Wikipedia Concepts". Eighth International Conference on Language Resources and Evaluation (LREC 2012) Open access icon
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

@automatic detection of inconsistencies - active and useful projects

Bulwersator (talk) 16:15, 30 May 2012 (UTC)[reply]

Thanks! See also m:Death anomalies table (as background for the first link), Magnus Manske's Free Image Search Tool (FIST), and the Change Detector from the EU-funded RENDER project, where the reviewer of this paper is working. Regards, Tbayer (WMF) (talk) 20:12, 11 June 2012 (UTC)[reply]
  • Regarding a lot of content I've seen recently, but especially the text "Unfortunately, the paper does not touch on Wikipedia policies such as Wikipedia:Civility and Wikipedia:No personal attacks, which would certainly have added to their analysis. Despite the paper's claim to have received approval for research through a university research ethics committee [...]": You really need to decide whether you are trying to report news or trying to write opinion pieces. As everything in this publication is ostensibly straight news except for those articles labeled as opinion, it is jarring to see editors taking specific sides within articles. Most often it is to defend Wikipedia against whatever they see as an attack. Even if the writers and editors here lack a journalism background, even just experience writing for Wikipedia (and having to follow WP:NPOV and WP:OR) should already have Signpost writers in the habit of not trying to make their own opinions be the story. DreamGuy (talk) 00:22, 1 June 2012 (UTC)[reply]
Reporting of news requires analysis of the material; detecting significant features and identifying inconsistencies is part of reporting, and has long been characteristic of the Signpost, just as it ought to be. DGG ( talk ) 17:27, 1 June 2012 (UTC)[reply]
Signpost articles are not encyclopedia articles, in particular it does not make sense to invoke WP:OR - on the contrary, original reporting is explicitly welcomed, see Wikipedia:Wikipedia Signpost/About. Of course there are conventions for journalistic publications as well, but this section (the Wikimedia Research Newsletter) has since its inception invited not only mere summaries, but also reviews of recently published academic research papers, and conventions for that genre absolutely permit that kind of remark. Regards, Tbayer (WMF) (talk) 20:12, 11 June 2012 (UTC)[reply]
  • Awesome graphic looking like a rainbow on the history of art - I never thought of Houbraken as being on the other end of the spectrum from Picasso. Actually I stared at this for a while, since it puzzled me why Houbraken was in the middle of a starburst, and then I realized that of course the other artist biographers that "bridge the gap" from Vasari are still pretty poorly covered. There is still a lot of work to be done onwiki for Karel van Mander and Joachim von Sandrart, which is why the Germans seem so underrepresented in the early days. Jane (talk) 20:55, 1 June 2012 (UTC)[reply]
  • Love how the "History of Art" graphic looks like a work of abstract art itself, hehe. -- œ 21:06, 3 June 2012 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0