The Signpost

Recent research

CSCW '14 retrospective; the impact of SOPA on deletionism; like-minded editors clustered; Wikipedia stylistic norms as a model for academic writing

Contribute  —  
Share this
By David Ludwig, Morten Warncke-Wang, Maximilian Klein, Piotr Konieczny, Giovanni Luca Ciampaglia, Dario Taraborelli and Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

CSCW '14 retrospective

The 17th ACM Conference on Computer-supported cooperative work and Social Computing (CSCW '14) took place this month in Baltimore, Maryland.[supp 1] The conference brought together more than 500 researchers and practitioners from industry and academia presenting research on "the design and use of technologies that affect groups, organizations, communities, and networks." Research on Wikipedia and wiki-based collaboration has been a major focus of CSCW in the past. This year, three papers on Wikipedia were presented:

Unique editors per quarter in conventional and alternative WikiProjects, 2002-2012
Edits per quarter in conventional and alternative WikiProjects, 2002-2012
Slides from Editing beyond articles[1]

Clustering Wikipedia editors by their biases

review by User:Maximilianklein

Building on the streams of rating editors by content persistence and algorithmically finding cliques of editors, Nakamura, Suzuki and Ishikawa propose[4] a sophisticated tweak to find like- and disparate-minded editors, and test it against the Japanese Wikipedia. The method works by finding cliques in a weighted graph between all editors of an article and weighting the edges by the agreement or disagreement between editor. To find the agreement between two editors, they iterate through the full edit history and use the content persistence axioms of interpreting edits that are leaving text unchanged as agreement, and deleting text as disagreement. Addressing that leaving text unchanged is not always a strong indication of agreement, they normalize by each action's frequency of both the source editor and the target editor. That is, the method accounts for the propensity of an editor to change text, and the propensity of editors to have their text changed.

To verify their method, its results are compared to a simplified weighting scheme, random clustering, and human-clustered results on seven articles in the Japanese Wikipedia. In six out of seven articles, the proposed technique beats simplified weighting. An example they present is their detection of pro- and anti-nuclear editors on the Nuclear Power Plant article. An implication of such detection would be a gadget that colours text of an article depending on which editor group wrote it.

Monthly research showcase launched

The lifetime of deleted articles by year of creation

The Wikimedia Foundation's Research & Data team announced its first public showcase, a monthly review of work conducted by researchers at the Foundation. Aaron Halfaker presented a study of trends in newcomer article creation across 10 languages with a focus on the English and German Wikipedias (slides). The study indicates that in wikis where anonymous users can create articles, their articles are less likely to be deleted than articles created by newly registered editors. Oliver Keyes presented an analysis of how readers access Wikipedia on mobile devices and reviewed methods to identify the typical duration of a mobile browsing session (slides). The showcase is hosted at the Wikimedia Foundation every third Wednesday of the month and live streamed on YouTube.

Study of AfD debates: Did the SOPA protests mellow deletionists?

Wikipedia's SOPA blackout

A paper titled "What influences online deliberation? A wikipedia [sic] study"[5] studies rationales used by participants in deletion discussions, in the larger context of democratic online deliberation. The authors reviewed in detail deletion discussions for a total of 229 articles, listed for deletion on three dates, one of them being January 15th, 2012, three days before the the English Wikipedia's global blackout as part of the Wikipedia:SOPA initiative. The authors looked into whether this event would influence rationales of the deletion discussions and their outcome. They also reviewed, in less detail, a number of other deletions from around the time of the SOPA protest. The authors display a good knowledge of relevant literature, including that in the field of Wikipedia studies, presenting an informative literature review section.

Overall, the authors find that the overall quality of the discussions is high, as most of the participants display knowledge of Wikipedia's policies, particularly on the notability and credibility (or what we would more likely refer to as reliability) of the articles whose deletion is considered. In re, notability far outweighs the second most frequent rationale, credibility (reliability). They confirm that the deletion system works as intended, with decisions made by majority voters.

Interestingly, the authors find that certain topics did tend to trigger more deletion outcomes, said topics being articles about people, for-profit organizations, and definitions. In turn, they observe that "locations or events are more likely to be kept than expected, and articles about nonprofit organizations and media are more likely to be suggested for other options (e.g., merge, redirect, etc.) than expected". Discussions about people and for-profit organizations were more likely to be unanimous than expected, whereas articles about nonprofit organizations, certain locations, or events were more likely to lead to a non-unanimous discussion. Regarding the SOPA protests' influence on deletion debates, the authors find a small and short-lived increase in keep decisions following the period of community mobilization and discussion about the issue, and tentatively attribute this to editors being impacted by the idea of Internet freedom and consequently allowing free(er) Internet publishing.

The authors sum up those observations, noting that "the community members of Wikipedia have clear standards for judging the acceptability of a biography or commercial organization article; and such standards are missing or less clear when it comes to the topics on location, event, or nonprofit organization ... Thus, one suggestion to the Wikipedia community is to make the criteria of judging these topics more clear or specific with examples, so it will alleviate the ambiguity of the situation". This reviewer, as a participant of a not insignificant number of deletion discussions as well as those about the associated policies, agrees with said statement. With regards to the wider scheme, the authors conclude that the AfD process is an example of "a democratic deliberation process interested in maintaining information quality in Wikipedia".

Word frequency analysis identifies "four conceptualisations of femininity on Wikipedia"

Girl with Cherries by Ambrogio de Predis (the current lead illustration of the article femininity)

In a linguistics student paper[6] at Lund University, the author reviews the linguistic conceptualisation of femininity on (English) Wikipedia, with regards to whether language used to refer to women differs depending on the type of articles it is used in. Specifically, the author analyzed the use of five lexemes (a term which in the context of this study means words): ladylike, girly, girlish, feminine and womanly. The findings confirm that the usage of those terms is non-accidental. The word feminine, most commonly used of the five studied, correlates primarily to the topics of fashion, sexuality, and to a lesser extent, culture, society and female historical biographies. The second most popular is the word womanly, which in turn correlates with topics of female artists, religion and history. Girlish, the fourth most popular world, correlates most strongly with the biographies of males, as well as with the articles on movies and TV, female entertainers, literature and music. Finally, girly and ladylike, respectively 3rd and 5th in terms of popularity, cluster together and correlate to topics such as movies and TV (animated), Japanese culture, art, tobacco and female athletes. Later, the author also suggests that there is a not insignificant overlap in usage between the cluster for girlish and the combined cluster for girly and ladylike. He concludes that there are three or four different conceptualisations of femininity on Wikipedia, which in more simple terms means, to quote the author, that "people do indeed represent women in different ways when talking about different things [on Wikipedia]", with "girly and girlish having a somewhat frivolous undertone and womanly, feminine and ladylike being of a more serious and reserved nature".

The study does suffer from a few issues: a literature review could be more comprehensive (the paper cites only six works, and not a single one of them from the field of Wikipedia studies), and this reviewer did not find sufficient justification for why the author limited himself to the analysis of only 500 occurrences (total) of the five lexemes studied. A further discussion of how the said 500 cases were selected would likely strengthen the paper.

Wikipedia and the development of academic language

Ursula Reutner’s article “Wikipedia und der Wandel der Wissenschaftssprache”[7] discusses Wikipedia's linguistic norms and style as a case study of the development of academic language.

The article is divided into three main sections. After providing some historical context about Wikipedia and the history of encyclopedias (section 1), the article focuses on linguistic norms in Wikipedia and their relation to linguistic norms in academic language (section 2). Reutner identifies five crucial linguistic norms in Wikipedia: (1) non-personal language such as the avoidance of first- and second-person pronouns, (2) neutral language as expressed in the policy of a “neutral point of view”, (3) avoidance of redundancies, (4) avoidance of unnecessarily complex wording, and (5) focus on simple syntax and the use of short independent clauses. Although Reutner mentions many well-known differences between Wikipedia and traditional forms of academic writing (e.g. the dynamic, collaborative, and partly non-academic character of Wikipedia), she stresses that the policies of Wikipedia largely follow traditional norms of academic writing.

The third section focuses on case studies of Wikipedia articles (mostly fr:Euro and it:Euro) and finds a large variety of norm violations that suggest a gap between linguistic norms and actual style in Wikipedia. Reutner's examples of biased, clumsy, and long-winded formulations hardly come as a surprise as these quality issues are well-known topics in Wikipedia research[supp 2]. However, Reutner's analysis is not limited to quality problems but also addresses further interesting features of Wikipedia articles. For example, she points out that Wikipedia differs from many print encyclopedias in Romanic languages such as the Grande Dizionario Enciclopedico (1964) or the Enciclopedia Treccani (2010) through a focus on accessibility as illustrated by the use of copular sentences at the beginning of articles and the repetition of crucial ideas and terms. Furthermore, Reutner argues that Wikipedia differs from other forms of academic writing through narrative elements and a generous use of space.

Reutner's findings raise general questions regarding the relation between Wikipedia and the development of academic language and her short conclusion makes three suggestions: First, Wikipedia's policies largely follow traditional norms of academic writing. Second, the digital, collaborative, and partly non-academic character of Wikipedia leads to “emotional and dialogic elements that are surprising in the tradition of encyclopedias“ (p.17). Third, the focus on accessibility follows an Anglo-American tradition of academic writing (even in the Italian and French language versions). Although Reutner's conclusions seem well-justified, they leave the question open whether Wikipedia reflects or even influences the general development of academic language. For example, one may argue that many of Reutner's findings are effects of the partly non-academic character of Wikipedia and therefore not representative of the development of academic language. Other linguistic features are arguably effects of collaborative text production and it would be interesting to compare Reutner's findings with other collaborative and non-collaborative forms of academic writing. Finally, one may worry that some of Reutner's findings are artifacts of a small and biased sample. For example, Reutner only considers articles (de:Euro, en:Euro, es:Euro, fr:Euro, and it:Euro) that are created by large and diverse author groups but does not discuss more specialized articles that usually only have one or two main authors. However, it is well-known that the style and quality of Wikipedia articles depends on variables such as group size and group composition[supp 3] and diverse forms of collaboration patterns[supp 4]. It would therefore be interesting to discuss Reutner's linguistic findings in the context of a more diverse sample of Wikipedia articles.



  1. ^ a b Morgan, J. T.; Gilbert, M.; McDonald, D. W.; Zachry, M. (2014). "Editing beyond articles: diversity & dynamics of teamwork in open collaborations". Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing - CSCW '14 (PDF). p. 550. doi:10.1145/2531602.2531654. ISBN 9781450325400.
  2. ^ André, P.; Kittur, A.; Dow, S. P. (2014). "Crowd synthesis: extracting categories and clusters from complex data". Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing - CSCW '14 (PDF). p. 989. doi:10.1145/2531602.2531653. ISBN 9781450325400.
  3. ^ Mejova, Y.; Garimella, V. R. K.; Weber, I.; Dougal, M. C. (2014). "Giving is caring: understanding donation behavior through email". Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing - CSCW '14 (PDF). p. 1297. doi:10.1145/2531602.2531611. ISBN 9781450325400.
  4. ^ Nakamura, Akira; Yu Suzuki; Yoshiharu Ishikawa (November 17, 2013). "Clustering Editors of Wikipedia by Editor's Biases". 2013 IEEE/WIC/ACM International Conferences on Web Intelligence (PDF).
  5. ^ Xiao, Lu; Nicole Askin (2014). "What influences online deliberation? A wikipedia study". Journal of the Association for Information Science and Technology. doi:10.1002/asi.23004. ISSN 2330-1643. Closed access icon
  6. ^ Max Bäckström: The conceptualisation of FEMININITY on English Wikipedia
  7. ^ Reutner, Ursula (2013-12-20). "Wikipedia und der Wandel der Wissenschaftssprache". Romanistik in Geschichte und Gegenwart. 19 (2): 231–249. Closed access icon
  8. ^ Forte, A., Andalibi, N., Park, T., and Willever-Farr, H. (2014) Designing Information Savvy Societies: An Introduction to Assessability. In: Proceedings of CHI 2014
  9. ^ Moreira, Carlos Eduardo M.; Viviane P. Moreira (2013-12-09). "Finding Missing Cross-Language Links in Wikipedia". Journal of Information and Data Management. 4 (3): 251. ISSN 2178-7107.
  10. ^ Kummer, Michael (2013). "Spillovers in Networks of User Generated Content – Evidence from 23 Natural Experiments on Wikipedia" (PDF). ZEW Discussion paper no. 13-098.
  11. ^ Konieczny, Piotr (2014-02-01). "Rethinking Wikipedia for the Classroom". Contexts. 13 (1): 80–83. doi:10.1177/1536504214522017. ISSN 1536-5042.
  12. ^ Koistinen, Olavi (2013-11-30). "HS selvitti: Näin luotettava Wikipedia on".
Supplementary references:
  1. ^ CSCW '14 website
  2. ^ e.g. Anderka, M., & Stein, B. (2012, April). A breakdown of quality flaws in Wikipedia. In Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality (pp. 11-18). ACM. (cf. review: "One in four of articles tagged as flawed, most often for verifiability issues")
  3. ^ e.g. Arazy, O., Nov, O., Patterson, R., & Yeo, L. (2011). Information quality in Wikipedia: The effects of group composition and task conflict. Journal of Management Information Systems, 27(4), 71-98.
  4. ^ Liu, J., & Ram, S. (2009, December). Who does what: Collaboration patterns in the wikipedia and their impact on data quality. In 19th Workshop on Information Technologies and Systems (pp. 175-180)
  5. ^ Hecht, Brent; Gergle, Darren (2010). The Tower of Babel Meets Web 2.0: User-Generated Content and Its Applications in a Multilingual Context (PDF). ACM CHI Conference on Human Factors in Computing Systems. pp. 291–300.
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
Hi Johnbod, thanks for pointing that out - Oliver has since guessed the missing word correctly. Regards, Tbayer (WMF) (talk) 05:37, 1 March 2014 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0