A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Ninety-eight registered participants attended the annual WikiSym+OpenSym conference from August 5-7 at Hong Kong's Cyberport facility. The event preceded the annual global Wikimania conference of the Wikimedia movement in the same city.
WikiSym was started in 2005 as the "International Symposium on Wikis", and its scope has since been broadened to include the study of other forms of "open collaboration" (such as free software development, or open data), reflected in the adoption of the separate "OpenSym" label. The proceedings, published online at the start of the conference, contain 22 full papers (out of 43 submissions), in addition to short papers, posters, abstracts for research-in-progress presentations, etc. The coverage below reflects the scope of this research report, and complements the pre-conference reviews of some papers in the previous issue.
Episode 96 of the "Wikipedia Weekly" podcast contains some coverage of WikiSym 2013 (from around 10:30-20:00), and some images and media from the event can be found on Wikimedia Commons.
Next year's WikiSym+OpenSym conference will be held in Berlin, on August 27-29, 2014, and call for papers is already out. Conference chair Dirk Riehle announced that the proceedings will continue to be published with ACM, now under its new open access policy.
Despite policy, only just over half of Wikipedia sources are secondary: "Getting to the Source: Where does Wikipedia Get Its Information From" presents an overall statistics on the sources referred to in English Wikipedia articles to answer this question. The initial seed of source tags is constructed by analysing 30 randomly selected articles, and then all articles in Wikipedia as of May 2012 have been probed to find and classify the references. Some 67 million citations for 3.5 million articles have been found. The classification is performed on a random selection of 500 citations and by two human coders. More than 30% of the citations were classified as primary sources, around 53% as secondary, and around 13% as tertiary. After discussing type, creator, and publisher of the references as well as large scale domain analysis and persistence in time, the paper concludes: "Wikipedia’s content is ultimately driven by the sources from which that content comes. ... Although secondary sources are considered by policy to be the most desirable type, we demonstrate that nearly half of all citations are either primary or tertiary sources, with primary sources making up approximately one-third of all citations."
Conflict on Wikipedia as "generative friction": A paper titled "The role of conflict in determining consensus on quality in Wikipedia articles" analyses 147 conversations about quality from the archived history of the English Wikipedia article Australia. Based on this case study and after observing that editors refer to Wikipedia policies and regulation in their discussions quite often, it is claimed that "conflict in Wikipedia is a generative friction, regulated by references to policy as part of a coordinated effort within the community to improve the quality of articles." Although the paper builds on a very strong theory and a fascinating literature review on constructive conflict, generalising the results of the case study on a single article in English Wikipedia to some 29 million Wikipedia articles in more than 280 language editions is not easily feasible. An interesting interview with the author on the article has recently been published in Oxford University's Policy and Internet Blog.
Wikipedia home alone without the anti-vandal bot: In "When the Levee Breaks: Without Bots, What Happens to Wikipedia’s Quality Control Processes?" Stuart Geiger and Aaron Halfaker analyzed the impact of the temporary downtime of one of the main automated vandal-fighting tools – ClueBot NG – on the quality control processes of the English Wikipedia. They took four historical incidents during which the bot went down for a sustained amount of time as a naturally occurring experiment. They analyze the division of labor between automated, tool-assisted and manual edit revert activity and find that robotic reverts are the most rapidly occurring ones, the vast majority of them happening within one minute of the target edit. During ClueBot NG’s downtime, the authors observe, no other tool was available to perform the same type of early revert work and as a result the median time-to-revert nearly doubled. The paper concludes that Wikipedia's quality control processes are resilient insofar as the same proportion of reverted edits eventually is reached, but at a substantially slower pace than when the bot is available.
WikiProjects open to non-members: A paper titled "Project talk: Coordination work and group membership in WikiProjects" finds that depending on activity and size of a WikiProject, different methods and theoretical perspectives may be applicable. While most research has focused so far on the most active WikiProjects, those are usually much larger and much more formally organized than most. A typical WikiProject will have only a few active members, and is very loosely (what this reviewer would describe as adhocratically) organized. The official membership lists are often misleading, as some significant contributors may not even be official "members". The authors dispute some previous findings suggesting that members prefer to work with other members, finding there's little to no bias in members responding to requests by non-members. The authors also find that many WikiProjects are organized in a fashion similar to many small FLOSS projects.
WikiProjects are like free software projects: Another paper by the same authors analyzed "788 work-related discussions from the talk pages of 138 WikiProjects", with the results suggesting that "that WikiProject collaboration is less structured and more open than that of many virtual teams and that WikiProjects may function more like FLOSS projects than traditional groups."
Automatic detection of deletion candidates: A paper titled "Automated Decision Support for Human Tasks in a Collaborative System: The Case of Deletion in Wikipedia" presents a model for identifying English Wikipedia articles deleted via the three main types of deletion process, speedy deletion, proposed deletion and articles for deletion. The model uses a variety of features including properties of the article creator, language-related features (such as the frequency of verbs, adverbs or adjectives) and the actual text of the article. The best model – which combining all sets of features – performs particularly well overall, and reaches a precision of 98% and recall of 97%, in the "easy" case of Speedy Deletions – a level of performance that the authors submit is good enough for the model to be implemented as a decision-support tool for Wikipedia editors. The tool also detects non-encyclopedic articles even when they have remained in Wikipedia for a long time and can therefore be used as a solution to identify older articles that can either be improved or removed from the encyclopedia.
"Design and Implementation of Wiki Content Transformations and Refactorings": In this paper, based on the authors' earlier work on implementing a parser for wiki syntax (an effort separate from the Wikimedia Foundation's new "Parsoid" software for MediaWiki wikis), the authors present a framework for wikis that makes it easy to automatically carry out transformations such as the renaming of a category (in all pages that belong to that category). In the talk, Dohrn observed that there are over 100 wiki engines, but none of them use a formally defined syntax, making the content hard to process for computers.
"Revision graph extraction in Wikipedia based on supergram decomposition": Similar to an earlier paper by the same authors covered previously in this space ("Unearthing the "actual" revision history of a Wikipedia article"), this research replaces the linear version history of a Wikipedia article with a graph where loops can account for reverts, etc., which is formed by analyzing differences between article revisions by means of "supergram decomposition" of the article text.
"The Illiterate Editor: Metadata-driven Revert Detection in Wikipedia" presents "a content-agnostic, metadata-driven classification approach to Wikipedia revert detection. Our primary contribution is in building a metadata-based feature set for detecting edit quality, which is then fed into a Support Vector Machine for edit classification. By analyzing edit histories, the IllEdit system builds a profile of user behavior, estimates expertise and spheres of knowledge, and determines whether or not a given edit is likely to be eventually reverted."
Keynote on applicable Wikipedia research
In reflection of the conference's broadened scope, only one of the three keynotes focused on research about wikis and open collaboration: In his presentation "Descending Mount Everest: Steps towards applied Wikipedia research," Dario Taraborelli, Senior Research Analyst at the Wikimedia Foundation (and co-editor of this research report) made the case for Wikipedia research that has the potential to have a positive impact on Wikipedia itself, citing e.g. the opportunities opened by the ongoing user interface development work at the Foundation, and pointing to the ample data resources it offers researchers. (The title alluded to the metaphor of Wikipedia as the Mount Everest of online collaboration researchers, as put forth in the title of a session at last year's CSCW conference: "Scaling our Everest".)
Surveying the existing body of research, he identified the study of Wikipedia's gender gap and of its breaking news collaborations as relatively new research areas, and "Wikipedia and higher education" as a fast-growing topic, while papers which use Wikipedia as a corpus (from the field of Natural language processing, in particular) continue to see steady growth. Areas which have seen successful existing examples of actionable research include:
Two research-in-progress presentations by Oxford-based Taiwanese researcher Han-Teng Liao compared the Chinese Wikipedia and Baidu Baike, providing interesting insights from one of the few languages where Wikipedia has serious competition as a user-generated encyclopedia:
"How do Baidu Baike and Chinese Wikipedia filter contribution? A case study of network gatekeeping" examined editorial policies and practices on both projects, finding that "In Chinese Wikipedia, filtering copyright-dubious materials and accommodating Chinese geo-linguistic variants are more salient, whereas censoring politically-sensitive content and enforcing a national cultural political framework of People’s Republic China are more salient in Baidu Baike." On Baidu Baike, employees of the hosting company (Baidu) define the basic rules, whereas the Chinese Wikipedia community sets its own editorial policies. Commenting on his statistical analysis of the most cited Chinese sources on both wikis, Liao observed that Baidu Baike is "overrun with spam" from e.g. book review sites, whereas external links appear to be more rigorously curated on the Chinese Wikipedia, resulting in a perhaps surprising prominence of official Chinese government sites. The Chinese Wikipedia community was found to be very politically diverse, with many users declaring an affiliation on their user page, and appeals to the principle that "Wikipedia not censored".
It's "search engines favor user-generated encyclopedias", not "Google favors Wikipedia": In "How does localization influence online visibility of user-generated encyclopedias? A study on Chinese-language Search Engine Result Pages (SERPs)," Han-Teng Liao reported on results (some of which previously published on his blog) comparing the ranking of three Chinese-language user-generated encyclopedias (Wikipedia, Baidu Baike and Hudong) on nine Chinese-language search engine variants (by the three companies Google, Baidu and Yahoo, in mainland China, Singapore, Hong Kong and Taiwan, the former two mostly using simplified Chinese and the latter two traditional), for a collection of search terms. He found that the three projects generally dominate Chinese-language search results, alongside other user-generated content. That Baidu Baike ranks highly on the search engine run by its mother company might come as no surprise (in fact, Hudong submitted a complaint to a government body last year about this), but it still ranked the wikipedia.org domain a (distant) second place in four of the seven search term categories studied. Liao also interpreted the results, tentatively, as evidence against the often-voiced (but never substantiated) suspicion that Google artificially favors Wikipedia - in fact, Google as seen in China (in simplified Chinese) tends to rank Baidu Baike above Chinese Wikipedia. Instead, the results appear to indicate a general preference of search engines for user-generated content. (Cf. related earlier coverage: "High search engine rankings of Wikipedia articles found to be justified by quality")
Many Swiss GLAM institutions unaware of CC-NC downsides: A survey among Swiss cultural heritage institutions (like museums), presented at WikiSym, found that 11% of responding institutions have staff who contribute to Wikipedia during office hours, and 14% have staff who do so in their free time. More than half of them were unaware that non-commercial (NC) licenses prevent reuse of their content by Wikipedia.
Collective memories in Wikipedia
Researchers Michela Ferron and Paolo Massa expand on their previous work analyzing collective memories on Wikipedia to find statistical evidence that commemorative editing of traumatic events differentiates these articles from other article and talk page contribution patterns. For major recent events such as the 9/11 attacks as well as more historical events such as the Pearl Harbor attack in World War II, there is a significant increase in editing activity on these articles and talk pages (in the English Wikipedia) on the anniversaries of these events compared to the normal day-to-day editing patterns. Qualitative examination of the content of talk page discussions on these dates likewise reveals editors' attempts to make sense of and commemorate traumatic cultural events on their anniversaries. The implications of this research are important because Wikipedia is a commons on which different perspectives about traumatic and historic events are interpreted, co-constructed, and revisited by users. The data used in this analysis was also released by the authors and is available here.
"Impact of Wikipedia on citation trends": The authors tested an interesting hypothesis: that inclusion of scholarly references in Wikipedia affects the citation trends for those references. The authors do not reach conclusive findings. While the citations to Wikipedia references increase, they do not do so significantly more than for articles which are not cited on Wikipedia. The authors do note, however, that Wikipedia will often list highly cited articles in its references.
News portal automatically generated from Wikipedia edits: An ArXiv preprint presents "a news-reader that automatically identifies late-breaking news among the most recent Wikipedia articles and then displays it on a dedicated Web site", called "WikiPulse" (not to be confused with the Wikipulse visualization of recent changes on Wikipedia). Besides pageviews, it analyzes edits, among other things emphasizing edits by the top 50 most active editors, and editors that are classified as "domain experts".
Ethnography of bots: In a blog post in the "Ethnographies of Objects" series on group blog "Ethnography Matters," Wikipedian (bot-operator) and Wikipedia researcher Stuart Geiger offers an ethnographic analysis of how a bot can be socially perceived.
English Wikipedia not a huge threat for non-English Wikipedias: An ArXiv preprint titled "Comparing the usage of global and local Wikipedias with focus on Swedish Wikipedia" investigates the question "To what extent (and why) do people from non-English language communities use the English Wikipedia instead of the one in their local language?", finding that "Altogether, we can conclude that access volume for typical Swedish articles decreases by less than a few percent when an English version is created. This is a major result, since it shows that English articles do not draw away much attention from Swedish articles."
^Getting to the Source: Where does Wikipedia Get Its Information From?
^The role of conflict in determining consensus on quality in Wikipedia articles 
^Stuart Geiger and Aaron Halfaker : When the Levee Breaks: Without Bots, What Happens to Wikipedia’s Quality Control Processes? 
^Jonathan T. Morgan, Michael Gilbert, David W. McDonald, Mark Zachry: Project talk: Coordination work and group membership in WikiProjects 
^Michael Gilbert, Jonathan T. Morgan, David W. McDonald, Mark Zachry: Managing Complexity: Strategies for group awareness and coordinated action in Wikipedia PDF
^Hannes Dohrn, Dirk Riehle: Design and Implementation of Wiki Content Transformations and Refactorings