A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Analyzing edits to the then 46 largest Wikipedias between July 9 and August 8, 2013, a study[1] identified a set of about 8,000 contributors (labeled multilingual) with a global user account who have edited more than one of these language versions (excluding Simple English, which was treated separately) in that time frame. It tested five hypotheses about cross-language editing and editors and looked, for instance, at the proportion of contributions that any of these Wikipedias receives from multilingual editors versus contributions from those only editing one language version. The research found that Esperanto and Malay stick out with a high proportion of contributions from multilinguals, and on the other end, that Japanese has few contributions from multilinguals. Overall, in terms of edits per user, multilingual users made more than twice the number of contributions to the study corpus than monolinguals did; they often work on the same topics across language; and in any given language, they are frequently editing articles not edited by monolinguals during the one-month period analyzed here. They thus serve a bridging function between languages.
Two existing write-ups are good starting points to putting the study in context.[supp 1][supp 2] In the long run, it would be interesting to extend the research to (a) cover a longer time span, (b) include contributions from non-registered users, despite technical difficulties, (c) include smaller Wikipedias, and (d) explore the effects of that bridging function in more detail, perhaps in search for ways to support its beneficial effects while minimizing the non-beneficial ones. It would also be interesting to focus on some aspects of those multilingual users (e.g. how do the languages they edit in match with the languages they display on their user pages) or their contributions (e.g. how do their contributions to text, illustrations, references, links, templates, categories or talk page discussions differ across languages, or how contributions from multilinguals differ across topics or between pages with high and low traffic – or to entertain ideas for a multilingual version of editing tools like User:SuggestBot. The paper is one of the first to make use of Wikidata; comparing such cross-lingual Wikipedia contributions with contributions to multi-lingual projects like Wikidata and Commons may also be a fruitful avenue for further research. (See also earlier coverage of a CSCW paper about a similar topic: "Activity of content translators on Wikipedia examined")
A new paper on arXiv asks the question "Can electoral popularity be predicted using socially generated big data?"[2] Operating on the assumption that "sentiment data is implied in information seeking behaviour," the authors Yasseri and Bright compare Wikipedia page views and Google search trends to election outcomes in Iran, Germany and the UK. In Iran and the UK, where the researchers were able to use the articles of individual politicians, the page view and search trend data correctly pick the winners of the elections. In the UK, the data polled even correctly picks the orders of the runners-up, but the same is not true for Iran. In the German case, no correlation is found between search data and election results. Yasseri and Bright defer to the argument from previous studies on Twitter prediction that conclude that the sample data is too self-selecting. Overall, it is shown that "people do not simply search in the same proportions that they vote." Still the researchers note that these techniques react "quickly to the emergence of new 'insurgent' candidates."
A book titled Confidentiality and Integrity in Crowdsourcing Systems contains a chapter on the integrity of the English Wikipedia as a case study of integrity management in crowdsourcing systems.[3] To test the integrity of Wikipedia, they first tried to start a new article with "invalid content" (it got deleted) and then turned to vandalizing pages systematically, both of which violates Wikipedia policies (cf. Wikipedia:Vandalism). They noted that simple cases were caught by automated counter-vandalism tools (ClueBot and XLinkBot, whose user pages – one of them with a typo – are the only references cited in the chapter), whereas more subtle cases ("incorrect information containing words related to the page’s topic" or adding external links present in related Wikipedia articles) were not. No indication was given as to whether these inappropriate edits had later been removed (by the authors themselves or by other users), nor what the affected pages were or what IP address(es) they had used to make those edits.
In a next step, the authors went through dumps of the English Wikipedia from 2001 to 2011 and analyzed revision histories for "100 good and featured articles" (which refers to Wikipedia:Good articles and Wikipedia:Featured articles – later, they call this set "high-quality articles") and "100 non-featured articles" (by which they mean neither good nor featured – later, they refer to this set as "low-quality articles"). In this sample (of which no further details are given), they observed that the number of contributions to high-quality articles is about one order of magnitude higher than that of low-quality articles and "that there is a highly active group of contributors involved from the creation of high quality articles until present", while most editors to low-quality articles never contributed to those pages again. They then looked at revert rates, at the overlap between sets of top contributors to a given article across years, and at the range of topics edited by top contributors to an article, observing that "the top contributors have become the owners of high quality articles and their engagement has increased" (which runs contrary to WP:OWN), "[T]his results in higher quality for a small portion of articles in Wikipedia" and "[T]op contributors of high quality articles are more like- minded than the top contributors of low quality articles", concluding "that the main difference between low quality and featured articles is the number of contributions."
From that, they venture into extrapolating to crowdsourcing systems more generally: "[w]e observe that to have higher integrity in crowdsourcing systems, we need to have a permanent set of contributors who are dedicated for maintaining the quality of the contributions to the articles. For systems with open access such as Wikipedia, this can be a huge burden for the permanent editors. Therefore, we need new mechanisms for coordinating the activities in a crowdsourcing information system." No discussion of these new mechanisms is offered.
The chapter has a few simple tables and plots but no link to the underlying data nor the code used for the analysis, nor links to relevant literature or Wikipedia policies, but it is paywalled behind a price tag of $29.95 / €24.95 / £19.95. Given that the experimental edits to Wikipedia actually damaged the project, it is hard to imagine that an ethical review panel involving Wikipedians might have approved the study in that form. In fact, such a panel does exist in the form of the Research Committee, which had not been contacted about the project. Considering further that the conclusions of the study are not new, their possibly interesting implications for crowdsourcing more generally are not discussed and neither the paper nor its materials are available to those concerned about the integrity of Wikipedia, it is hard to see any benefit of this study that would outweigh the damage it caused (cf. earlier coverage: "Link spam research with controversial genesis but useful results", "Traffic analysis report and research ethics").
Discuss this story
The fact that Esperanto has a large number of multilingual contributors is not surprising. It's no one's native language (so every Esperanto speaker is fluent in something else), and it's usually the third, fourth, or fifth (or more) language. Jut updating articles about Esperanto can take you to many wikis.
I'd like to know whether these contributions were text, or if adding the same set of images to many Wikipedias counted the same as being able to write a sentence. WhatamIdoing (talk) 22:53, 28 December 2013 (UTC)[reply]
LanguageTool
I'd like to hear anything people have to say about open-source grammar checkers. The one mentioned here, LanguageTool, isn't likely to be useful. (When I asked it to check World War II, it told me: "The noun 'all' seems to be countable, so consider using: alls.") - Dank (push to talk) 19:51, 28 December 2013 (UTC)[reply]
Integrity of Wikipedia and Wikipedia research
"... it is hard to see any benefit of this study ..." or of summarising it or otherwise reporting on it. -- Michael Bednarek (talk) 08:16, 31 December 2013 (UTC)[reply]
re: Evaluation of gastroenterology and hepatology articles on Wikipedia
And indeed all the reasons mentioned there are why Wikipedia articles, imperfect as they are, are still much much better than most peer reviewed literature: they are accessible. Unlike the said study, which, paywalled or not, will not be read by not only most students, but even by most instructors and practitioners. Rather than wasting time trying to warn the students way from Wikipedia, the study should do the more constructive thing, which is to encourage the instructors and students to improve Wikipedia articles, as some others, commendable initiatives have done. --Piotr Konieczny aka Prokonsul Piotrus| reply here 12:13, 1 January 2014 (UTC)[reply]