A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Many of the more active Wikipedia contributors are multilingual. In the April 2011 Wikipedia Editors Survey,[supp 1] 72% of respondents said they read Wikipedia content in more than one language, and 51% said they contributed to multiple Wikipedias. Research has estimated that approximately 15% of active Wikipedians are multilingual.[supp 2] These contributors are important as they can enable knowledge transfer between different language editions of Wikipedia, yet little is known about who they are and what they do.
A recent paper published in PLOS ONE by researchers at KAIST and OII, titled "Understanding Editing Behaviors in Multilingual Wikipedia"[1], adds to our knowledge of multilingual contributors by investigating their engagement level, topic interests, and language proficiency. The paper uses a dataset spanning a month of Wikipedia contributions in July–August 2013 and defines a multilingual editor as one who make contributions to multiple languages. Overall the dataset contains 12,577 multilingual editors, of which 77.3% are bilingual, 11.4% trilingual, and 4.1% quadrilingual.
Out of Wikipedia's (now) 288 language editions, the paper focuses on three: English, German, and Spanish. These three languages were chosen because the paper utilizes natural language processing to estimate language proficiency, and the tools available in those languages are sufficiently developed. The multilingual editors are divided into two groups: primary editors, consisting of the contributors who make most of their edits to a certain language edition, and non-primary editors. These two groups are then compared in terms of their engagement, topic interests, and language proficiency.
To measure editor engagement, consecutive edits by the same editor to the same article are collapsed into edit sessions.[supp 3] T-tests are used to compare primary and non-primary editors on several measures: number of edits per session, session length, amount of content added (number of characters or tokens such as words), and whether non-visible changes are made. The results show that primary editors are more engaged as they commit more edits, have longer sessions, add more content, and are more likely to make visible edits compared to non-primary editors.
Editor interests are identified using a combination of LDA and DBSCAN to create a set of 20 topic clusters for each language. These topic clusters are then labelled by humans, resulting in cluster labels such as “Science” and “Global Sports”. Primary and non-primary editors are found to be generally interested in the same topics, but some significant differences show up. For instance, non-primary editors are contributing more to articles about cities in English, soccer in German, and plants in Spanish. Primary editors are, on the other hand, more interested in, for example, computers in English and German, and politicians and entertainment in Spanish.
Lastly the paper studies the language complexity of contributions by primary and non-primary editors. Several measures of language complexity from the literature are used, for example entropy of parts-of-speech unigrams, bigrams, and trigrams, as well as whether articles (in English: the, a, an) are used correctly. Because different topics use language differently – for instance fact-oriented topics such as sports show lower language complexity compared to more conceptual topics such as history – both intra-topic complexity as well as inter-topic complexity is controlled for. Primary editors are found to use more diverse terms and edit more complex parts of articles compared to non-primary editors across all three languages. However, English differs from German and Spanish when it comes to linguistic proficiency of the edits made. In German and Spanish, primary editors display higher linguistic proficiency compared to the non-primary editors, whereas in English there is no noticeable difference.
Taken together, the results indicate how language continues to be a barrier to entry, seeing how non-primary editors are less engaged and make less complex contributions. The findings also point to how English continues to be a hub language in Wikipedia: It has the lowest proportion of primary editors with 32.9%, compared to German’s 49.9%. (In this context, the authors mention a 2012 WikiSym paper[supp 4], co-authored by this reviewer, which found that English was by far the most-used language to translate from – as measured by translation template usage – and discussed how English Wikipedia thereby could be used as a hub.) At the same time, multilingual Wikipedians are important in helping move content across languages, as exemplified by the Wikimedia Foundation’s development of a tool to recommend articles for translation.[supp 5] As mentioned in the paper’s conclusion, when it comes to multilingual Wikipedians there are still many questions left, although this paper makes significant contributions by answering some of them.
This article[2] is a report on one component of a longitudinal study of how "rationales" are utilized by Wikipedians on articles for deletion (AfD) to direct collaboration. In order to arrive at conclusions about the role of rationales in decision-making processes, the author has approached the research object from a number of angles. Previously the researcher had conducted an exploratory content analysis of rationales. This research was subsequently followed by interviews of Wikipedians. The current research describes the process of developing an algorithmic tool that will be able to analyze large data sets for "directive rationales". The author admits that AfD discussions are predisposed to this kind of analysis due to the predictable order of comments that describe an action and a rationale for the action. Decision-making of this sort substantially differs from the style of discussion for the rest of Wikipedia's talk pages. Regardless of this limitation, the author concludes that further research into rationales will provide insights into how it functions to connect policies with practices. Given the breadth of research methods of the project, it will be interesting to see what conclusions the author comes to when the project concludes.
This paper[3] addresses the area of scientific knowledge creation online, as well as the notion of controversy, by examining the editing history and discussion about English Wikipedia pages on schizophrenia and its subpage, causes of schizophrenia. The specific controversy authors focused on is that of genetic basis for schizophrenia (a topic which the authors note is still debated by scholars and on which there is no consensus). The authors commend the neutrality of the lead of the Wikipedia article ("The causes of schizophrenia have been the subject of much debate, with various factors proposed and discounted or modified...") and ask "How are such statements constructed, or in other words, what is the work which goes into making these claims?" The authors used a dataset from August 2006 to October 2011 (20,000 words of talk text and 13,000 words of article text) to investigate how this topic is presented and contested in Wikipedia.
The authors make a number of interesting observations. They observe that editors are not equal, and in addition to the usual admin>user>anon>bot hierarchy, they noted that "'who you are’ is important when it comes to editing the schizophrenia article...". Many editors self-identified as living with schizophrenia or as medical experts. The talk pages are policed to keep the discussion on discussing article's contents, and anecdotes and personal experience stories are discouraged, or even removed from the pages. WP:V and WP:OR are certainly enforced as well, and Wikipedians will be pleased to note their observation that "Priority is always given to the published scientific literature." However, there are also a number of problems. Not all contributors have access to paywalled, quality content, and some seemingly rely only on article abstracts.
Some low quality references slip through the net, and standards are not enforced consistently ("Attention to the reference list in the schizophrenia article at the time of our study revealed numerous citations that were not reviews", but original research academic papers about "breakthroughs" – this mentioned in the context of a talkpage argument that "such papers should be avoided until their findings are confirmed"). The authors also note that they found at least "one reference to another Wikipedia article and also to a schizophrenia forum discussion". The article's structure is a result of years of minor edits with little attention to the big picture, resulting in occasionally illogical and incoherent layout with some contradictions or clearly obsolete but not updated sections, which leads the authors to summarize the state of the article as "a rather ad hoc assemblage of resources" and "a chronological patchwork of studies that nonetheless does have the effect of synthesising knowledge". Despite those problems, they conclude that the Wikipedia article, and the creation process behind it, is similar to an academic review article. Also, despite Wikipedia's claims that it is simply describing the state of things, rather than creating new arguments or points of view, the authors do think that the Wikipedia article is also an active voice in ongoing discussions, and note that some editors on the talk page see the purpose of the article as educating the public as well as some experts.
There are some unfortunate omissions (through to some degree understandable due to academic publish word limit). The authors do not discuss in detail whether some users, such as experts, seem to pull more weight in the discussions, or whether removal of personal stories impacts the friendliness of the discussion. Despite these omissions, the paper is an interesting analysis of knowledge creation on Wikipedia, as well as another contribution to the ongoing discussion about the reliability and quality of Wikipedia. On that note, it is worth noting that schizophrenia is a Featured Article, following a 2003 nomination that by today's FA standards is more like a joke. Given the criticism of the article's 2011 version as voiced by this paper, the community may want to consider a Feature Article Review here.
Co-citation graphs (networks of who cites whom) are frequently used to recommend books and articles, but how well does links between Wikipedia articles work for this purpose? A paper[4] to be published at the upcoming Joint Conference on Digital Libraries evaluates this by comparing the performance of co-citation with and without proximity analysis against the commonly used “More Like This” (MLT) text-based approach found in Apache Lucene. The paper’s main finding is that co-citation with proximity analysis (CPA) performs comparably to MLT, but that the two methods have different strengths: MLT is good at identifying closely related articles, while CPA is better at finding broader ones and will identify more popular articles that typically are of higher quality. These results suggest a hybrid approach might be best suited for finding related articles in Wikipedia, something the authors plan to study in future work.
This paper[5] in JASIST from April this year is a brief opinion piece summarizing perceptions of Wikipedia in academia. It provides a short literature review of works that discuss this subjects, summarizes the research on Wikipedia's reliability (still a concern among many scholars), notes the spread of the use of Wikipedia as a teaching assignment in colleges, acknowledges the general widespread use of Wikipedia by the public, and in the paper's own words, calls "for a peaceful coexistence". A more detailed take on those very subjects is presented by the very same journal in March[6] (disclaimer: the latter article is written by this reviewer).
Yet another small university class (<20 first-year university students) has independently tried Wikipedia editing and tells its story.[7]
The students were told to edit an article and succeeded; while doing so, they improved their information literacy, digital literacy and trust in the Wikipedia system. On the other hand, the exercise itself was not sufficient to make them understand in depth the dynamics and principles of Wikipedia, nor to integrate in the community.
In the opinion of this reviewer, the article makes for a nice blog post to be shared with university professors belonging to other Nordic countries as well as similar disciplines. The experience also confirms that university professors can and should use Wikipedia as a teaching tool, but can improve results if they contact expert Wikimedians (usually via a local Wikimedia chapter) to actually introduce the students to the spirit and dynamics of the Wikimedia projects.
Short opinion piece from the University of Wisconsin-River Falls supporting the usage of Wikipedia as teaching tool to improve information literacy.[8] Under the guise of a literature review, the author mentions 4 past experiments of usage of the wikis in the classroom, published between 2006 and 2009.
According to this 2012 survey of 800 professors belonging to the Universitat Oberta de Catalunya, professors mostly agree with the usage of Wikipedia as an "open repository" to dissiminate research and a growing number of them approves of its usage as a teaching tool. At the time of the survey, however, most professors were still waiting to be convinced by their colleagues.[9] See also the longer review of the paper's preprint version[supp 6] in our December 2014 issue: "Use of Wikipedia in higher education influenced by peer opinions and perception of Wikipedia's quality"
A paper[10] to be published at the forthcoming SIGIR 2016 conference as part of their demonstration track describes MultiWiki (demo available online), a tool that calculates similarities and differences between pairs of articles in different Wikipedia languages. The tool then visualises these using a timeline, a map, and by displaying article texts side-by-side. Visualising similarities and differences between Wikipedia languages is not a new idea[supp 7] [supp 8], but this tool is the first to show textual alignment.
A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.
[NPOV] was new to many of them. Some say they are surprised to find that there are so many rules and norms to consider before the text is up to standards. One respondent expressed astonishment that "there are even standards for how to write numbers in percentage!" Others are surprised to find any rules at all, having heard about the inaccuracies and biases of Wikipedia's content: "I used to think anything goes."' ... The students were positive about their discovery of the Wikipedia community, which for many changed some of their attitudes to the site. ... For those who mention trust, they related it to one or both of the following factors: (a) to the discovery of the qualifications of many Wikipedians ("lots of educated people") or (b) to the control mechanism available and that there are people who "check the pages" and "remove unwanted content" ... The initial skepticism expressed in the questionnaire has thus changed, leaving Wikipedia "a place I can partly trust on par with other sources, as it is surveilled by a kind of administrators"."
Discuss this story
Schyzofrenia.. does it even exist?
The first thing about an article about schizophrenia and genetics is that the two parts exist. There is a lot of literature scientific at that that denies schizophrenia exists. That makes the second part irrelevant. The next and obvious question is, what are we talking about. Thanks, GerardM (talk) 07:14, 29 May 2016 (UTC)[reply]
Universitat Oberta de Catalunya paper
This is a minor gripe of mine but we (probably me) already reviewed this here back in Dec 2014 (this is the very same paper, published not January THIS year as the newsletter states, but February LAST one, compare [1] and [2], we probably reviewed a pre-print back then, but any changes if exist are minor). In my relatively comprehensive (or at least I'd like to think so) lit review on the subject from March THIS year that has yet to be reviewed by the Research Newsletter I have a note saying "Meseguer Artola et al.'s (2015) study incorporates and builds on an earlier work of its contributors, Eduard (2014) and Lladós, Aibar, Lerga, Meseguer, and Minguillón (2013), using the same data set and arriving at more refined conclusions. For that reason, those works are not reviewed or cited separately." I was wondering if the said authors published yet another remix of their research, but no, it seems to be a mistake in our review. I suggest removing that section. We have plenty of unreviewed research (hint: dear readers, we have a backlog - help!), no need to discuss the same paper twice. --Piotr Konieczny aka Prokonsul Piotrus| reply here 09:08, 29 May 2016 (UTC)[reply]
Availability of broadband
I have been told if I pay more I too can have broadband at home. I decided not to spend the money. I have no problem contributing at home, but I have often copied the information from other sources at libraries. At one time this was because I couldn't access the information at home, but now the resource that I used the most is unavailable unless I travel about 30 miles. But the truth is I don't have the patience to wait and wait at home. Only those few sites I spend a lot of time on ever approach the speed that they do at libraries. That first time accessing a site (a long way from actual research) can take a very long time.— Vchimpanzee • talk • contributions • 20:23, 30 May 2016 (UTC)[reply]
The difference between information acquisition and learning knowledge