A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
One of the earliest research papers on Wikipedia is called "Talk Before You Type: Coordination in Wikipedia"[supp 1] by Fernanda Viégas and Martin M. Wattenberg, then of IBM Research. In that paper, the researchers perform a series of analyses of Wikipedia, including visualizations of article growth using their HistoryFlow visualization platform, in order to understand how Wikipedia editors coordinate their activities around content production, curation, and quality control. Their analysis revealed that between 2003 and 2005, Wikipedia talk pages grew at a greater rate than article pages—which suggested that explicit coordination among editors had become increasingly important as the community and encyclopedia grew. In order to investigate this trend, they performed content analysis on a purposeful sample[supp 2] of 25 article talk pages in order to understand how these discussion spaces support article development. Among their findings were that 58% of talk page posts included requests or proposals to edit the related article. Other types of talk page posts included requests for information about the article, references to vandalism, references to policy, and off-topic remarks.
The findings and methods from "Talk Before You Type" have informed many subsequent studies of Wikipedia. As of September 2018, the study has been cited over 400 times according to Google Scholar.[supp 3] However, the content analysis portion of the study focused on a relatively small number of talk pages, and only talk pages on English Wikipedia. Furthermore, the study was conducted over a decade ago. So it's fair to ask: do other language editions of Wikipedia use talkpages the same way as the English Wikipedia does? And how may have collaboration practices on the English Wikipedia itself changed in the intervening decade?
A recent paper published in the proceedings of the 2018 OpenSym conference addresses both of these questions. Titled "We All Talk Before We Type?: Understanding Collaboration in Wikipedia Language Editions"[1] and written by researchers at the University of Washington, it attempts to replicate Viégas and Wattenberg's content analysis using a larger and more recent sample of talk page posts from English, Spanish, and French Wikipedias. The researchers find evidence that different Wikipedia communities use talk pages differently: for example, Spanish Wikipedia articles seem to feature an overall higher proportion of requests for information than either English or French, and a higher proportion of Information Boxes. However, the most striking result of their study is that while proportions of different kinds of talk page posts are broadly similar across the three Wikipedias they analyzed, they are all substantially different from the results of the 2007 study. In particular, while Viégas and Wattenberg found that 58% of talk page posts included requests for editing coordination, only 35–37% of posts in the newer sample implicated coordinated editing activity.
This result suggests that the way editors use talk pages has changed dramatically since the 2005 sample was collected. It may be an indication of changes in the focus of editing work on Wikipedia. Perhaps the work of maintaining a much larger and fuller Wikipedia requires less direct coordination than was necessary earlier in the project's development, when more editors focused on writing and expanding new articles. It's also possible that the Wikipedia editor communities, which now contain a much higher proportion of experienced editors, are able to coordinate their activities more effectively in a stigmergic manner, making "talking before you type" less necessary. Future research can build on this work by examining other differences, such as between the editing dynamics of other older and younger Wikipedias, as well as examine potential cultural forces that mediate how editing communities talk to, and work with, one another.
On the matter of stigmergy, another paper presented at OpenSym 2018 titled "Stigmergic Coordination in Wikipedia" investigated evidence for just that.[2] From the abstract: "Using a novel approach to identifying edits to the same part of a Wikipedia article, we show that a majority of edits to two example articles are not associated with discussion on the article Talk page, suggesting the possibility of stigmergic coordination. However, discussion does seem to be related to article quality, suggesting the limits to this approach to coordination."
Although the researchers only analyzed two articles, namely Abraham Lincoln and Business from the English Wikipedia, they concluded that "the data presented in this paper suggest that a substantial fraction of the edits made on Wikipedia are coordinated without explicit discussion on the Talk pages", which they hypothesize as representative of stigmergic coordination. In fact, it appears that the majority of edits analyzed demonstrated stigmergic behavior; although this may be obvious for minor edits and fixing vandalism, the stigmergy was apparent even in substantial edits. Moreover, the authors caution that due to the "overly strict operationalization" they used in gathering and analyzing the data, these analyses may be underestimating the reality of stigmergic editing in Wikipedia.
From the abstract:[3] "[...] we look at the structure of the influences between Western art painters as revealed by their biographies on Wikipedia. We use a modified version of modularity maximisation with metadata to detect a partition of artists into communities based on their artistic genre and school in which they belong. We then use this community structure to discuss how influential artists reached beyond their own communities and had a lasting impact on others [...]"
See also earlier coverage of similar research: "The history of art mapped using Wikipedia"
From the abstract:[4]: "This study concentrates on extracting painting art history knowledge from the network structure of Wikipedia. Therefore, we construct theoretical networks of webpages representing the hyper-linked structure of articles of seven Wikipedia language editions. These seven networks are analyzed to extract the most influential painters in each edition using Google matrix theory. Importance of webpages of over 3000 painters are measured using PageRank algorithm. The most influential painters are enlisted and their ties are studied with the reduced Google matrix analysis. [...] For instance, our analysis groups together painters that belong to the same painting movement and shows meaningful ties between painters of different movements. We also determine the influence of painters on world countries using link sensitivity between Wikipedia articles of painters and countries. [...] The world countries with the largest number of top painters of selected seven Wikipedia editions are found to be Italy, France, Russia."
For each of these seven Wikipedia languages, the paper contains a list of the top 50 painters by PageRank, led by Pablo Picasso in case of the French Wikipedia, Leonardo da Vinci for the English, German, Italian, Spanish and Russian Wikipedia, and Rembrandt van Rijn on the Dutch Wikipedia.
From the abstract:[5] "... we look into Wikipedia articles on historical people for studying link-related temporal features of articles on past people. [...] We propose a novel style of analysis in which we use signals derived from the hyperlink structure of Wikipedia as well as from article view logs, and we overlay them over temporal dimension to understand relations between time periods, link structure and article popularity. In the latter part of the paper, we also demonstrate several ways for estimating person importance based on the temporal aspects of the link structure as well as a method for ranking cities using the computed importance scores of their related persons."
From the abstract: "... it is our goal to [analyze] the representation of world literature in Wikipedia with its millions of articles in hundreds of languages. As a preliminary, we introduce and compare three different approaches to identify writers on Wikipedia using data from DBpedia, a community project with the goal of extracting and providing structured information from Wikipedia. Equipped with our basic set of writers, we analyze how they are represented throughout the 15 biggest Wikipedia language versions. We combine intrinsic measures (mostly examining the connectedness of articles) with extrinsic ones (analyzing how often articles are frequented by readers) and develop methods to evaluate our results. The better part of our findings seems to convey a rather conservative, old-fashioned version of world literature, but a version derived from reproducible facts revealing an implicit literary canon based on the editing and reading behavior of millions of people."
The authors published their datasets at http://data.weltliteratur.net/, including lists of the top 25 writers for various language editions by various measures (e.g. Mircea Eliade leads by page length on the English Wikipedia.)
From the abstract: "we compare the temporal pattern of information supply (article creations) and information demand (article views) on Wikipedia for two groups of scientists: scientists who received one of the most prestigious awards in their field and influential scientists from the same field who did not receive an award. Our research highlights that awards function as external shocks which increase supply and demand for information about scientists, but hardly affect information supply and demand for their research topics. Further, we find interesting differences in the temporal ordering of information supply between the two groups: (i) award-winners have a higher probability that interest in them precedes interest in their work; (ii) for award winners interest in articles about them and their work is temporally more clustered than for non-awarded scientists."
See the research events page on the Meta Wiki for upcoming conferences and events, including submission deadlines.
Other recent publications that could not be covered in time for this issue include the items listed below. contributions are always welcome for reviewing or summarizing newly published research.
From the abstract: "The paper at hand analyzes vandalism and damage in Wikipedia with regard to the time it is conducted and the country it originates from. First, we identify vandalism and damaging edits via ex post facto evidence by mining Wikipedia's revert graph. Second, we geolocate the cohort of edits from anonymous Wikipedia editors using their associated IP addresses and edit times [...]. Third, we conduct the first spatio-temporal analysis of vandalism on Wikipedia. Our analysis reveals significant differences for vandalism activities during the day, and for different days of the week, seasons, countries of origin, as well as Wikipedia's languages. [...] the ratio is typically highest at non-summer workday mornings, with additional peaks after break times. We hence assume that Wikipedia vandalism is linked to labor, perhaps serving as relief from stress or boredom, whereas cultural differences have a large effect."
From the abstract:[9] "In this study, we examine changes in collaborative behavior of editors of Chinese Wikipedia that arise due to the 2005 government censorship in mainland China. Using the exogenous variation in the fraction of editors blocked across different articles due to the censorship, we examine the impact of reduction in group size, which we denote as the shock level, on three collaborative behavior measures: volume of activity, centralization, and conflict. We find that activity and conflict drop on articles that face a shock, whereas centralization increases."
From the abstract:[10] "[...] our approach reaches 80.8% classification accuracy and 0.88 mean average precision. We compared against ORES, the most recent tool developed by Wikimedia which assigns a damaging score to each edit, and we show that our system outperforms ORES in spam users detection. Moreover, by combining our features with ORES, classification accuracy increases to 82.1%."
From the abstract: "Using a corpus of 725,000 revisions made to 2,012 pages about rules and rule discussions since 2001, we explore the dynamics of English Wikipedia's rule-making and maintenance over time. Our analysis reveals a policy environment marked by on-going rule-making and deliberation across multiple regulatory levels more than a decade after its creation. This dynamism is however balanced by strong biases in the attention and length towards older rules coupled with a diminishing flexibility to change these rules, declining revision activity over time, and a strong shift toward deliberation."
From the abstract:[12] "This study aims to establish benchmarks for the relative distribution and referral (click) rate of citations—as indicated by presence of a Digital Object Identifier (DOI)—from [English] Wikipedia, with a focus on medical citations. [...] all DOIs in Wikipedia were categorized as medical (WP:MED) or non-medical (non-WP:MED). Using this categorization, referred DOIs were classified as WP:MED, non-WP:MED, or BOTH, meaning the DOI may have been referred from either category. Data were analyzed using descriptive and inferential statistics. Out of 5.2 million Wikipedia pages, 4.42% (n = 229,857) included at least one DOI. 68,870 were identified as WP:MED, with 22.14% (n = 15,250) featuring one or more DOIs. WP:MED pages featured on average 8.88 DOI citations per page, whereas non-WP:MED pages had on average 4.28 DOI citations. For DOIs only on WP:MED pages, a DOI was referred every 2,283 pageviews and for non-WP:MED pages every 2,467 pageviews. DOIs from BOTH pages accounted for 12% (n = 58,475)."
(Compare also an ongoing research project by the Wikimedia Foundation about how readers use citations: m:Research:Characterizing Wikipedia Citation Usage)
From the abstract:[13] "[...] citations in Wikipedia and Scopus were compared for conference papers (and journal articles) published in 2011 in four engineering fields that value conferences. Wikipedia citations had correlations that were statistically significantly positive only in Computer Science Applications, whereas the correlations were not statistically significantly different from zero in Building & Construction Engineering, Industrial & Manufacturing Engineering and Software Engineering. Conference papers were less likely to be cited in Wikipedia than were journal articles in all fields, although the difference was minor in Software Engineering."
From the abstract:[14] "In the first model, peer-reviewed material is published in a journal and subsequently copied to Wikipedia under a compatible licence (typically Creative Commons). This produces new, high-quality articles and is easily consistent with current open access journal practices. A second, less common format is where material is first published in Wikipedia, then subjected to academic peer review before being published as a journal article. This model is also compatible with the recent practice of improving and peer-reviewing existing Wikipedia pages. A third model is where a journal requires authors to update Wikipedia as part of the journal's publication process. This allows content to be pitched at different levels for the journal and Wikipedia."
From the paper:[15]
Imagine, if you will, hosting a research party and inviting all of the major [research] databases. Everyone who's anyone would be there [e.g. JSTOR, ScienceDirect and LexisNexis. ...] Then Wikipedia shows up to this party and suddenly the room goes silent. Web of Science won't even make eye contact with him. "Who invited this imposter?" whispers one of the ProQuest databases. The agitation is almost tangible.
Even though he could easily mingle with any of the guests and has brought enough food and drinks for everyone, Wikipedia stands alone in the corner of the room. He's the most popular person in the world, yet no one is happy to see him at this research party. [...]
Wikipedia finally snaps and screams, "What did I do to deserve this? Why do you all hate me so much?" PsycINFO looks over and says, "You're a liar, Wikipedia! You're untrustworthy and lack integrity. You have 1,350 administrators, 6,000+ reviewers, and countless editors making you the poster child for dissociative identity disorder. Your presence soils our reputations in academia. [...]
This is the stigma of Wikipedia in the world of scholarly research [...]
The essay's author, a librarian at San Jose State University, concludes that this stigma "is strong and it will likely dominate the narrative for quite a while, but that stigma does not necessarily hold up against the findings regarding Wikipedia's accuracy and authority. More and more research is emerging that suggests otherwise. For content that is not politically charged or controversial, Wikipedia has proven to be as good as, if not better than, some its peers."
{{cite conference}}
: External link in |conference=
(help)
Discuss this story