A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
The impact of the re-use of Wikipedia content is a huge question facing Wikimedia,[supp 1] but a large obstacle to understanding the impact of Wikipedia content that appears outside of the Wikipedia ecosystem is our ability to detect where this occurs. In this ECIR paper, titled "Wikipedia Text Reuse: Within and Without",[1] Alshomary et al. apply text-reuse detection algorithms to Wikipedia and the Common Crawl in order to identify text re-use within Wikipedia (e.g., content copied between articles) and between Wikipedia and external sources. This has a number of analytical challenges: how to detect matches, how to define what is and is not re-use, and how to extract appropriate blocks of text that might be re-used given that the matches generally are not exact. They examine several approaches to efficiently solving these problems (e.g., locality-sensitive hashes, adapting approaches developed for plagiarism-detection).
The analysis distinguishes between two types of re-use: structural and content. Structural re-use is akin to a template where the same sentence structure is re-used but the words are changed. Content re-use is when perhaps the sentence changes, but the same content is contained within two different articles. Table 2 within the paper has an excellent figure comparing these two with examples and Section 4 within the paper discusses the implications of each type of re-use. Finally, the paper closes with an estimate of revenue generated by the re-used content based on the calculations of external content re-use and assumptions about advertising revenue: 5.5 million USD per month, as a conservative lower bound.
This analysis raises a number of interesting questions that I think would have to be answered before reaching conclusions about how to fully interpret the results. The first is the directionality of the re-use: are Wikipedia pages re-using content from external websites or vice versa? The second is what is the context under which re-use is happening: to what degree is it bots working with a pre-defined template vs. users copying content through that has been generated by others. To the authors' credit, they have committed to releasing the re-use datasets generated for the paper. These datasets will hopefully support further analysis and attention to the broader question of the impact of re-use.
What sort of questions can we ask about Wikipedia? David Moats[2] provides a timely and necessary reflection on this question and how it impacts research. Briefly stated, he makes the argument that when it comes to the study of online platforms, researchers often pose questions that can be answered by the analysis of the data that is already structured by the platforms themselves. For example, scraping data from a website or following the trail of hashtags only make sense for very particular kinds of research questions. This is a cause for concern since not all research is inclined towards the structures and logics that these data assume.
To make his point, he reviews the common approach of Actor-Network Theory within Science and Technology Studies (STS). Under this theoretical and methodological frame, the researcher is encouraged to "follow the actors," or to take note only of those individuals that enact a significant change. The benefit of this approach has long been understood by its capacity to recognize how social and technical actors are entangled with one another. In other words, the distinction between a human or a non-human has little methodological purchase. What matters is which actor/actant is performing an action.
But, as the author argued, the actors that have the ability to articulate their socio-technical networks gain this ability through uneven power relations. Therefore, if the purpose is to study how platforms articulate socio-technical relationships, then there are host of interactions that are missing from this platform-sanctioned data. In particular, if a researcher is interested in understanding the silences and resistances with platforms, then other methods of investigation will need to be deployed. As Moats aptly summarized, "the dictum to follow the actors might be in conflict with analogous strategies to follow the medium" (p.3).
To demonstrate how research can be sensitive to these concerns, Moats provides his case study of the controversies that arose on Wikipedia's article about the Fukushima disaster. The first thing to note about his approach was his use of mixed methods: a content analysis of article page size over time; a content analysis of website domains used in the reference section; and a discourse analysis of the talk page discussions of the validity of sources. But even this combination of qualitative and quantitative research requires additional support. In this regard, he made the conscious decision to include data that was messy, inconvenient, unformatted and time-consuming to parse. In his words, this kind of data would normally be understood as an impediment to "our ability to follow the actors" (p.21). But in support of his argument, this kind of data contribute to – rather than take away from – our ability to understand social phenomena. In both the topical sense of providing more nuanced understandings of how controversies occur on Wikipedia and in a theoretical sense of pushing for reflexive research, Moats's article should be a consistent citation for Wikipedia researchers.
See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.
Other recent publications that could not be covered in time for this issue include the items listed below. contributions are always welcome for reviewing or summarizing newly published research.
From the abstract:[3] "we use the Focus Theory to examine interactions between several sources of normative influence in a Wikipedia sub-community: local descriptive norms, local injunctive norms, and norms imported from similar sub-communities. We find that exposure to injunctive norms has a stronger effect than descriptive norms, that the likelihood of performing a behavior is higher when both injunctive and descriptive norms are congruent, and that conflicting social norms may negatively impact pro-normative behavior." (See also research showcase presentation)
From the abstract: [4] "Drawing insights from attachment theories in social psychology, we examine two types of pre-joining connections: identity-based attachment (how much members' interests were aligned with the subgroup's topics) and bonds-based attachment (how much members had interacted with other members of the subgroup). Analyses of 79,704 editors in 1,341 WikiProjects show that 1) both identity-based and bonds-based attachment increased editors' post-joining productivity and reduced their likelihood of withdrawal; 2) identity-based attachment had a stronger effect on boosting direct contributions to articles while bonds-based attachment had a stronger effect on increasing article and project coordination, and reducing member withdrawal." (See also GroupLens blog post: "Your feelings of connecting to a group can predict your future behavior")
From the abstract:[5] "This article considers the extent to which non-legal factors (nationality, activity/experience, conflict avoidance, and time constraints) affect decision making within collegiate courts, through the study of the Wikipedia’s Arbitration Committee. [...] This study shows that the decision-making process of this body seems mostly unaffected by the demographic factors studied and the acclimatization bias. Some evidence of conflict avoidance is found. Despite the professed equality of members of the Committee, there is clear evidence that some are much more active (and thus, influential) than others."
From the abstract:[6] "... the question of when and how [internet-based approaches to disease monitoring] work remains open. We addressed this question using Wikipedia access logs and category links. Our experiments, replicable and extensible using our open source code and data, test the effect of semantic article filtering, amount of training data, forecast horizon, and model staleness by comparing across 6 diseases and 4 countries using thousands of individual models. We found that our minimal-configuration, language-agnostic article selection process based on semantic relatedness is effective for improving predictions [...]. We also found, in contrast to prior work, very little forecasting value ..." (See also earlier: "Two new papers on disease forecasting using Wikipedia")
From the abstract:[7] "We show that generating English Wikipedia articles can be approached as a multi-document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. [...] We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles." (see also media coverage)
From the (preprint) abstract:[8] "... we propose the use of deep learning to detect vandals based on their edit history. In particular, we develop a multi-source long-short term memory network (M-LSTM) to model user behaviors by using a variety of user edit aspects as inputs, including the history of edit reversion information, edit page titles and categories. With M-LSTM, we can encode each user into a low dimensional real vector, called user embedding. [...] we can predict whether a user is benign or vandal dynamically based on the up-to-date user embedding. Furthermore, those user embeddings are crucial to discover collaborative vandals."
From the abstract: [9] "This study simulates a patient search for online educational material about gender affirming surgery and evaluates the accessibility, readability, and quality of the information [including the English Wikipedia's article sex reassignment surgery ]. Readability was assessed using 10 established tests: Coleman–Liau, Flesch–Kincaid, FORCAST, Fry, Gunning Fog, New Dale-Chall, New Fog Count, Raygor Estimate, Simple Measure of Gobbledygook, and Flesch Reading Ease. Quality was assessed using Journal of the American Medical Association criteria and the DISCERN instrument. ... All articles and Web sites exceeded the recommended sixth grade level."
From the abstract:[10] "WikiTopReader, a reader of Wikipedia page rank, lets users explore connections among top-viewed pages by connecting page-rank behaviors with page-link relations. Such a combination enhances the unweighted Wikipedia page-link network and focuses attention on the page of interest."
From the abstract:[11] ".. we first introduce a formal model of controversy as the basis of computational approaches to detecting controversial concepts. Then we propose a classification based method for automatic detection of controversial articles and categories in Wikipedia. Next, we demonstrate how to use the obtained results for the estimation of the controversy level of search queries. [...] The method is independent of the search engine’s retrieval and search results recommendation algorithms, and is therefore unaffected by a possible filter bubble. Our approach can be also applied in Wikipedia or other knowledge bases for supporting the detection of controversy and content maintenance." (includes rating data from the article feedback tool)
From the paper:[12] "... we also evaluated our method on Wikidata, where less is known about our assumptions"
From the abstract:[13] "A lightweight method distinguishes articles within Wikipedia that are classes ('Novel', 'Book') from other articles ('Three Men in a Boat', 'Diary of a Pilgrimage'). It exploits clues available within the article text and within categories associated with articles in Wikipedia, while not requiring any linguistic preprocessing tools."
From the abstract:[14] "We assume that the collective interest of a language-speaking community to document their events, people and any feature important for them, by the online encyclopedia Wikipedia, can act as a footprint of the whole group’s collective identity. ... We, then, report results about the number of edits, editors, and pages into categories, displayed by the several languages."
From the abstract:[15] "... we highlight some of the most salient aspects of human-bot collaboration in Wikidata. We argue that the combination of automated and semi-automated work produces new challenges with respect to other online collaboration platforms."
 
{{cite journal}}: Cite journal requires |journal= (help)
 Author's postprint
 Author's postprint
 
 
 author's copy (different version)
 author's copy (different version)
 author's copy
 author's copy
 
     
     
Discuss this story