A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
How do readers use health information on Wikipedia? A recent paper[1] explores this question using semi-structured interviews with 21 adults from seven countries. All participants had used Wikipedia for health information at least once in the previous year.
The research was qualitative in intent and all participants happened to have at least some post-secondary education, so the results are not necessarily representative of Wikipedia readers as a whole. Nevertheless, it gives a fascinating breadth of results. The whole paper is well worth reading – it's brief, digestible, and probably quite gratifying for Wikipedia volunteers. Some highlights:
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:[2]
"We found that WPM [WikiProject Medicine] articles are longer, possess a greater density of external links, and are visited more often than other articles on Wikipedia. Readers of WPM articles are more likely to hover over and view footnotes than other readers, but are less likely to visit the hyperlinked sources in these footnotes. Our findings suggest that WPM readers appear to use links to external sources to verify and authorize Wikipedia content, rather than to examine the sources themselves."
From the abstract:[3]
"From 11,325 Wikipedia medical articles, we identified citations to 137,889 journal articles from over 15,000 journals. There was a large spike in the number of journal articles published in or after 2002 that were cited by Wikipedia. The higher the importance of a Wikipedia article, the higher was the mean number of journal citations it contained (top article, 48.13 [SD 33.67]; lowest article, 6.44 [SD 9.33]). ...We found evidence of “recentism,” which refers to preferential citation of recently published journal articles in Wikipedia. Traditional high-impact medical and multidisciplinary journals were extensively cited by Wikipedia, suggesting that Wikipedia medical articles have robust underpinnings. In keeping with the Wikipedia policy of citing reviews/secondary sources in preference to primary sources, the Cochrane Database of Systematic Reviewswas the most referenced journal."
From the abstract:[4]
"This cross-publisher study (Taylor & Francis and University of Michigan Press) aimed to investigate [scholarly] author sentiment towards Wikipedia as a source of trusted information. [...] A short survey was distributed to 40,402 authors of papers cited in Wikipedia (n=21,854 surveys sent, n=750 complete responses received). The survey gathered responses from published authors in relation to their views on Wikipedia’s trustworthiness in relation to the citations to their published works. [...] Overall, authors expressed positive sentiment towards research citation in Wikipedia and researcher engagement practices (mean scores >7/10). Sub-analyses revealed significant differences in sentiment based on publication type (articles vs. books) and discipline (Humanities and Social Sciences vs. Science, Technology, and Medicine), but not access status (open vs. closed access).
From the "Discussion" section:
"Our results suggest there is general trust among researchers in Wikipedia both in terms of representativeness and accuracy. Most would also recommend the Wikipedia page where their work is cited to a colleague or the general public."
On May 8, two of the paper's authors (from Taylor & Francis and British technology company Digital Science) will present this research at two free webinars, while also giving "a sneak peek at an upcoming collaboration between Wikipedia, Digital Science and Taylor & Francis," together with a Wikimedia Foundation representative.
From the abstract:[5]
"[...] We quantify the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and collaboratively agreed upon by Wikipedia editors. We discover that some sources (or web domains) deemed untrustworthy in one language (i.e., English) continue to appear in articles in other languages. This trend is especially evident with sources tailored for smaller communities. Furthermore, non-authoritative sources found in the English version of a page tend to persist in other language versions of that page. We finally present a case study on the Chinese, Russian, and Swedish Wikipedias to demonstrate a discrepancy in reference reliability across cultures. Our finding highlights future challenges in coordinating global knowledge on source reliability."
From the paper:
"To investigate the spread of English Wikipedia's perennial sources across multiple language editions, we identify the proportion of articles in each edition that include at least one reference to these sources. Figure 1 shows the percentage of articles referencing reliable and non-authoritative sources in the 40 editions with the largest number of articles. [...] The plot shows outliers in the two directions of the confidence interval represented by the gray area. On the one hand, the English edition is located below the confidence interval, meaning the proportion of articles citing reliable domains is larger. This observation is consistent with recent research [...], as the community of English Wikipedia is more aware of the non-authoritative domains listed in the local perennial sources list. On the other hand, the outliers above the confidence interval appear to have a relatively larger proportion of articles citing deprecated or blacklisted domains. These are Russian (ru), Armenian (hy), Chinese (zh), French (fr), and Bulgarian (bg)."
"Figure 3: Top 15 non-authoritative sources (from the perennial source list of the local Wikipedia edition or the one of English Wikipedia) by the number of citations in Russian, Swedish, and Chinese Wikipedia editions":
From this post on the blog of the University of Geneva's Confucius Institute:[6]
"For English Wikipedia, we accessed the 'Reliable sources/Perennial sources' page and extracted the list of reliable and controversial sources. Similarly, for Chinese Wikipedia, we accessed the equivalent page containing source reliability information list. [...] in our quantitative analysis, differences in the diversity and number of sources suggest that English Wikipedia may have access to a wider range of sources, whereas Chinese Wikipedia seems to be more selective or restricted in its choice of sources. Due to the existence of [the] “无共识” (no consensus) label, the rating of reliable sources in Chinese Wikipedia is more ambiguous than in the English version."
From the abstract:[7] From the abstract:
"We investigate gender- and country-based biases in Wikipedia citation practices using linked data from the Web of Science and a Wikipedia citation dataset. Using coarsened exact matching, we show that publications by women are cited less by Wikipedia than expected, and publications by women are less likely to be cited than those by men. Scholarly publications by authors affiliated with non-Anglosphere countries are also disadvantaged in getting cited by Wikipedia, compared with those by authors affiliated with Anglosphere countries. The level of gender- or country-based inequalities varies by research field, and the gender-country intersectional bias is prominent in math-intensive STEM fields."
See also a presentation at the Wikimedia Research Showcase
From the abstract:[8]
"Citation Worthiness Detection (CWD) consists in determining which sentences, within an article or collection, should be backed up with a citation to validate the information it provides. This study, introduces ALPET, a framework combining Active Learning (AL) and Pattern-Exploiting Training (PET), to enhance CWD for languages with limited data resources. Applied to Catalan, Basque, and Albanian Wikipedia datasets, ALPET outperforms the existing CCW baseline while reducing the amount of labeled data in some cases above 80%. ALPET's performance plateaus after 300 labeled samples, showing it suitability for low-resource scenarios where large, labeled datasets are not common. [...] Overall, ALPET's ability to achieve high performance with fewer labeled samples makes it a promising tool for enhancing the verifiability of online content in low-resource language settings."
From the abstract:[9]
"To date, research on automating citation worthiness detection has largely focused on the most resourceful language, English Wikipedia, neglecting the applicability to smaller Wikipedias. In addition, previous research proposed models that analyze the content inherent to a sentence to determine its citation worthiness, overlooking the potential of additional context to improve the prediction. Addressing these gaps, our study proposes a transformer-based contextualized approach for smaller Wikipedias, presenting a novel method to compile high-quality datasets for the Albanian, Basque, and Catalan editions. We develop the Contextualized Citation Worthiness (CCW) model, employing sentence representations enriched with adjacent sentences and topic categories for enhanced contextual insight. Empirical experiments on three newly created datasets demonstrate significant performance improvements of our contextualized CCW model, with 6%, 3% and 6% absolute improvements over the baseline for Albanian, Basque and Catalan datasets, respectively. [...] This has implications for supporting Wikipedia projects across low-resource languages, promoting better article validation and fact-checking."
From the abstract:[10]
"This study examines Wikipedia’s role in promoting and preserving Setswana and Punjabi. The research is framed by the Ethnolinguistic Vitality Theory (EVT), which suggests that language survival lies in reclamation, revitalization, and reinvigoration. A quanti-qualitative approach is used to investigate the issue, integrating quantitative metrics from Wikipedia’s statistical pages with qualitative content analysis of the articles. Data were collected from May 2022 to May 2024, focusing on article counts, edits, active editors, new pages, top edited pages, and views. [...] The findings show that Punjabi Wikipedia has a much larger content volume and user base, but comparatively lower recent activity and collaborative depth compared to Setswana Wikipedia. (Setswana) Tswana Wikipedia, while smaller in content volume, demonstrates a more engaged and active editing community, reflected by a higher depth score and a larger number of active users."
From the abstract:[11]
"This study investigates the dynamics of public risk awareness in the aftermath of the Charlie Hebdo terrorist attack on January 7, 2015, through a dual-focus analysis of Wikipedia traffic and Google Trends data. Analyzing the temporal patterns of Wikipedia page views in both English and French, sheds light on how significant media events, anniversaries, and related incidents influence public engagement with terrorism-related content over time. [...] Francophone regions, particularly France and its former colonies, exhibit a more sustained and consistent interest in the Charlie Hebdo event compared to Anglophone regions. The heightened engagement in French-speaking areas suggests that cultural and historical ties influence public risk perception and awareness."
From the abstract:[12]
"[...] we present a comprehensive, multilingual dataset capturing all Wikipedia mentions and links shared in posts and comments on Reddit 2020-2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits."
See also:
From the "Conclusion" section:[13]
"Over the course of this dissertation, I have shown how the infrastructure that constitutes Wikipedia, made out of various connected digital artefacts, does more than embedding values. It cocreates them – on one hand, by welcoming or resisting intervention, and by being a site of ideological negotiation; on the other hand, by suggesting, implicitly, what values are important, what constitutes a moral good in the first place. Beyond affording intervention to humans – being a substrate or a tool for ethical and epistemic meaning making – Wikipedia's platform offers up technical values to be turned into social, epistemic, aesthetic values. This is true, for instance, of forkability: as I have shown, forkability in its original formulation informed design because of its practical advantages, concerning safety and distribution of code. Forkability then, through the community that created Wikipedia, became an epistemic value as well.
Programming practice is a key component of Wikipedia's culture, [...] the concrete circumstances in which coders have worked define the way Wikipedia produces knowledge. [...]
A side-effect of the partial overlap between programming and creating Wikipedia's content is that the flavour of Wikipedia’s community matches cultural traits of hacker culture. The effect of this phenomenon is two-fold. First, Wikipedia inherited assumptions found in hacker culture – downplaying the role of the body, faith in machinery, anti-aesthetic leanings, connecting intelligence and skill with the ability to code. Secondly, because of the connection between taste and belonging to specific communities, Wikipedia can be an uncomfortable space for those who don’t participate in hacker culture."
Discuss this story