A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
The author of this 2021 study[1] looks at Wikipedia projects in several European minority, regional and endangered (MRE) languages. His main focus is on Low German (Plattdeutsch, a minority language spoken in northern Germany), but he also considers other European languages: Occitan, Piedmontese, Sardinian, Kashubian and Ladin.
The introduction provides an overview of the history of Low German, once lingua franca in Northern Europe, its erosion and loss of popularity until being added to the EU list of Endangered Languages. The author mentions the recent generational divide, the role of new media including the internet and, of course, Wikipedia. He reviews studies looking at the online presence of MRE languages, including a notable 2013 study dividing digital presence of languages into four groups: thriving, vital, heritage and still.[supp 1]
The paper then compared the minority languages on Wikipedia: their rank by number of articles, number of administrators and active users. It was noted that the presence of a Wikipedia project is a positive achievement in itself, considering that the rules for creating a new language project are described as fairly complex, and that many of the world’s languages are not represented on Wikipedia. The author also notes that at least as of 2021, dedicated, small groups of speakers/users are mostly responsible for maintaining the languages' presence on Wikipedia.
The author moved on to search for fifty common terms in each language, doing a word count for each entry if applicable (a search for 400 Wikipedia entries in total). This study was carried out between November 2016 and January 2017 with data reexamined in March 2021. The author does not explain in much detail how he chose the 50 common words. This part is somewhat comparable to a recent study by Lewoniewski et al.,[2] although in that case the authors explained their choice of words to analyse more transparently, and were looking at quality of articles.
Interestingly, the author notes:
"The brevity of many of the articles and the paucity of information may cause more problems than benefits, such as perceptions that these languages are unrefined, unsophisticated, and second-rate. The voluntary work of the respective Wikipedia crews can, of course, not be faulted for this dilemma. Rather, these numbers underscore the fact that languages with fewer speakers do not have a good or even fair chance of succeeding on Wikipedia, which in turn negatively affects a meaningful online presence. These results, however, must be considered preliminary, and a study larger in scope will be needed to genuinely validate it."
The paper also provides an intriguing case study for changes in Low German Wikipedia:
“Something unexpected happened with the Low German Wikipedia edition in the course of this study. At the beginning of this study, in January 2018, Low German had 27,342 articles, 4 administrators, and fifty-two active users. In terms of total number of entries, it ranked at number one hundred. In March 2021, the Low German Wikipedia had 85,467 entries and had climbed to number seventy-five. It still had four administrators in March 2021 and seventy active users. This means that the number of articles in the Low German Wikipedia edition more than tripled in three years, while the volunteer staff only slightly increased. What are the reasons for this remarkable increase in articles? The answer is baffling, as many of the newly added articles may not have been authored by humans.”
As shown in table 6 of the study, it turned out that over half of the articles in the languages examined, with the exception of Ladin, are bot-generated. This is in contrast to large Wikipedias such as English or German, which use many bots, but less frequently for content creation. While bots can be useful for generating stub articles with basic information, these entries tend to be repetitive and have very few or no references.
The author provides an interesting critique of the digital world achieving or failing to achieve more equality and presence for marginalised groups and minority languages. Beyond digital limitations, the paper reminds its readers that the barriers to the representation of MRE languages can go beyond the domain of linguistics, into the cultural sphere, as many of these languages are “embedded in a profound oral tradition, which often includes the lack of a common orthography. This means that not only Wikipedia but also the Internet in general is not really an adequate medium for communication in these languages.”
After writing this review, I was informed that Book Publisher International (the company which published the volume in which this study is a chapter) is considered a predatory publisher (thank you to the editor who pointed it out!). Considering that the author published in other reputable places on this subject,[3] I think there is still merit in this small yet interesting study. The author himself was careful to underline that the results must be understood as samples and snapshots rather than definite conclusions. Still, the findings raise important questions about the future of minority languages online. It would be interesting to see a follow up study on this, although understandably, as the author points out, that would require more resources and finances. It would be equally intriguing to see a comparison of how the languages have been doing since the study was conducted and published.
This paper[4] makes a case for using Wikidata in botany, highlighting its benefits and multiple application opportunities in this field. The authors are researchers and Wikidata editors, and the publication comes after a workshop and poster presentation during the International Botanical Congress (IBC) in Spain in 2024.
The paper starts with a narrative about Oxalis psoraleoides (a species of flowering plant), accompanied by a knowledge graph, demonstrating how many elements mentioned in the botanical story (such as collections, species, places, explorers) are interlinked on Wikidata and can be visualised there:
To quote from the paper:
"(...) there is a huge amount of botany-related information that has been published over centuries that contains hidden connections between such entities. Much of this textual information is made available on the internet in a digital format. This information is usually unstructured, and hence is siloed, lacking in context, and not interoperable. In addition, information in different biodiversity databases as well as digital libraries is often not linked. (...)
Publishing information in Wikidata ensures it is findable, able to be accessed, interoperable (the structure follows a documented standard), and obstacle-free with regard to reuse (licensing), thus progressing towards achieving compliance with FAIR principles (Findability, Accessibility, Interoperability, and Reuse of digital assets, [Q29032644], Wilkinson et al., 2016[supp 2])"
This is followed by the basics of Wikidata, and a comprehensive list of various botany-related types of data in Wikidata, with a detailed explanation for each of them – from people, to taxa, publication, institutions, collections, expeditions, and more.
The authors describe several tools relevant to botanists that visualise Wikidata and use Wikidata QIDs in websites or catalogues. The paper is concluded by practical examples of how the botanical community can use Wikidata to its advantage, and a Wikidata call to action closely tied to the Madrid Declaration, "collectively published by the congress participants at the end of the IBC and is aimed at botanists, institutions and citizens to 'strengthen the connection between plants and people, nurture mutual benefits, and enhance planetary health and resilience' (XX International Botanical Congress, 2024)."
The article is accompanied by the original research poster and other interesting graphs. This story is also summarised in an accessible blog post: The power and potential of Wikidata for botany. In my view, it would be really interesting to see similar Wikidata overviews for other disciplines.
The Wikimedia Foundation has published a draft document titled "Guidance for NPOV Research on Wikipedia". Besides general explainers about Neutral Point of View as a core Wikipedia policy, the document also appears to attempt to address some fallacies in prior studies of biases on Wikipedia, e.g. by asking researchers to "Distinguish between bias in sources on Wikipedia vs. bias in sources outside of Wikipedia", and suggesting other ways to "make rigorous assessments of Wikipedia's adherence to NPOV". The Foundation also solicited feedback on its guidelines draft from researchers and community members until August 31.
The document appears to be related to the "Common global standards for NPOV policies" working group launched by the Foundation earlier this year (Signpost coverage: "WMF to explore 'common standards' for NPOV policies; implications for project autonomy remain unclear"), which itself recently published an "Analysis of Neutral Point of View Policies across Wikipedias".
In "The Conversation", researcher Heather Ford raised concerns about the Foundation's "new rules":
[...] instead of supporting open inquiry, the guidelines reveal just how unaware the Wikimedia Foundation is of its own influence.
These new rules tell researchers – some based in universities, some at non-profit organisations or elsewhere – not just how to study Wikipedia’s neutrality, but what they should study and how to interpret their results. That’s a worrying move.
As someone who has researched Wikipedia for more than 15 years – and served on the Wikimedia Foundation’s own Advisory Board before that – I’m concerned these guidelines could discourage truly independent research into one of the world’s most powerful repositories of knowledge.
[...] the Wikimedia Foundation has lots of control over research on Wikipedia. It decides who it will work with, who gets funding, whose work to promote, and who gets access to internal data. That means it can quietly influence which research gets done – and which doesn’t.
Now the foundation is setting the terms for how neutrality should be studied.
Ford is also part of a group of researchers who recently published a manifesto and commentary calling for "Uniting and reigniting critical Wikimedia research" (see our previous coverage), which suggests to "Examine power relations" as one of several research focus areas.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
Translated from the abstract:[5]
Previous studies have mainly positioned Wikipedia as a "meta-media", "reference book" or "social media", which seem unable to explain the reasons behind the continuous expansion of the Wikimedia project. Our research notes that Wikipedia originated from an open-source culture, and its development process can be traced back to the concept of the rights revolution. By evolving from Wikipedia to projects such as Wikidata and Wikimedia Commons, as well as holding activities like edit-a-thons, Wikimedia has transformed from an internet encyclopedia to a global social movement. The Wikimedia movement can be regarded as an open-source knowledge movement without boundaries, of which the action process is fully transparent throughout the domain, the action subjects heterogeneous and integrated, and the forms of interaction competitive and cooperative. Viewing Wikipedia as a social movement and an open-source community helps to further understand the logic of global open-source knowledge dissemination.
FYI: One of the co-authors, Gan Lihao (a professor of communication at East China Normal University), has just led his students to finish a book titled Wikipedia Politics and shared a report on the book at Wikimania 2025. Gan was mentioned in a 2019 BBC article. In this compiler’s opinion, BBC had misinterpreted the Chinese scholar’s words, which followed a special way of expression under China’s context.
From the English-language abstract of this Chinese-language paper:[6]
With the increasing globalization of Chinese martial arts, diverse forms and levels of documentation have emerged worldwide. This study examines the collaborative folk writing of Chinese martial arts in English-language contexts by analyzing the development trajectory, contributor demographics, writing practices, and negotiations/competitions observed in the Wikipedia entry "Chinese Martial Arts". The research reveals that the entry's evolution is characterized by continuous growth and refinement, yet remains an ongoing, "unfinished" process. While the writing inherently exhibits distinct international and collective traits, the emergence of core contributor groups and Wikipedia’s editorial protocols dominate the negotiation and competition among diverse perspectives, ultimately shaping the entry's narrative direction and textual representation. Notably, the study identifies a striking absence of Chinese voices in this collaborative writing process. It emphasizes the urgent need to integrate Chinese perspectives and knowledge into the global dissemination of martial arts discourse to bridge this representational gap.
From the abstract:[7]
"Citation Worthiness Detection (CWD) consists in determining which sentences, within an article or collection, should be backed up with a citation to validate the information it provides. This study, introduces ALPET, a framework combining Active Learning (AL) and Pattern-Exploiting Training (PET), to enhance CWD for languages with limited data resources. Applied to Catalan, Basque, and Albanian Wikipedia datasets, ALPET outperforms the existing CCW baseline while reducing the amount of labeled data in some cases above 80%."
From the abstract:[8]
"Despite [Wikipedia's] widespread use, significant disparities persist among language publications, including variations in the number of articles, the spectrum of topics covered, and even the number of contributing community editors. In this paper, we aim to alleviate this gap in the coverage of low-resource languages. Although previous work has focused on multilingual interoperability efforts, the potential of hyperlinks has not been fully realized. Therefore, this study introduces a novel approach focused on hyperlinks, specifically emphasizing hyperlink types derived from Wikidata. We extract and analyze patterns related to these hyperlink types across different languages, using them as recommended solutions to connect the topics of various languages, particularly low-resource languages"
By "hyperlink type", the authors refer to the Wikidata topic that a Wikipedia article is an "instance of", via Wikidata property P31. From the paper:
[...] our research is carried out in a case study involving the English (en), Japanese (ja), and Vietnamese (vi) [Wikipedia] languages [...]
[... W]e notice a significant preference for topics such as film, automobile models, music groups, singles, and video games in English editors. In Japanese articles, hyperlinks emphasize various aspects of Japan, including city, town, chōchō, municipality, railway station, and manga series. In contrast, the Vietnamese context focuses primarily on topics such as world war, sovereign state, chemical compound, and organization.
From the abstract:[9]
"[Wikipedia] content in low-resource languages [is] frequently outdated or incomplete. Recent research has sought to improve cross-language synchronization of Wikipedia tables using rule-based methods. These approaches can be effective, but they struggle with complexity and generalization.This paper explores large language models (LLMs) for multilingual information synchronization, using zero-shot prompting as a scalable solution. We introduce the Information Updation dataset, simulating the real-world process of updating outdated Wikipedia tables, and evaluate LLM performance. [...] Our proposed method outperforms existing baselines, particularly in Information Updation (1.79%) and Information Addition (20.58%)"
{{cite journal}}
: CS1 maint: article number as page number (link)
Discuss this story