The Signpost

Recent research

Language bias: Wikipedia captures at least the "silhouette of the elephant", unlike ChatGPT

Contribute  —  
Share this
By Tilman Bayer


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


"A Perspectival Mirror of the Elephant: Investigating Language Bias on Google, ChatGPT, Wikipedia, and YouTube"

This arXiv preprint[1] (which according to the authors grew out of a student project for a course titled "Critical Thinking in Data Science" at Harvard University) finds that

The blind men and the elephant
(wall relief in Northeast Thailand)

[...] Google and its most prominent returned results – Wikipedia and YouTube – simply reflect the narrow set of cultural stereotypes tied to the search language for complex topics like "Buddhism," "Liberalism," "colonization," "Iran" and "America." Simply stated, they present, to varying degrees, distinct information across the same search in different languages (we call it 'language bias'). Instead of presenting a global picture of a complex topic, our online searches turn us into the proverbial blind person touching a small portion of an elephant, ignorant of the existence of other cultural perspectives.

Regarding Wikipedia, the authors note it "is an encyclopedia that provides summaries of knowledge and is written from a neutral point of view", concluding that

[...] even though the tones of voice and views do not differ much in Wikipedia articles across languages, topic coverage in Wikipedia articles tends to be directed by the dominant intellectual traditions and camps across different language communities, i.e., a French Wikipedia article focuses on French thinkers, and a Chinese article stresses on Chinese intellectual movements. Wikipedia’s fundamental principles or objectives filter language bias, making it heavily rely on intellectual and academic traditions.

While the authors employ some quantitative methods to study the bias on the other three sites (particularly Google), the Wikipedia part of the paper is almost entirely qualitative in nature. It focuses on an in-depth comparison of a small set of (quite apparently non-randomly chosen) article topics across languages, not unlike various earlier studies of language bias on Wikipedia (e.g. on the coverage of the Holocaust in different languages, see our previous coverage here and here). Unfortunately, the paper fails to cite such such earlier research (which has also included quantitative results, such as those represented in the "Wikipedia Diversity Observatory", which among other things includes data on topic coverage across 300+ Wikipedia languages) – despite asserting "there has been a lack of investigation into language bias on platforms such as Google, ChatGPT, Wikipedia, and YouTube".

The first and largest part of the paper's examination of Wikipedia's coverage concerns articles about Buddhism and various subtopics, in the English, French, German, Vietnamese, Chinese, Thai, and Nepali Wikipedias. The authors indicate that they chose this topic starting out from the observation that

To Westerners, Buddhism is generally associated with spirituality, meditation, and philosophy, but people who primarily come from a Vietnamese background might see Buddhism as closely tied to the lunar calendar, holidays, mother god worship, and capable of bringing good luck. One from a Thai culture might regard Buddhism as a canopy against demons, while a Nepali might see Buddhism as a protector to destroy bad karma and defilements.

Somewhat in confirmation of this hypothesis, they find that

Compared to Google’s language bias, we find that Wikipedia articles' content titles mainly differ in topic coverage but not much in tones of voice. The preferences of topics tend to correlate with the dominant intellectual traditions and camps in different language communities.

However, the authors also observe that "randomness is involved to some degree in terms of topic coverage on Wikipedia", defying overly simplistic predictions of biases based on intellectual traditions. E.g.

Looking at the Chinese article on "Buddhism", it addresses topics like "dharma name", "cloth", and "hairstyle" that do not exist on other languages' pages. There are several potential causes for its special treatment on these issues. First, many Buddhist texts, such as the Lankavatara Sutra (楞伽经) and Vinaya Piṭaka (律藏), that address these issues were translated into Chinese during medieval China, and these texts are still widely circulated in China today. Second, according to the official central Chinese government statistics, there are over 33,000 monasteries in China, so people who are interested in writing Wikipedia articles might think it is helpful to address these issues on Wikipedia. However, like the pattern in the French article, Vietnam, Thailand, and Nepal all have millions of Buddhist practitioners, and the Lankavatara Sutra and Vinaya Piṭaka are also widely circulated among South Asian Buddhist traditions, but their Wikipedia pages do not address these issues like the Chinese article.

A second, shorter section focuses on comparing Wikipedia articles on liberalism and Marxism across languages. Among other things, it observes that the "English article has a long section on Keynesian economics", likely due to its prominent role in the New Deal reforms in the US in the 1930s. In contrast,

In the French article on liberalism, the focus is not solely on the modern interpretation of the term but rather on its historical roots and development. It traces its origins from antiquity to the Renaissance period, with a focus on French history. It also highlights the works of French theorists such as Montesquieu and Tocqueville [...]. The Italian article has a lengthy section on "Liberalism and Christianity" because liberalism can be seen as a threat to the catholic church. Hebrew has a section discussing the Zionist movement in Israel. The German article is much shorter than the French, Italian, and Hebrew ones. Due to Germany's loss in WWII, its post-WWII state was a liberal state and was occupied by the Allied forces consisting of troops from the U.S., U.K., France, and the Soviet Union. This might have influenced Germany's perception and approach to liberalism.

Among other proposals for reducing language bias on the four sites, the paper proposes that

"[Wikipedia] could potentially invite scholars to contribute articles in other languages to improve the multilingual coverage of the site. Additionally, Wikipedia could merge non-overlapping sections of articles on the same term but written in different languages into a single article, like how multiple branches of code are merged on GitHub. Like Google, Wikipedia could translate the newly inserted paragraphs into the user’s target language and place a tag to indicate its source language.

Returning to their title metaphor, the authors give Wikipedia credit for at least "show[ing] a rough silhouette of the elephant", whereas e.g. Google only "presents a piece of the elephant based on a user's query language". However, this "silhouette – topic coverage – differs by language. [Wikipedia] writes in a descriptive tone and contextualizes first-person narratives and subjective opinions as cultural, historical, or religious phenomena." YouTube, on the other hand, "displays the 'color' and 'texture' of the elephant as it incorporates images and sounds that are effective in invoking emotions." But its top-rated videos "tend to create a more profound ethnocentric experience as they zoom into a highly confined range of topics or views that conform to the majority's interests".

The papers singles out the new AI-based chatbots as particularly problematic regarding language bias:

The problem with language bias is compounded by ChatGPT. As it is primarily trained on English language data, it presents the Anglo-American perspective as truth [even when giving answers in other languages] – as if it were the only valid knowledge.

On the other hand, the paper's examination of the biases of "ChatGPT-Bing" [sic] highlights among other concerns its reliance on Wikipedia among the sources it cites in its output:

[...] all responses list Wikipedia articles as its #1 source, which means that language bias in Wikipedia articles is inevitably permeated in ChatGPT-Bing's answers.

Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"A systematic review of Wikidata in Digital Humanities projects"

From the abstract:[2]

"A systematic review was conducted to identify and evaluate how DH [Digital Humanities] projects perceive and utilize Wikidata, as well as its potential and challenges as demonstrated through use. This research concludes that: (1) Wikidata is understood in the DH projects as a content provider, a platform, and a technology stack; (2) it is commonly implemented for annotation and enrichment, metadata curation, knowledge modelling, and Named Entity Recognition (NER); (3) Most projects tend to consume data from Wikidata, whereas there is more potential to utilize it as a platform and a technology stack to publish data on Wikidata or to create an ecosystem of data exchange; and (4) Projects face two types of challenges: technical issues in the implementations and concerns with Wikidata’s data quality."


"Leveraging Wikipedia article evolution for promotional tone detection"

From the abstract:[3]

"In this work we introduce WikiEvolve, a dataset for document-level promotional tone detection. Unlike previously proposed datasets, WikiEvolve contains seven versions of the same article from Wikipedia, from different points in its revision history; one with promotional tone, and six without it. This allows for obtaining more precise training signal for learning models from promotional tone detection. [...] In our experiments, our proposed adaptation of gradient reversal improves the accuracy of four different architectures on both in-domain and out-of-domain evaluation."


"Detection of Puffery on the English Wikipedia"

From the abstract:[4]

"Wikipedia’s policy on maintaining a neutral point of view has inspired recent research on bias detection, including 'weasel words' and 'hedges'. Yet to date, little work has been done on identifying 'puffery,' phrases that are overly positive without a verifiable source. We [...] construct a dataset by combining Wikipedia editorial annotations and information retrieval techniques. We compare several approaches to predicting puffery, and achieve 0.963 f1 score by incorporating citation features into a RoBERTa model. Finally, we demonstrate how to integrate our model with Wikipedia’s public infrastructure [at User:PeacockPhraseFinderBot] to give back to the Wikipedia editor community."


"Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences"

From the abstract:[5]

"The shortage of volunteers brings to Wikipedia many issues, including developing content for over 300 languages at the present. Therefore, the benefit that machines can automatically generate content to reduce human efforts on Wikipedia language projects could be considerable. In this paper, we propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models."

Among other examples given in the paper, a Wikidata statement involving the items and properties Q217760, P54, Q221525 and P580 is mapped to the Wikipedia sentence "On 30 January 2010, Wiltord signed with Metz until the end of the season."

Judging from the paper's citations, the authors appear to have been unaware of the Abstract Wikipedia project, which is pursuing a closely related effort.


"The URW-KG: a Resource for Tackling the Underrepresentation of non-Western Writers" on Wikidata

From the paper:[6]

"[...] the UnderRepresented Writers Knowledge Graph (URW-KG), a dataset of writers and their works targeted at assessing and reducing their potential lack of representation [...] has been designed to support the following research objectives (ROs):
1. Exploring the underrepresentation of non-Western writers in Wikidata by aligning it with external sources of knowledge; [...]

A quantitative overview of the information retrieved from external sources [ Goodreads, Open Library, and Google Books] shows a significant increase of works (they are 16 times more than in Wikidata) as well as an increase of the information about them. External resources include 787,125 blurbs against the 40,532 present in Wikipedia, and both the number of subjects and publishers extentively grow.

[...] the impact of data from OpenLibrary and Goodreads is more significant for Transnational writers [...] than for Western [...]. This means that the number of Transnational works gathered from external resources is higher, reflecting the wider [compared to Wikidata] preferences of readers and publishers in these crowdsourcing platforms."


References

  1. ^ Luo, Queenie; Puett, Michael J.; Smith, Michael D. (2023-03-28). "A Perspectival Mirror of the Elephant: Investigating Language Bias on Google, ChatGPT, Wikipedia, and YouTube". arXiv:2303.16281 [cs.CY].
  2. ^ Zhao, Fudie (2022-12-28). "A systematic review of Wikidata in Digital Humanities projects". Digital Scholarship in the Humanities. 38 (2): –083. doi:10.1093/llc/fqac083. ISSN 2055-7671.
  3. ^ De Kock, Christine; Vlachos, Andreas (May 2022). "Leveraging Wikipedia article evolution for promotional tone detection". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL 2022. Dublin, Ireland: Association for Computational Linguistics. pp. 5601–5613. doi:10.18653/v1/2022.acl-long.384. Data
  4. ^ Bertsch, Amanda; Bethard, Steven (November 2021). "Detection of Puffery on the English Wikipedia". Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). EMNLP-WNUT 2021. Online: Association for Computational Linguistics. pp. 329–333. code
  5. ^ Ta, Hoang Thang; Gelbukha, Alexander; Sidorov, Grigori (2022-10-23). "Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences". arXiv:2210.12659 [cs.CL].
  6. ^ Stranisci, Marco Antonio; Spillo, Giuseppe; Musto, Cataldo; Patti, Viviana; Damiano, Rossana (2022-12-21). "The URW-KG: a Resource for Tackling the Underrepresentation of non-Western Writers". arXiv:2212.13104 [cs.CL].
S
In this issue
+ Add a comment

Discuss this story

Draft or not

Is this a draft paper or a published one? PAC2 (talk) 21:16, 3 April 2023 (UTC)Reply[reply]

The "elephant" paper? It's a preprint, as mentioned in the opening sentence of the review. Regards, HaeB (talk) 18:10, 9 April 2023 (UTC)Reply[reply]





       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0