The Signpost

File:Screenshot of Google Images search for 'Daguerreotype ' - 2019-04-01.jpg
Recent research

Images on Wikipedia "amplify gender bias"

Contribute  —  
Share this
By Bri and Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Images on Wikipedia "amplify gender bias" compared to article text

Reviewed by Bri and Tilman Bayer

A Nature paper titled "Online Images Amplify Gender Bias"[1] studies:

"gender associations of 3,495 social categories (such as 'nurse' or 'banker') in more than one million images from Google, [English] Wikipedia and Internet Movie Database (IMDb), and in billions of words from these platforms"

As summarized by Neuroscience News:

This pioneering study indicates that online images not only display a stronger bias towards men but also leave a more lasting psychological impact compared to text, with effects still notable after three days.

This was a two-part research paper in which the authors:

While the paper's main analyses focus on Google, the authors replicated their findings with text and image data from Wikipedia and IMDb.

Gender bias in text and images

For the first part, images were retrieved from Google Search results for 3,495 social categories drawn from WordNet, a canonical database of categories in the English language. These categories include occupations—such as doctor, lawyer and carpenter—and generic social roles, such as neighbour, friend and colleague. Faces extracted from these images (using the OpenCV library) were tagged with gender by workers recruited via Amazon Mechanical Turk. The reliability of tagging was validated against the self-identified gender from a "canonical set" of celebrity portraits culled from IMDb and Wikipedia.[supp 1]

For the replication analysis with English Wikipedia (relegated mainly to the paper's supplement), an analogous set of images was derived using another existing Wikipedia image dataset,[supp 2] whose text descriptions yielded matches for 1,523 of the 3,495 WordNet-derived social categories (For example, we retrieve the Wikipedia article with the title ‘Physician’ for the social category physician:

To measure gender bias in a corpus of text from e.g. Google News, the authors use word embeddings (a computational natural language processing technique) trained on that corpus. Specifically, their method (adapted from a 2019 paper) assigns a number to each category (e.g. doctor, lawyer or carpenter) that captures the extent to which [the word for this category] co-occurs with textual references to either women or men [in the corpus]. This method allows us to position each category along a −1 (female) to 1 (male) axis, such that categories closer to −1 are more commonly associated with women and those closer to 1 are more commonly associated with men [in the corpus]. [...] The category ‘aunt’, for instance, falls close to −1 along this scale, whereas the category ‘uncle’ falls close to 1 along this scale. The authors interpret any deviation of this "gender association" value from 0 as evidence of "gender bias" for a particular category. Figure 1 in the paper illustrates this in case of Google News for a list of occupations. There, the three categories with the largest male bias appear to be "football player", "philosopher", and "mechanic", and the three categories with the largest female bias "cosmetologist", "ballet dancer", and "hairstylist". In the figure, the category closest to being unbiased (0) in the Google News text was "programmer". Overall though, texts from Google News exhibit [only] a relatively weak bias towards male representation, with an average score of 0.03.

In case of Wikipedia text, this gender association of a particular WordNet category was determined using a pre-trained word embedding model of Wikipedia available in Python’s gensim package, which was built using the GloVe method to analyze a 2014 corpus of 5.6 billion words from Wikipedia. Somewhat concerningly, this description by the authors is inconsistent with the gensim documentation, which states that this 5.6 billion token corpus was not based on Wikipedia alone, but on "Wikipedia 2014 + Gigaword". According to the original GloVe paper,[supp 3] "Gigaword 5 [...] has 4.3 billion tokens", meaning that it would form a much bigger part of that corpus than Wikipedia. (The GloVe authors also observed that Wikipedia's entries are updated to assimilate new knowledge, whereas Gigaword is a fixed news repository with outdated and possibly incorrect information; the corpus contains newswire text dating back to 1994.)

In other words, the Nature study's conclusions about Wikipedia text might not be valid. Assuming they are though, they might seem vaguely reassuring for Wikipedians (and perhaps somewhat in contrast with earlier research about textual gender bias on Wikipedia): Using several different variants of the model (with different word embedding dimensions), respectively, 57% (50D), 59% (100D), 57.6% (200D), and 54% (300D) of categories [are] male-skewed, with an average strength of gender association below 0.06 (recall that the authors describe the corresponding value of 0.03 for Google News as a relatively weak bias). The story is different for images, though:

images over Wikipedia are significantly skewed toward male representation. 80% of categories are male-skewed according to images over Wikipedia (p < 0.0001, proportion test, n = 495, two-tailed). [...] Including all 1,244 categories in our analysis continues to show a strong bias toward male representation in Wikipedia images (with 68% of faces being male, p < 0.00001). [...] Wikipedia content can appear to be neutral in its gender associations if one focuses only on text, whereas examining Wikipedia images from the same articles can reveal a different reality, with evidence of a strong bias toward male representation and a stronger bias toward more salient gender associations in general.

Impact of image vs. text search on users' gender bias

For the second part (which did not involve Wikipedia directly), the researchers

... conducted a nationally representative, preregistered experiment that shows that googling for images rather than textual descriptions of occupations amplifies gender bias in participants’ beliefs.

To measure participants' gender bias after they had completed the googling task, an implicit association test (IAT) methodology was used, which supposedly reveals unconscious bias in a timed sorting task. In the researchers' words, "the participant will be fast at sorting in a manner that is consistent with one's latent associations, which is expected to lead to greater cognitive fluency [lower measured sorting times] in one's intuitive reactions." Specifically, the IAT variant used was designed to detect the implicit bias towards associating women with liberal arts and men with science. The test measured how long participants took to associate a particular word or image (e.g. "Girl", "Engineering", "Grandpa", "Fashion") with either the male/female or science/liberal arts categories.

The labeling of text descriptions was performed by other humans recruited via Amazon Mechanical Turk. Both the test subject, and the labelers, were adults from the United States, and the test subjects were screened to be representative of the U.S. population to include a nearly 50/50 male/female split (none self identified as other than those two categories). The experiment focused on a sample of 22 occupations, e.g. immunologist, harpist, hygienist, and intelligence analyst.

Some test subjects were given a task related to occupation-related text prior to the IAT, and some were given a task related to images. The task was either to use Google search to retrieve images of representative individuals in the occupation, or Google search to retrieve a textual description of the occupation. A control group performed an unrelated Google search. Before the IAT was performed, the test subjects were required to indicate on a sliding scale, for each of the occupations, "which gender do you most expect to belong to this category?" The test was performed again a few days later with the same test subjects.

On the second test, subjects exposed to images in the first test had a stronger IAT score for bias than those exposed to text.

The experimental part of the study depends partly on IAT and partly on self-assessment to detect priming, and there are concerns about replicability concerning the priming effect, and the validity and reliability of IAT. Some of the concerns are described at Implicit-association test § Criticism and controversy. It seemed that the authors recognized this (We acknowledge important continuing debate about the reliability of the IAT), and in their own study found that the distribution of participants' implicit bias scores [arrived at with IAT] was less stable across our preregistered studies than the distribution of participants' explicit bias scores, and discounted the implicit bias scores somewhat.

The conclusion drawn by the researchers, based partly but not entirely on the different IAT scores of experimental subjects, was that of the paper title: "images amplify gender bias" — both explicitly as determined by the subject's assignments of occupation to gender on a sliding scale, and implicitly as determined by reaction times measured in the IAT.


The paper opens with the (rather thinly referenced) observation that "Each year, people spend less time reading and more time viewing images". Combined with the finding that searching for occupation images on Google amplified participants' gender biases, this forms an "alarming" trend according to the study's lead author (Douglas Guilbeault of UC Berkeley's Haas School of Business), as quoted by AFP on "the potential consequences this can have on reinforcing stereotypes that are harmful, mostly to women, but also to men".

The researchers also determined, apart from experimental subjects, that the Internet – represented singularly by Google News – exhibits a strong gender bias. It was unclear to this reviewer how much of the reported Internet bias is really "Google selection bias". Based on these findings, the authors go on to speculate that "gender biases in multimodal AI may stem in part from the fact that they are trained on public images from platforms such as Google and Wikipedia, which are rife with gender bias according to our measures".


Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

"Gender stereotypes embedded in natural language [of Wikipedia articles] are stronger in more economically developed and individualistic countries"

From the abstract:[2]

From the abstract: "[...] measuring stereotypes is difficult, particularly in a cross-cultural context. Word embeddings are a recent useful tool in natural language processing permitting to measure the collective gender stereotypes embedded in a society. [...] We considered stereotypes associating men with career and women with family as well as those associating men with math or science and women with arts or liberal arts. Relying on two different sources (Wikipedia and Common Crawl), we found that these gender stereotypes are all significantly more pronounced in the text corpora of more economically developed and more individualistic countries. [...] our analysis sheds light on the “gender equality paradox,” i.e. on the fact that gender imbalances in a large number of domains are paradoxically stronger in more developed/gender equal/individualistic countries."

To determined "the relative contribution of residents from each country to each language [version of Wikipedia]", the author (a researcher at CNRS) used the Wikimedia Foundation's "WiViVi" dataset which provides the percentage of pageviews per country for a given language Wikipedia. This data is somewhat outdated (last updated in 2018) and also, for the goal of measuring contribution (rather than consumption), the separate Geoeditors dataset might have been worth considering (which provides the number of editors per country, although with - somewhat controversial - privacy redactions).

"Poor attention: The wealth and regional gaps in event attention and coverage on Wikipedia"

"On November 2nd 2020, two terrorist attacks occurred: One in Vienna, Austria and one in Kabul, Afghanistan (top). Although both events gained many page views and edits on the English Wikipedia (bottom left), the attack in Austria received much more attention than the one in Afghanistan, including coverage in more Wikipedia language editions (bottom right)." (Figure 1 ("Motivation") from the paper)

From the abstract:[3]

"for many people around the world, [Wikipedia] serves as an essential news source for major events such as elections or disasters. Although Wikipedia covers many such events, some events are underrepresented and lack attention, despite their newsworthiness predicted from news value theory. In this paper, we analyze 17 490 event articles in four Wikipedia language editions and examine how the economic status and geographic region of the event location affects the attention [page views] and coverage [edits] it receives. We find that major Wikipedia language editions have a skewed focus, with more attention given to events in the world’s more economically developed countries and less attention to events in less affluent regions. However, other factors, such as the number of deaths in a disaster, are also associated with the attention an event receives."

Relatedly, a 2016 paper titled "Dynamics and biases of online attention: the case of aircraft crashes"[4] had found:

that the attention given by Wikipedia editors to pre-Wikipedia aircraft incidents and accidents depends on the region of the airline for both English and Spanish editions. North American airline companies receive more prompt coverage in English Wikipedia. We also observe that the attention given by Wikipedia visitors is influenced by the airline region but only for events with a high number of deaths. Finally we show that the rate and time span of the decay of attention is independent of the number of deaths and a fast decay within about a week seems to be universal.

A new corpus of Wikipedia passages about events, paired with potential sources

From the abstract:[5]

"[...] we present FAMuS, a new corpus of Wikipedia passages that report on some event, paired with underlying, genre-diverse (non-Wikipedia) source articles for the same event. Events and (cross-sentence) arguments in both report and source are annotated against FrameNet, providing broad coverage of different event types. We present results on two key event understanding tasks enabled by FAMuS: source validation -- determining whether a document is a valid source for a target report event -- and cross-document argument extraction -- full-document argument extraction for a target event from both its report and the correct source article. "

"Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities"

"An illustration of the proposed OVEN task. Examples on the right are sampled from the constructed OVEN-Wiki dataset. OVEN aims at recognizing entities physically presented in the image or can be directly inferred from the image." (figure 1 from the paper)

From the abstract of this preprint by a group of authors from Google Research and Georgia Institute of Technology:[6]

"... we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning."

"Understanding Structured Knowledge Production: A Case Study of Wikidata’s Representation Injustice"

From the paper:[7]

"... through a case study of comparing human [Wikidata] items of two countries, Vietnam and Germany, we propose several reasons that might lead to the existing biases in the knowledge contribution process. [...]
We chose Germany and Vietnam as subjects based on three primary considerations. Firstly, both nations have comparable population sizes. Secondly, the editors who speak the predominant languages of each country maintain their distinct Wiki communities on Wikidata. [...]
The first analysis we did was comparing different components of Wikidata pages between pages in two countries. The components we are comparing are labels, descriptions, claims, and sitelinks. For a single Wikidata page, label is the name that this item is known by, while description is a short sentence or phrase that also serves disambiguate purpose. [...] In the dataset we collected, there are 290,750 people who have citizenship of Germany, and there are only 4,744 people who have citizenship of Vietnam. [...] German pages on average had 13 more labels, 5 more descriptions and 7 more claims compared to Vietnamese pages. While surprisingly, Vietnamese pages had slightly more sitelinks, the difference according to effect size was negligible.
The second analysis focused on the edit history of Wikidata items. [...] we quantified the attention metric into five features: Number of total edits, number of human edits, number of bot edits, and number of distinct bot and human edits. [...] in all the five features the [difference in means between the German and Vietnamese Wikidata human pages] is significant and in terms of bot activity and total activity, the effect size is beyond medium threshold (0.5).

"The Politics of Memory: An Extended Case Study of the Memory of Crisis on Wikipedia"

From the abstract:[8]

... an extended case study is developed on the (re)construction of a major pollution event (the [1952] Great Smog of London). Critical discourse analysis of intertextuality (connections between texts through hyperlinking and other shared patterning) is utilised to move from a focus on micro level practices to macro and meta level findings on the ordering of Wikipedia and its interactions with other institutions. Findings evidence a layered, self-referencing formation across texts, favouring the interests of established institutions and providing limited opportunity for marginalised groups to interact with sustained (re)constructions of the Great Smog. Comparison to a previous study of the constructed memory of a crisis (the London Bombings 2005) reveals dynamics across Wikipedia that lead to an emphasis on connecting (re)constructions to institutional traditions rather than the potential usefulness of such (re)construction for those at higher risk of negative outcomes arising from repeated crises.


  1. ^ Guilbeault, Douglas; Delecourt, Solène; Hull, Tasker; Desikan, Bhargav Srinivasa; Chu, Mark; Nadler, Ethan (February 14, 2024), "Online Images Amplify Gender Bias", Nature, 626 (8001): 1049–1055, Bibcode:2024Natur.626.1049G, doi:10.1038/s41586-024-07068-x, PMID 38355800Open access icon code and (links to) data files
  2. ^ Napp, Clotilde (2023-11-01). "Gender stereotypes embedded in natural language are stronger in more economically developed and individualistic countries". PNAS Nexus. 2 (11). Michele Gelfand (ed.): –355. doi:10.1093/pnasnexus/pgad355. ISSN 2752-6542. PMC 10662454. PMID 38024410.
  3. ^ Ruprechter, Thorsten; Burghardt, Keith; Helic, Denis (2023-11-08). "Poor attention: The wealth and regional gaps in event attention and coverage on Wikipedia". PLOS ONE. 18 (11). Robin Haunschild (ed.): –0289325. Bibcode:2023PLoSO..1889325R. doi:10.1371/journal.pone.0289325. ISSN 1932-6203. PMID 37939022. Data and code:
  4. ^ García-Gavilanes, Ruth; Tsvetkova, Milena; Yasseri, Taha (2016-10-01). "Dynamics and biases of online attention: the case of aircraft crashes". Open Science. 3 (10): 160460. arXiv:1606.08829. Bibcode:2016RSOS....360460G. doi:10.1098/rsos.160460. ISSN 2054-5703. PMC 5098985. PMID 27853560.
  5. ^ Vashishtha, Siddharth; Martin, Alexander; Gantt, William; Van Durme, Benjamin; White, Aaron Steven (2023-11-09). "FAMuS: Frames Across Multiple Sources". arXiv:2311.05601 [cs.CL].
  6. ^ Hu, Hexiang; Luan, Yi; Chen, Yang; Khandelwal, Urvashi; Joshi, Mandar; Lee, Kenton; Toutanova, Kristina; Chang, Ming-Wei (2023-02-22). "Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities". arXiv:2302.11154 [cs.CV]. Code and data request form
  7. ^ Ma, Jeffrey Jun-jie; Zhang, Charles Chuankai (2023-11-05). "Understanding Structured Knowledge Production: A Case Study of Wikidata's Representation Injustice". arXiv:2311.02767 [cs.HC]. extended abstract. In: CSCW ’23 Workshop on Epistemic injustice in online communities, October 2023, Minneapolis, MN.. ACM, New York, NY, USA
  8. ^ Schuller, Nina Margaret (2023). The politics of memory: An extended case study of the memory of crisis on Wikipedia (phd). University of Southampton. (dissertation)
Supplementary references and notes:
  1. ^ the "IMDB-WIKI dataset", from: Rothe, Rasmus; Timofte, Radu; Van Gool, Luc (2018-04-01). "Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks". International Journal of Computer Vision. 126 (2): 144–157. doi:10.1007/s11263-016-0940-3. hdl:20.500.11850/204027. ISSN 1573-1405. S2CID 207252421.
  2. ^ the "Wikipedia-based Image Text Dataset" (cf. our earlier coverage: "Announcing WIT: A Wikipedia-Based Image-Text Dataset")
  3. ^ Pennington, Jeffrey; Socher, Richard; Manning, Christopher (October 2014). "Glove: Global Vectors for Word Representation". In Moschitti, Alessandro; Pang, Bo; Daelemans, Walter (eds.). Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics. pp. 1532–1543. doi:10.3115/v1/D14-1162.
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

I don't have the fortitude to understand the statistical complexities of this subject -- but it seems to me that availability of pictures and text accounts for a lot of what is called "gender bias." In reliable sources, especially in sources about historical subjects and long-dead people, there is a lot more information about men than women. And there are more photos and pictures of men than women available to Wikipedia editors. One reason is that many photos and pictures must be 95 or more years old to be in the public domain, and hence eligible to be posted to Wikimedia.

I have tough skin, so heave bricks at me if you wish for the above statement. Smallchief (talk) 17:52, 2 March 2024 (UTC)[reply]

Nah, I'd heave my agreement- you're right that bias in availability is the root cause of bias in the images used. IDK man I hope AI helps with that, but that's just me. Firestar464 (talk) 02:14, 3 March 2024 (UTC)[reply]

Male bias in images for "football player", "philosopher", and "mechanic"? They are not serious, are they? I say sloppy scholarship. - Altenmann >talk 21:08, 2 March 2024 (UTC)[reply]

I found a great reason to add a relevant and high-quality photo of a woman to Mechanic :) ~Maplestrip/Mable (chat) 09:57, 4 March 2024 (UTC)[reply]
Yes, the authors do repeatedly call these numbers "gender bias" (although their chart legend for figure 1 uses the less loaded term "gender association"). This kind of fuzzy usage of the term "bias" is unfortunately common in publications about Wikipedia's gender gap, many of which interpret any deviation from 50% as evidence of bias on Wikipedia's part (in a "tipping the scales" kind of causal sense). Here, the authors do seem to be aware that this kind of reasoning can't be fully valid for all categories - besides the "aunt" and "uncle" examples quoted in the review, in A.1.10 they mention the category with the strongest negative [i.e. female] association (-0.42, “chairwoman”) [...] and the category with the strongest positive [i.e. male] association (0.33, “guy”).
Also, to be fair, the authors' main result focuses on the difference between these "bias" numbers for images and text. And, in the paper they also compare them with US census data on gender ratio of occupations and with the results from an opinion survey they ran, asking the question "Which gender do you most expect to belong to this category?". (We didn't get to cover that in this already quite detailed review, also because these comparisons focus on the Google-related results instead of Wikipedia.)
Ultimately though, the problem of selecting a "fair" reference point to compare Wikipedia to remains a difficult one. Regards, HaeB (talk) 06:05, 5 March 2024 (UTC)[reply]

They can't be serious. - Master of Hedgehogs (converse) (hate that hedgehog!) 00:10, 4 March 2024 (UTC)[reply]

Amusingly, today, we have six bust pictures of men on our frontpage, typically the maximum possible. This is an issue that people have thought about before of course. Scientist has the pair of Curies as the lead image and that works great (also the first "scientist"? Wow!). But should we replace a picture of Bohr or Fermi with Meitner? These are hard and arbitrary decisions. The balance of relevance within the context/framing of the article can make it hard to improve on this, but I can already spot some places where we can include more women. ~Maplestrip/Mable (chat) 09:51, 4 March 2024 (UTC)[reply]

I struggle with this topic a little bit because, as an encyclopaedia, it's our task to reflect the world around us, not necessarily to try and change it. Away from Wikipedia I'm a massive advocate for tackling the inequalities and stereotypes we see all around us, but here our aim is to present a neutral point of view. From a neutral point of view, the vast majority of nurses worldwide are female, so it follows that a neutrally selected illustration of a "typical" nurse would be female. We should present reality as it is, not how we would like it to be. WaggersTALK 12:00, 6 March 2024 (UTC)[reply]

Typically, the bias of Wikipedia simply mirrors the bias of our sources, and that's in theory how it should be. WP:rightgreatwrongs is another hing we have to keep in mind. But I think even just considering looking for new images can be valuable for finding new perspectives to view a topic from (like I did on Mechanic), which might have a whole swath of literature tied to it as well. I think this works way better when it comes to using non-American/European perspectives, like finding images of Asian or African people in these occupations. ~Maplestrip/Mable (chat) 12:12, 6 March 2024 (UTC)[reply]
Actually, rather than plugging in arbitrary photos of women into articles aboiut "male-domitated" occupations, it is good to add whenever possible sections about gender bias in them, especially when thisgs were changing. For example, "Rosie the Riveter" tackles the issue; unfortunately it talks only about simple skilled crafts, such as welding, riveting, etc. - Altenmann >talk 20:24, 10 March 2024 (UTC)[reply]

The fact that some people devote their entire careers, lives even, to topics like this really speaks to the state of academia. skarz (talk) 17:11, 13 March 2024 (UTC)[reply]

Sure thing, "British scientists" :-) - Altenmann >talk 17:41, 13 March 2024 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0