The Signpost

File:Crowd at Knebworth House - Rolling Stones 1976.jpg
Sérgio Valle Duarte
CC 3.0 BY
600
Recent research

New survey of over 100,000 Wikipedia users

Contribute  —  
Share this
By Tilman Bayer


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


Survey dataset of over 100,000 Wikipedia readers and contributors

From the abstract:[1]

"The dataset focuses on Wikipedia users and contains information about demographic and socioeconomic characteristics of the respondents and their activity on Wikipedia. The data was collected using a questionnaire available online between June and July 2023. The link to the questionnaire was distributed via a banner published in 8 languages on the Wikipedia page. [...] The survey includes 200 questions about: what people were doing on Wikipedia before clicking the link to the questionnaire; how they use Wikipedia as readers ("professional" and "personal" uses); their opinion on the quality, the thematic coverage, the importance of the encyclopaedia; the making of Wikipedia (how they think it is made, if they have ever contributed and how); their social, sport, artistic and cultural activities, both online and offline; their socio-economic characteristics including political beliefs, and trust propensities. More than 200 000 people opened the questionnaire, 100 332 started to answer, and constitute our dataset, and 10 576 finished it."

This dataset paper doesn't contain any results from the survey itself. And from the communications around it (including the project's page on Meta-wiki at Research:Surveying readers and contributors to Wikipedia) it is not clear whether and when the authors or others are planning to publish any analyses themselves. Hence we are taking a quick look ourselves at some topline results below (note: these are taken directly from the "filtered" dataset published by the authors, without any weighing by language or other debiasing efforts). It remains to be hoped that more use will be made of this data soon, also considering that various questions appear to have been designed for compatibility with certain previous surveys.

These gender ratios are notably somewhat more balanced than e.g. the figures from the Wikimedia Foundations "Community Insights" surveys of recent years; however, those targeted a different population consisting exclusively of contributors. Still, the gender gap in this new survey data is even somewhat smaller than that found for English-language Wikipedia readers in a past survey by the Wikimedia Foundation (cf. below).

Distribution of responses to the question "In political matters, people talk of 'the left' and 'the right.' How would you place your views on this scale, generally speaking?" (NB: 11.7% of those who responded chose the option "This distinction does not speak to you".)

Unless we are dealing with a data anomaly here, this chart shows a general preponderance of left-of-center political positions among Wikipedia users, partly balanced out by a substantial share of far-right users (10 on a scale from 1 = left to 10 = right).


Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Global Gender Differences in Wikipedia Readership"

"Wikipedia reader gender by language" (from 2019 survey data)

From the abstract and introduction:[2]

"From a global online survey of 65,031 readers of Wikipedia and their corresponding reading logs, we present first evidence of gender differences in Wikipedia readership and how they manifest in records of user behavior. More specifically we report that (1) women are underrepresented among readers of Wikipedia, (2) women view fewer pages per reading session than men do, (3) men and women visit Wikipedia for similar reasons, and (4) men and women exhibit specific topical preferences"
"Across 16 surveys, men represent approximately two-thirds of Wikipedia readers on any given day. Additionally, we observe that women view fewer pages per reading session than men do. However, we also find that on average, men and women visit Wikipedia for similar reasons. That is, the depth of knowledge that they seek, referred to as information need for the remainder of this paper, and their triggers for reading Wikipedia, referred to as motivations, are remarkably similar. Finally, men and women exhibit specific topical preferences. Readership of articles about sports, games, and mathematics is skewed to-wards men, while readership of articles about broadcasting, medicine, and entertainment is skewed towards women. We further observe evidence of self-focus bias[...], i.e. that women tend to read relatively more biographies of women than men do, whereas men tend to read relatively more biographies of men than women do."
"closing content gaps is not a panacea as evidenced by prior research on Welsh Wikipedia, where a majority of the biographies are about women [...], a majority of Welsh speakers are women,[...] but readership is still heavily skewed towards men"

See also project page on Meta-wiki: m:Research:Characterizing_Wikipedia_Reader_Behaviour/Demographics_and_Wikipedia_use_cases and a subsequent literature review which formulated various potential explanations for the observed gender gap in Wikipedia readers.


"Hunters, busybodies and the knowledge network building associated with deprivation curiosity"

From the abstract:[3]

"A recently developed historicophilosophical taxonomy of curious practice distinguishes between the collection of disparate, loosely connected pieces of information and the seeking of related, tightly connected pieces of information. With this taxonomy, we use a novel knowledge network building framework of curiosity to capture styles of curious information seeking in 149 participants as they explore Wikipedia for over 5 hours spanning 21 days. We create knowledge networks in which nodes consist of distinct concepts (unique Wikipedia pages) and edges represent the similarity between the content of Wikipedia pages. We quantify the tightness of each participants' knowledge networks using graph theoretical indices and use a generative model of network growth to explore mechanisms underlying the observed information seeking. We find that participants create knowledge networks with small-world and modular structure. Deprivation sensitivity, the tendency to seek information that eliminates knowledge gaps, is associated with the creation of relatively tight networks and a relatively greater tendency to return to previously-visited concepts. We further show that there is substantial within-person variability in knowledge network building over time and that building looser networks than usual is linked with higher than usual sensation seeking."

See also an explanatory Twitter thread by one of the authors


"Architectural styles of curiosity in global Wikipedia mobile app readership"

From the abstract:[4]

"[...] most curiosity research relies on small, Western convenience samples. Here, we expand an analysis of a laboratory study with 149 participants browsing Wikipedia to 482,760 readers using Wikipedia's mobile app in 14 languages from 50 countries or territories. By measuring the structure of knowledge networks constructed by readers weaving a thread through articles in Wikipedia, we provide the first replication of two distinctive architectural styles of curiosity: that of the busybody and of the hunter [in reference to the above paper involving some of the same authors ...] Finally, across languages and countries, we identify novel associations between the structure of knowledge networks and population-level indicators of spatial navigation, education, mood, well-being, and inequality."

See also research project page on Meta-wiki: m:Research:Understanding Curious and Critical Readers


"Quantifying knowledge synchronization [between Wikipedia language versions] with the network-driven approach"

From the paper:[5]

"[...] we explore the dominant path of knowledge diffusion in the 21st century using Wikipedia, the largest communal dataset. We evaluate the similarity of shared knowledge between population groups, distinguished based on their language usage. When population groups are more engaged with each other, their knowledge structure is more similar, where engagement is indicated by socio-economic connections, such as cultural, linguistic, and historical features. Moreover, geographical proximity is no longer a critical requirement for knowledge dissemination.
We used Wikipedia SQL dump of 59 different language editions on February 1, 2019. [...] Specifically, we used two collections of the Wikipedia dump: category membership link records (*-categorylinks.sql.gz) and interlanguage link records (*-langlinks.sql.gz). [...] From the linkage between Wikipedia pages and categories, we extracted a hierarchical knowledge network of each language edition. [...Based on these per-language structures] we constructed the similarity network from the pairwise knowledge structure similarity, where nodes represent the language of Wikipedia, and the link's weight indicates similarity between languages.
"English is in the center and serves as a hub node, while intermediate hub languages such as Spanish, German, French, Russian, Portuguese, Chinese, and Dutch also function as cluster centroids"


Despite teachers' skepticism, 86% of Estonian high school students use Wikipedia at least a couple of times per month (female students more often)

From the abstract:[6]

"The article is based on a quantitative study in which 381 Estonian school children [9th and 12th grade students] participated in filling out an online survey. The questionnaire included both multiple-choice and open-ended questions. Findings: Statistical analyses and responses to open-ended questions showed that students often use Wikipedia as a primary source of information, but that their use of the site for learning tasks is guided by teachers’ attitudes and perceptions towards Wikipedia. Students perceive Wikipedia as a quick and convenient source of information but are uncertain about its reliability."

From the "Results" section:

"[...] 5% of the students surveyed use Wikipedia every day, 51% at least a couple of times a week and 30% a couple of times a month. To compare the groups, we conducted a t-test, which concluded that statistically significant differences were present across gender and grades. For the purpose of the calculations, we treated responses as numerical (rarely/not at all = 1, a few times a year = 2, a few times a month = 3, a few times a week = 4, every day = 5). For gender, the mean is 3.73 for women and 3.46 for men (p < 0.05). Thus, there is a statistically significant difference in the frequency of Wikipedia use between the two groups, with female students using Wikipedia more often than male students. [...] 24% of the students surveyed said that teachers had no objection to using Wikipedia, 3% said that teachers did not allow to use Wikipedia, 47% said that some teachers did and some did not and 10% said that they did not know. Teachers do not explicitly forbid students from using Wikipedia for learning tasks, but they do recommend that students use more trustworthy sources [...]"


"With or without Wikipedia? Integrating Wikipedia into the Teaching Process in Estonian General Education Schools"

From the abstract:[7]

The study is based on semi-structured interviews with 49 teachers from 11 general education schools in Estonia. The results of the qualitative content analysis of the interviews indicate that teachers consider the use of Wikipedia to be a suitable for teaching, alongside other information sources and environments. However, teachers acknowledge some uncertainty and caution towards Wikipedia, as they do not consider it a very reliable teaching tool: an attitude largely inherited from the early days of Wikipedia. While teachers themselves are active and frequent Wikipedia users, and allow students to search for information, they do not assign Wikipedia-based text-creation tasks to students. "

References

  1. ^ Cruciani, Caterina; Joubert, Léo; Jullien, Nicolas; Mell, Laurent; Piccione, Sasha; Vermeirsche, Jeanne (2023-12-01). "Surveying Wikipedians: a dataset of users and contributors' practices on Wikipedia in 8 languages". arXiv:2311.07964. Dataset: Cruciani, Caterina; Joubert, Léo; Jullien, Nicolas; Mell, Laurent; Piccione, Sasha; Vermeirsche, Jeanne (2023-12-01). Surveying Wikipedians: a dataset of users and contributors' practices on Wikipedia in 8 languages. doi:10.34847/nkl.4ecf4u8m.
  2. ^ Johnson, Isaac; Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Strohmaier, Markus; Zia, Leila (2021-05-22). "Global Gender Differences in Wikipedia Readership". Proceedings of the International AAAI Conference on Web and Social Media. 15: 254–265. doi:10.1609/icwsm.v15i1.18058. ISSN 2334-0770.
  3. ^ Lydon-Staley, David M.; Zhou, Dale; Blevins, Ann Sizemore; Zurn, Perry; Bassett, Danielle S. (2020-11-30). "Hunters, busybodies and the knowledge network building associated with deprivation curiosity". Nature Human Behaviour. 5 (3): 327–336. doi:10.1038/s41562-020-00985-7. ISSN 2397-3374. PMC 8082236. PMID 33257879. Earlier preprint: Lydon-Staley, David Martin; Zhou, Dale; Blevins, Ann Sizemore; Zurn, Perry; Bassett, Danielle S. (2019-06-08). Hunters, busybodies, and the knowledge network building associated with curiosity. PsyArXiv.
  4. ^ Zhou, Dale; Patankar, Shubhankar; Lydon-Staley, David Martin; Zurn, Perry; Gerlach, Martin; Bassett, Danielle S. (2023-11-02). Architectural styles of curiosity in global Wikipedia mobile app readership. PsyArXiv. doi:10.31234/osf.io/szuyj.
  5. ^ Yoon, Jisung; Park, Jinseo; Yun, Jinhyuk; Jung, Woo-Sung (2023-11-01). "Quantifying knowledge synchronization with the network-driven approach". Journal of Informetrics. 17 (4): 101455. doi:10.1016/j.joi.2023.101455. ISSN 1751-1577.
  6. ^ Remmik, Marvi; Siiman, Ann; Reinsalu, Riina; Vija, Maigi; Org, Andrus (January 2024). "Using Wikipedia to Develop 21st Century Skills: Perspectives from General Education Students". Education Sciences. 14 (1): 101. doi:10.3390/educsci14010101. ISSN 2227-7102.
  7. ^ Reinsalu, Riina; Vija, Maigi; Org, Andrus; Siiman, Ann; Remmik, Marvi (June 2023). "With or without Wikipedia? Integrating Wikipedia into the Teaching Process in Estonian General Education Schools". Education Sciences. 13 (6): 583. doi:10.3390/educsci13060583. ISSN 2227-7102.


S
In this issue
+ Add a comment

Discuss this story

Wikipedians are more careful than to believe in the results of convenience sampling. -SusanLesch (talk) 14:21, 25 April 2024 (UTC)[reply]

Huh, can you explain in more detail why you characterize the sampling method used by this survey as "convenience sampling"? That term is most often used for methods that rely on a grossly unrepresentative population (say surveying a class of US college students for making conclusions about all humans). But "people who access the Wikipedia website within a given timespan" is a pretty reasonable proxy for "Wikipedia users" (in the general sense).
For context: Recruitment of survey participants via banners or other kinds of messages on the Wikipedia website itself is kind of the state of the art in this area. (It has also been used in numerous editor and reader surveys conducted by the Wikimedia Foundation.) It e.g. forms the basis of many of the most-cited results on e.g. the gender gap among Wikipedia editors. Yes, it comes with various biases (which, as already indicated in the review, one can try to correct after the fact using various means, see e.g. our earlier coverage here of an important 2012 paper which did this regarding editors: "Survey participation bias analysis: More Wikipedia editors are female, married or parents than previously assumed", and the WMF's "Global Gender Differences in Wikipedia Readership" paper also listed in this issue). But so does any other method (door-knocking, cold-calling landline telephones, etc. - and regarding phone surveys, these biases have become much worse in the last decade or so, at least in the US, as political pollsters have found out).
In sum, it's fine to call out specific potential biases in such surveys (e.g. I have been reminding people for a over a decade now that - per the aforementioned 2012 paper - one of the best available estimate for the share of women editors in the US is 22.7% as of 2008, considerably higher than various other numbers floating around). But dismissing their results entirely strikes me as a nirvana fallacy.
Regards, HaeB (talk) 19:25, 25 April 2024 (UTC) (Tilman)[reply]
Hi, Mr./Dr. Bayer, thank you for your enthusiastic defense. Your sample size is admirable. Maybe our difficulty is in defining terms. I use the term convenience to describe samples created at the convenience of the researcher, to include self-selected participants. The latter is the problem here. I have no knowledge of statistics to share, only the admonition from a former professor that convenience surveys are the weakest sort. It's pretty simple: I never do surveys. My sister always does. The same caveat applied when Elon Musk asked whether he should step down as head of Twitter. His answer looks legitimate and scientific all the way down to one decimal point. I promise to read your article and all of its sources in detail (which I have not had a chance to do) after my editing chores are done. -SusanLesch (talk) 13:55, 26 April 2024 (UTC)[reply]
I still sense a lot of confusion here.
Your sample size is admirable. - Not sure what you mean by the possessive pronoun here, I was not involved at all with this survey.
Maybe our difficulty is in defining terms. - If you were using the term "convenience sampling" in a different meaning than the established one, it would have been good to clarify that from the beginning.
to include self-selected participants - It sounds like you are referring to the mundane fact that participation in the survey was voluntary, which is the case for almost all large-scale social science surveys (and even legally compulsory surveys like the US census have great trouble achieving a 100% response rate and avoiding undercounting). Again, while this might cause participation biases, these can be examined and to some extent handled (see above). It's not a valid reason for dismissing such empirical results out of hand.
I am also very unclear about the relevance of your sister and Elon Musk to this conversation, except perhaps that the latter's social media use illustrates the dangers of shooting off snarky one-sentence remarks based on a very incomplete understanding the topic being discussed. In any case, I appreciate your intention to now actually read the Signpost story that you have been commenting on.
Regards, HaeB (talk) 21:00, 26 April 2024 (UTC)[reply]
Mr./Dr. Bayer, I don't have your fancy vocabulary, nor am I being snarky (nor was Mr. Musk, who asked an honest question). This discussion has become so unpleasant that I no longer wish to read your sources' methodology. The sampling your article describes leads us away from high grade information. -SusanLesch (talk) 16:54, 27 April 2024 (UTC)[reply]
It is great that we have some new good survey data about the community. It is ridcolous they are not available under open licence as open data, and that such a big survey was done without WMF cooperating with this and/or ensuring the data will be available. This is something for the mentioned white paper on best research practices to consider, actually. --Piotr Konieczny aka Prokonsul Piotrus| reply here 00:57, 26 April 2024 (UTC)[reply]
I am a bit confused about what you are referring to.
It is ridcolous they are not available under open licence as open data - the dataset is available (it's how I was able to create the graphs for this review, after all), and licensed under CC-BY SA 4.0.
such a big survey was done without WMF cooperating with this - judging from the project's page on Meta-wiki, the team extensively cooperated with the Wikipedia communities where the survey was to be run (and also invited feedback from some WMF staff who had previously run related surveys). Plus they followed best practices by creating this public project page on Meta-wiki in the first place (actually on your own suggestion it seems?), something even some WMF researchers occasionally forget unfortunately. What's more, the team also notified the research community in advance on the Wiki-research-l mailing list.
Regards, HaeB (talk) 03:46, 26 April 2024 (UTC)[reply]
PS: Also keep in mind that the Wikimedia Foundation has so far not been releasing any datasets from its somewhat comparable "Community Insights" editor surveys. (At least that is my conclusion based on a cursory search and this FAQ item; CCing TAndic and KCVelaga to confirm.) So I am unsure why you are confident that a collaboration with WMF would have been ensuring the data will be available.
PPS: To clarify just in case, I entirely agree with you on the principle that (sanitized) replication data for such surveys should be made available as open data.
Regards, HaeB (talk) 04:08, 26 April 2024 (UTC)[reply]
@HaeB what you write in PPS is pretty much what I meant. Reading the Signpost article gave me the impression this is not the case here (This dataset paper doesn't contain any results from the survey itself. And from the communications around it (including the project's page on Meta-wiki at Research:Surveying readers and contributors to Wikipedia) it is not clear whether and when the authors or others are planning to publish any analyses themselves. Hence we are taking a quick look ourselves at some topline results below (note: these are taken directly from the "filtered" dataset published by the authors, without any weighing by language or other debiasing efforts).) I gather that something is available but not as much as it shoulds be. As for PS, yes, WMF is hardly a paragon of virtue in this regard either, and it is worth complaining about it too. WMF should be a paragon here, and should be both showcasing and enforcing best practices. Piotr Konieczny aka Prokonsul Piotrus| reply here 01:48, 28 April 2024 (UTC)[reply]

It would be interesting (at least to me) to see the results/analyses of the following questions from the survey:

Anyway, thanks for creating those graphs and sharing some of the topline results! Some1 (talk) 00:08, 27 April 2024 (UTC)[reply]

HaeB: The issue with the survey is that the sample is non-random, so the results cannot be relied upon. It is not impossible that the self-selected participants represent a valid sample of the population, but there is no assurance that this is so. Very often, such a sample turns out to be skewed. Chiswick Chap (talk) 11:31, 28 April 2024 (UTC)[reply]

HaeB: I've found at least two attempts to randomize content (which might be easier than randomizing users). That they exist suggests that "state of the art" remains RCTs.

References

  1. ^ Halfaker, A.; Kittur, A.; Kraut, R.; Riedl, J. (October 2009). A jury of your peers: quality, experience and ownership in Wikipedia. 5th International Symposium on Wikis and Open Collaboration. Association for Computing Machinery (ACM). pp. 1–10. doi:10.1145/1641309.1641332 – via Penn State. For our analysis, we used a random sample of approximately 1.4 million revisions attributed to registered editors (with bots removed) as extracted from the January, 2008 database snapshot of the English version of Wikipedia made available by the Wikimedia Foundation.
  2. ^ Thompson, Neil; Hanley, Douglas (February 13, 2018). "Science Is Shaped by Wikipedia: Evidence From a Randomized Control Trial". MIT Sloan Research Paper No. 5238-17. Social Science Research Network. doi:10.2139/ssrn.3039505. From 2013-2016 we ran an experiment to ascertain the causal impact of Wikipedia on academic science. We did this by having new Wikipedia articles on scientific topics written by PhD students from top universities who were studying those fields. Then, half the articles were randomized to be uploaded to Wikipedia, while the other half were not uploaded.

-SusanLesch (talk) 13:41, 4 May 2024 (UTC)[reply]





       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0