The Signpost

Recent research

Conflict dynamics, collaboration and emotions; digitization vs. copyright; WikiProject field notes; quality of medical articles; role of readers; Best Wiki Paper Award

Contribute  —  
Share this
By Daniel Mietchen, Giovanni Luca Ciampaglia, Jodi.a.schneider, Taha Yasseri, OrenBochman, Dario Taraborelli, Benjamin Mako Hill and Tilman Bayer

Modeling social dynamics in a collaborative environment

A draft of a letter, submitted for publication, has been posted on ArXiv.[1] The letter reports research on modeling the process of collaborative editing in Wikipedia and similar open-collaboration writing projects. The work builds on previous research by some of its authors on conflict detection in Wikipedia. The authors explore a simple agent-based model of opinion dynamics, in which editors influence each other either by direct communication or by successively editing a shared medium, such as a Wikipedia page. According to the authors, the model, although highly idealized, exhibits a rich behavior that can reproduce, albeit only qualitatively, some key characteristics of conflicts over real-world Wikipedia pages. The authors show that, for a fixed editorial pool with one "mainstream" and two opposing "extremist" groups, consensus is always reached. However, depending on the values of the model's input parameters, achieving consensus may take an extremely long time, and the consensus does not always conform to the initial mainstream view. In the case of a dynamic group, where new editors replace existing ones, consensus may be achieved through a phase of conflict, depending on the rate of new editors joining the editorial pool and on the degree of controversy over the article's topic.

How Wikipedia articles benefit from the availability of public domain resources

In a copyright panel at this month's Wikimania, Abhishek Nagaraj – a PhD student and economist from the MIT Sloan School of Management – presented early results from an econometric study of copyright law. The study used data from the English Wikipedia's WikiProject Baseball to try to consider how gains from digitization are moderated by the effects of copyright. Previous work on the economics of copyrights have struggled to disentangle the effects of copyright with the effects of increased access that often coincides with content after it has entered the public domain.

The paper takes advantage of the fact that in 2008, Google digitized and published a large number of magazines as part of the Google Books projects. Among other magazines published were 70 years of back-issues of Baseball Digest, a magazine that publishes baseball stories, statistics, and photographs. Measuring the effect of digitization, Nagaraj found that the articles on baseball All-Stars from between 1944 and 1984 saw large increases in size (5,200) around the period that the digital Google Books version of Baseball Digest became available. However, because of the law governing copyright expiration, all the issues of Baseball Digest published before 1964 were in the public domain, while issues published after were not. Using the econometric difference in differences technique, Nagaraj compared the different effects of digitization for (1) players who began their professional baseball career after 1964 and as a result had no new digitized public-domain material and (2) players who had played before and were thus more likely to have digitized material about them enter the public domain.

In terms of the effect of copyright, Nagaraj found no effect on the length of Wikipedia articles on public domain status but found a strong effect for images. Wikipedia writers could, presumably, simply rewrite copyrighted material or may not have found the Baseball Digest form appropriate for the encyclopedia. However, Nagaraj found that the availability of public domain material in Baseball Digest led to a strong increase in the number of images. Before Google Books published the material, the pre-64 group had an average of 0.183 pictures on their articles and the post 64 group had about 0.158 pictures. In the period after digitization, both groups increased but the older group increased more, to 1.15 pictures per article as opposed to 0.667 images for the more recent players whose Baseball Digest material was still under copyright. Nagaraj also found that those players with public domain material have more traffic to their articles. The essay controls for a large number of variables related to players, their performance and talent, and their potential popularity, as well as for trends in Wikipedia editing.

The presentation slides are available on the Wikimania conference website[2] and a nice journalistic write-up was published by The Atlantic.

Annotating field notes via Wikisource

Extraction of location, date and taxon data from Field Notes of Junius Henderson on Wikisource
User:Aubrey's diagram of a future Wikisource, which combines text with additional layers of transcription, hypertext, annotations and comments.

Field notes can be a valuable source of information about meteorological, geological and ecological aspects of the past, and making them accessible by way of Wikisource-based semantic annotation was the focus of a recent study[3] published in ZooKeys as part of a special issue on the digitization of natural history collections. The paper described how the field notes of Junius Henderson from the years 1905–1910 have been transcribed on Wikisource and then semantically annotated, as illustrated in the screenshot. Henderson was an avid collector of molluscs and, while trained as a judge, served as the first curator of the University of Colorado Museum of Natural History. His notebooks are rich in species occurrence records, but also contain occasional gems like this one from September 3, 1905:

The article provides a detailed introduction to the workflows on the English Wikisource in general and to WikiProject Field Notes in particular, which is home to transcriptions of other field notes as well. The data resulting from annotation of the field notes are available in Darwin Core format under a Creative Commons Public Domain Dedication (CC0). This work ties in with discussions that took place at Wikimania about the future of Wikisource, the technical prerequisites and existing tools and initiatives.

Quality of medical information in Wikipedia

The quality of medical information in Wikipedia could be vastly improved, based on the results of a recent study of 24 articles in pediatric otolaryngology[4] (more commonly referred to as "ear, nose, and throat" or ENT). The study compared results on common ENT diagnoses from Wikipedia, eMedicine, and MedlinePlus (the three most popular websites, by their determination) and they found that Wikipedia's articles on ENT were the least accurate and had the most errors of the three and that they were in the middle of the other two in regards to readability.

While one of the most referenced sources in this area, Wikipedia had poor content accuracy (46%) compared to the two other frequent sources. MedlinePlus has comparable (49%) accuracy, but was missing 7 topics. The clear leader in accuracy, eMedicine, suffers from a higher reading level. The study provides specific criteria, in section 2.3, which could be considered for evaluation of existing articles. One limitation of the study is that, while suggesting that Wikipedia "suffers from the lack of understanding that a physician-editor may offer", it does not point to information on how to get involved with Wikipedia. Engagement with the pediatric medicine community would be beneficial, especially since about 25% of parents made decisions about their children's care in part based on online information.

Emotions and dialogue

A forthcoming paper at this year's WikiSym conference investigates the emotions expressed in article and user talk pages.[5] "Administrators tend to be more positive than regular users", and the paper suggests that "as women gain experience in Wikipedia they tend to adopt the emotional tone of administrators", for instance linking to policy at more than twice the rate as males. Due to the likelihood of women to interact with other women, they suggest gender-aware recruiting to address the gender gap.

The authors point out the utility of positive emotion in keeping discussions on track, and suggest that experienced editors should be encouraged to maintain a positive climate. To determine users' gender, they used a crowd-sourced study through Crowdflower. Emotions are determined using the ANEW wordlist which distinguishes the range of emotional variability, based on valence, arousal, and dominance. The paper notes that policy mentions tend to have "a remarkably positive and dominant tone, and with stronger emotional load than in the rest of the discussion'".

Editor collaboration patterns

A paper from the University of Alberta addresses the difficulty of analyzing edit histories and finding conflict in particular.[6] They use terms indicating content-based agreement (e.g. "add", "fix", "spellcheck", "copy", and "move") and disagreement ("uncited", "fact", "is not", "bias", "claim", "revert", and "see talk page"). They define conflicting interactions as those that revert, or delete content, or use more negative terms than positive terms. They find that this is a useful way to identify controversial articles.

Why does the number of Wikipedia readers rise while the number of editors doesn't?

A student paper for a course on "Project in Mining Massive Data Sets" at Stanford University, titled "Wikipedia Mathematical Models and Reversion Prediction"[7] tries to use mathematical models "to explain why the amount of [editors on the English Wikipedia] stops increasing, whereas the amount of viewers keeps increase", and "to predict if an edit will be reverted." The researchers used Elastic MapReduce on Amazon's servers to carry out this research. The paper is a bit confused since the researchers are more interested in models and validation than explaining the phenomena.

The first part of the paper includes two models for examining the relation of visitors to editors in Wikipedia's community. The first model makes the assumption that editors act as predators and articles have the role of prey. However this model did not fit the data. The second model used a linear regression between a number of factors which allow the authors to model the community's statistics over time. The model is then tested using simulation and seems to present accurate results.

In the second part of the paper, three models were used to predict which edits will get reverted. The models were trained using 24 features, classified either as edit, editor or article based. E.g. an article's age; its edit count; number of editors participating in editing; number of articles the editor has edited; change in information compared to previous status. The outcome of the prediction which used three machine learning algorithms achieved about 75% accuracy and another interesting conclusion was that the ability to detect reversion has not changed much over time.



  1. ^ Török, J.; Iñiguez, G.; Yasseri, T.; San Miguel, M.; Kaski, K.; Kertész, J. (2012) "Opinions, Conflicts and Consensus: Modeling Social Dynamics in a Collaborative Environment". ArXiv. Open access icon
  2. ^ Nagaraj, Abhishek. (2012) "The effect of copyright law on the reuse of digital content". Wikimania 2012, July 12–15 2012, George Washington University. Open access icon
  3. ^ Thomer, A.; Vaidya, G.; Guralnick, R.; Bloom, D.; Russell, L. (2012). "From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks". ZooKeys (209): 235–53. Bibcode:2012ZooK..209..235T. doi:10.3897/zookeys.209.3247. PMC 3406479. PMID 22859891. Open access icon
  4. ^ Volsky, P. G.; Baldassari, C. M.; Mushti, S.; Derkay, C. S. (2012). "Quality of Internet information in pediatric otolaryngology: A comparison of three most referenced websites". International Journal of Pediatric Otorhinolaryngology. 76 (9): 1312–6. doi:10.1016/j.ijporl.2012.05.026. PMID 22770592. Closed access icon
  5. ^ Laniado, David; Castillo, Carlos; Kaltenbrunner, Andreas; Fuster Morell, Mayo. (submitted) "Emotions and dialogue in a peer-production community: the case of Wikipedia". WikiSym’12, August 27–29, 2012, Linz, Austria. Open access icon
  6. ^ Sepehri-Rad, Hoda; Makazhanov, Aibek; Rafiei, Davood; Barbosa, Denilson. (2012) ""Leveraging Editor Collaboration Patterns in Wikipedia)" (PDF).". Open access icon In Proceedings of the 23rd ACM conference on Hypertext and Social Media, pp. 13–22. doi:10.1145/2309996.2310001 Closed access icon
  7. ^ Jia Ji; Bing Han; Dingyi Li. (2012) ""Wikipedia Mathematical Models and Reversion Prediction" (PDF)." Open access icon
  8. ^ Eklou, D., Asano, Y., & Yoshikawa, M. (2012). How the web can help Wikipedia: a study on information complementation of Wikipedia by the web. Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication – ICUIMC ’12 (p. 1). New York, New York, USA: ACM Press. doi:10.1145/2184751.2184763 Closed access icon
  9. ^ Ng, P. C. (2012). "What Kobe Bryant and Britney Spears Have in Common: Mining Wikipedia for Characteristics of Notable Individuals". Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media. Open access icon
  10. ^ Keegan, Brian. (July 21, 2012) "Aurora shootings."
  11. ^ Yasseri, Taha. (2012) "Number of covering WPs vs. time" [1].
  12. ^ Saengthongpattana, Kanchana; Soonthornphisaj, Nuanwan. (2012) ""Thai Wikipedia Quality Measurement using Fuzzy Logic" (PDF)." 26th Annual Conference of the Japanese Society for Artificial Intelligence, June 12–15, 2012, Yamaguchi, Japan. Open access icon
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Why does the number of Wikipedia readers rise while the number of editors doesn't?"

If Standford University wanted to know "Why does the number of Wikipedia readers rise while the number of editors doesn't?" all they had to do was look at the nuclear power industry. Our site is like the power station, with the editors as the fuel rods and the guidelines, policies, bureaucracy, etc, as the control rods. Our problem on site is the the editors are increasingly frustrated by the control rods, which seem to sink further into the reactor each year and as a result of the control rod insertion more and more editors are experiencing the difficulty of having to work harder to get the article material heated to acceptable levels. Those at the top of the reactor have already experienced a total retardation of the nuclear fission process, while those at the bottom are unable or unwilling to pick up the slack. Despite this disturbing trend it does not effect the readers, who are outside the reactor's water loop and thus interact with the articles only in the heat exchanger, and as long as there is sufficient energy to boil the water - or in this case, to be more precise, maintain the articles and add new ones (even at a reduced rate) - the readers in the power loop will continue to power the machine that keeps Wikipedia moving. TomStar81 (Talk) 14:28, 31 July 2012 (UTC)[reply]

TomStar81, I liked your analogy a lot but I wonder how many readers without training in nuclear engineering will be able to understand you. :-) Respectfully, Hispalois (talk) 00:53, 1 August 2012 (UTC)[reply]
To be fair to that analogy how often do the average people understand scatter plots and technical diagrams and so forth. All I've done is recycle that 'keep people in the dark about the true interpretation of the results' mentality to the site by giving my own analogy. I will concede a point though that there are people out there who would be unable to interpret this analogy without a little help, so allow me to enlighten anyone who needs a little help with the interpretation of my analogy: open File:PressurizedWaterReactor.gif and observe the process. Wikipedia editors are playing for the red team, while the readers are playing for the blue team. The whole process can be researched by reading our articles on nuclear power and nuclear fission. At the same time, the readers can help our improvement of the articles in question by providing feedback as to the ease of understanding the articles and where the article's need to be improved. TomStar81 (Talk) 06:33, 1 August 2012 (UTC)[reply]

Why would anyone expect any correlation between the number of Wikipedia readers and editors, or their respective rates of change? Individuals read Wikipedia to obtain information. Individuals edit Wikipedia for a wide variety of reasons. There is no causal relationship between the numbers of readers and editors, and therefore no reason to expect numerical correlation.—Finell 19:24, 1 August 2012 (UTC)[reply]

I agree with polite Hispalois: I like the TomStar81 analogy too. I just wanted to point out that even if a reader hasn't got much training in "nuclear engineering" they can always check the "nuclear fuel" article on wikipedia in order to see what are wiki-editors compared with. But what about a rod ? He could find something about it but on wiktionary, please check it out: rod . Reading between the lines, though, I smiled after the hidden phallic symbol that there is behind a "rod". Is wikipedia still a male world? Well... if you take a deeper look at the hyperlinked article on what is a "phallic symbol" in psychoanalysis... you will discover that "Women, not having the phallus, are seen to "be" the phallus". At least according to Jacques Lacan (1901-1981). Have a nice weekend. Maurice Carbonaro (talk) 21:32, 4 August 2012 (UTC)[reply]
Without going so far as to read the paper, I'm still boggling at "The first model makes the assumption that editors act as predators and articles have the role of prey". Sometimes it feels the other way round. Still, most carnivores would love to be able to create their own prey, I'm sure. Johnbod (talk) 01:12, 7 August 2012 (UTC)[reply]


It looks like something's broken in the mediawiki handling of English->Thai interwiki links, because the wikicode

[[th:วิกิพีเดีย:บทความคัดสรร|featured articles on the Thai Wikipedia]]

disappears entirely, causing the fuzzy logic paragraph in this article to have this mangled phrase:

to discern the (88 at the time of the study) from non-featured articles

-R. S. Shaw (talk) 21:30, 31 July 2012 (UTC)[reply]

Thanks. Sonia has added the missing colon. Regards, Tbayer (WMF) (talk) 00:05, 1 August 2012 (UTC)[reply]

Thai Featured Article study and overfitting

There are only 91 featured articles in the Thai wikipedia, 88 of which used by the study. I'm not sure that's really a good enough sample size to get good results. (Okay, sure, there are 75,000 Thai WP articles total, which is a good sample set, but they picked only 100 "normal" articles.) The fact that their algorithm caught *all* the FAs makes me a bit suspicious too - it's easy to make a model catch everything with lots of specific hacks, but it's not clear if you get a good model going forward - overfitting. (Think of weird edge case FAs in English WP promoted in 2007 with cleanup tags in the middle of a FAR - it'd be weird for a non-overfit / non-super-generous model to mark it as featured, so some error rate is "good.") If they'd had, say, 400 featured articles to play with, and fed 300 of them into the corpus + 10K non-featured articles, and then had to guess on the remaining 100 FAs mixed in with a different 10K normal articles, then the results might have been interesting. As it stands, alas. I'd also want to see a very low rate of false positives ideally since so many "normal" articles are easy to rule out just on basis of, say, footnote count; an algorithm that could tell the difference between articles with lots of footnotes because they're radioactively controversial recent events vs. ones with lots of footnotes because they're featured.

Obligatory Nate Silver link: . (Nice & simple overfitting explanation with examples, although presidential elections have an even tinier sample size.) SnowFire (talk) 17:58, 2 August 2012 (UTC)[reply]

Admin, emotion, women..what?

That has got to be one of the oddest paragraphs I have read about anything gender gap related in a long time. How does linking to policy make one emotional? I'm confused by that. I also don't understand - do more women link to policy as compared to what male editors? I find that hard to believe, but, I'm a staunch anti-link to policy supporter when working with new editors, at least. It's research like this that makes me often wonder what use it is to us (anymore?). I also understand if some women might have to find ways to defend "ourselves" by having policy as a back up, but...even then, I don't see that in the areas of Wikipedia where I hang out. I'm assuming "gender aware" means women recruiting women or...? And I don't know why that is news, it's old news :) SarahStierch (talk) 17:24, 6 August 2012 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0