The Signpost

Recent research

99.25% of Wikipedia birthdates accurate; focused Wikipedians live longer; merging WordNet, Wikipedia and Wiktionary

Contribute  —  
Share this
By Scott Hale, Piotr Konieczny, Maximilian Klein, Andrew Krizhanovsky, Tilman Bayer and Pine

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"Reliability of user-generated data: the case of biographical data in Wikipedia"

"Third Volume of a 1727 edition of Plutarch's Lives of the Noble Greeks and Romans printed by Jacob Tonson"; caption quoted from the Wikipedia article Biography
Review by User:Maximilianklein

0.75% of Wikipedia birthdates are inaccurate, reported Robert Viseur at WikiSym 2014.[1] Those inaccuracies are "low, although higher than the 0.21% observed for the baseline reference sources". Given that biographies represent 15% of English Wikipedia,[supp 1] the third largest category after "arts" and "culture", their accuracy is important. The method used was to find biographies that were both in Wikipedia and 9 reference databases, which are sadly not named due to the wishes of an "anonymous sponsor" of the paper (Red flag or Belgian bureaucracy?). Of 938 such articles found, those whose birthdates did not match in all 10 databases – 14.4% – were manually investigated. Some errors were due to coincidental names, thus proving the point for authority control in collecting data. One capping anecdote is that most of the mistakes in Wikipedia's 0.75% were corrected in the intervening time between data collection and manual investigation. However, one may need to account for the sample bias that these were the biographies which existed in 10 separated databases – well known personalities. Therefore the predictive power of the study remains limited, but at least we know that some objective data on Wikipedia has the same order of magnitude error rate as other "reliable sources".

Focused Wikipedians stay active longer

Group photo of Wikimedians at Wikimania 2012

A new preprint[2] by three Dublin-based computer scientists contributes to the debate around editor retention. The authors use techniques such as the topic modeling and non-negative matrix factorization. to categorize Wikipedians into several profiles ("e.g. content experts, social networkers"). Those profiles, or user roles, are based on namespaces that editors are most active in. The authors analyzed the behavior of about half a million Wikipedia editors. The authors find that short-term editors seem to lack interest in any one particular aspect of Wikipedia, editing various namespaces briefly before leaving the project. Long-term editors are more likely to focus on one or two namespaces (usually mainspace, plus article talk or user talk pages), and only after some time diversify to different namespaces; in other words, the namespace distribution of edits over time "predicts an editor's departure from the community". The authors note that "we show that understanding patterns of change in user behavior can be of practical importance for community management and maintenance".

Unfortunately, the paper is heavy in jargon and statistical models, and provides little practical data (or at least, that data is not presented well). For example, the categorization of editors into seven groups is very interesting, but no descriptive data is presented that would allow us to compare the number of editors in each group. Further, the paper promises to use those profiles to predict editor lifecycles, but such models don't seem to be present in the paper. In the end, this reviewer finds this paper to be an interesting idea that hopefully will develop into some research with meaningful findings – for now, however, it seems more of a theoretical analysis with no practical applications.

"WordNet-Wikipedia-Wiktionary: construction of a three-way alignment"

A Wiktionary logo
Reviewed by Andrew Krizhanovsky

The authors of this paper,[3] presented at the International Conference on Language Resources and Evaluation (LREC 2014), integrated two previously constructed alignments for WordNet-Wikipedia and WordNet-Wiktionary into a three-way alignment WordNet-Wikipedia-Wiktionary. This integration result in lower accuracy, but greater coverage in comparison with two-way alignment.

Wiktionary does not provide a convenient and consistent means of directly addressing individual lexical items or their associated senses. Third-party tools such as the JWKTL (Java-based Wiktionary Library) API can overcome this problem.

Since the WordNet–Wikipedia alignment is for nouns only, the resulting synonym sets in the conjoint threeway alignment consist entirely of nouns. However, the full three-way alignment contains all parts of speech (adjectives, nouns, adverbs, verbs, etc.).

Larger synonym sets in the source data (WordNet and Wiktionary) results in more incorrect mapping in the outcome alignment (this is strange from the average person's point of view and shows that the alignment algorithm is not perfect yet).

Informal examination shows that conjoint alignment is correct in general, but existing errors in the source alignments were magnified (snowball effect).


Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.


  1. ^ VISEUR, Robert (2014). "Reliability of User-Generated Data:the Case of Biographical Data in Wikipedia" (PDF). WikiSym 2014. Retrieved 24 September 2014.
  2. ^ Qin, Xiangju; Derek Greene; Pádraig Cunningham (29 July 2014). "A latent space analysis of editor lifecycles in Wikipedia". arXiv:1407.7736 [cs.SI].
  3. ^ Miller, Tristan; Iryna Gurevych (May 2014). "WordNet-Wikipedia-Wiktionary: construction of a three-way alignment" (PDF). Proceedings of the 9th International Conference on Language Resources and Evaluations. data
  4. ^ Biancani, Susan (2014). "Measuring the Quality of Edits to Wikipedia" (PDF). WikiSym 2014. Retrieved 24 September 2014.
  5. ^ Liping Wang: A Wiki Framework for the Sweble Engine. Master thesis, Friedrich-Alexander University Erlangen-Nürnberg 2014 PDF
  6. ^ Hwang, Thomas J.; Florence T. Bourgeois; John D. Seeger (2014). "Drug Safety in the Digital Age". New England Journal of Medicine. 370 (26): 2460–2462. doi:10.1056/NEJMp1401767. ISSN 0028-4793. PMID 24963564.
  7. ^ Thomas Roessing: The Dispute over Filtering “indecent” Images in Wikipedia. Masaryk University Journal of Law and Technology Issue: 2/2013 PDF
  8. ^ Britt, Brian C. (January 2014). Evolution and revolution of organizational configurations on wikipedia: A longitudinal network analysis (Thesis). Purdue University. Closed access icon
  9. ^ Rijt, Arnout van de; Soong Moon Kang; Michael Restivo; Akshay Patil (28 April 2014). "Field experiments of success-breeds-success dynamics". Proceedings of the National Academy of Sciences. 111 (19): 6934–9. Bibcode:2014PNAS..111.6934V. doi:10.1073/pnas.1316836111. ISSN 0027-8424. PMC 4024896. PMID 24778230.
  10. ^ Lee, Kyungho (2014). "How collective intelligence emerges: knowledge creation process in Wikipedia from microscopic viewpoint". Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. AVI '14. New York, NY, USA: ACM. pp. 373–374. doi:10.1145/2598153.2600040. ISBN 978-1-4503-2775-6. Closed access icon
  11. ^ Temple, Norman J.; Joy Fraser (2014). "How accurate are Wikipedia articles in health, nutrition, and medicine? / Les articles de Wikipédia dans les domaines de la santé, de la nutrition et de la médecine sont-ils exacts ?". Canadian Journal of Information and Library Science. 38 (1): 37–52. ISSN 1920-7239. Closed access icon
  12. ^ Joanne Robert: Community and the dynamics of spatially distributed knowledge production. The case of Wikipedia in: The social dynamics of innovation networks. edited by Roel Rutten, Paul Benneworth, Dessy Irawati, Frans Boekema p.179ff
  13. ^ DeDeo, Simon (8 July 2014). "Group minds and the case of Wikipedia". Human Computation. 1 (1): 5–29. arXiv:1407.2210. doi:10.15346/hc.v1i1.2.
  14. ^ Mesgari, Mostafa and Okoli, Chitu and Mehdi, Mohamad and Nielsen, Finn Årup and Lanamäki, Arto (2014) "The sum of all human knowledge": A systematic review of scholarly research on the content of Wikipedia. Journal of the Association for Information Science and Technology. ISSN 2330-1635 (In Press) PDF
Supplementary references and notes:
  1. ^ Kittur. "Whats in Wikipedia?" (PDF).
  2. ^ Halfaker, A.; Kittur, A.; Kraut, R.; Riedl, J. (2009). "A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia". WikiSym '09. Retrieved 24 September 2014.
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Hey Max, wouldn't a more appropriate statement be that 0.75% of Wikipedia biography dates "do not match" those in standard reference works? I've worked on at least two articles (one was John Latouche - see the talk page) where I'm sure I established the correct dates even though many reference sources provide different ones (I even got the Library of Congress to amend their dates for Latouche). Maybe that's also true for the rest of the .75%. So it's not that WP is wrong, it's just different and possibly more accurate. kosboot (talk) 06:18, 28 September 2014 (UTC)[reply]

I have also experienced email exchanges where the referenced party changed their data thanks to Wikipedia and my inquiry. So I would agree that mismatches do not always indicate that Wikipedia is inaccurate. Wikipedia's "blue-link" feature and all of the "checkwiki" bots make it possible for people to quickly realize that a grandson cannot be born before his grandfather and so forth. These basic checks lead to cleaner data in the long run. Jane (talk) 07:28, 28 September 2014 (UTC)[reply]

The method used was to find biographies that were both in Wikipedia and 9 reference databases, which are sadly not named due to the wishes of an "anonymous sponsor" of the paper.

This is pretty suspicious, particularly since the exact reason for not naming is not mentioned (a violation of journalistic ethics). Am I the only one who finds this suspicious? (As mentioned in the article, it could be a red flag) Narutolovehinata5 tccsdnew 12:34, 29 September 2014 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0