The Signpost

Recent research

Who wrote this? New dataset on the provenance of Wikipedia text

Contribute  —  
Share this
By Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Who wrote this? A new dataset tracks the provenance of English Wikipedia text over 15 years

Much of the existing Wikipedia research is based on the freely licensed datasets published by the Wikimedia Foundation: Content dumps, pageview numbers, Clickstream samples, etc. But some individual researchers are giving back too. An example for this is the TokTrack dataset, described in an accompanying paper[1] as

"a dataset that contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and re-deleted from its article, enabling a complete and straightforward tracking of its history."

Tracking authorship and provenance of Wikipedia article text is by no means a new topic (see e.g. meta:Research:Content persistence). However, the paper's authors assert that their method provides much higher accuracy than earlier efforts such as Wikitrust. One of them, Fabian Flöck, has been studying this problem with other researchers for years (cf. our coverage from 2012 and 2014: "Precise and efficient attribution of authorship of revisioned content", "Better authorship detection, and measuring inequality", "New algorithm provides better revert detection"; the present dataset is generated by their "Wikiwho" algorithm, which also underlies a browser extension called "Whocolor").

What's more, the paper points out that "this data would be exceedingly hard to create by an average potential user" for the entire English Wikipedia due to the computational effort involved ("around 25 days on a dedicated Ubuntu Server [...] with 122 GB RAM and 20 cores"; for comparison, a community-created tool, "WikiBlame", which is linked from every revision history page on English Wikipedia, can take several minutes to find the provenance of an individual token in a single article).

After describing the dataset and the underlying methodology, the paper also briefly presents some insights that can be derived from it about the history of English Wikipedia. First, it looks at the number of added and surviving tokens over time, observing that

"the rapid growth in added tokens leveled off around the beginning of 2007, and transformed into a slight decline before recovering towards the middle of 2014. [...] the ratio of newly added content that was good or uncontentious enough to survive 48 hours exhibits a (mostly) continuous decrease from 2001 until 2007, coinciding with the change in total added content, then stabilizes and even begins to slightly climb again until recently."

It highlights "a surprising spike in Oct. 2002 (also in absolute additions)". Although not mentioned in the paper, this is very likely the effect of bot contributions by User:Ram-Man of US geographical content. Figure 2(b) in the paper also seems to indicate that more than half of these October 2002 additions were still live 14 years later.

Analyzing the "persisting" tokens (that had not been removed within 48 hours) by user group, the authors observe:

"While it seems that the addition of persisting tokens of unregistered editors has become comparably stable since 2006, it has not been keeping up by far with the enormous increase by registered editors, which make up for over 80% of all added surviving content for most months since 2007. In fact, a small group of registered users generates the vast majority of sustained content [...]. Bots showed an increased presence from mid-2007 until 2013, when, presumably by the migration of inter-language links to Wikidata, the demand for bot-created content dropped."

The remainder of the paper uses the dataset to study editing controversies. First, the authors define two measures of how controversial an article is, both yielding evolution, Mustafa Kemal Atatürk and Bob Dylan (in that order) as the three most controversial articles as of October 2016 (based on the surviving content at that time only). They also find that "barneys" was the top most conflicted string token.

Lastly, they examine the frequency of edits that undo other edits partially or totally, where the token-based data enables a more sophisticated approach than simpler types of revert analysis. They find that

"in total, 61.51% of all edits included some kind of removal or reinsertion of content (i.e., 38.49% revisions purely added content), and in 14.62% of the revisions editors correct their own edits. 14.84% of all revisions fully undid another revision and 50.65% did so partially."

However, they caution that since "content added by one revision can over (a long) time be corroded by many small changes [...] 'revert' cannot per se be equated with antagonism here, as these numbers include the complete spectrum from minor corrections to full-on opinion clashes and vandal fighting."

Briefly

Conferences and events

See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. contributions are always welcome for reviewing or summarizing newly published research.

References

  1. ^ Flöck, Fabian; Erdogan, Kenan; Acosta, Maribel (2017-05-03). TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia. Eleventh International AAAI Conference on Web and Social Media.
  2. ^ Rosnay, Mélanie Dulong de; Langlais, Pierre-Carl (2017-02-16). "Public artworks and the freedom of panorama controversy: a case of Wikimedia influence". Internet Policy Review. 6 (1). ISSN 2197-6775.
  3. ^ Gottschalk, Simon; Demidova, Elena (2016). "Analysing temporal evolution of interlingual Wikipedia article pairs". Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. SIGIR '16. New York, NY, USA: ACM. pp. 1089–1092. arXiv:1702.00716. doi:10.1145/2911451.2911472. ISBN 9781450340694., Online demo
  4. ^ Jirschitzka, Jens; Kimmerle, Joachim; Halatchliyski, Iassen; Hancke, Julia; Meurers, Detmar; Cress, Ulrike (2017-06-02). "A productive clash of perspectives? The interplay between articles' and authors' perspectives and their impact on Wikipedia edits in a controversial domain". PLOS ONE. 12 (6): –0178985. Bibcode:2017PLoSO..1278985J. doi:10.1371/journal.pone.0178985. ISSN 1932-6203. PMC 5456356. PMID 28575077.
  5. ^ Mandler, Michael D. (2017-01-26). "Glaring chemical errors persist for years on Wikipedia". Journal of Chemical Education. 94 (3): 271–272. Bibcode:2017JChEd..94..271M. doi:10.1021/acs.jchemed.6b00478. ISSN 0021-9584. (letter) Closed access icon
  6. ^ Greving, Hannah; Oeberst, Aileen; Kimmerle, Joachim; Cress, Ulrike (2017-06-29). "Emotional content in Wikipedia articles on negative man-made and nature-made events". Journal of Language and Social Psychology. 37 (3): 0261927–17717568. doi:10.1177/0261927X17717568. ISSN 0261-927X. S2CID 149165526. Closed access icon
  7. ^ Thornton, Katherine (2017-02-14). Powerful Structure: Inspecting Infrastructures of Information Organization in Wikimedia Foundation Projects (Thesis). hdl:1773/38160. (dissertation)
  8. ^ Neef, Sebastian (2017-08-26). "Implementation and evaluation of a framework to calculate impact measures for Wikipedia authors". arXiv:1709.01142 [cs.DL].
  9. ^ Gupta, Amit; Lebret, Rémi; Harkous, Hamza; Aberer, Karl (2017-04-25). "280 Birds with one stone: inducing multilingual taxonomies from Wikipedia using character-level classification". arXiv:1704.07624 [cs.CL].
  10. ^ Khairova, Nina; Lewoniewski, Włodzimierz; Węcel, Krzysztof (2017-06-28). Estimating the quality of articles in Russian Wikipedia using the logical-linguistic model of fact extraction. International Conference on Business Information Systems. Lecture Notes in Business Information Processing. Springer, Cham. pp. 28–40. doi:10.1007/978-3-319-59336-4_3. ISBN 9783319593357. Closed access icon
S
In this issue
+ Add a comment

Discuss this story





       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0