OpenSym 2015 report; PageRank and wiki quality; news suggestions; the impact of open access

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

OpenSym 2015

The Presidio and Golden Gate Bridge in San Francisco (the conference venue is in the left center)

OpenSym, the eleventh edition of the annual conference formerly known as WikiSym, took place on August 19 and 20 at the Golden Gate Club in the Presidio of San Francisco, USA, followed by an one-day doctoral symposium. While the name change (enacted last year) reflects the event's broadened scope towards open collaboration in general, a substantial part of the proceedings (23 papers and posters) still consisted of research featuring Wikipedia (8) and other wikis (three, two of them other Wikimedia projects: Wikidata and Wikibooks), listed in more detail below. However, it was not represented in the four keynotes, even if some of their topics did offer inspiration to those interested in Wikimedia research. For example, in the Q&A after the keynote by Peter Norvig (Director of Research at Google) about machine learning, Norvig was asked where such AI techniques could help human Wikipedia editors with high "force multiplication". He offered various ideas for applications of a "natural language processing pipeline" to Wikipedia content, such as automatically suggesting "see also" topics, potentially duplicate article topics, or "derivative" article updates (e.g. when an actor's article is updated with an award won, the list of winners of that award should be updated too). The open space part of the schedule saw very limited usage, although it did facilitate a discussion that might lead to a revival of a Wikitrust-like service in the not too distant future (similar to the existing Wikiwho project).

As in previous years, the Wikimedia Foundation was the largest sponsor of the conference, with the event organizers' open grant application supported by testimonials by several Wikimedians and academic researchers about the usefulness of the conference over the past decade. This time, the acceptance rate was 43%. The next edition of the conference will take place in Berlin in August 2016.

An overview of the Wikipedia/Wikimedia-related papers and posters follows, including one longer review.

"Wikipedia in the World of Global Gender Inequality Indices: What The Biography Gender Gap Is Measuring" (poster)^[1]
"Peer-production system or collaborative ontology engineering effort: What is Wikidata?"^[2] presented the results of an extensive classification of edits on Wikidata, touching on such topics as the division of labor (i.e. the differences in edit types) between bots and human editors. Answering the title question, the presentation concluded that Wikidata can be regarded as a peer production system now (i.e. an open collaboration, which is also more accessible for contributors than Semantic MediaWiki), but could veer into more systematic "ontology engineering" in the future.
"The Evolution Of Knowledge Creation Online: Wikipedia and Knowledge Processes"^[3]: This poster applied evolution theory to Wikipedia's knowledge processes, using the "Blind Variation and Selective Retention" model.
"Contribution, Social networking, and the Request for Adminship process in Wikipedia "^[4]: This poster examined a 2006/2007 dataset of admin elections on the English Wikipedia, finding that the optimal numbers of edits and talk page interactions with users to get elected as Wikipedia admin fall into "quite narrow windows".
"The Rise and Fall of an Online Project. Is Bureaucracy Killing Efficiency in Open Knowledge Production?"^[5] This paper compared 37 different language Wikipedias, asking which of them "are efficient in turning the input of participants and participant contributions into knowledge products, and whether this efficiency is due to a distribution of participants among the very involved (i.e., the administrators), and the occasional contributors, to the projects’ stage in its life cycle or to other external variables." They measured a project's degree of bureaucracy using the numerical ratio of the number of admins vs. the number of anonymous edits and vs. the number of low activity editors. Among the findings summarized in the presentation: Big Wikipedias are less efficient (partly due to negative economies of scale), and efficient Wikipedias are significantly more administered.
"#Wikipedia on Twitter: Analyzing Tweets about Wikipedia": See the review in our last issue
"Page Protection: Another Missing Dimension of Wikipedia Research":^[6] Following up on their paper from last year's WikiSym where they had urged researchers to "consider the redirect"^[7] when studying pageview data on Wikipedia, the authors argued that page protection deserves more attention when studying editing activity - it affects e.g. research on breaking news articles, as these are often protected. They went through the non-trivial task of reconstructing every article's protection status at a given moment in time from the protection log, resulting in a downloadable dataset, and encountered numerous inconsistencies and complications in the process (caused e.g. by the combination of deletion and protection). In general, they found that 14% of pageviews are to edit-protected articles.
"Collaborative OER Course Development – Remix and Reuse Approach"^[8] reported on the creation of four computer science textbooks on Wikibooks for undergraduate courses in Malaysia.
"Public Domain Rank: Identifying Notable Individuals with the Wisdom of the Crowd"^[9] "provides a novel and reproducible index of notability for all [authors of public domain works who have] Wikipedia pages, based on how often their works have been made available on sites such as Project Gutenberg (see also earlier coverage of a related paper co-authored by the author: "Ranking public domain authors using Wikipedia data")

"Tool-Mediated Coordination of Virtual Teams"

Review by Morten Warncke-Wang

"Tool-Mediated Coordination of Virtual Teams in Complex Systems"^[10] is the title of a paper at OpenSym 2015. The paper is a theory-driven examination of edits done by tools and tool-assisted contributors to WikiProjects in the English Wikipedia. In addition to studying the extent of these types of edits, the paper also discusses how they fit into larger ecosystems through the lens of commons-based peer production^{[supp 1]} and coordination theory.^{[supp 2]}

Identifying automated and tool-assisted edits in Wikipedia is not trivial, and the paper carefully describes the mixed-method approach required to successfully discover these types of edits. For instance, some automated edits are easy to detect because they're done by accounts that are members of the "bot" group, while tool-assisted edits might require manual inspection and labeling. The methodology used in the paper should be useful for future research that aims to look at similar topics.

Measuring Wiki Quality with PageRank

Review by Morten Warncke-Wang and Tilman Bayer

A paper from the WETICE 2015 conference titled "Analysing Wiki Quality using Probabilistic Model Checking"^[11] studies the quality of enterprise wikis running on the MediaWiki platform through a modified PageRank algorithm and probabilistic model checking. First, the paper defines a set of five properties describing quality through links between pages. A couple of examples are "temples", articles which are disconnected from other articles (akin to orphan pages in Wikipedia), and "God" pages, articles which can be immediately reached from other pages. A stratified sample of eight wikis was selected from the WikiTeam dump, and measures extracted using the PRISM model checker. Across these eight wikis, quality varied greatly, for instance some wikis have a low proportion of unreachable pages, which is interpreted as a sign of quality.

The methodology used to measure wiki quality is interesting as it is an automated method that describes the link structure of a wiki, which can be turned into a support tool. However, the paper could have been greatly improved by discussing information quality concepts and connecting it more thoroughly to the literature, research on content quality in Wikipedia in particular. Using authority to measure information quality is not novel, in the Wikipedia-related literature we find it in Stvilia's 2005 work on predicting Wikipedia article quality^{[supp 3]}, where authority is reflected in the "proportion of admin edits" feature, and in a 2009 paper by Dalip et al.^{[supp 4]} PageRank is part of their set of network features, a set that is found to have little impact on predicting quality. While these two examples aim to predict content quality, whereas the reviewed paper more directly measures the quality of the link structure, it is a missed opportunity for a discussion on what encompasses information quality. This discussion of information quality and how high quality can be achieved in wiki systems is further hindered by the paper not properly defining "enterprise wiki", leaving the reader wondering if there is at all much of an information quality difference between these and Wikimedia wikis.

The paper builds on an earlier one that the authors presented at last year's instance of the WETICE conference, where they outlined "A Novel Methodology Based on Formal Methods for Analysis and Verification of Wikis"^[12] based on Calculus of communicating systems (CCS). In that paper, they also applied their method to Wikipedia, examining the three categories "Fungi found in fairy rings", "Computer science conferences" and "Naval battles involving Great Britain" as an experiment. Even though these only form small subsets of Wikipedia, computing time reached up to 30 minutes.

"Automated News Suggestions for Populating Wikipedia entity Pages"

A paper accepted for publication at the 2015 Conference on Information and Knowledge Management (CIKM 2015) by scientists from the L3S Research Center in Hannover, Germany that suggests news articles for Wikipedia articles to incorporate.^[13] The paper builds on prior work that examines approaches for automatically generating new Wikipedia articles from other knowledge bases, accelerating contributions to existing articles, and determining the salience of new entities for a given text corpus. The paper overlooks some other relevant work about breaking news on Wikipedia,^{[supp 5]} news citation practices,^{[supp 6]} and detecting news events with plausibility checks against social media streams.^{[supp 7]}

Methodologically, this work identifies and recommends news articles based on four features (salience, authority, novelty, and placement) while also recognizing that the relevance for news items to Wikipedia articles changes over time. The paper evaluates their approach using a corpus of 350,000 news articles linked from 73,000 entity pages. The model uses the existing news, article, and section information as ground truth and evaluates its performance by comparing its recommendations against the relations observed in Wikipedia. This research demonstrates that there is still a substantial amount of potential for using historical news archives to recommend revisions to existing Wikipedia content to make them more up-to-date. However, the authors did not release a tool to make these recommendations in practice, so there's nothing for the community to use yet. While Wikipedia covers many high-profile events, it nevertheless has a self-focus bias towards events and entities that are culturally proximate.^{[supp 8]} This paper shows there is substantial promise in making sure all of Wikipedia's articles are updated to reflect the most recent knowledge.

"Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science"

Review by Andrew Gray

This paper, developed from one presented at the 9th International Conference on Web and Social Media, examined the citations used in Wikipedia and concluded that articles from open access journals were 47% more likely to be cited than articles from comparable closed-access journals.^[14] In addition, it confirmed that a journal's impact factor correlates with the likelihood of citation. The methodology is interesting and extensive, calculating the most probable 'neighbors' for a journal in terms of subject, and seeing if it was more or less likely to be cited than these topical neighbors. The expansion of the study to look at fifty different Wikipedias, and covering a wide range of source topics, is welcome, and opens up a number of very promising avenues for future research - why, for example, is so little scholarly research on dentistry cited on Wikipedia, compared to that for medicine? Why do some otherwise substantially-developed Wikipedias like Polish, Italian, or French cite relatively few scholarly papers?

Unfortunately, the main conclusion of the paper is quite limited. While the authors do convincingly demonstrate that articles in their set of open access journals are cited more frequently, this does not necessarily generalise to say whether open access articles in general are - which would be a substantially more interesting result. It has previously been shown that as of 2014, around half of all scientific literature published in recent years is open access in some form - that is, a reader can find a copy freely available somewhere on the internet.^{[supp 9]} Of these, only around 15% of papers were published in the "fully" open access journals covered by the study. This means that almost half of the "closed access" citations will have been functionally open access - and as Wikipedia editors generally identify articles to cite at the article level, rather than the journal level, it makes it very difficult to draw any conclusions on the basis of access status. The authors do acknowledge this limitation - "Furthermore, free copies of high impact articles from closed access journals may often be easily found online" - but perhaps had not quite realised the scale of 'alternative' open access methods.

In addition, a plausible alternative explanation is not considered in the study: fully open access journals tend to be younger. Two-thirds of those listed in Scopus have begun publication since 2005, against only around a third of closed-access titles, which are more likely to have a substantial corpus of old papers. It is reasonable to assume that Wikipedia would tend towards discussing and citing more recent research (the extensively-discussed issue of "recentism"). If so, we would expect to see a significant bias in favour of these journals for reasons other than their access status.

Early warning system identifies likely vandals based on their editing behavior

Summary by Srijan Kumar, Francesca Spezzano and V.S. Subrahmanian

“VEWS: A Wikipedia Vandal Early Warning System” is a system developed by researchers at University of Maryland that predicts users on Wikipedia who are likely to be vandals before they are flagged for acts of vandalism.^[15] In a paper presented at KDD 2015 this August, we analyze differences in the editing behavior of vandals and benign users. Features that distinguish between vandals and benign users are derived from metadata about consecutive edits by a user and capture time between consecutive edits (very fast vs. fast vs. slow), commonalities amongst categories of consecutively edited pages, hyperlink distance between pages, etc. These features are extended to also use the entire edit history of the user. Since the features only depend on the meta-data from an editor’s edits, VEWS can be applied to any language Wikipedia.

For their experiments, we used a dataset of about 31,000 users (representing a 50-50 split of vandals and benign users), since released on our website. All experiments were done on the English Wikipedia. The paper reports an accuracy of 87.82% with a 10-fold cross validation, as compared to a 50% baseline. Even with the user’s first edit, the accuracy of identifying the vandal is 77.4%. As seen in the figure, predictive accuracy increases with the number of edits used for classification.

Current systems such as ClueBot NG and STiki are very efficient at detecting vandalism edits in English (but not foreign languages), but detecting vandals is not their primary task. Straightforward adaptations of ClueBot NG and STiki to identify vandals yields modest performance. For instance, VEWS detects a vandal on average 2.39 edits before ClueBot NG. Interestingly, incorporating the features from ClueBot NG and STiki into VEWS slightly improves the overall accuracy, as depicted in the figure. Overall, the combination of VEWS and ClueBot NG is a fully automated vandal early warning system for English language Wikipedia, while VEWS by itself provides strong performance for identifying vandals in any language.

"DBpedia Commons: Structured Multimedia Metadata from the Wikimedia Commons"

Review by Guillaume Paumier

DBpedia Commons: Structured Multimedia Metadata from the Wikimedia Commons is the title of a paper accepted to be presented at the upcoming 14th International Semantic Web Conference (ISWC 2015) to be held in Bethlehem, Pennsylvania on October 11-15, 2015.^[16] In the paper, the authors describe their use of DBpedia tools to extract file and content metadata from Wikimedia Commons, and make it available in RDF format.

The authors used a dump of Wikimedia Commons's textual content from January 2015 as the basis of their work. They took into account "Page metadata" (title, contributors) and "Content metadata" (page content including information, license and other templates, as well as categories). They chose not to include content from the Image table ("File metadata", e.g. file dimensions, EXIF metadata, MIME type) to limit their software development efforts.

The authors expanded the existing DBpedia Information Extraction Framework (DIEF) to support special aspects of Wikimedia Commons. Four new extractors were implemented, to identify a file's MIME type, images in a gallery, image annotations, and geolocation. The properties they extracted, using existing infobox extractors and the new ones, were mapped to properties from the DBpedia ontology.

The authors boast a total of 1.4 billion triples inferred as a result of their efforts, nearly 100,000 of which come from infobox mappings. The resulting datasets are now included in the DBpedia collection, and available through a dedicated interface for individual files (example) and SPARQL queries.

It seems like a missed opportunity to have ignored properties from the Image table. This choice caused the authors to re-implement MIME type identification by parsing file extensions themselves. Other information, like the date of creation of the file, or shutter speed for digital photographs, is also missing as a consequence of this choice. The resulting dataset is therefore not as rich as it could have been; since File metadata is stored in structured format in the MediaWiki database, it would arguably have been easier to extract than the free-form Content metadata the authors included.

It is also slightly disappointing that the authors didn't mention the CommonsMetadata API, an existing MediaWiki interface that extracts Content metadata like licenses, authors and descriptions. It would have been valuable to compare the results they extracted with the DBpedia framework with those returned by the API.

Nonetheless, the work described in the paper is interesting in that it focuses on a lesser-known wiki than Wikipedia, and explores the structuring of metadata from a wiki whose content is already heavily soft-structured with templates. The resulting datasets and interfaces may provide valuable insights to inform the planning, modeling and development of native structured data on Commons using Wikibase, the technology that powers Wikidata.

Briefly

Wikipedia in education as an acculturation process: This paper^[17] looks at the benefits of using Wikipedia in the classroom, stressing, in addition to the improvement in writing skills, the importance of acquiring digital literacy skills. In other words, by learning how to edit Wikipedia students acquire skills that are useful, and perhaps essential, in today's world, such as ability to learn about online project's norms and values, how to deal with trolls, how to work with other in collaborative online projects, etc. The authors discuss those concepts through the acculturation theory and develop their views further through the grounded theory methodology. They portray learning as an acculturation process that occurs when two independent cultural systems (Wikipedia and academia) come into contact.

Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.

"Depiction of cultural points of view on homosexuality using Wikipedia as a proxy"^[18]
"The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi"^[19]
"Wikipedia Chemical Structure Explorer: substructure and similarity searching of molecules from Wikipedia" ^[20]

References

^ Maximilian Klein: Wikipedia in the World of Global Gender Inequality Indices: What The Biography Gender Gap Is Measuring. OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA http://www.opensym.org/os2015/proceedings-files/p404-klein.pdf / http://notconfusing.com/opensym15/
^ Claudia Müller-Birn , Benjamin Karran, Markus Luczak-Roesch, Janette Lehmann: Peer-production system or collaborative ontology engineering effort: What is Wikidata? OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA http://www.opensym.org/os2015/proceedings-files/p501-mueller-birn.pdf
^ Ruqin Ren: The Evolution Of Knowledge Creation Online: Wikipedia and Knowledge Processes. OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA. http://www.opensym.org/os2015/proceedings-files/p406-ren.pdf
^ Romain Picot Clemente, Cecile Bothorel, Nicolas Jullien: Contribution, Social networking, and the Request for Adminship process in Wikipedia. OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA http://www.opensym.org/os2015/proceedings-files/p405-picot-clemente.pdf
^ Nicolas Jullien, Kevin Crowston, Felipe Ortega: The Rise and Fall of an Online Project. Is Bureaucracy Killing Efficiency in Open Knowledge Production? OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA http://www.opensym.org/os2015/proceedings-files/p401-jullien.pdf slides
^ Benjamin Mako Hill, Aaron Shaw: Page Protection: Another Missing Dimension of Wikipedia Research. OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA. http://www.opensym.org/os2015/proceedings-files/p403-hill.pdf / downloadable dataset
^ Benjamin Mako Hill, Aaron Shaw: Consider the Redirect: A Missing Dimension of Wikipedia Research. OpenSym ’14 , Aug 27-29 2014, Berlin, Germany. http://www.opensym.org/os2014/proceedings-files/p604.pdf
^ Sheng Hung Chung, Khor Ean Teng: Collaborative OER Course Development – Remix and Reuse Approach. OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA. http://www.opensym.org/os2015/proceedings-files/c200-chung.pdf
^ Allen B. Riddell: Public Domain Rank: Identifying Notable Individuals with the Wisdom of the Crowd. OpenSym ’15, August 19 - 21, 2015, San Francisco, CA, USA. http://www.opensym.org/os2015/proceedings-files/p300-riddell.pdf
^ Gilbert, Michael; Zachry, Mark (August 2015). "Tool-mediated coordination of virtual teams in complex systems" (PDF). Proceedings of the 11th International Symposium on Open Collaboration. pp. 1–8. doi:10.1145/2788993.2789843. ISBN 9781450336666. S2CID 11963811.
^ Ruvo, Guiseppe de; Santone, Antonella (June 2015). "Analysing Wiki Quality Using Probabilistic Model Checking" (PDF). 2015 IEEE 24th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises. pp. 224–229. doi:10.1109/WETICE.2015.18. ISBN 978-1-4673-7692-1. S2CID 10868389.
^ Giuseppe De Ruvo, Antonella Santone: A Novel Methodology Based on Formal Methods for Analysis and Verification of Wikis doi:10.1109/WETICE.2014.25 http://www.deruvo.eu/preprints/W2T2014.pdf
^ Fetahu, Besnik; Markert, Katja; Anand, Avishek (October 2015). "Automated News Suggestions for Populating Wikipedia Entity Pages" (PDF). Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 323–332. arXiv:1703.10344. doi:10.1145/2806416.2806531. ISBN 9781450337946. S2CID 11264899.
^ Teplitskiy, M.; Lu, G.; Duede, E. (2015). "Amplifying the impact of open access: Wikipedia and the diffusion of science". Journal of the Association for Information Science and Technology. 68 (9): 2116–2127. arXiv:1506.07608. doi:10.1002/asi.23687. S2CID 10220883.
^ Kumar, Srijan; Spezzano, Francesca; Subrahmanian, V.S. (August 2015). "VEWS: A Wikipedia Vandal Early Warning System" (PDF). Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 607–616. arXiv:1507.01272. doi:10.1145/2783258.2783367. ISBN 9781450336642. S2CID 2198041.
^ Vaidya, Gaurav; Kontokostas, Dimitris; Knuth, Magnus; Lehmann, Jens; Hellmann, Sebastian (October 2015). "DBpedia Commons: Structured Multimedia Metadata from the Wikimedia Commons" (PDF). Proceedings of the 14th International Semantic Web Conference.
^ Brailas, Alexios; Koskinas, Konstantinos; Dafermos, Manolis; Alexias, Giorgos (July 2015). "Wikipedia in Education: Acculturation and learning in virtual communities". Learning, Culture and Social Interaction. 7: 59–70. doi:10.1016/j.lcsi.2015.07.002.
^ Croce, Marta (2015-04-30). "Depiction of cultural points of view on homosexuality using Wikipedia as a proxy". Density Design.
^ Jimmy Lin: The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi https://www.umiacs.umd.edu/~jimmylin/publications/Lin_JCDL2015.pdf Short paper, JCDL’15, June 21–25, 2015, Knoxville, Tennessee, USA.
^ Ertl, Peter; Patiny, Luc; Sander, Thomas; Rufener, Christian; Zasso, MichaÃ«l (2015-03-22). "Wikipedia Chemical Structure Explorer: substructure and similarity searching of molecules from Wikipedia". Journal of Cheminformatics. 7 (1): 10. doi:10.1186/s13321-015-0061-y. ISSN 1758-2946. PMID 25815062.

Supplementary references and notes: