Bot writes about theatre plays; "Renaissance editors" create better content

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

US playwright Alice Gerstenberg. A bot-generated article about her 1920 comedy Fourteen was accepted with minimal changes.

Bot detects theatre play scripts on the web and writes Wikipedia articles about them

A paper^[1] presented at the International Conference on Pattern Recognition last year (earlier poster) presents an automated method to improve Wikipedia's coverage of theatre plays ("only about 10% of the plays in our dataset have corresponding Wikipedia pages"). It searches for playscripts and related documents on the web, extracts key information from them (including the play's main characters, relevant sentences from online synopses of the play, and mentions in Google Books and the Google News archive in an attempt to ensure that the play satisfies Wikipedia's notability criteria). It then compiles this information into an automatically generated Wikipedia article. Two of the 15 articles submitted as result of this method were accepted by Wikipedia editors. For the first, Chitra by Rabindranath Tagore, the initial bot-created submission underwent significant changes by other editors ("the final page reflects some of the improvements we can incorporate in our bot"). The second one, Fourteen by Alice Gerstenberg, "was moved into Wikipedia mainspace with minimal changes. All the references, quotes and paragraphs were retained".

"Renaissance Editors" create better Wikipedia content

A study of the German Wikipedia^[2], about the diversity of editor contributions among the 8 "main categories", shows a relationship between editor diversity and quality. The authors start by defining an "interest profile" of an editor – the proportion of bytes contributed across all categories. Then an entropy measure is proposed which rewards an interest profile for being more distributed across more categories – having a polymath style.

Leonardo Da Vinci is a famous example of a "Renaissance man" or "polymath"

There is a correlation shown between the average diversity of contributors and what types of article quality they've contributed to. Article quality is determined based on whether the article is a "Good Article", "Featured Article", or neither. It is also shown that total productivity, measured by bytes contributed, is linked to diversity, only marginally insignificantly. Finally, a logistic regression shows that diversity more than productivity significantly determines article quality.

Despite too many simplifications (e.g. single language, naive article quality ratings, too broad categories), the methods used by the researchers are well-defined, clear, and convincing in a limited scope, and place a finger on the notion that our most lauded editors tend to run all over Wikipedia.

Briefly

In-depth examination of the history of three featured articles on the Swedish Wikipedia, and their main editors: This paper^[3] looks at collaboration on the Swedish Wikipedia via a qualitative analysis of three Featured Articles. Information is pulled into the articles from a variety of sources including other language Wikipedias and curated by editors. The qualitative study found the articles' growth followed a similar trajectory and were contributed to by both content and process oriented editors, in what the author calls a process of 'intercreation.'

"Contropedia" tool identifies controversial issues within articles: This paper ^[4] discusses the formation of a new method for identifying and examining controversial issues within Wikipedia articles. The paper outlines the development of an algorithm used to identify the most contested topics via an analysis of the edits surrounding wikilinks. The resulting Contropedia tool (already presented at WikiSym 2014^[5]) provides an excellent visual presentation of hot button issues in a given article. The authors note that the tool has the potential to be of use to researchers interested in studying the evolution of controversial issues over time in an article, as well as affording Wikipedians insight into potential sites of controversy.

"Building Multilingual Language Resources in Web Localisation: A Crowdsourcing Approach": This volume on natural-language processing was semi-recently published, several chapters of which are about wikis, confirming their value for NLP research. Some results are still of some use.

"Micro-crowdsourcing" and shared translation memory are proposed as solutions to localising the web: in fact, both were already implemented by the Translate extension; respectively on translatewiki.net in 2009 and on Wikimedia wikis in 2012, years ahead of the researchers' idea.^[6]
A Basque Wikipedia machine translation experiment we didn't yet know of is reported (see Wikimania 2010). Translating 100 articles was enough to improve a machine translation system by 10%, which is encouraging for Wikimedia's Content translation project.^[7]
A compilative paper lists some tools to use on wiki talk pages, including an active freely licensed spellchecker. The rest was either (en|simple).wiki-specific or supersed by official client lists and Wikimedia Labs at the time of publication.^[8]
Italian linguists developed a CC-BY-SA dictionary based on Tullio De Mauro's, which they describe as "very close to Wiktionary" but with two differences in their platform: "senses (and their relationships) are first-class citizens [...] a rich interactive and WYSIWYG Web interface that is tailored to linguistic content."^[9]

Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.

"The dynamic nature of conflict in Wikipedia"^[10] From the abstract: "With a small number of simple ingredients, our model mimics several interesting features of real human behaviour, namely in the context of edit wars. We show that the level of conflict is determined by a tolerance parameter, which measures the editors' capability to accept different opinions and to change their own opinion."

"Comprehensive Wikipedia Monitoring for Global and Realtime Natural Disaster Detection"^[11] (slides)

"Digital doorway: Gaining library users through Wikipedia"^[12] (about Template:Library resources box)

"Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History"^[13] From the abstract: "Hedera exploits Map-Reduce paradigm to achieve rapid extraction, it is able to handle one entire Wikipedia articles' revision history within a day in a medium-scale cluster, and supports flexible data structures for various kinds of semantic web study."

"Learning to Identify Historical Figures for Timeline Creation from Wikipedia Articles"^[14]

"WiiCluster: A Platform for Wikipedia Infobox Generation"^[15]

"Proceed With Extreme Caution: Citation to Wikipedia in Light of Contributor Demographics and Content Policies"^[16]

"Wikipedia: helping to promote the art and science of civil engineering"^[17]

References

^ Banerjee, Siddhartha; Cornelia Caragea; Prasenjit Mitra (2014). Playscript Classification and Automatic Wikipedia Play Articles Generation. 2014 22nd International Conference on Pattern Recognition (ICPR). pp. 3630–3635. doi:10.1109/ICPR.2014.624. , preprint, dataset
^ Szejda, J.; Sydow M.; Czerniawska D. (2014). "Does a "Renaissance Man" Create Good Wikipedia Articles?" (PDF). Proceedings of the International Conference on Knowledge Discovery and Information Retrieval. Vol. (KDIR-2014). p. 425-430. doi:10.5220/0005155804250430. ISBN 978-989-758-048-2. Retrieved 28 January 2015.
^ Mattus, Maria (26 November 2014). "The Anyone-Can-Edit Syndrome – Intercreation Stories of Three Featured Articles on Wikipedia". Nordicom Review (35) 2014: 189–203. Retrieved 28 January 2015.
^ Borra, Erik; et al. (2015). "Societal Controversies in Wikipedia Articles" (PDF). Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. pp. 193–196. arXiv:1904.08721. doi:10.1145/2702123.2702436. ISBN 978-1-4503-3145-6. Retrieved 28 January 2015.
^ Erik Borra, Esther Weltevrede, Paolo Ciuccarelli, Andreas Kaltenbrunner, David Laniado, Giovanni Magni, Michele Mauri, Richard Rogers, Tommaso Venturini: Contropedia – the analysis and visualization of controversies in Wikipedia articles PDF
^ Wasala, Asanka; Schäler, Reinhard; Buckley, Jim; Weerasinghe, Ruvan (21 Feb 2013). Building Multilingual Language Resources in Web Localisation: A Crowdsourcing Approach. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 69–99. doi:10.1007/978-3-642-35085-6_3. ISBN 978-3-642-35085-6. Retrieved 26 January 2015.
^ Alegria, Iñaki; Cabezon, Unai; Betoño, Unai Fernandez de; Labaka, Gorka (21 Feb 2013). Reciprocal Enrichment Between Basque Wikipedia and Machine Translation. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 101–118. doi:10.1007/978-3-642-35085-6_4. ISBN 978-3-642-35085-6. Retrieved 26 January 2015.
^ Ferschke, Oliver; Daxenberger, Johannes; Gurevych, Iryna (21 Feb 2013). A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 121–160. doi:10.1007/978-3-642-35085-6_5. ISBN 978-3-642-35085-6. Retrieved 26 January 2015.
^ Oltramari, Alessandro; Vetere, Guido; Chiari, Isabella; Jezek, Elisabetta (2013). Senso Comune: A Collaborative Knowledge Resource for Italian. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 45–67. ISBN 978-3-642-35085-6. Retrieved 26 January 2015.
^ Gandica, Y.; F. Sampaio dos Aidos; J. Carvalho (2014-08-19). "The dynamic nature of conflict in Wikipedia". Epl (Europhysics Letters). 108 (1) 18003. arXiv:1408.4362. Bibcode:2014EL....10818003G. doi:10.1209/0295-5075/108/18003.
^ Thomas Steiner: Comprehensive Wikipedia Monitoring for Global and Realtime Natural Disaster Detection. ISWC 2014 Developers Workshop PDF
^ A Spencer, B Krige, S Nair: Digital doorway: Gaining library users through Wikipedia PDF
^ Tuan Tran and Tu Ngoc Nguyen: Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History PDF
^ Sandro Bauer, Stephen Clark , Thore Graepel: Learning to Identify Historical Figures for Timeline Creation from Wikipedia Articles. PDF
^ Zhang, Kezun; Yanghua Xiao; Hanghang Tong; Haixun Wang; Wei Wang (2014). WiiCluster: A Platform for Wikipedia Infobox Generation. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. CIKM '14. New York, NY, USA: ACM. pp. 2033–2035. doi:10.1145/2661829.2661840. ISBN 978-1-4503-2598-1.
^ Wilson, Jodi L. (2014). "Proceed With Extreme Caution: Citation to Wikipedia in Light of Contributor Demographics and Content Policies". JETLaw: Vanderbilt Journal of Entertainment & Technology Law. 16 (4): 857.
^ Armstrong, Richard (2014-08-01). "Wikipedia: helping to promote the art and science of civil engineering". Proceedings of the ICE – Civil Engineering. 167 (3): 101. doi:10.1680/cien.2014.167.3.101. ISSN 0965-089X.