The Signpost
Single-page Edition
WP:POST/1
26 November 2012

News and notes
Toolserver finance remains uncertain
Recent research
Movie success predictions, readability, credentials and authority, geographical comparisons
Featured content
Panoramic views, history, and a celestial constellation
Technology report
Wikidata reaches 100,000 entries
WikiProject report
Directing Discussion: WikiProject Deletion Sorting
 

2012-11-26

Toolserver finance remains uncertain

Contribute  —  
Share this
By Jan eissfeldt
Wikimedia Labs, the project that is supposed to replace the Toolserver in 2013.

On November 24, a general assembly of Wikimedia Germany (WMDE) voted on the fate of the Wikimedia Toolserver, a central external piece of technical infrastructure supporting the editing communities with volunteer-developed scripts and webpages of various kinds that are assisting in performing mostly menial tasks.

The chapter set up the Toolserver in the Netherlands in 2005, and has funded its general budget, which has grown to €100k (US$130k), with some financial and technical assistance from the WMF and some financial assistance from European chapters ever since. However, in 2011 the foundation decided to create WikiLabs (also known as Wikimedia Labs) to perform various tasks, including an approximation of the Toolserver's functionality by mid-2013; as part of the plan, the foundation will wind down financial support for what would become at least partially redundant infrastructure.

After WMDE published its annual plan for the upcoming financial year, saying it will not continue to fund the Toolserver after a transitional period, a debate on the potential of WikiLabs to replace the older structure got traction. DaB, the long-serving "root" volunteer of the Toolserver, said he would resign by the end of the year unless sufficient funding is provided to handle the growing demands on the system. The chapter's management delivered what he saw as insufficient assurances and responded by publishing a proposal to the WMDE general assembly to guarantee future funding.

While the German Wikipedia community set up a survey to make its reliance on the Toolserver transparent to voters at the general assembly, the WMDE board, led by DerHexer, responded by outlining a significant amendment to DaB's proposal.

On November 24, the assembly voted and decided to go along with the changes to DaB's proposal. By this decision, it replaced the assurance to fund the Toolserver until a later general assembly can make a final decision based on the facts concerning what will by then be the established WikiLabs project during a six-month transitional period. During this time-window, WMDE seeks a binding statement by WMF until when and how the foundation's project is going to replace Toolserver functions. If the demand is not met, the chapter will work out a big-picture governance model to run its infrastructure beyond 2013. The text sponsored by DerHexer also replaced a concrete commitment—to both invest in five new servers and guarantee one full-time staffer—with relatively vague wording, saying that the chapter aims to ensure a "(nearly) trouble-free functionality for the Toolserver", but without specific financial or personnel commitments. Out of the chapter's 2400 members, who are largely not active on WMF projects, 24 supported the amended proposal and six voted against the changes (informal protocol).

Merlissimo, who administers several bots on the Toolserver, told the Signpost that his list of significant reasons why WikiLabs cannot replace the functionality of WMDE's infrastructure remains unaffected by the vote. Summing up his views the day after, DaB stated on the mailing list that he was "disappointed", emphazising that DerHexer's changes to his proposal are leaving open significant risks of ambiguity. He said he will announce next Sunday whether he will step down by year's end.

Brief notes

2012-11-26

Movie success predictions, readability, credentials and authority, geographical comparisons

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Early prediction of movie box-office revenues with Wikipedia data

An open-access preprint[1] has announced the results from a study attempting to predict early box-office revenues from Wikipedia traffic and activity data. The authors – a team of computational social scientists from Budapest University of Technology and Economics, Aalto University and the Central European University – submit that behavioral patterns on Wikipedia can be used for accurate forecasting, matching and in some cases outperforming the use of social media data for predictive modeling. The results, based on a corpus of 312 English Wikipedia articles on movies released in 2010, indicate that the joint editing activity and traffic measures on Wikipedia are strong predictors of box-office revenue for highly successful movies.

The authors contrast their early prediction approach with more popular real-time prediction/monitoring methods, and suggest that movie popularity can be accurately predicted well in advance, up to a month before the release. The study received broad press coverage and was featured in The Guardian, the MIT Technology Review and the Hollywood Reporter among others. The authors observe that their approach, being "free of any language based analysis, e.g., sentiment analysis, could be easily generalized to non-English speaking movie markets or even other kinds of products". The dataset used for this study, including the financial and Wikipedia activity data is available among the supplementary materials of the paper.

Readability of the English Wikipedia, Simple Wikipedia, and Britannica compared


The automated readability index, one of the readability metrics used in the study[2]

A study[2] by researchers at Kyoto University presents a detailed assessment of the readability of the English Wikipedia against Encyclopedia Britannica and the Simple English Wikipedia using a series of readability metrics and finds that Wikipedia "seems to lag behind the other encyclopedias in terms of readability and comprehensibility of its content". The paper, presented at CIKM’12, uses a variety of metrics spanning syntactical readability indices (such as Flesch reading ease, the automated readability index and the Coleman–Liau index) as well as metrics based on word popularity (including the Dale–Chall readability formula and word frequency indices derived from Google News or the American National Corpus).

The authors prepared a corpus of matching articles for the purpose of comparison between the English and Simple English Wikipedia. The study did not perform a random selection of articles, but selected a sample based on the existence of a corresponding article in Simple Wikipedia. The findings of the first analysis indicate that Simple Wikipedia consistently outperforms the English Wikipedia on all readability metrics. Wikipedia also appears to contain on average more proper nouns than Britannica – which, the authors speculate, may be due to specific editorial policies. The second section of the paper measures readability for 500 articles for each one of eight topic categories selected from DBpedia (biology, chemistry, computing, economics, history, literature, mathematics, and philosophy).

The comparison indicates that articles in the computing category are the most readable by syntactical and familiarity measures. Biology and chemistry, on the other hand, seem to include the most difficult articles. The final section reviews the readability of Britannica articles, in particular comparing the readability of articles in the "introductory" class with that of Simple Wikipedia articles and the readability of "encyclopedia" class articles with that of Wikipedia articles. The findings indicate that Britannica outperforms Wikipedia in readability overall, while introductory articles outperform Simple Wikipedia articles. It should be noted that the comparisons were not performed on matched pairs and that the the criteria used to sample articles from Britannica were not specified.

A paper whose preprint was previously covered in this research report, and now published as a full research article in PLOS One,[3] found that the Simple English Wikipedia has a higher degree of complexity than the corpus of Charles Dickens' books when measured via the Gunning fog index, but is less complex than the British National Corpus, "which is a reasonable approximation to what we would want to think of as ‘English in general’". See also the September issue of this research report for a summary of a third readability study which had applied the standard Flesch Reading Ease test to the English and Simple English Wikipedias.

Wikipedia favors established views and scientifically backed knowledge

An article appearing in Information, Communication & Society[4] studies the discussion pages of English and German September 11 attacks articles, contributing to the ongoing debates on collaborative knowledge creation in the wiki Web 2.0 context, participation of experts and amateurs on Wikipedia, and, indirectly, reliability of Wikipedia. The article's research question, coming from the sociology of knowledge and social constructivism perspectives, asks to what degree Wikipedia's "anyone can edit" policy democratizes the production of knowledge, removing it from traditional hierarchies "between experts and lay participants". The term democratization here is used in the context of such theoretical concepts as wisdom of crowds, participatory culture, produsage and (more critically) the notions of cult of the amateur or digital Maoism. All of these refer to the fact that Wikipedia's editors are more often amateurs ("lay participants") than professionally recognized experts.

Using the grounded theory approach, the study focuses not on editors, but on their arguments. It finds that due to community-upheld Wikipedia policies such as Wikipedia:Reliable sources, dissenting opinions ("traditionally marginalized types of knowledge") such as various conspiracy theories are still marginalized or straight-out excluded; according to the author, this "did not lead to a ‘democratization’ of knowledge production, but rather re-enacted established hierarchies". The finding should be taken in a certain context; as the author notes, the article was written by amateurs ("lay participants"), who however decided to reproduce traditional knowledge hierarchies, relegating various conspiracy theories and similar points not backed up to reliable sources to obscurity on Wikipedia. The paper concludes that Wikipedia, like other encyclopedias, is prone to a "scientism bias", i.e. treating scientifically backed knowledge as "better" than knowledge coming from alternative outlets. This despite the "anyone can edit" motto of Wikipedia, the paper finds support for the argument that Wikipedia puts more stress on article quality than democratic participation, or in the words of the article: "Although laypeople apparently play a significant part in the text production, this does not mean that they favor lay knowledge. On the contrary, it is clearly elite knowledge of well-established authorities which is finally included in the article, whereas alternative interpretations are harshly excluded or at least marginalized."

Side-note: The study's use of a Firefox add-on Wired-Maker for content analysis rather ingenious, and applauds the mentioning of such a practical methodological tip in their paper.

Trust, authority and credentials on Wikipedia: The case of the Essjay controversy

At the Academy of Management conference in Boston, Dariusz Jemielniak presented a paper on Trust, Control, and Formalization in Open-Collaboration Communities: A Qualitative Study of Wikipedia [5]. It is built around a detailed description and interpretation of the Essjay controversy on the English Wikipedia in 2007 about the use of inaccurate credentials by active Wikipedian and administrator Essjay. The paper is framed in terms of the literature from organization theory on trust and control. Jemielniak argues that organization theory suggests that organizations must either be able or willing to trust participants or must rely on control systems which essentially obviate the need for trust. Using ethnographic data from Wikipedia, Jemielniak suggests that Wikipedia — and, perhaps, a series of similar computer-mediated "open-collaboration communities" — instead rely on a series of procedures and "legalistic remedies" which provide a previously untheorized alternative to traditional control systems used in organizations.

The working paper is the first in what Jemielniak suggests will be a series of papers based on a long-term participatory ethnographic study: over the past five years, Jemielniak has edited Wikipedia almost daily and is a steward on Wikimedia projects (as well as the chair of the Wikimedia movement's newly established Funds Dissemination Committee, and recently announced the committee's recommendations on funding requests by various Wikimedia organizations totaling US$10.4M). Jemielniak uses his own experience as well as detailed on-wiki records from conversations surrounding the Essjay affair to walk through the controversy and its implications in depth. He discusses how Wikipedians construct authority and initially reacted with indifference to the revelation that Essjay had used fake credentials, how this changed when new information about Essjay's use of his credentials came to light, how a series of proposals to prevent or respond to such issues in the future were raised, and how the community essentially decided to keep the status quo.

The paper paints a detailed, nuanced, and deeply informed portrait of Wikipedians' responses to the controversy and the ways in which trust and its relationships to authority and credentials are navigated in the project. The author suggests that the creation of rules and legalistic procedures allowed Wikipedians to walk the line between rejecting descriptions of authority per se while minimizing the effects of inaccurate descriptions of authority by suggesting that editors on Wikipedia should rely much more heavily on users' experience and on the degree to which particular contributions conform to Wikipedia's content guidelines.

A working paper by the same writer, presented at the annual meeting of the Society for Applied Anthropology[6] gives an overview of Wikipedia's culture by reviewing the role of its norms, guidelines and policies.

Network of users communicating on Wikipedia article talk pages (Neff et al., p.22).[7] Edges connecting two Democrats are colored blue, edges connecting two Republicans in red, and edges representing inter-party dialogue are shown in green.

Briefly

Notes

  1. ^ Mestyán, M., Yasseri, T., & Kertész, J. (2012). Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. ArXiV. PDF Open access icon
  2. ^ a b Jatowt, A., & Tanaka, K. (2012). Is Wikipedia Too Difficult? Comparative Analysis of Readability of Wikipedia , Simple Wikipedia and Britannica. CIKM’12, pp. 2607–2610. PDFDOI Open access icon
  3. ^ Yasseri, T., Kornai, A., & Kertész, J. (2012). A Practical Approach to Language Complexity: A Wikipedia Case Study. PLoS ONE, 7(11), e48386. DOI Open access icon
  4. ^ König, R. (2012). Wikipedia. Between lay participation and elite knowledge representation. Information, Communication & Society. Advance online publication. DOI Closed access icon
  5. ^ Jemielniak, D. (2012). Trust, Control, and Formalization in Open-Collaboration Communities: A Qualitative Study of Wikipedia. Academy of Management 2012 Annual Meeting. PDF Open access icon
  6. ^ Jemielniak, D. (2012). Wikipedia: An effective anarchy. Society for Applied Anthropology 2012 Annual Meeting (SfAA 2012). PDF Open access icon
  7. ^ a b Neff, J. G., Laniado, D., Kappler, K., Volkovich, Y., Aragón, P., & Kaltenbrunner, A. (2012). Jointly they edit: examining the impact of community identification on political interaction in Wikipedia. ArXiV, PDF Open access icon
  8. ^ Clark, Malcolm; Ruthven, Ian; O’Brian Holt, Patrik and Song, Dawei (2012). Looking for genre: the use of structural features during search tasks with Wikipedia. Fourth Information Interaction in Context Conference (IIiX 2012). DOIPDF Open access icon
  9. ^ Daxenberger, J., & Gurevych, I. (2012). A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012). PDF Open access icon
  10. ^ Rycak, M. (17 November, 2012) Wikipedia-Zugriffszahlen bestätigen Second-Screen-Trend. martinrycak.de. HTML Open access icon
  11. ^ Liu, Y. (2012). WT-verifier. Truthfulness verification of fact statements on Wikipedia (unpublished masters' thesis). State University of New York at Binghamton. HTML Closed access icon
  12. ^ Reinoso, A. J., Muñoz-Mansilla, R., Herraiz, I., & Ortega, F. (2012). Characterization of the Wikipedia Traffic. Seventh International Conference on Internet and Web Applications and Services (ICIW 2012), pp. 156–162. PDF Open access icon
  13. ^ Taraborelli, D. (2012) Wikipedia article ratings. The Data Hub TSV Open access icon
  14. ^ Graham, M. (5 November 2012). Virtuous Visible Circles: mapping views to place-based Wikipedia articles. Zero Geography. HTML Open access icon
  15. ^ Graham, M. (11 November 2012). The most visible country in Europe (on Wikipedia) is... Zero Geography. HTML Open access icon
  16. ^ Zachte, E. (15 November 2012) Wikipedia page reads, breakdown by region. Infodisiac. HTML Open access icon


Reader comments

2012-11-26

Panoramic views, history, and a celestial constellation

This edition covers content promoted between 18 and 24 November 2012.
View over Colorada Lake in Bolivia. Panorama stitched from nine portrait format images. A new featured picture.
African River Martin (adult in foreground) and juvenile. Its population size is unknown.
Siege of Constantinople, as depicted in the 14th-century Bulgarian translation of the Manasses Chronicle.
Color composite image of Centaurus A, revealing the lobes and jets emanating from the active galaxy’s central black hole. A new featured picture.

Six featured articles were promoted this week:

  • African River Martin (nom) by Jimfbleak. The African River Martin is a large swallow, mainly black with a blue-green gloss to the head and a greener tint to the back and wings. The main breeding areas are in the Democratic Republic of the Congo along the Congo River and its tributary, the Ubangi, in habitats characterised by mixed tropical forest types including swampy or seasonally flooded woodland in a part of Africa that is poorly known.
  • Siege of Constantinople (717–718) (nom) by Cplakidas. The Second Arab Siege of Constantinople was the Arab offensive against the Byzantine Empire's capital city after twenty years of attacks and progressive Arab occupation of Byzantine borderlands. The Arabs invaded Byzantine Asia Minor, crossing into Thrace in 717, and built siege lines to blockade Constantinople. Attacked by the Byzantines and Bulgars, the Arabs were forced to lift the siege in 718 and the city was rescued. The siege's failure had wide repercussions; the Arab attempt to conquer Byzantine territories was abandoned, and historians credit the siege with halting the Muslim advance into Europe.
  • Muhammad Ali Jinnah (nom) by TopGun, Inlandmamba, and Wehwalt. Muhammad Ali Jinnah (1876–1948) was a lawyer, politician and statesman, known as the founder of Pakistan. He served as leader of the All-India Muslim League from 1913 until Pakistan's independence in 1947, and as Pakistan's first Governor-General from independence until his death. He is revered in Pakistan as the Father of the Nation; his birthday is observed as a national holiday.
  • Harry S. Truman (nom) by PumpkinSky, and Wehwalt. Harry S. Truman (1884–1972) was the 33rd President of the United States (1945–1953). The running mate of President Franklin D. Roosevelt in 1944, Truman succeeded to the presidency on April 12, 1945, when Roosevelt died. Under Truman, the US successfully concluded World War II; in its aftermath, tensions with the Soviet Union increased, marking the start of the Cold War.
  • Leo Minor (nom) by Casliber. Leo Minor is a small, faint constellation in the northern celestial hemisphere. Its name is Latin for "the smaller lion", in contrast to Leo, the larger lion. It lies between the larger Ursa Major to the north and Leo to the south. There are 37 stars brighter than apparent magnitude 6.5 in the constellation; three are brighter than magnitude 4.5. It also includes two stars with planetary systems, two pairs of interacting galaxies, and the unique deep-sky object Hanny's Voorwerp.
  • John Adair (nom) by Acdixon. John Adair (1757–1840) was an American pioneer, soldier, statesman and the eighth Governor of Kentucky. He served in the Revolutionary War and was twice held prisoner by the British. He served eight terms in Kentucky's House of Representatives but failed to win a full term in the Senate after his implication in the Burr conspiracy. His role in the War of 1812, and his defense of Kentucky's soldiers against charges of cowardice at the Battle of New Orleans restored his reputation.

One featured list was promoted this week:

  • Latin Grammy Award for Best Singer-Songwriter Album (nom) by Hahc21 and Status. The Latin Grammy Award for Best Singer-Songwriter Album is an honor presented annually at the Latin Grammy Awards to recognize excellence and increase awareness of the diverse contributions of Latin recording artists in the United States and internationally.

Six featured pictures were promoted this week:

One featured topic was promoted this week:

River martin (nom) by Jimfbleak. A small swallow subfamily with just two species: one from Thailand is probably extinct, and the other in west Africa is little-studied.

Panoramic view of Naqsh-e Rustam, the tombs of Achaemenid kings and a current archaeological site in Iran: a new featured picture.


Reader comments

2012-11-26

Wikidata reaches 100,000 entries

The team behind Wikidata

Wikidata, the new "Wikimedia Commons for data" and the first new Wikimedia project since 2006, reached 100,000 entries this week. The project aims to be a single, human- and machine-readable database for common data, spanning across all Wikipedia projects, which will "lead to a higher consistency and quality within Wikipedia articles, as well as increased availability of information in the smaller language editions" while lowering the burden on Wikipedia's volunteer editors—whose numbers have stalled overall, and continue to dwindle on the English Wikipedia.

Wikidata is currently in the first of three phases. The site is currently only accepting interwiki links to different-language versions of a page. For example, the 100,000th entry, Cadier en Keer, has only a short description and four links to Wikipedia articles in English, French, Dutch, and Limburgish. The second phase will start the actual collection and storage of data, so that Cadier en Keer will contain basic statistics such as country, province, size, and population. It aims to supplement the infoboxes which many Wikipedias use to display this common data. The third phase will allow anyone to make lists and charts based on the statistics.

The project raised €1.3M (US$1.87M), for development from three major funders: half from Allen Institute for Artificial Intelligence, founded by Microsoft co-founder Paul Allen; a quarter from the Gordon and Betty Moore Foundation, established by Intel co-founder Gordon Moore; and a final quarter from Google, who said that "[our] mission is to make the world's information universally accessible and useful ... we hope [Wikidata] will make significant amounts of structured data available to all." It has eight developers actively working on its infrastructure.

The fast growth of what Linux User & Developer calls "Wikipedia's Game-changer"—over 100,000 entries in one month, with over 800 active users—bodes well for the site so far. In time, Wikidata's overarching goals may seem lofty: one of the original funders stated that "Wikidata ... will transform the way that encyclopedia data is published, made available, and used by a global audience. [It] will build on semantic technology that we have long supported, will accelerate the pace of scientific discovery, and will create an extraordinary new data resource for the world."

Yet even detractors believe that Wikidata has a high potential for expanding human knowledge in the world: "a primary goal ... [is] to make information in Wikipedia much more understandable to artificial intelligence systems. In other words, Wikidata—if successful—is going to form the 'brains' of many future technologies and online platforms."

In brief

  • RENDER milestone: RENDER, a seven-partner (including Wikimedia Germany) program attempting to "develop methods, techniques, software and data sets which enable both scholars and users of internet applications such as Wikipedia to understand, to describe, to process and to make use the diversity of knowledge and information", has reached its second of three years of operation (link is in German).
  • WMF calls for more OSM involvement: the Wikimedia Foundation has announced its intention to develop a Tile Map Service from OpenStreetMap for Wikimedia sites. They also called for a "face-to-face meetup/hackfest [for] geodata/mapping related development work [to be held] sometime around Feb/March 2013," where the WMF would offer sponsorships for "key developers" to attend.

    Reader comments

2012-11-26

Directing Discussion: WikiProject Deletion Sorting

WikiProject news
News in brief
Submit your project's news and announcements for next week's WikiProject Report at the Signpost's WikiProject Desk.

This week, we uncovered WikiProject Deletion Sorting, Wikipedia's most active project by number of edits to all the project's pages. This special project seeks to increase participation in Articles for Deletion nominations by categorizing the AfD discussions by various topic areas that may draw the attention of editors. The project was started in August 2005 with manual processes that are continued today by a bevy of bots, categories, and transclusions. The project took inspiration from WikiProject Stub Sorting and some historical discussions on deletion reform. As the sheer number of AfDs continues to grow, the project is seeking better tools to manage the deletion sorting process and attract editors to comment on these deletion discussions. We interviewed Frankie.


What motivated you to join WikiProject Deletion Sorting? Why do AfD discussions need sorting?

For the first question, I started sorting debates because it is a way to assist the deletion process that I can easily fit into my everyday schedule, which tends to be a bit erratic. For participating in a discussion I need to set apart a minimum amount of time to research the subject, whereas for sorting I can make bursts of small, quick edits, and if for some reason I need to go do something else I'm not leaving anything half-way done. On the second question, sorting AfDs is helpful to the deletion process because it increases awareness over one crucial point of an article's lifecycle, as deletion marks a point where the content becomes unavailable, and thus no longer workable. I think AfD regulars may find the daily logs sufficient, but there are many editors that do not review AfD regularly that would be interested in being aware if an article within certain subject areas is nominated, or that would prefer not to have to review the whole daily logs just to find those discussions that they want to take part of.


WikiProject Deletion Sorting is Wikipedia's most active WikiProject when ranked by changes made to articles (second when bots are excluded). Where do these edits come from? How does the project coordinate such enormous activity?

The amount of work required to keep all discussions sorted is by no means trivial, but I think those numbers might a bit misleading. Note that there will be two corresponding edits under the WikiProject space for each AfD for each list it's included in: one to sort the nomination, and one for the bot to remove it after closure. Given that most nominations are sorted in more than one list, that's 4 or 6 edits per nomination, and there are around 40-50 fresh nominations every day.


How is deletion sorting actually conducted? What templates, scripts, lists, and other tools are available to help sort AfD discussions?

The sorting lists can be found at WP:WikiProject Deletion sorting/Flat or WP:WikiProject Deletion sorting/Compact (they're the same list, only the presentation varies). The process consists in transcluding the nomination in the appropriate list(s), and then including a notice on the nomination using the template {{subst:delsort|ListName}}. In addition to letting involved editors know that the debate has been sorted, this notice helps the closing administrator by telling them how long has the debate been advertised, which may take part in deciding whether to relist.


What kinds of editors tend to use the project's resources? Can new Wikipedians take part or is deletion sorting more appropriate for experienced users?

New Wikipedians can simply take part on the process. Unlike mainspace categorization, which has a number of considerations and caveats when it comes to how to categorize an article, deletion sorting is a meta-process that simply aims to increase awareness, and it has minimal pitfalls. Sorting a discussion in a list that is completely off from the article's subject wouldn't be optimal, but all it would mean is that one just let people know about a discussion that they might not care about.


Anything else you'd like to add?

If anything, just to reiterate that editors should feel free to jump in, and to post any doubts at WT:WikiProject Deletion sorting.


Next week, we'll take a stroll down the vast, unspoiled Yorkshire countryside. Until then, you can locate our previous reports in the archive.

Reader comments

If articles have been updated, you may need to refresh the single-page edition.



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0