The Signpost

Recent research

Wikipedia more useful than academic journals, but is it stealing the news?

Contribute  —  
Share this
By Tilman Bayer and Smallbones

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"Is Wikipedia stealing the news?"

A paper in the current issue of First Monday[1] "analyzes Wikipedia’s breaking news practices and the ways the Internet is changing perceptions of news", based on a case study of the article 2014 Sydney hostage crisis.

The author is a lecturer in journalism at the University of Sydney, and co-organiser of an upcoming academic conference co-sponsored by Wikimedia Australia ("The Worlds of Wikimedia™: communicating and collaborating across languages and cultures"). In a press release by the university, somewhat provocatively titled "Is Wikipedia stealing the news?" (see also podcast, starting at 21:55), she describes Wikipedia as "a competitor to media organisations" and states:

Wikipedia contributors don't undertake the core role of journalists, which is to produce new work. Contributors' news gathering practices are solely "aggregation and assemblage", and it is important to recognise that the journalistic labour that underpins a Wikipedia page has been funded by media organisations and appropriated without economic consideration.

The case study in the paper itself includes:

The author also interviewed a senior Wikipedian involved in the article.

The paper criticizes the "reasoning [behind some of Wikipedias policies and protocols around news as] contradictory. The claim [in WP:NOTNEWS ] that breaking news should not be emphasized or treated differently doesn’t fit with the specific parameters set by their ‘current event’ template. The entry also claims that Wikipedia is not written in ‘news style’ which also doesn’t hold up to scrutiny ... The 2014 Sydney hostage crisis page clearly conforms: the lead sentence contains who, what, when, and where [Five Ws], is written in past tense and the information is presented according to an inverted pyramid structure."

Alongside the presence of other Wikipedia features such as the "In the News" section on the main page and the use of infoboxes to summarize essential information, the author interprets this as a vindication of traditional news-writing practices: "Over the decades since [Wikipedia's founding], through trial and error and negotiation, the community has adopted a form for presenting information that is readily recognisable as employing news conventions ... . This demonstrates the ongoing versatility of news writing style as an efficient form of communication that extends beyond legacy newspapers, where it originated, and into new forms as they emerge on the Internet." She acknowledges the quality work done by the Wikipedia volunteers, with talk pages "show[ing] just how closely the behaviour of non-journalists resembles that of a professional newsroom."

While these conclusions are backed by detailed observations about Wikipedia, the paper offers few arguments to substantiate the appropriation and competition claims highlighted in the press release. In a Facebook discussions with Wikipedians, the author distanced herself from "stealing" headline, but otherwise seemed to stand by these concerns. Her use of terms like "appropriated", "in the economic sense", "payment" etc. suggests an underlying assumption of property rights about facts that is at odds with the existing legal and economic system that has been underlying the news business in Western countries for a long time. In copyright law, this relies on the idea–expression divide, or specifically in Australia on the seminal court decision Victoria Park Racing & Recreation Grounds Co Ltd v Taylor, which asserted: "The law of copyright does not operate to give any person an exclusive right to state or to describe particular facts. A person cannot by first announcing that a man fell off a bus or that a particular horse won a race prevent other people from stating those facts". It seems that Avieson disagrees with this, at least when the first person is a journalist and those "other people" are Wikipedia editors. Given that journalists themselves routinely rely on the "labour" of other journalists without compensating them (most newspaper articles don't exclusively consist of original reporting) and on that of their sources (paying them is a highly controversial practice even when those sources undergo substantial efforts or risks to provide information to the journalist), it's hard to escape the impression that this paper falls into a common trap of Wikipedia criticism: Berating the open, volunteer community project for practices that are in fact common in traditional, commercial media as well.

Conferences and events

See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines, and the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

"Wikipedia can be more useful than academic journal articles" for learning about certain technologies

From the abstract:[2]

This article analyses five technology-enhanced learning-related terms on Wikipedia, assessing their usefulness in relation to academic journal articles concerning the same terms. Data were obtained about the word lengths of the Wikipedia articles, the numbers of Wikipedia edits and numbers of academic journal publications over the first 5 years after the creation of the first Wikipedia entry. ... The article argues that Wikipedia can be more useful than academic journal articles in the new and emerging phases of a technology, because of the volume of information made available, together with the speed of its publication and the updating of its contents.

"The Network Structure of Successful Collaboration in Wikipedia"

From the abstract:[3]

... we compare the network mechanisms underlying the production of the complete set of featured articles, with the network mechanisms of a contrasting sample of comparable non-featured articles in the English-language edition of Wikipedia. Estimates of relational event models suggest that contributors to featured articles display greater deference toward the reputation of their team members. Contributors to featured articles also display a weaker tendency to follow the behavioral norms predicted by the theory of structural balance, and hence a weaker tendency toward polarization.

(See also our earlier review of a paper by the same authors: "Articles receiving the most attention (by editors) overall lack the depth of quality found in featured articles")

"Negotiation processes on Wikipeda talk pages in case of the White Rose"

Paper/book chapter in German[4], title translates as "How does communicative memory become cultural memory? Negotiation processes on Wikipeda talk pages in case of the White Rose"

"Application of SEO Metrics to Determine the Quality of Wikipedia Articles and Their Sources"

From the abstract:[5]

Based on the fact that most of [Wikipedia's] references are web pages, it is possible to get more information about their quality by using citation analysis tools. ... This paper presents general results of Wikipedia analysis using metrics from the Toolbox SISTRIX, which is one of the leading providers of indicators for Search Engine Optimization (SEO). In addition to the preliminary analysis of the Wikipedia articles as separate web pages, we extracted data from more than 30 million references in different language versions of Wikipedia and analyzed over 180 thousand most popular hosts.

(See also related earlier coverage)

Wikipedia biographies show how the invention of printing shaped the history of science and art

From the abstract:[6]

Here we combine a common causal inference technique (instrumental variable estimation) with a dataset on nearly forty thousand biographies from Wikipedia (Pantheon 2.0), to study the effect of the introduction of printing in European cities on Wikipedia’s digital biographical records. By using a city’s distance to Mainz as an instrument for the adoption of the movable type press, we show that European cities that adopted printing earlier were more likely to become the birthplace of a famous scientist or artist during the years following the invention of printing.

"What is the central bank of Wikipedia?"

From the abstract: [7]

We analyze the influence and interactions of 60 largest world banks for 195 world countries using the reduced Google matrix algorithm for the English Wikipedia network with 5 416 537 articles. While the top asset rank positions are taken by the banks of China, with China Industrial and Commercial Bank of China at the first place, we show that the network influence is dominated by USA banks with Goldman Sachs being the central bank.

"Generating Wikipedia by Summarizing Long Sequences"

From the abstract:[8]

We show that generating English Wikipedia articles can be approached as a multi-document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. ... We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles.

See also media coverage

"Computing controversy: Formal model and algorithms for detecting controversy on Wikipedia and in search queries"

From the abtract:[9]

... we propose a classification based method for automatic detection of controversial articles and categories in Wikipedia. Next, we demonstrate how to use the obtained results for the estimation of the controversy level of search queries. The proposed method can be incorporated into search engines as a component responsible for detection of queries related to controversial topics. The method is independent of the search engine’s retrieval and search results recommendation algorithms, and is therefore unaffected by a possible filter bubble. Our approach can be also applied in Wikipedia or other knowledge bases for supporting the detection of controversy and content maintenance.


  1. ^ Avieson, Bunty (2019-04-30). "Breaking news on Wikipedia: collaborating, collating and competing". First Monday. 24 (5). doi:10.5210/fm.v24i5.9530. ISSN 1396-0466.
  2. ^ Flavin, Michael; Hulova, Katerina (2018-11-23). "An inferior source? Quantitatively analysing the production and revision of five technology-enhanced learning-related terms on Wikipedia". Research in Learning Technology. 26. doi:10.25304/rlt.v26.2103. ISSN 2156-7077. S2CID 70095525. CC BY 4.0
  3. ^ Lerner, Juergen; Lomi, Alessandro (2019-01-08). The Network Structure of Successful Collaboration in Wikipedia. 52nd Annual Hawaii International Conference on System Sciences. p. 2622-2631. hdl:10125/59700. ISBN 9780998133126.
  4. ^ Heinrich, Horst-Alfred; Gilowsky, Julia (2018). "Wie wird kommunikatives zu kulturellem Gedächtnis? Aushandlungsprozesse auf den Wikipedia-Diskussionsseiten am Beispiel der Weißen Rose". (Digitale) Medien und soziale Gedächtnisse. Soziales Gedächtnis, Erinnern und Vergessen – Memory Studies. Springer VS, Wiesbaden. pp. 143–167. doi:10.1007/978-3-658-19513-7_7. ISBN 9783658195120. Closed access icon Google Books
  5. ^ Lewoniewski, Włodzimierz; Härting, Ralf-Christian; Węcel, Krzysztof; Reichstein, Christopher; Abramowicz, Witold (2018). "Application of SEO Metrics to Determine the Quality of Wikipedia Articles and Their Sources". In Robertas Damaševičius; Giedrė Vasiljevienė (eds.). Information and Software Technologies. Communications in Computer and Information Science. Springer International Publishing. pp. 139–152. doi:10.1007/978-3-319-99972-2_11. ISBN 9783319999722. Closed access icon
  6. ^ Jara-Figueroa, C.; Yu, Amy Z.; Hidalgo, César A. (2019-02-20). "How the medium shapes the message: Printing and the rise of the arts and sciences". PLOS ONE. 14 (2): –0205771. Bibcode:2019PLoSO..1405771J. doi:10.1371/journal.pone.0205771. ISSN 1932-6203.
  7. ^ Demidov, Denis; Frahm, Klaus M.; Shepelyansky, Dima L. (2019-02-21). "What is the central bank of Wikipedia?". Physica A: Statistical Mechanics and Its Applications. 542. arXiv:1902.07920. doi:10.1016/j.physa.2019.123199. S2CID 67787806.
  8. ^ Liu, Peter J.; Saleh, Mohammad; Pot, Etienne; Goodrich, Ben; Sepassi, Ryan; Kaiser, Lukasz; Shazeer, Noam (2018-01-30). "Generating Wikipedia by Summarizing Long Sequences". arXiv:1801.10198 [cs.CL].
  9. ^ Zielinski, Kazimierz; Nielek, Radoslaw; Wierzbicki, Adam; Jatowt, Adam (2018-01-01). "Computing controversy: Formal model and algorithms for detecting controversy on Wikipedia and in search queries". Information Processing & Management. 54 (1): 14–36. doi:10.1016/j.ipm.2017.08.005. ISSN 0306-4573.

In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

This nonsense needs to be taken in the context of very successful pushes to try to come up with whole new levels of copyright or related craziness in Europe - see Julia Reda's leadership for information, or Directive on Copyright in the Digital Single Market for our version.

The idea that ordinary proles could speak directly to one another about events that affect their lives has become highly, highly controversial. Nonetheless, even in that awful legislation, Wikipedia was granted theoretical exemptions on account of its influence. Thus the desire of some for some kind of new prong of the assault. Every eyeball belongs to that single Master who owns the Company that owns All, and every deviation of those eyeballs from unending observance and meditation on his perfection is an inexcusable theft that can never be forgiven. Wnt (talk) 04:46, 31 May 2019 (UTC)[reply]

On the other hand, we do take expensive journalism work product which volunteers would ordinarily have no way to produce on their own, and strip it of the advertising intended to pay for it (in ways that Google generally does not but Microsoft and Apple certainly do -- but with compensation.) Journalists, their editors, and publishers aren't upset about that because they think you're a prole or that you and your peers shouldn't be talking among yourselves. They're upset about it because it cuts into their pay. Readers will prefer a Wikipedia article for breaking news not just because they know that dozens of sufficiently competent editors have already read through the commercial news and picked out all the most important parts for them, but also because they know they won't have to look at the ads that are supposed to pay for the work in the first place. I'm not saying it's legally or technically wrong in any way, but there is an unavoidable moral argument that we've been biting the hand that not only feeds us, but that was one of our strongest defenses against corruption and abuse of power. EllenCT (talk) 06:41, 31 May 2019 (UTC)[reply]
EllenCT, I think that it is an interesting point that is raised, but then we should ask ourselves: What should we do about it? I think that as it stands there is no clear solution to the problem because either we force ourselves to be out of date and inaccurate or we don't cover recent events. One solution would be to somehow pay for the use of the articles. I like this idea because it would benefit the journalists and encourage them to cooperate with Wikipedia. However, it would require a very large grant. StudiesWorld (talk) 10:46, 31 May 2019 (UTC)[reply]
Every so often I propose doing something that would work to keep journalists and other content creators gainfully employed, but nobody ever goes for it, because it's not about one guy controlling all the news and all the readers in the world. It's not my fault that copyright is an unfair system that can't work, which relied only on the physical difficulty of copying things to impose taxes on the act of readership via a tax farming system. But for the record, there is no reason why we could not enact some legislation whereby a certain percent on top of a person's annual income tax had to be paid via a funding mechanism in lieu of any copyright or patent royalties. The individual taxpayer would get to choose what organizations would disburse this funding in the form of grants to content creators those organizations desire. (In theory, individuals could choose their own recipients, but to prevent back-scratching arrangements the sum going to any one recipient would have to be kept very small, making this difficult) Because copyright is almost infinitely inefficient, a mechanism where authors get paid the same as they are now would be one where people have access to vastly more information, at no added cost; indeed, at a reduced cost because a lot of bean counting and copy protection and encryption and obfuscation would no longer be "needed". My suggestion is by no means the only way to do funding, but it is workable, whereas the of copyrighting facts and individual words is not workable and will not preserve the jobs of journalists: to the contrary, it will ensure that each and every remaining journalist is replaced by a stenographer as the "marketplace of ideas" is replaced by a marketplace of licenses to write about them. Wnt (talk) 22:16, 31 May 2019 (UTC)[reply]
@Wnt: I'd support that and it would be great to see the Foundation pushing it. Who do we even ask? EllenCT (talk) 06:55, 1 June 2019 (UTC)[reply]
The WMF would have a hard time "pushing" an idea like this (or political ideas of any kind) in a general sense. However, when people talk about what can be done to save journalism, or how Wikipedians displace paying jobs in the encyclopedia industry, it is applicable incidentally.
I should add that -- though the temptation to use them could readily become overwhelming -- it is possible to hybridize the idea with more traditional politics, i.e. by imposing funding directives by sector. For example, if a congress were worried that too many individual taxpayers would choose to put all their creativity funding toward hip-hop, they could order that the allocations for popular music (or even hip-hop per se, with all the unpleasant racial politics connotations such a specific restriction might carry) would be diminished by some factor or to a specific relative or absolute level, in order that the funds be redirected to comparably expand the allocations to cancer research. The extreme case of this is of course the current situation where the government simply funds NIH and NSF certain amounts and you don't get a choice of which sector or which agency you want your money to be applied toward. This system lacks the flexibility of a taxpayer-directed system, but it can be argued that in technical sectors the voter simply wouldn't know what to fund and would fall prey to organizations that market themselves (yet despite this impression and the overall disrepute of charities in the U.S., major independent funders like American Cancer Society are strikingly honest and effective in their funding choices). In music, obviously the taxpayer knows what he likes. News might be argued to be between these extremes since people get duped every day, yet are often reasonably canny and can find good sources despite the obvious conflicts of interests imposed from above by advertiser funding in the present system. Wnt (talk) 09:29, 1 June 2019 (UTC)[reply]
Wnt says "we force ourselves to be out of date and inaccurate or we don't cover recent events" as if not covering recent events was a Bad Thing. If we simply deleted everything regarding any event of any kind until three days had passed and announced this new policy on the main page and any articles affected along with a suggestion to go to Wikinews for anything late-breaking it would reduce the amount of conflict and rapidly mutating pages by 80%,[Citation Needed] would greatly reduce the number of errors in Wikipedia,[Citation Needed] and would force the "this just in!" editing addicts to quit their habit cold turkey.[Citation NOT Needed] We would simply train our readers to expect Wikipedia to be silent for three days and then to suddenly give them better information than available from any of the "late breaking news!" sources. What's so bad about an encyclopedia that is an encyclopedia instead of a twitter feed? --Guy Macon (talk) 19:05, 1 June 2019 (UTC)[reply]
Hmmmm, I don't know. Why do you suppose it might be better to have most of the people editing the encyclopedia write about something while they are reading it and following it and interested rather than after they have forgotten about it and moved on? More to the point though, with many stories the best source is the first source, followed by hundreds or thousands of echoes, newspapers running the same thing with less facts and less attribution. Often the longer you wait to research, the harder it is to find detailed and accurate information. Wnt (talk) 17:33, 2 June 2019 (UTC)[reply]
I disagree with EllenCT's assertion that we strip journalism of the advertising intended to pay for it (although I imagine this was stated as devil's advocate). How journalism (or any other media we cite) is paid for is not our concern. We have a fundamental right to be able to discuss current events without paying anyone a fee. Just as you have a right to discuss current events with the person sitting next to you without paying a fee. No one should be able to "own" the facts surrounding current events (or any events for that matter). This drum-beating about how journalism will die unless we give media companies special copyrights is both disingenuous and dangerous. Journalism isn't going anywhere (despite claims to the contrary for the past dozen years) and owning facts only gives corporations a tighter stranglehold on our ability to actually communicate with each other, which should be held as a fundamental human right more important than the survival of a particular economic model. Kaldari (talk) 15:18, 9 June 2019 (UTC)[reply]

"Generating Wikipedia by Summarizing Long Sequences"

Page 15 of [1] tells you everything you need to know about that paper. EllenCT (talk) 07:03, 31 May 2019 (UTC)[reply]


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0