The Signpost

Recent research

Military history, cricket, and Australia targeted in Wikipedia articles' popularity vs. quality; how copyright damages economy

Contribute  —  
Share this
By Niklas Laxström, Federico Leva, Masssly, Gamaliel and Piotr Konieczny

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Popularity does not breed quality (and vice versa)

This paper[1] provides evidence that quality of an article is not a simple function of its popularity, or, in the words of the authors, that there is "extensive misalignment between production and consumption" in peer communities such as Wikipedia. As the author notes, reader demand for some topics (e.g. LGBT topics or pages about countries) is poorly satisfied, whereas there is over-abundance of quality on topics of comparatively little interest, such as military history.

Rank Popular and underdeveloped topics High-quality, not popular topics
1 Countries Cricket
2 Pop music Tropical cyclones
3 Internet Middle Ages
4 Comedy Politics
5 Technology Fungi
6 Religion Birds
7 Science fiction Military history
8 Rock music Ships
9 Psychology England
10 LGBT studies Australia
Illustration from Wedding, cited as an example for start-class articles which ought to be featured articles if quality ratings were perfectly aligned with popularity

The authors arrived at this conclusion by comparing data on page views to articles on English, French, Russian, and Portuguese Wikipedias to their respective Wikipedia:Assessment (and like) quality ratings. The authors note that at most 10% of Wikipedia articles are well correlated with regards to their quality and popularity; in turn over 50% of high quality articles concern topics of relatively little demand (as measured by their page views). The authors estimate that about half of the page views on Wikipedia – billions each month – are directed towards articles that should be of better quality, if it was just their popularity that would translate directly into quality. The authors identify 4,135 articles that are of high interest but poor quality, and suggest that the Wikipedia community may want to focus on improving such topics. Among specific examples of extremes are articles with poor quality (start class) and high number of views such as wedding (1k views each day) or cisgender (2.5k views each day). For examples of topics of high quality and little impact, well, one just needs to glance at a random topic in the Wikipedia:Featured articles – the authors use the example of 10 Featured Articles about the members of the Australian cricket team in England in 1948 (itself a Good Article; 30 views per day). Interestingly, based on their study of WikiProjects, popularity and quality, the authors find that contrary to some popular claims, pop culture topics are also among those that are underdeveloped. The authors also note that even within WikiProjects, the labor is not efficiently organized: for example, within the topic of military history, there are numerous featured articles about individual naval ships, but the topics of broader and more popular interests, such as about NATO, are less well attended to. In conclusion, the authors encourage the Wikipedia community to focus on such topics, and to recruit participants for improvement drives using tools such as User:SuggestBot.

Within a sample of US bestseller authors, what effect may the addition of this image to the article Michael Gold have had on its traffic?

Paul J. Heald and his coauthors at the University of Glasgow continued their extremely valuable studies of the public domain, publishing "The Valuation of Unprotected Works".[2] The study finds that "massive social harm was done by the most recent copyright term extension that has prevented millions of works from falling into the public domain since 1998" which "provides strong justification for the enactment of orphan works legislation."

Context

In recent years, authorities have started acknowledging possible errors in copyright legislation of the past, which would have been prevented by an evidence-based approach. Heald mentions the Hargreaves Report (2011), endorsed by the UK's IP office, but other examples can be found in World Intellectual Property Organization reports. This awakening corresponds to the work by researchers and think tanks to prove the importance of public domain and certain damages of copyright.[supp 1]

The importance of evidence-based legislation can't be overstated, especially in the current process of EU copyright revision.

As Heald notes, past copyright policy has relied on a number of incorrect assumptions, in short:

Recent studies, some of which are mentioned in this paper (Pollock, Waldfogel, Heald), have instead found strong indicators that:

In short, it seems that "the public is better off when a work becomes freely available", insofar as copyright has been "robust enough to stimulate the creation of the work in the first place" and that a work "must remain available to the public after it falls into the public domain".

Findings

However, it is impossible to measure the value of knowledge acquired by society and, even considering the mere monetary value, it is impossible to measure transactions which did not happen. The English Wikipedia is used by the authors as dataset because its history is open to inspection and its content is unencumbered by copyright payments, so every "transaction" is public.

In particular, the study measures what would be the cost of gratis images not being available for use on English Wikipedia articles, as a proxy of (i) the consumer surplus generated by those images, (ii) their private value, and (iii) their contribution to social welfare. If a positive value is found, it is proved that a more restrictive copyright would be harmful, and we can reasonably infer that reducing copyright restrictions would make society richer.

The calculation is done in three passages.

  1. 362 authors of New York Times bestsellers of 1895–1969 are considered. Their English Wikipedia articles are checked for inclusion of portraits and copyright status thereof; the increase in page views caused by the presence of the image is calculated. To depurate other factors, authors are compared in "matched pairs" of similar popularity as suggested by Amazon review or pageviews in mid 2009. Only the lowest scoring months are considered, the general increase in pageviews is discounted, etc.
    • The first proxy considered is how much it would cost to buy the images from traditional image sellers, in the hypothetical (and absurd) case that article authors were allowed to. Such an image typically costs around US$100 even if it is in the public domain or identical to the one used by our articles.
    • The second proxy is how much the added pageviews are worth in terms of potential advertising revenue ($0.53 cents/view, according to [1]).
  2. The values are then validated on a different dataset, some hundreds composers and lyricists.
  3. The amounts are then expanded proportionally to all English Wikipedia articles by considering images and pageviews of a sample of 300 articles.

Clearly, the number of inferences is great, but the authors believe the findings to be robust. The pageview increase, depending on the method, was 6%, 17% or 19%, and at any rate positive. Authors with most images were those died before 1880, an outcome which has no possible technological reason nor any welfare justification: it's clearly a distortion produced by copyright.

For those fond of price tags, the English Wikipedia images were esteemed to be worth about $30,000/year for those 362 writers, or about $30m in hypothetical advertising revenue for English Wikipedia, or $200m–230m in hypothetical costs of image purchase.

At any rate, this reviewer thinks that the positive impact of the lack of copyright royalties is proven and confirms the authors' thesis. It is quite challenging to extend the finding to the whole English Wikipedia, all Wikimedia projects, the entire free knowledge landscape and finally the overall cultural works market; and even more fragile to put a price tag on it. However, this kind of one-number communication device is widely used to explain the impact of legislation and numbers traditionally used by legislators are way more fragile than this. Moreover, the study makes it possible to prove a positive impact on important literature authors and their life, i.e. their reputation, which is supposed to be the aim of copyright laws, while financial transactions are only means.

Methodological nitpicks

There are several possible observations to be made about details of the study.

Briefly

Other recent publications

A list of other recent publications that could not be covered in time for this issue – contributions are always welcome for reviewing or summarizing newly published research.


References

  1. ^ Morten Warncke-Wang; Vivek Ranjan; Loren Terveen & Brent Hecht (2015). "Misalignment Between Supply and Demand of Quality Content in Peer Production Communities" (PDF). Pre-print PDF, to appear in the Proceedings of the The 9th International AAAI Conference on Web and Social Media (ICWSM).
  2. ^ Heald, Paul J. and Erickson, Kris and Kretschmer, Martin, "The Valuation of Unprotected Works: A Case Study of Public Domain Photographs on Wikipedia" (February 4, 2015). Available at SSRN: http://ssrn.com/abstract=2560572 or http://dx.doi.org/10.2139/ssrn.2560572
  3. ^ Hingu, Dharmendra; Shah, Deep; Udmale, Sandeep S. (January 2015). "Automatic text summarization of Wikipedia articles". 2015 International Conference on Communication, Information Computing Technology (ICCICT). 2015 International Conference on Communication, Information Computing Technology (ICCICT). doi:10.1109/ICCICT.2015.7045732. Closed access icon
  4. ^ Claire, Charron (2014). "Analysing Trends Between US Google Searches and English Wikipedia Page Edits" (PDF). Retrieved 26 April 2015. {{cite journal}}: Cite journal requires |journal= (help)
  5. ^ Harwood, George; Walker, Evangeline (2015). "How Much of the Amazon Would it Take to Print the Internet?". Journal of Interdisciplinary Science Topics. 4. Centre for Interdisciplinary Science, University of Leicester.
  6. ^ Clément, Maxime; Guitton, Matthieu J. (September 2015). "Interacting with bots online: Users' reactions to actions of automated programs in Wikipedia". Computers in Human Behavior. 50: 66–75. doi:10.1016/j.chb.2015.03.078. ISSN 0747-5632. Closed access icon
  7. ^ Davoust, Alan; Alexander Craig; Babak Esfandiari; Vincent Kazmierski (2014-10-01). "P2Pedia: a peer-to-peer wiki for decentralized collaboration". Concurrency and Computation: Practice and Experience. 27 (11): 2778–2795. doi:10.1002/cpe.3420. ISSN 1532-0634. S2CID 35114840. Closed access icon
  8. ^ Davoust, Alan; Hala Skaf-Molli; Pascal Molli; Babak Esfandiari; Khaled Aslan (2014-11-01). "Distributed wikis: a survey" (PDF). Concurrency and Computation: Practice and Experience. 27 (11): –. doi:10.1002/cpe.3439. ISSN 1532-0634. S2CID 45142475. Closed access icon
  9. ^ Benjamín Machíın Serna: "Deteccion de Especulaciones utilizando Active Learning". Student thesis, Universidad de la República – Uruguay, 2013 PDF)
Supplementary references and notes:
  1. ^ The most important of these initiatives is probably the 2009 Public Domain Manifesto. Some examples in the context of orphan works: m:Italian cultural heritage on the Wikimedia projects#Advocating for the public domain bibliography commented in an Italian paper by this reviewer. http://arxiv.org/abs/1411.6675
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
I haven't read the first paper yet, but I think two factors might explain some of it. Perhaps editors feel that it is easier, more manageable, and less intimidating to tackle a smaller-scale subject, such as the SS Minnow, as opposed to an article covering an extremely broad topic like the entire US Navy. Also, perhaps editors mistakenly assume that these broad topics are already well covered by the encyclopedia. Gamaliel (talk) 02:47, 1 May 2015 (UTC)[reply]
  • Your first factor is by far the most important. This problem has been discussed many times on WP, perhaps most extensively in this huge thread at the FAC talk page in 2011, and this same answer comes up again and again. Covering a very broad topic comprehensively is orders of magnitude more difficult than covering a small one. In writing articles on ancient Egypt, I've covered a couple of small topics and a couple of fairly broad ones. There is a major difference in difficulty between the two, even though I'm drawing my information from a fairly small and insular field of scholarship dealing with a single ancient culture. The amount of work that would be involved in thoroughly researching a universal topic like mythology, house, or brain boggles the mind. A. Parrot (talk) 04:40, 1 May 2015 (UTC)[reply]
  • I'd agree; even in well-developed but niche fields the same result plays out- video games has 1100 GAs, FAs, and FLs, but can't get basic articles about the history of the subject to an acceptable quality despite extensive appeals. It's 1/100th of the work to write a good article about your local lake than about the sea. --PresN 04:59, 1 May 2015 (UTC)[reply]
It's one of those things we contributors have passively known for a long time, quite honestly I'm always a little amused when research rediscovers the fact. And TCO's report indeed deserved better. ResMar 06:53, 1 May 2015 (UTC)[reply]
TCO's study was spot-on, and I think people recognized that to an extent at the time, but a lot of people took issue with his... tone, and word choice. FAC has some people with egos, and several people were already annoyed at him at the time for long-winded rambles without him labeling editors with words they found pejorative. --PresN 17:21, 1 May 2015 (UTC)[reply]
  • @The ed17: I've only skimmed User:TCO's study so far, but it seems spot on. I'd love to hear more about the objections, specifically any of substance. I assume there was a lot of knee-jerk objections to the idea that people's preferred topics were "unimportant", relatively speaking? Gamaliel (talk) 17:10, 1 May 2015 (UTC)[reply]
  • @Gamaliel: The initial discussion was here, though references to it pop up for months afterwards in the archives. Basically, TCO had already angered quite a few people with diatribes against the way the FAC process was run that people interpreted as attacks against them, and they reall, really didn't like the bit in the middle where he split the FAC nominators into 4 groups (via a graph) labelled "Champions", "Battleships", "Dabblers", and "Star Collectors", based on how many FAs they had crossed with the average monthly page views of those FACs. Given how many FAC regulars write FAs on "niche" topics... And yeah, TCO was pretty clear that he felt that those people were wasting their time and should be focusing more on high-pageview topics. --PresN 17:31, 1 May 2015 (UTC)[reply]
I must say I had a lot of sympathy for those complaints, even though I got labeled a "champion" in that study. Slapping those labels on people based on the FA nominations from a small stretch of time was pretty unfair. User:Ealdgyth got labelled a "star collector", but what's she done since then? William the Conqueror, Norman conquest of England, and Middle Ages! The push for more coverage of core topics may have motivated her—you'd have to ask her—but maybe she just felt finally prepared to tackle those colossal topics. I know that my desire and preparation to write on the topic are pretty much the only things that govern when I write what I write. And there is something to be said for the coverage of obscure topics; they may have only obscure sources that are hidden away in inaccessible libraries, and if one Wikipedian lays hand on those sources and writes an article based on them, the information becomes accessible worldwide at a stroke. Finally, I don't remember TCO offering very specific ideas for how to address the problem, and he got embroiled in an even bigger wikipolitical blowup at the FA project in early 2012, so it was easy to write him off as a disruptive noisemaker rather than somebody offering constructive solutions. I'm not saying he was wrong, just that he could have gone about making his points better than he did. A. Parrot (talk) 18:51, 1 May 2015 (UTC)[reply]
The labels were pretty unfairly applied- I got Battleship, because I had a video game FA with a high pageview count dragging me up, but I've also written one that gets ~1500/month, and GAs that get single-digits per day. The takeaway that we incentivize small, niche articles over large, difficult articles was and is true, but the editor analysis wasn't very helpful. --PresN 18:59, 1 May 2015 (UTC)[reply]
I can't say that TCO's "study" (I say that in quotes as the data was not a problem, and actually interesting, but the labeling of people with pejorative labels made it pretty clear that the goal of the study was to get people's goat) had much to do with my spate of editing "large scale" topics - it was much more a combination of me having time for wikipedia and being involved in the wikicup where there was a significant multiplier for articles that had lots of interwiki links. I've not managed it again because my time for wikipedia has been much more limited. And, to be honest, the aftermath of sockpuppetry and hell I took for bringing Middle Ages up to a higher standard isn't exactly a motivator for doing such a topic again. (It's there in the archives of the talk page... ). I've always thought, along with Iridescent, that obscure topics are where Wikipedia shines - as topics such as Urse d'Abetot, William of Wrotham, or Roger Norreis are often the best coverage of the topic available outside of some obscure specialist publications. It's easy to find information on United States in lots of places - but Monroe Edwards? That's a different story. I suspect most of the people reading our "high level" articles are not actually reading them but just looking for a quick fact or two. Ealdgyth - Talk 13:30, 6 May 2015 (UTC)[reply]
That's pretty much what I thought, but I didn't want to stalk about your motivations when I wasn't sure. You're probably right about how most high-traffic articles are used, but I'd argue there are some articles on subjects that are broad and attract people—but it's not easy to find good information about them because the sources that analyze them in depth are specialized and scholarly. Nearly everything in my chosen topic area of ancient Egyptian religion, with a few possible exceptions like ancient Egyptian burial customs, fits that description. Take ancient Egyptian deities#Characteristics. People may know several fragments of Egyptian myths, but they don't know that the gods are more like symbols than like actual characters, that Egyptian myths are more like metaphors than like legends, or that Egyptian gods are immortal because they die over and over. My huge public library only contains one or two books that discuss those kinds of things in reasonable depth. Lots of ancient religions are like that. There must be lots of other subjects out there that I'm less familiar with that are both popular and difficult to get good information on. A. Parrot (talk) 00:08, 7 May 2015 (UTC)[reply]

So, Comedy and Science Fiction topics are underdeveloped, while Politics and Birds are High-quality ... and this is a problem? Curly Turkey ¡gobble! 04:36, 1 May 2015 (UTC)[reply]

Aside from the self-interest (gobble) angle, I agree. There's lots of great articles on the US Navy already, like the US Navy's. I see the Wiki being far, far more important for less covered topic, where there are few places this information can be collected and presented, for free. Is it a problem that the US Navy article could be better, or super-fantastic that the article on the Ferranti Sirius exists at all? I think the later, and that people are missing the forest for the trees. Maury Markowitz (talk) 11:24, 1 May 2015 (UTC)[reply]
Exactly—it's the long tail nature of Wikipedia that brought me here in the first place, and that kept me coming back for years before I started contributing seriously. And it's not like the "popular" articles don't get any love—they get knocked out at a slow pace, but people do get to them. Curly Turkey ¡gobble! 11:36, 1 May 2015 (UTC)[reply]
And pop culture will generally always be easier to write good articles for, given generally easy to find sources and a higher number of potential contributors interested in them, so I'm never that worried (downside: pop culture attracts more trolls, vandals, and bad edits.) I've always disliked the "eww we have articles on X show but not subject Y" because it's really trying to make information into "highbrow" or "lowbrow" and devalue the work of contributors. We should always be encouraging quality edits, whether you only care about Sponge Bob or whether you love obscure dinosaur taxonomy. The readers who are looking for that information are going to appreciate the effort either way. Der Wohltemperierte Fuchs(talk) 13:50, 1 May 2015 (UTC)[reply]
I think there could be a lot of value in finding ways to help editors tackle the larger topics. It took me 5 years to write Texas Revolution - much easier to write articles on the individual battles and people involved. I agree with Maury Markowitz, though, that part of the appeal of Wikipedia is that it contains information on topics that is not found anywhere else online. We're bringing light to topics that might be underrepresented, and that is powerful. We need to find a good balance between those competing principles - it's great that the studies are pointing out the imbalance, but we need suggestions on ways to improve the first part, rather than decrease the second. Karanacs (talk) 14:10, 1 May 2015 (UTC)[reply]

One of those cricket FAs in the topic mentioned is Donald Bradman. That isn't so very unpopular - it typically gets 500-1000 views a day, which ain't bad for an article about a sportsman who retired 50 years ago. --Dweller (talk) 09:23, 5 May 2015 (UTC)[reply]

That is right and, given that cricket is the world's second most popular sport after football, it hardly belongs in a column headed "unpopular". I think the authors of the survey need to apply some real world thinking about the typical WP editor. This person is not a professional and so cannot be regimented into developing articles which they believe to be important. Instead, he uses his gifted amateurism to develop those articles in which he is interested. His involvement in a project is not based on any desire to promote the project, or to work in some organised fashion within it, but rather on the project as a forum in which to share views with and, sometimes, assist other editors with a mutual interest. The articles about the 1948 Australian cricket team, mentioned above, were a collaborative effort which I would not expect to see repeated often if at all. Obviously, we would all like to see quality and, yes, there are far too many stubs and starts but, given the volume of articles already started and the volume of potential articles in what is after all an encyclopaedia, what can you realistically expect? Jack | talk page 10:50, 5 May 2015 (UTC)[reply]

Hi everyone, and apologies for being late to the party! In case you don't know, I'm the first author on the paper about popularity/quality. Thank you all for a very interesting discussion, I have jotted down notes from it once already, and will re-read it and write down more notes. The links to previous discussions along these lines are also very helpful, although I haven't yet had the time to read all of them (some of them are quite large). Let me comment on a few specific things, before I go dish out actual thanks to everyone. I'll be adding this talk page to my watchlist in case there are follow-up questions, and I welcome questions or comments on my talk page as well, of course, and I can be emailed if you want to reach me off-wiki.

Gamaliel brings up an important point with regards to why these general subjects don't have FAs (size of the topic), and Karanacs' work on Texas Revolution is a good example (massive kudos for that effort!) We think along the same lines in the paper, although perhaps not at clearly. Figuring out why something occurs was outside the scope of this paper (it's analytical, we try to describe what the world looks like, so to speak), but as I continue my research I am interested in building tools to support contributors who are interested in working on these types of articles, and then those types of issues are of course very important.

Maury Markowitz and Curly Turkey mentioned the long tail, and Jack mentioned contributors choosing from self-interest. The latter is part of our motivation for studying this and something we point to several times in the paper, we wanted to know more about how that type of work selection affects systems like Wikipedia. When it comes to the long tail, it's typically not a "problem" in the popularity context. In all four languages we studied the majority of articles are stub/start quality and they do not get a lot of views, so there is no issue there. It's also clear that because Wikipedia's contributors are volunteers, they're free to leave, and therefore a central decision process on what to work on us unlikely to happen (we discuss this in the paper). Yet, I'm thinking that it would be great if we could figure out a way to serve high-quality content to a larger portion of Wikipedia's audience, which as Karanacs pointed out doesn't mean I'd want to decrease other parts.

Lastly, a technical detail: cricket is, as Dweller and Jack point out, not an unpopular topic. In our paper we were interested in understanding what topics are in the two extremes: highly-popular non-FAs, and FAs that aren't particularly popular. In the latter group, the relative risk of encountering an article from WikiProject Cricket is very high, which is why that project made our list. In other words, we didn't try to define the entirety of topics as popular/not-popular, we instead looked at specific subsets of articles to understand more about them.

Thanks again for the comments, everyone, and please do ask if you have questions! Regards, Nettrom (talk) 22:33, 5 May 2015 (UTC)[reply]

I think one of the single best ways to get more people interested in the topics with bigger scope is to provide a better way for editors to find willing collaborators. Texas Revolution would still be a miserable shell if I hadn't been specifically invited by another editor to help with the article - after the History Channel approached the WMF to see if there was any chance the article could be featured. (The deadline and external motivation was also helpful.) Collaborations may the larger articles soooo much easier to write. As it stands, the only way to find them is to either know that someone is interested in a topic and approach them directly, post on a Wikiproject (largely defunct) or article talk page, or randomly run across kindred spirits (as has happened to me too). Karanacs (talk) 00:27, 6 May 2015 (UTC)[reply]
I've worked on some FAs that were extremely popular, and some that were quite obscure. For me, working on the obscure ones has been a far more enjoyable process. It's not because the popular articles are intimidating (although the idea of bringing United States to FA, for example, does seem overwhelming) but because they attract more editors' attention, which forces me to collaborate. Sometimes, the collaboration is productive and enjoyable -- such as when Wehwalt and I worked on James A. Garfield recently. But when the other editors are POV-pushers, edit-warriors, and talk-page-filibusterers, the process quickly becomes unpleasant. Take a look at Talk:Thomas Jefferson, for example. A nightmare. That article should have been promoted to FA years ago. There are plenty of reliable sources and plenty of interested editors. But in a volunteer project, there's only so much aggravation an editor is willing to put up with before walking away. That, I think, is one of the biggest impediment to getting vital articles to FA or GA. I'm not sure what the solution is. --Coemgenus (talk) 12:47, 6 May 2015 (UTC)[reply]
We have no real, binding means of resolving content disputes. An RfC may hold for a while, but a new one can be opened up, so the arguments never stop. And even when a few editors make a concerted effort to improve an article, their improvements may not stick because of continued interference (see Ealdgyth's comments above about the disputes at Talk:Middle Ages). Wikipedians generally adhere to the principles that no one owns anything and that articles can always be improved. As a result, Wikipedia provides no way to stop the barrage of complaints. Of course no article is perfect, but what too many Wikipedians won't admit is that once an article's been improved to a certain level, efforts to change the article are more likely to be detrimental (by pushing ill-informed or biased ideas, or simply by aggravating the editors who actually understand the subject) than they are to be helpful. Most of the time I've been lucky enough to escape this phenomenon, but I've seen it happen over and over to other people. A. Parrot (talk) 00:08, 7 May 2015 (UTC)[reply]
Thanks again for great comments! Based on the previous threads, I figured that collaboration would be a key part of working on some of these broader topic articles, and learning about your experiences (particularly with some links to discussions so I can go check them out for myself) is very helpful. I'll definitely be keeping this in mind as I continue my research in this area. Cheers, Nettrom (talk) 16:26, 10 May 2015 (UTC)[reply]
See also this discussion on Twitter with another one of the paper's authors. Regards, Tbayer (WMF) (talk) 05:16, 17 July 2015 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0