The Signpost


Special report

Taking stock of the Good Article backlog

The GA Trophy awarded at the end of a Good Article Cup
Wugapodes is a two-time GA Cup participant and WikiCup finalist. Their academic work focuses on the linguistic impacts of group behavior.

Before an English Wikipedia article can achieve good article status (the entry grade among the higher-quality article rankings), it must undergo review by an uninvolved editor. However, the number of articles nominated for review at any given time has outstripped the number of available reviewers for almost as long as the good article nominations process has existed creating a backlog of unreviewed articles. The resulting backlog in the queue of articles waiting to be reviewed has been a perennial concern. Nevertheless, the backlog at Good Article Nominations (GAN) reached its lowest point in two years on 2 July 2016. The culprit was the third annual Good Article Cup, which ended on 30 June 2016; the 2016-2017 GA Cup, its fourth iteration, began on 1 November and is ongoing. The GA Cup is the GA WikiProject's most successful backlog reduction initiative to date, but there is a problem that plagues this and all other backlog elimination drives: editor fatigue.

The backlog at GAN has been growing ever since the process was created, with fluctuations and trends along the way. If the GA Cup, or any elimination drive, is going to be successful, it must at some point begin to treat the cause not simply the symptom. While the GA Cup has done a remarkable job in reducing the backlog, for long term success the cause of the backlog needs to be understood. The cause appears to be editor fatigue, with boom and bust reviewing periods where the core group of reviewers try to reduce the backlog and then tire out, causing the backlog to rebound. This is the chief benefit of the GA Cup: its format helps counteract the cycle of fatigue with a long term motivational structure.

The GA Cup is a multi-round competition modeled on the older and broader-purpose WikiCup (which has run annually since 2007 and concluded this year on 31 October). Members of the GA WikiProject created the GA Cup as a way to encourage editors to review nominations and reduce the backlog through good-natured competition. Participants are awarded points for reviewing good article nominations, with more points being awarded the longer a nomination has languished in the queue. Each GA Cup sees a significant reduction in the number of nominations awaiting review. On this metric alone the GA Cup is a success; but counting raw articles awaiting review only gives insight into what happens while the GA Cup is running, ignoring the origin of the backlog and masking ways in which the GA Cup can be further improved.

The GA Cup's predecessors, backlog elimination drives, only lasted a month, while the GA Cup lasts four. While the time commitment alone can be a source of fatigue, the mismatch between the time taken to review and the ease of nomination can lead to an unmanageable workload. A good article review nominally takes 7 days, so if the rate of closing reviews is less than the rate of nominations added, the backlog will not only increase, but the number of reviews being done by a given reviewer will balloon, causing them to burn out by the end of the competition. Well-known post-cup backlog spikes demonstrate the oft temporary nature of GA Cup efforts.

With proper information and planning, the GA Cup can begin to treat the cause of the backlog rather than the symptom and succeed in sustaining backlog reductions after its conclusion.


A history of the Good Article project

Good articles can be identified by a green plus symbol. The plus-minus motif was not the first suggested; other ideas included a thumbs up, check mark, or ribbon.

The Good Article project was created on 11 October 2005 "to identify good content that is not likely to become featured". The criteria were similar to those we have now:



At first, the project was largely a list of articles individual editors believed to be good: any editor could add an article to the list, and any other editor could remove it. This received significant pushback, with core templates {{GA}} and {{DelistedGA}} receiving nominations for deletion on 2 December 2005 as "label creep" and a suggestion that the then-guideline should be deleted as well. They were kept, but, after discussions, the GA process received a slight tweak: while editors could still freely add articles they did not write as GAs, those wishing to self-nominate their work were referred to a newly created good article nomination page.

While the first version of the Good Article page told editors to nominate all potential Good Articles at Wikipedia:Good article candidates (now Good Article Nominations), that requirement was removed 10 hours later. The current process was not adopted until a few months later. In March 2006 another suggestion was made:


The next day the GA page was updated to reflect this new assessment process, and the nominations procedure was extended to all nominations, not just self-nominations.

From there on the nomination page continued to grow. The first concerns over the backlog were raised in late 2006 and early 2007, when the nomination queue hovered around 140 unreviewed nominations. In May, the first backlog elimination drive was held, lasting three weeks. The drive saw a reduction in the backlog from 168 to just 77 articles. This did not last, however, with the backlog jumping back up to 124 a week later. The next backlog drive was held the next month, from 10 July to 14 August, with 406 reviews completed—but a net backlog reduction of just 50, leaving 73 articles still needing reviewed. Another drive planned for September was canceled due to perceived editor fatigue. Backlog elimination drives have been held at irregular intervals ever since then, with the most recent during August 2016. These drives were "moderately successful", to quote a 2015 Signpost op-ed by Figureskatingfan:



With a looming backlog of more than 450 unreviewed articles by August 2014, a new solution was sought: the GA cup. Figureskatingfan, who co-founded the cup with Dom497, writes of its creation:

I was in Washington, D.C., at the Wikipedia Workshop Facilitator Training in late August 2014. While I was there, I was communicating through Messenger with another editor, Dom497. We were discussing a long-standing challenge for WikiProject Good Articles—the traditionally long queue at GAN. Dom was a long-time member of the GA WikiProject. This impressive young man created several projects to encourage the reviewing of GAs, most of which I supported and participated in, but they all failed. I shared this dilemma with some of my fellow participants at the training, and in the course of the discussion, it occurred to me: Why not follow the example of the wildly successful and popular WikiCup, and create a tournament-based competition encouraging the review of GAs, but on a smaller scale, at least to start?

I was literally on the way to the airport on my way home, discussing the logistics of setting up such a competition with Dom. By the time I got home, we had set up a preliminary scoring system and Dom had created the pages necessary. We brought up our idea at the WikiProject, and most expressed their enthusiastic support. We recruited two more judges, and conducted our first competition beginning in October 2014.


— Figureskatingfan


A history of the backlog

The GAN backlog, 10 May 2007 to 25 June 2016.

Over the last nine years, the GAN backlog has grown by about three nominations per month on average—the solid blue line above. Backlog levels are almost never stable. Large trends cause the backlog to fluctuate above and below the regressive average often. These trends though also have their own fluctuations with local peaks and valleys along an otherwise upward or downward trend. What causes these fluctuations? For the three declines after 2014, the answer is relatively simple: the GA Cup. But what about the earlier declines?

The most obvious hypothesis is that the drops coincide with the backlog elimination drives, but this is not sufficient. While most backlog drives coincide with steep drops in the backlog, the ones that do are clustered towards the early years of GAN before it was as popular as it is now. It is easier to make significant dents in the backlog when only a couple nominations are coming in per day than when ten or more are coming in. Indeed, the last three backlog drives had a marginal impact, if any. More obviously, not all drops in the backlog stem from backlog elimination drives. Take, for instance, the reduction in the backlog in mid 2008—a reduction of 100 nominations without any backlog drive taking place. Similar reductions occurred thrice in 2013. In fact, the opposite effect has also been seen: the two most recent backlog drives seemingly occurred during natural backlog reductions, and didn't accelerate things by much. If elimination drives are not, taken together, the sole cause at play there must be some more fundamental cause that accounts for all the reductions seen.

A better explanation comes from the field of finance: the idea of support and resistance in stock prices. For a stock, there is a price that is hard to rise above—a line of resistance—and a price that it is hard to fall below—a line of support. These phenomena are caused by the behavior of investors. When a stock price rises above a certain point, investors sell, causing the price to fall; conversely, when the price falls to a certain point, investors buy, causing the price to rise.

Does this apply to good article reviews as well? By analogy, imagine GA reviewers as investors and the backlog as a stock price. When the backlog rises to a certain point, GA reviewers collectively think the backlog is too large and so begin reviewing at a higher pace to lower it—a line of resistance. When the backlog falls to a certain point, reviewers slow down their pace or get burned out, causing the backlog to grow—a line of support. This makes intuitive sense. The impetus behind most backlog elimination drives is a group of reviewers thinking the backlog has grown too large. The backlog elimination drives then are just a more organized example of reviewers picking up their pace.

If this hypothesis is correct, then backlog reduction initiatives should be held during the low tide, encouraging weary reviewers, rather than during the high, when they are more likely to review nominations anyway, initiatives notwithstanding. But how can we tell where these lines of support exist and when the backlog is likely to bounce back? Economists and investors have found the moving average to be a useful tool in describing the lines of support and resistance in stock prices, so perhaps it can be useful here. In the graph above, the dashed, red line represents a 90-day simple moving average. It seems to capture the lines of support and resistance for the backlog well, as most local peaks tend to bounce off of it, but major trend changes pass through it.

An example of the utility of this theory can be seen in early 2009. The backlog began to fall naturally in January, but was about to hit a line of resistance that may have caused the upward trend to continue. However, a backlog drive took place in February, causing an even steeper decline in the backlog, pushing it past the line of resistance. Unfortunately, the full impact of this cannot be understood as the data for April to November 2009 were never recorded by the GA Bot.


The impact of the GA Cup

The backlog over the last three years.

After almost a year of no backlog drives in 2013, followed by two rather unsuccessful ones, the GA Cup was started. Over the past two years, three GA Cups have been run, all with robust participation and significant reductions in nominations outstanding. But is the cup succeeding? To answer that question I looked at the daily rates of new nominations, closed nominations, nominations passed, and nominations failed during each of the GA Cups and compared them to the rates before and after the first GA Cup.

The presence of a reduction in the backlog is obvious: each cup correlates with a steep drop in the number of nominations, the most effective being the third GA Cup, which concluded on June 30 this year. The most recent GA Cup reduced the backlog by about two nominations per day, 92 more nominations completed than during the first GA Cup—despite the third Cup being significantly shorter than the first. The third GA Cup was lauded a success.

Yet in late April, the backlog reduction began to stagnate. The number of nominations added remained relatively stable over this period, but this period coincided with a drop in the number of nominations being completed. In early May the backlog began to rise, crossing over the line of resistance in the process, and so beginning to shrink again towards the end of May, with a distinct downward trend by June.

Backlog during the third GA Cup with a 15-day simple moving average

Ultimately, the best way to conceptualize the GA review backlog is as a mismatch between the "supply" of reviewers and the "demand" for reviews. To borrow another concept from finance, it is simply a mismatch in supply and demand. The number of nominations—the demand—is relatively consistent, at about 10 nominations per day. There is a mild decrease in the rate of nominations—the daily rate decreases by one nomination every two years—but, all-in-all, relatively stable.

Measuring supply is more difficult. The change in the backlog is equal to the number of nominations added minus the number of reviews opened, so if the average demand is 10 nominations, and the average supply of reviews is 0, then the backlog would grow by 10 nominations each day; if the supply were 5, it would grow by 5. That means the average number of nominations minus the average number of reviews equals the average change in the backlog. Since the average change in the backlog, the linear regression, and the average number of nominations are both known, the average supply can easily be calculated. It turns out to be about six per day. Taken in combination with the aforementioned demand, shows a net daily increase in the backlog by four nominations each day. And since this analysis includes the GA cup time period, the backlog is actually increasing at an even higher rate whenever a Cup isn't active!

Backlog from the end of the Second GA Cup to the end of the Third GA Cup. The blue line indicates when the Third GA cup was announced and the green line when the Third Cup began.

The number of open reviews does not inspire much confidence either. Reviews open drops dramatically after each GA cup, likely due to participant burnt-out. Interestingly, the number of open reviews also drops before the GA Cup causing a counterproductive uptick in the backlog. In fact, the drop just before this year's cup coincided with the announcement of the event's competition date a month prior to its start. This development came at a time when the number of reviews was increasing and the backlog naturally starting to decline.

All told, these are not fatal flaws, as the GA Cup is succeeding despite them in other ways. Most obviously, the backlog has been decreasing during cups, and review quality doesn't seem to decline, qualitatively, either. Comparing five months before with the four months during the first GA Cup, there is no significant difference between the pass rates during or before the GA Cup ( t(504.97)=-1.788, p=0.07 ). In fact, may have actually decreased slightly, from 85% beforehand to 82% during the cup and because the p-value is close to significance, the idea that GA Cup reviewers are more stringent may be worth examining further.

This is not to say that there is no other way to examine review quality. Reasonable minds can disagree on how well this metric describes the quality of reviews, and concerns of the quality of reviews have been raised a number of times, but this is the preferable starting point for this analysis. We now know that the GA Cup does not lead to "drive-by" passes, and that any problems with unfit articles passing or fit articles failing are occurring at about the same rate as normal. Hopefully, then, those solutions can be more general, improving all reviews' qualities, rather than specific to the GA Cup.

Conclusions

The GA Cups have been effective at encouraging editors completing GA reviews. Its effect on the cause of the backlog, on the other hand, is less clear. Long-lasting backlog reductions require a nuanced approach: recruiting more reviewers, finding the correct timing, and giving proper encouragement. The GA Cup is arguably already successful at encouragement, but that does not mean the former aspects cannot be improved as well.

The GA Cup has so far been executed at times when reviewers were already increasing their efforts to reduce the backlog, and the announcement of the third GA Cup, for instance, caused these efforts to stagnate. By allowing these natural reductions to take place, and then holding the GA Cup when editors get burnt out, we can leverage GA cups' morale boost to help reduce backlogs even further.

Furthermore, while there was no good way to analyze how well the GA Cup recruits new reviewers, anecdotally it seems to do so. Bringing in new reviewers when the regulars are getting burnt out would reduce the backlog rebound in the short term, and may lead to an increase in the number of regular reviewers in the long term.


The organizers of the GA Cup understand that what is most needed is more reviews and more reviewers, which and whom the GA Cup has done an admirable job recruiting. The Third GA Cup has been the most successful so far, and hopefully the next cup will surpass it in all metrics.

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

The results of gamification

The assertion that "We now know that the GA Cup does not lead to "drive-by" passes" has no basis in fact. It may be true that insufficient reviews occur at the same rate so the Cup doesn't encourage the practice but let's remember that the number of bad reviews is increasing at the same time reviews, generally, are increasing. Doing GA reviews sucks because it's actual work; I have more fun doing GOCE drives. I think the Good Article WikiProject is key to objectively improving content whereas GOCE is by-and-large just fixing word salad, which almost anyone can do. Efforts like the GA Cup are our collective means of putting these articles to stringent standards. GA status is often, though not always, a precursor to pursuit of A-class or FA. I remain concerned that these contests (of which I am a part currently) attract editors who are still unfamiliar with proper reviewing. It's demoralizing to see bad reviews done, especially when you're competing for points. WikiProject Articles for Creation had 8 drives since 2012. The last drive saw a lot of poorly done draft reviews and the results were so skewed that the WikiProject hasn't held another drive since 2014. I would hate to have these drives ruined by bad editing and we can only rely on the judges of the competition to stay alert to malfeasance. Chris Troutman (talk) 21:17, 26 November 2016 (UTC)[reply]

I'm confused by your point. If you're saying judges should stay vigilant of quick passes, I agree. That is not exclusive of the fact that there's no evidence to support a claim that quick passes are increasing, because indeed my argument is based in fact. To quote the paragraph prior to your quotation: Comparing five months before with the four months during the first GA Cup, there is no significant difference between the pass rates during or before the GA Cup ( t(504.97)=-1.788, p=0.07 ). In fact, may have actually decreased slightly, from 85% beforehand to 82% during the cup and because the p-value is close to significance. If more drive-by passes were occurring and more reviews were happening then we would see a higher rate of passage during the GA Cup. There is no significant difference. So either there was no change in the number of reviews or there was no change in the rate of drive-by passes. There clearly is an increase in the number of reviews, that's why the backlog decreased, so the only remaining reason for the result is that there was no change in the rate of passes.
This, admittedly, is an operational definition that doesn't get fully at the answer. It assumes that there was not a substantial amount of quick-passes outside of the Cup and that whatever the ratio of quick-pass to non-quick-pass was outside of the cup is the same as during the cup. You can dispute these assumptions, but based on all the data I have available to me there is no evidence that the GA Cup causes drive by passes. That claim is far from having no basis in fact, and is far more factual than anecdotal gripes. Wugapodes [thɔk] [ˈkan.ˌʧɻɪbz] 04:00, 27 November 2016 (UTC)[reply]

I don't really buy into the "gamification" of this (and various similar "challenges", "drives", etc.). Maybe it really does motivate a few people, but not everyone feels competitive about this stuff. The very nature of GA, FA, DYK, ITN, etc., as "merit badges" for editors to "earn", and the drama surrounding that, led to a rancorous ArbCom case recently, and cliquish behavior at FAC has generated further pointless psychodramatics. We really need to focus on the content and improving it for readers, not on the internal wikipolitics of labels, badges, and acceptance into politicized editorial camps.

It might be more practical and productive to have a 100-point (or whatever) scale and grade articles on it to a fixed and extensive set of criteria, with FA, GA, A-class, B, C, Start, and Stub all assigned as objectively as possible based on level of compliance with these criteria (and resolving the tension of exactly what A-Class is in this scheme, which seems to vary from "below GA" to "between GA and FA" to "FA+" to "totally unrelated to GA or FA"). There are a quite a number of GA, A and probably even FA quality articles that have no such assessments, because their principal editors just don't care about (or actively don't care for) the politics and entrenched personality conflicts of our article assessment processes as they presently stand. I, for one, will probably never attempt to promote an article to FA myself directly, because of the poisonous atmosphere at FAC (which is now an order of magnitude worse than it was when I first came to that conclusion several years ago). I guess the good news is I'll have more time for GA work. :-) The more that FA, and some of the more rigid and too-few-participants A-class processes, start to work like GA historically has, the better. If, as Kaldari suggests below, the opposite is happening, with GA sliding toward FA-style "our way or the highway" insularity, then you can expect negative results and declining participation.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  09:48, 2 December 2016 (UTC)[reply]

@SMcCandlish: I like the theory of a 100-point scale, but I'm not sure how it would cope any better than the current system with the inherent uncertainty of what a "complete" article would look like, you need to know what the end result is before you can start having a percentage of it. For instance, the Tower of London is an FA, reflecting the huge amount of material that's been published about it, whereas its "sisters" Baynard's Castle and Montfichet's Castle no longer exist and in the latter case even the exact location is not entirely certain. I took both of them up to a pretty decent standard back in 2010, bringing together more information on them than was available in any one place on the web at the time and probably anywhere bar the Museum of London library, but they're still tiny compared to the Tower of London article. As it happens the former was GAN'd by someone else in 2013, and passed with minor copyedits; the latter just needs some minor work on the lede and formatting - that reminds, me, I need to dig out some photos I took ages ago... I wasn't that bothered about taking them through the GA process, and certainly have no interest in taking them to FA. Actually I think the view of GA as "a precursor to pursuit of A-class or FA" is part of the problem, to my mind we need a lot more emphasis on the good as opposed to perfection. After all, under the guidelines of many projects even a GA article is "Useful to nearly all readers, with no obvious problems; approaching (but not equalling) the quality of a professional encyclopedia", but one FA takes as much time as ??3-5?? GAs? In fact I'd argue the real focus should be more on avoiding bad articles than polishing the already pretty good ones. I've a little mini-project on the go where I've started on getting all the Category:Towns in Kent to a decentish minimum standard, that is still nowhere near GA but at least avoids the real horrors - my working definition is restructuring them with all the sections of WP:UKTOWNS and some text and a reference in each section, plus linking in any nearby articles. So going from say this to this. Not perfect, but it's gone from an incoherent mess to somewhere in the right direction. Perhaps we could encourage people to work on the weakest articles in a set by extending the idea of GA/FA topics to C-class and B-class topics?

Age of nominations

This is very interesting read! One thing that I was looking for here that didn't get discussed was the effect of the Cup on the age of nominations - are reviews now sitting in the queue for less time than they were before these competitions started? (For those who don't know, I'm a judge in the Cup, after having competed in it the first year.)--3family6 (Talk to me | See what I have done) 05:13, 27 November 2016 (UTC)[reply]

Reviewer burnout

I took part in the first GA cup. It was a new idea with a good purpose and I felt I wasn't pulling my weight by putting more GA nominations on the pile than reducing the backlog by reviewing. Towards the end of the cup, I got burned out and reduced my activity; I still do the odd review but not as many as I used to. I know some other GA stalwarts have also stopped reviews. How can we reach out to these people and get them to participate in reviews again? Ritchie333 (talk) (cont) 14:59, 27 November 2016 (UTC)[reply]

I quit doing GA reviews when it became a tedious regimented process. I'm glad we have high standards for quality, but I miss the relatively informal process that we used in the old days. Kaldari (talk) 21:42, 29 November 2016 (UTC)[reply]

Great article

I just wanted to say that this was a really interesting read. As someone who wasn't around for the early days of the project I'd love to learn more about how some of the other now well-established processes came to be. Sam Walton (talk) 16:13, 28 November 2016 (UTC)[reply]

I second that. Although I was an active editor back in 2006 I wasn't at all involved in the GA process so it's interesting to read a succinct history. WaggersTALK 08:47, 29 November 2016 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0