Article Feedback tool faces community resistance

Article feedback, at least through talk pages, has been a part of Wikipedia since its inception in 2001. The use of these pages by new editors, though, has typically been limited at best.

As part of the Wikimedia Foundation's (WMF) Public Policy Initiative, a specialized form of article feedback was developed and added to a selected set of English Wikipedia articles in September 2010 (see Signpost coverage). Over the next several months, the tool was tweaked several times, resulting in several iterations until version four was deployed to every English Wikipedia article in July 2012. This iteration allowed readers to rate articles from one to five stars in four categories: trustworthiness, objectivity, completeness, and how well it was written.

An early image of version five, showing the possibility of entering feedback via a text box.

In December 2011, the WMF began the transition to adding version five (also known as AFTv5) to 10% of the articles, aiming for a full deployment in the first quarter of this year. Version five added an area where readers could give written feedback on articles. The feedback is directed to a centralized page, with various options, including seeing feedback from specific articles or only articles on your watchlist. Editors have the ability to feature a post, mark the concern as resolved, hide the post or request oversight, up- or down-vote a post, or flag the comment as abuse.

A request for comment (RfC) on the project was opened by MZMcBride in mid-January 2013. His impetus, as described in a Signpost op-ed in August 2012, was that the extension was "deployed without anti-abuse mechanisms", leading to the feedback area becoming a "safe haven for spam and other useless noise." In addition, he told the Signpost:

“

I've been seeing a dichotomy between tools that help editors reduce backlogs and tools that create new backlogs. For tools that create new backlogs, I think there must be a clear demonstration from the community (or whoever is expected to work on these backlogs) that they're interested in creating these piles of work. In the case of AFTv5, there hasn't been such a demonstration, as far as I'm aware. Compare to [AFTv5 to] tools such as Page Curation that help editors reduce the backlog of unpatrolled new pages. In that case, the tool is helping, not hurting, and there's sufficient community consensus that we want new articles.

”

In his view the feedback tool should be used only on an opt-in basis, where editors who are interested in the article—e.g. someone who wrote the piece and wishes to solicit feedback on how they can improve it further—will actually respond to the feedback. He believes the new backlogs are a burden, and deploying the tool to the entire site would make the issue worse. This has a strong basis in fact: according to the WMF, readers submitted an average of 4100 posts per day, of which fewer than 10% were moderated. These figures have fallen since that blog post. When scaled to the full site, the WMF expected that over 900,000 posts per month, or over 30,000 a day, would come in via the feedback tool—a figure per month more than all of the current feedback put together.

A more radical viewpoint was put forward by GregJackP, who simply stated that "the tool is useless" and that the community "should eliminate the feature." He expanded on his views to the Signpost, saying that he believes the most feedback is blank, "a general statement of dissatisfaction or satisfaction, or just garbage/spam." In GregJackP's assessment, the amount of time it takes for editors to find the positive feedback is far outweighed by the "garbage", and the positive feedback is typically " just a question that you have to research and determine if it [is] something that can even be found."

Contrary to this view, editors like Mike Cline believe that the tool is becoming a major source of data for how the public views Wikipedia. Tom Morris eloquently stated:

“

Shutting down the article feedback tool rather than improving it is a bad strategy. We do need better tools for churning through AFT5 responses and patrolling them. We need something like Huggle or STiki to do basic triage on the feedback we get, to remove libel and the "OMG I LOVE JUSTIN BIEBER" type things. The rest, though, those are telling us about potentially fixable issues with Wikipedia. If a reader, in good faith, wishes to give us feedback about an article, we should listen. We might set the feedback to one side because we aren't the sort of editors who can necessarily do anything about it. But if we stop listening to readers who have information that can improve the article, what's the damn point?

”

Oliver Keyes, the community liaison for the article feedback program, acknowledged the low level of moderation to the Signpost: "Are there sufficient resources to moderate and respond to all of the feedback? The honest answer is 'probably not'." However, he then related the issue to Wikipedia as a whole: "I don't see this as a problem: we're a wiki. Always have been, always will be. Edits will need oversighting or deleting, bad edits will slip through the cracks, and we accept that because it's necessary to produce the good things that an open system gives us. I see no reason not to take the same attitude with feedback."

Keyes told the Signpost that between 30 and 60 percent of all feedback was rated by editors as 'useful', which was a finding backed up by the fourth quarter report from the article feedback team, which reported that 40% of a random sampling in February through April was found to be helpful by at least two editors. In addition, he says that the WMF communicated its goals through the program through 17 different office hours on IRC (held at different times to target different regions of the world), mailing lists, and the village pump, in addition to the project talk page and a regular newsletter. The latter two alone reached at least 220 people, and probably more, far more than any typical Wikipedia discussion.

Still, the current request for comment has a large majority in favor of GregJackP's comment, more than double the second-most supported view (MZMcBride's) at the time of writing. The RfC will remain open until February 21.

In brief

Wikimania scholarships: Applications for scholarships to Wikimania 2013 in Hong Kong are now being accepted. Both full and partial scholarships are available—covering airfare, lodging, and registration; and up to half of the estimated airfare, respectively. Applicants will be rated on their Wikimedia activity (both on- and off-wiki), their open-source activity more broadly, their interest in both Wikimania and the Wikimedia movement, and their grasp of English. Applications will be accepted until 23:59 UTC on 22 February.

Chapters association: The Signpost reported last week on the problems with the proposed name of the planned association ("Wikimedia Chapters Association"), since the use of the name Wikimedia was inconsistent with the Wikimedia Foundation's trademark policy. On February 5, the WMF's Board of Trustees published a letter setting out its position towards the organization. It states, in part, that "Our reservations about the Chapters Association are serious, and we have difficulty envisioning circumstances in which the Wikimedia Foundation would be able to recognize it."

Ann Arbor edit-a-thon: The newest Wikipedian-in-Residence, Michael Barera (see the Signpost's coverage last week), along with the Gerald R. Ford Presidential Library and the Michigan Wikipedians, will be hosting an edit-a-thon at the presidential library, with the goals of assisting new editors and creating or improving Wikipedia's coverage of Gerald Ford, the 38th President of the United States.

Guided tours: As announced on the Wikimedia Blog, the Editor Engagement Experiments teams has built and launched a new guided tour system for new users.

Individual Engagement Grants: Applications for IEGs, the new WMF grant scheme, are due by February 15 and can be reviewed on Meta.

Ombudsman Commission: The appointments to the Ombudsman Commission, the body dealing with WMF privacy policy complaints, have been announced. Three editors (FloNight, Sir48, and Thogo) will return to the commission, while four editors (Deskana, Erzbischof, Huji, and Levg) will be joining the commission for the first time.

Steward election: The annual election of stewards, who have complete access on all WMF wikis to deal with transproject vandalism, among other matters, will open for voting on February 8.

English Wikipedia

Administrator proposals: The Signpost welcomes the newest administrator, Jason Quinn, who passed with 138 in support to 29 opposed. Three requests for adminship remain open, all with over 90% support as of publishing time.
Adminship reform: The second round of the 2013 request for comments on the request for adminship process has started.
Star Trek: The rather contentious debate over the capitalization of Star Trek (I/i)nto Darkness has ended in favor of a capital "I". See this week's "In the media."

← Previous "News and notes"

Next "News and notes" →

In this issue

4 February 2013 (all comments)

Special report

News and notes

WikiProject report

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Article Feedback Tool: I've said this before many, many times, and I'll say it again here. The Foundation's passion for stats and feedback does not always contribute to the improvement of Wikipedia and its management by the volunteer community. WMF projects thrust upon the Wiki have required massive community incentive to carry out the cleanups when they misfire, and reasonable solutions for improvement in quality of new articles required by community consensus have been summarily rejected by the Foundation. While a truly excellent tool in the hands of the right users, NewPagesFeed/CurationTool does not address these issues and has not improved the quantity and quality of new-page patrolling. AfT creates more work for this community than the net useful information that it is designed to produce. Someone recently stated words to the effect that the Foundation's answer to the community's claim that a car (project) is broken, is 'Keep pushing'. The only real solution is to deploy Foundation funds and resources to re-launch development of the Article Creation Workflow as a proper landing page for new users/page creators. Instead of simply wanting quantity instead of quality, the Foundation would probably rejoice at the result which would greatly reduce the burdens and backlogs in such areas as Articles for Creation, Deletions and AfD, largely resolve the issues surrounding the work of admins, and their appointment at WP:RfA, and reduce the endemic hat-collecting of minor rights. Meta areas, including WP:NPP, WP:AfC, WP:AfD, and possibly also the AfT, are a magnet to inexperienced users who cannot, or prefer not to expand or create content.Kudpung กุดผึ้ง (talk) 02:15, 6 February 2013 (UTC)[reply]
I'm sorry that you don't feel statistics and hard data has a role in helping the volunteer community with its workload, but I must confess to being bemused by how ACTRIAL or the problem(s) with patrolling incoming content has anything to do with AFT5 (or how AFT5 can be creating work for the community when we've said 'if you guys want to turn it off, we'll turn it off' and people seem to be heading in that direction). The Foundation is not looking at quantity instead of quality; it's looking to raise the number of people who can help with maintenance tasks. And yes, sometimes this involves not only training but also making the software easier, as we did with Page Curation, or pointing people towards those tasks that need to be done. The vast majority of users do not engage in meta areas, which is why it would surprise me to find that a majority or substantial chunk of inexperienced users did; to resort to statistics for a moment, I ran a quick database query against the patrolling tables. In the last 30 days, there have been 51 patrollers with fewer than 500 edits - that's 14 percent of the patrollers overall. They are responsible for 484 patrols, which is...6 percent of patrols. If they were doing that terrible a job, presumably people would be un-reviewing their pages - and yet in the time period specified, experienced users (>= 500 edits) unreviewed...8 pages. In total. Not sure if the initial reviews were by new people or not. I'm happy to accept that quantitative and qualitative information go hand in hand, but your argument doesn't seem to be backed up by either as you've presented it. Okeyes (WMF) (talk) 18:45, 6 February 2013 (UTC)[reply]
'In the last 30 days, there have been 51 patrollers with fewer than 500 edits - that's 14 percent of the patrollers overall' - you've just backed it it up for me, and it's far too many. The reason their patrolls have not been reverted is probably because not many patrollers are patrolling the patrollers - and that's not what we're supposed to be doing. --Kudpung กุดผึ้ง (talk) 12:34, 7 February 2013 (UTC)[reply]
Note also "They are responsible for 484 patrols, which is...6 percent of patrols" - really, the number of patrollers in [tranche] is not useful for looking at 'are they doing it well/badly/causing more work'; the thing that counts is "how many patrols are they doing?". If we have one patroller doing 400 patrols, that makes a much bigger impact on the value of patrolling-as-a-way-of-triaging-junk than 10 patrollers doing 5 each. So, yes, they are 14 percent of patrollers: they are responsible for a much smaller chunk of the work. I certainly agree that patrollers do not exist to answer the quis custodiet problem - but either patrollers aren't seeing bad work, in which case your argument that there is a substantial problem involving poor-quality patrolling is...confusing, or patrollers are seeing bad work, and at no point deciding it's worth undoing. Okeyes (WMF) (talk) 13:03, 7 February 2013 (UTC)[reply]
What percentage is useful? Regarding the claim that "Between 30 and 60 percent of all feedback was rated by editors as 'useful", at Wikipedia:Article Feedback Tool/Version 5/Feedback evaluation#Is this useful? the instructions say "It is only the most entirely useless feedback that should be categorized as 'no' (not useful)." Several editors have worked together to post a random sample of 1000 feedbacks (after the anti-abuse filters and excluding anything that an editor has marked as hidden) at User:Guy Macon/Workpage. I welcome the interested reader to look at it and make their own estimate of what percentage is useful. --Guy Macon (talk) 02:43, 6 February 2013 (UTC)[reply]
Yeah; that's actually an outdated description :). Would you like me to pull the categories/descriptions for the most recent tests? Okeyes (WMF) (talk) 17:58, 6 February 2013 (UTC)[reply]
My personal preference is that when WMF publishes the results of a study, it should have a two prominent links to "methodology" and "raw data" on the main page of the study. In this particular case the methodology link should tell me, among other things, how the test subjects were selected, what instructions they were given, etc. The raw data should be such that if I want to I can replicate your work. This would bring a welcome level of scientific rigor to these studies. While I am waiting for that to happen, I would like to see a hatnote on anything that is outdated. --Guy Macon (talk) 18:44, 6 February 2013 (UTC)[reply]
Obviously our raw data is not necessarily possible (some of it might be oversighted) but I'll see what I can do. Okeyes (WMF) (talk) 23:04, 6 February 2013 (UTC)[reply]
It might be best to start with the next one. If you know that you are eventually going to publish some raw data, it is pretty easy to make a version with [Name redacted] and [Email redacted] or [Redacted for privacy reasons] as you go along. If you try to go back and do that after the fact, you always have a doubt about whether you missed one. I care far less about this particular result than I do about instilling a mentality in the WMF where they wouldn't dream about not publishing full details about methodology or not publishing raw data. And we haven't even started talking about single-blind vs. double-blind...

If you really want to focus on this particular study, rather than gathering raw data, somebody should start asking why WMF got "Between 30 and 60 percent useful" and my preliminary results are about 10% useful. That's a huge red flag. Is it because only one person cared enough to look at my data and post an estimate? was 200 a big enough sample? Is it because your study used 3 people? If you personally looked at the data would you come back and say that your estimate is 30%, not 10%? Is it because in both cases the person doing the evaluation was self-selected? If I saw results like that I would try to rip my own methodology to shreds and then I would try to rip the methodology of the other study to shreds. Somebody is doing something wrong. My attitude toward science: http://xkcd.com/242/ --Guy Macon (talk) 03:00, 7 February 2013 (UTC)[reply]

Frankly, I can't answer those questions; I'm not the researcher here ;p. I'll poke Aaron and see if he can comment. Okeyes (WMF) (talk) 11:29, 7 February 2013 (UTC)[reply]

poke received First of all, I want to direct you to the official report I wrote which includes the strategy for drawing both a random and stratified sample and the details of my methodology. I'm sad to find that this report was clearly referenced. You're not the first to have missed it. meta:Research:Article_feedback/Final_quality_assessment We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200). All feedback submissions were evaluated by two different people. The 30-60% number is a non-statistically founded, conservative minimization of these two evaluations/item. In the study, we found that 66% of feedback was marked *useful* by at least one evaluator ("best" in the report) and 39% of feedback was marked useful by both evaluators ("worst" in the report). Here's the breakdown of the four category classes we asked the evaluators to apply:

Useful - This comment is useful and suggests something to be done to the article.
Unusable - This comment does not suggest something useful to be done to the article, but it is not inappropriate enough to be hidden
Inappropriate - This comment should be hidden: examples would be obscenities or vandalism.
Oversight - Oversight should be requested. The comment contains one of the following: phone numbers, email addresses, pornographic links, or defamatory/libelous comments about a person.

Note that these exact descriptions appear as tooltips in multiple places in the feedback evaluation tool. If you'd like to personally replicate the study, I'd be happy to pull another random sample for you and load it up in the evaluation tool. --EpochFail^{(talk • work)} 15:42, 7 February 2013 (UTC)[reply]

Before I respond, let me reiterate that I think everyone at the WMF is doing a good job and has the right goals. This is a discussion about possible improvements, starting with some future study. Those who are looking for a club to beat WMF with should look elsewhere.

meta:Research:Article_feedback/Final_quality_assessment is a very useful overview of the methodology used, but in my opinion an additional detailed methodology would be a Good Thing. (I am about to write some questions, but please don't post the answers. They are examples of what should be in a detailed methodology -- I cannot explain what I am talking about without giving examples of questions that the overview does not answer.) For an example, the overview says "We assigned each sampled feedback submissions to at least two volunteer Wikipedians." A detailed methodology would have said something like this:

"Between 3AM and 4AM on December 24th, we posted a request for volunteers (in French) on Talk:Mojave phone booth and on the main page of xh.wikipedia.org. 43 people volunteered, and we rejected 20 of them for being confirmed sockpuppets of User:Messenger2010 (See Wikipedia:Long-term abuse/Messenger2010) and rejected 11 of them because Guy drank too much and decided he doesn't like editors with "e" in their username. That left us with Jimbo and a six-year-old girl (username redacted for privacy reasons). We then..."

Unlike "We assigned each sampled feedback submissions to at least two volunteer Wikipedians", the above details exactly how those volunteers were chosen. Again, I don't care how they were chosen. I just want future studies to contain a detailed methodology page that answers questions like this or questions about the RNG used. To pick another example, the post above this one says "We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200)." That detail is not found in the methodology overview. --Guy Macon (talk) 16:51, 7 February 2013 (UTC)[reply]

The specific 'how they were chosen' list, I can provide, actually. The purpose of the study was to compare the rating of feedback that did get rated to feedback that got missed out on, suspecting that people overwhelmingly checked feedback for high-profile articles. In order to get some consistency between the two sets of numbers, I pulled from the database a list of all users who had, in the 30 days before we started the recruitment process, monitored more than 10 pieces of feedback in some fashion. The users in question were then sent a talkpage invitation going 'would you like to participate in this?'. I appreciate that's more a specific example to highlight a general point than anything else - and I'm going to bear your general point in mind when writing up something I've been working on recently, actually - but I thought I'd address it :). Okeyes (WMF) (talk) 18:50, 7 February 2013 (UTC)[reply]

- I took a look at the list. In detail for the first 200 or so, and then just a few samples. My estimate of useful feedback would be closer to 10%. • • • Peter (Southwood) ^(talk): 06:57, 6 February 2013 (UTC)[reply]
From the lead, "The use of these [talk] pages, though, has typically been limited to experienced editors who know how to use them." Excuse me? This claim not only biases the introduction to the article but is demonstrably not true - I find comments from new users and unregistered users on talk pages fairly frequently. Their use is hardly limited to "experienced editors". – Philosopher ^{Let us reason together.} 05:27, 6 February 2013 (UTC)[reply]
- We've had different experiences, then... I've tweaked the introduction slightly based on your comments, though. Ed ^{[talk] [majestic titan]} 05:35, 6 February 2013 (UTC)[reply]
If we judge feedback solely by signal-to-noise ratio we do ourselves no favours at all. Charles Matthews (talk) 10:08, 6 February 2013 (UTC)[reply]
- Why do you think that? If article feedback allows the junk to sit there undisturbed -- and it does -- then article feedback will also allow prohibited material such as libel, personal details, copyright violations, spam, and violations of our living persons, sockpuppet, and banning policies to sit there undisturbed. --Guy Macon (talk) 11:18, 6 February 2013 (UTC)[reply]
  - I said "solely". I made a more detailed comment in the RfC itself. By the way, WP:BEANS to your catalogue of ways the feature can be misused. Charles Matthews (talk) 12:46, 6 February 2013 (UTC)[reply]
    - I have no idea why you referenced WP:BEANS. As for my listing ways the feature can be misused, anticipating potential problems and designing solutions for them before they bite you is a Good Thing. --Guy Macon (talk) 18:44, 6 February 2013 (UTC)[reply]

Thanks for writing about this RFC; I wouldn't have noticed it otherwise, and it's an important subject. -- phoebe / (talk to me) 22:42, 6 February 2013 (UTC)[reply]
The highest quality article feedback I've seen is most always on the article talk pages. To be useful, feedback generally needs to be longer than the short tweets I generally see from the Feedback Tool. I often find useful comments on talk pages that sit unanswered for months, or even years, before I address the issues raised. So we already have a backlog on talk pages, without increasing it with more chatter from this tool. I don't feel that even 10% of the AFT comments are useful, but I've only looked at these comments on a very limited number of articles. Wbm1058 (talk) 23:37, 7 February 2013 (UTC)[reply]
- Well take my comments with a grain of salt, but I'm wondering why there is such a difference in quality of comments between these links. The second seems to find much higher quality comments than the first does. Maybe I just haven't been finding the right articles to view feedback on. Wbm1058 (talk) 15:30, 8 February 2013 (UTC)[reply]
  - User:Guy Macon/Workpage
  - Special:ArticleFeedbackv5