The Signpost

News and notes

Article Feedback tool faces community resistance

Contribute  —  
Share this
By The ed17

Article feedback, at least through talk pages, has been a part of Wikipedia since its inception in 2001. The use of these pages by new editors, though, has typically been limited at best.

As part of the Wikimedia Foundation's (WMF) Public Policy Initiative, a specialized form of article feedback was developed and added to a selected set of English Wikipedia articles in September 2010 (see Signpost coverage). Over the next several months, the tool was tweaked several times, resulting in several iterations until version four was deployed to every English Wikipedia article in July 2012. This iteration allowed readers to rate articles from one to five stars in four categories: trustworthiness, objectivity, completeness, and how well it was written.

An early image of version five, showing the possibility of entering feedback via a text box.

In December 2011, the WMF began the transition to adding version five (also known as AFTv5) to 10% of the articles, aiming for a full deployment in the first quarter of this year. Version five added an area where readers could give written feedback on articles. The feedback is directed to a centralized page, with various options, including seeing feedback from specific articles or only articles on your watchlist. Editors have the ability to feature a post, mark the concern as resolved, hide the post or request oversight, up- or down-vote a post, or flag the comment as abuse.

A request for comment (RfC) on the project was opened by MZMcBride in mid-January 2013. His impetus, as described in a Signpost op-ed in August 2012, was that the extension was "deployed without anti-abuse mechanisms", leading to the feedback area becoming a "safe haven for spam and other useless noise." In addition, he told the Signpost:


In his view the feedback tool should be used only on an opt-in basis, where editors who are interested in the article—e.g. someone who wrote the piece and wishes to solicit feedback on how they can improve it further—will actually respond to the feedback. He believes the new backlogs are a burden, and deploying the tool to the entire site would make the issue worse. This has a strong basis in fact: according to the WMF, readers submitted an average of 4100 posts per day, of which fewer than 10% were moderated. These figures have fallen since that blog post. When scaled to the full site, the WMF expected that over 900,000 posts per month, or over 30,000 a day, would come in via the feedback tool—a figure per month more than all of the current feedback put together.

A more radical viewpoint was put forward by GregJackP, who simply stated that "the tool is useless" and that the community "should eliminate the feature." He expanded on his views to the Signpost, saying that he believes the most feedback is blank, "a general statement of dissatisfaction or satisfaction, or just garbage/spam." In GregJackP's assessment, the amount of time it takes for editors to find the positive feedback is far outweighed by the "garbage", and the positive feedback is typically " just a question that you have to research and determine if it [is] something that can even be found."

Contrary to this view, editors like Mike Cline believe that the tool is becoming a major source of data for how the public views Wikipedia. Tom Morris eloquently stated:


Oliver Keyes, the community liaison for the article feedback program, acknowledged the low level of moderation to the Signpost: "Are there sufficient resources to moderate and respond to all of the feedback? The honest answer is 'probably not'." However, he then related the issue to Wikipedia as a whole: "I don't see this as a problem: we're a wiki. Always have been, always will be. Edits will need oversighting or deleting, bad edits will slip through the cracks, and we accept that because it's necessary to produce the good things that an open system gives us. I see no reason not to take the same attitude with feedback."

Keyes told the Signpost that between 30 and 60 percent of all feedback was rated by editors as 'useful', which was a finding backed up by the fourth quarter report from the article feedback team, which reported that 40% of a random sampling in February through April was found to be helpful by at least two editors. In addition, he says that the WMF communicated its goals through the program through 17 different office hours on IRC (held at different times to target different regions of the world), mailing lists, and the village pump, in addition to the project talk page and a regular newsletter. The latter two alone reached at least 220 people, and probably more, far more than any typical Wikipedia discussion.

Still, the current request for comment has a large majority in favor of GregJackP's comment, more than double the second-most supported view (MZMcBride's) at the time of writing. The RfC will remain open until February 21.

In brief

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
  • Article Feedback Tool: I've said this before many, many times, and I'll say it again here. The Foundation's passion for stats and feedback does not always contribute to the improvement of Wikipedia and its management by the volunteer community. WMF projects thrust upon the Wiki have required massive community incentive to carry out the cleanups when they misfire, and reasonable solutions for improvement in quality of new articles required by community consensus have been summarily rejected by the Foundation. While a truly excellent tool in the hands of the right users, NewPagesFeed/CurationTool does not address these issues and has not improved the quantity and quality of new-page patrolling. AfT creates more work for this community than the net useful information that it is designed to produce. Someone recently stated words to the effect that the Foundation's answer to the community's claim that a car (project) is broken, is 'Keep pushing'. The only real solution is to deploy Foundation funds and resources to re-launch development of the Article Creation Workflow as a proper landing page for new users/page creators. Instead of simply wanting quantity instead of quality, the Foundation would probably rejoice at the result which would greatly reduce the burdens and backlogs in such areas as Articles for Creation, Deletions and AfD, largely resolve the issues surrounding the work of admins, and their appointment at WP:RfA, and reduce the endemic hat-collecting of minor rights. Meta areas, including WP:NPP, WP:AfC, WP:AfD, and possibly also the AfT, are a magnet to inexperienced users who cannot, or prefer not to expand or create content.Kudpung กุดผึ้ง (talk) 02:15, 6 February 2013 (UTC)[reply]
    I'm sorry that you don't feel statistics and hard data has a role in helping the volunteer community with its workload, but I must confess to being bemused by how ACTRIAL or the problem(s) with patrolling incoming content has anything to do with AFT5 (or how AFT5 can be creating work for the community when we've said 'if you guys want to turn it off, we'll turn it off' and people seem to be heading in that direction). The Foundation is not looking at quantity instead of quality; it's looking to raise the number of people who can help with maintenance tasks. And yes, sometimes this involves not only training but also making the software easier, as we did with Page Curation, or pointing people towards those tasks that need to be done. The vast majority of users do not engage in meta areas, which is why it would surprise me to find that a majority or substantial chunk of inexperienced users did; to resort to statistics for a moment, I ran a quick database query against the patrolling tables. In the last 30 days, there have been 51 patrollers with fewer than 500 edits - that's 14 percent of the patrollers overall. They are responsible for 484 patrols, which is...6 percent of patrols. If they were doing that terrible a job, presumably people would be un-reviewing their pages - and yet in the time period specified, experienced users (>= 500 edits) unreviewed...8 pages. In total. Not sure if the initial reviews were by new people or not. I'm happy to accept that quantitative and qualitative information go hand in hand, but your argument doesn't seem to be backed up by either as you've presented it. Okeyes (WMF) (talk) 18:45, 6 February 2013 (UTC)[reply]
    'In the last 30 days, there have been 51 patrollers with fewer than 500 edits - that's 14 percent of the patrollers overall' - you've just backed it it up for me, and it's far too many. The reason their patrolls have not been reverted is probably because not many patrollers are patrolling the patrollers - and that's not what we're supposed to be doing. --Kudpung กุดผึ้ง (talk) 12:34, 7 February 2013 (UTC)[reply]
    Note also "They are responsible for 484 patrols, which is...6 percent of patrols" - really, the number of patrollers in [tranche] is not useful for looking at 'are they doing it well/badly/causing more work'; the thing that counts is "how many patrols are they doing?". If we have one patroller doing 400 patrols, that makes a much bigger impact on the value of patrolling-as-a-way-of-triaging-junk than 10 patrollers doing 5 each. So, yes, they are 14 percent of patrollers: they are responsible for a much smaller chunk of the work. I certainly agree that patrollers do not exist to answer the quis custodiet problem - but either patrollers aren't seeing bad work, in which case your argument that there is a substantial problem involving poor-quality patrolling is...confusing, or patrollers are seeing bad work, and at no point deciding it's worth undoing. Okeyes (WMF) (talk) 13:03, 7 February 2013 (UTC)[reply]
  • What percentage is useful? Regarding the claim that "Between 30 and 60 percent of all feedback was rated by editors as 'useful", at Wikipedia:Article Feedback Tool/Version 5/Feedback evaluation#Is this useful? the instructions say "It is only the most entirely useless feedback that should be categorized as 'no' (not useful)." Several editors have worked together to post a random sample of 1000 feedbacks (after the anti-abuse filters and excluding anything that an editor has marked as hidden) at User:Guy Macon/Workpage. I welcome the interested reader to look at it and make their own estimate of what percentage is useful. --Guy Macon (talk) 02:43, 6 February 2013 (UTC)[reply]
    Yeah; that's actually an outdated description :). Would you like me to pull the categories/descriptions for the most recent tests? Okeyes (WMF) (talk) 17:58, 6 February 2013 (UTC)[reply]
    My personal preference is that when WMF publishes the results of a study, it should have a two prominent links to "methodology" and "raw data" on the main page of the study. In this particular case the methodology link should tell me, among other things, how the test subjects were selected, what instructions they were given, etc. The raw data should be such that if I want to I can replicate your work. This would bring a welcome level of scientific rigor to these studies. While I am waiting for that to happen, I would like to see a hatnote on anything that is outdated. --Guy Macon (talk) 18:44, 6 February 2013 (UTC)[reply]
    Obviously our raw data is not necessarily possible (some of it might be oversighted) but I'll see what I can do. Okeyes (WMF) (talk) 23:04, 6 February 2013 (UTC)[reply]
    It might be best to start with the next one. If you know that you are eventually going to publish some raw data, it is pretty easy to make a version with [Name redacted] and [Email redacted] or [Redacted for privacy reasons] as you go along. If you try to go back and do that after the fact, you always have a doubt about whether you missed one. I care far less about this particular result than I do about instilling a mentality in the WMF where they wouldn't dream about not publishing full details about methodology or not publishing raw data. And we haven't even started talking about single-blind vs. double-blind...
If you really want to focus on this particular study, rather than gathering raw data, somebody should start asking why WMF got "Between 30 and 60 percent useful" and my preliminary results are about 10% useful. That's a huge red flag. Is it because only one person cared enough to look at my data and post an estimate? was 200 a big enough sample? Is it because your study used 3 people? If you personally looked at the data would you come back and say that your estimate is 30%, not 10%? Is it because in both cases the person doing the evaluation was self-selected? If I saw results like that I would try to rip my own methodology to shreds and then I would try to rip the methodology of the other study to shreds. Somebody is doing something wrong. My attitude toward science: http://xkcd.com/242/ --Guy Macon (talk) 03:00, 7 February 2013 (UTC)[reply]
Frankly, I can't answer those questions; I'm not the researcher here ;p. I'll poke Aaron and see if he can comment. Okeyes (WMF) (talk) 11:29, 7 February 2013 (UTC)[reply]
poke received First of all, I want to direct you to the official report I wrote which includes the strategy for drawing both a random and stratified sample and the details of my methodology. I'm sad to find that this report was clearly referenced. You're not the first to have missed it. meta:Research:Article_feedback/Final_quality_assessment We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200). All feedback submissions were evaluated by two different people. The 30-60% number is a non-statistically founded, conservative minimization of these two evaluations/item. In the study, we found that 66% of feedback was marked *useful* by at least one evaluator ("best" in the report) and 39% of feedback was marked useful by both evaluators ("worst" in the report). Here's the breakdown of the four category classes we asked the evaluators to apply:
  • Useful - This comment is useful and suggests something to be done to the article.
  • Unusable - This comment does not suggest something useful to be done to the article, but it is not inappropriate enough to be hidden
  • Inappropriate - This comment should be hidden: examples would be obscenities or vandalism.
  • Oversight - Oversight should be requested. The comment contains one of the following: phone numbers, email addresses, pornographic links, or defamatory/libelous comments about a person.
Note that these exact descriptions appear as tooltips in multiple places in the feedback evaluation tool. If you'd like to personally replicate the study, I'd be happy to pull another random sample for you and load it up in the evaluation tool. --EpochFail(talkwork) 15:42, 7 February 2013 (UTC)[reply]
Before I respond, let me reiterate that I think everyone at the WMF is doing a good job and has the right goals. This is a discussion about possible improvements, starting with some future study. Those who are looking for a club to beat WMF with should look elsewhere.
meta:Research:Article_feedback/Final_quality_assessment is a very useful overview of the methodology used, but in my opinion an additional detailed methodology would be a Good Thing. (I am about to write some questions, but please don't post the answers. They are examples of what should be in a detailed methodology -- I cannot explain what I am talking about without giving examples of questions that the overview does not answer.) For an example, the overview says "We assigned each sampled feedback submissions to at least two volunteer Wikipedians." A detailed methodology would have said something like this:
"Between 3AM and 4AM on December 24th, we posted a request for volunteers (in French) on Talk:Mojave phone booth and on the main page of xh.wikipedia.org. 43 people volunteered, and we rejected 20 of them for being confirmed sockpuppets of User:Messenger2010 (See Wikipedia:Long-term abuse/Messenger2010) and rejected 11 of them because Guy drank too much and decided he doesn't like editors with "e" in their username. That left us with Jimbo and a six-year-old girl (username redacted for privacy reasons). We then..."
Unlike "We assigned each sampled feedback submissions to at least two volunteer Wikipedians", the above details exactly how those volunteers were chosen. Again, I don't care how they were chosen. I just want future studies to contain a detailed methodology page that answers questions like this or questions about the RNG used. To pick another example, the post above this one says "We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200)." That detail is not found in the methodology overview. --Guy Macon (talk) 16:51, 7 February 2013 (UTC)[reply]
The specific 'how they were chosen' list, I can provide, actually. The purpose of the study was to compare the rating of feedback that did get rated to feedback that got missed out on, suspecting that people overwhelmingly checked feedback for high-profile articles. In order to get some consistency between the two sets of numbers, I pulled from the database a list of all users who had, in the 30 days before we started the recruitment process, monitored more than 10 pieces of feedback in some fashion. The users in question were then sent a talkpage invitation going 'would you like to participate in this?'. I appreciate that's more a specific example to highlight a general point than anything else - and I'm going to bear your general point in mind when writing up something I've been working on recently, actually - but I thought I'd address it :). Okeyes (WMF) (talk) 18:50, 7 February 2013 (UTC)[reply]
  • Thanks for writing about this RFC; I wouldn't have noticed it otherwise, and it's an important subject. -- phoebe / (talk to me) 22:42, 6 February 2013 (UTC)[reply]
  • The highest quality article feedback I've seen is most always on the article talk pages. To be useful, feedback generally needs to be longer than the short tweets I generally see from the Feedback Tool. I often find useful comments on talk pages that sit unanswered for months, or even years, before I address the issues raised. So we already have a backlog on talk pages, without increasing it with more chatter from this tool. I don't feel that even 10% of the AFT comments are useful, but I've only looked at these comments on a very limited number of articles. Wbm1058 (talk) 23:37, 7 February 2013 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0