Experiments with article assessment

User:Sross (Public Policy) is Sage Ross, the Online Facilitator for the Wikimedia Foundation's Public Policy Initiative. As a volunteer, he edits as User:Ragesoss.

I've been working on Wikimedia's Public Policy Initiative team for a little over three months. The level of interest and enthusiasm we've seen from university professors and volunteers interested in the Wikipedia Ambassador Program has been gratifying, but we still have a long way to go before coming anywhere close to realizing the full potential of all the good will and interest among experts who don't (yet) contribute.

One of the great challenges of this project is assessment: how can we measure the degree to which the project is improving Wikipedia? We're working on three assessment projects within WikiProject United States Public Policy, each of which is relevant to the broader issue of content assessment in general on Wikipedia.

An optional new assessment system

This screencast explains the basics of the new assessment system and walks through an example assessment.

First, our quality assessment system (WP:USPP/ASSESS). Like many other WikiProjects, the U.S. Public Policy project has implemented its own variation on the standard Wikipedia 1.0 assessment system (in which articles are rated as Stub, Start, C, B, GA, A, or FA-class). The basic idea of the new system is to use weighted numerical ratings for six different aspects of article quality: comprehensiveness, sourcing, neutrality, readability, formatting, and illustrations. The system's rubric defines the different scores and how they translate into the standard Wikipedia 1.0 classes. There are several advantages: (1) it contains a specific weighted rubric, (2) it offers more detail on the areas that need work, (3) it provides numerical data for quantitative analysis, and (4) it is backward-compatible with the standard system. We hope it will also prove easier to learn and produce more consistent ratings. The downside is that it's more complicated, and we have yet to reach a critical mass of active reviewers trialing it.

The Wikipedia 1.0 scheme, which was originally pioneered by WikiProject Chemistry, succeeds to a large degree because of its simplicity. Experienced Wikipedians develop a good feel for the stages of improvement articles typically go through, and the 1.0 scale codifies those stages. It provides a quick way to mark the quality of individual articles and a blunt measurement of how quality is changing over large groups of articles, and even across the whole of Wikipedia. However, the system is not easy or intuitive for newcomers to pick up. Although simple from an experienced editor's perspective, the system has nuanced definitions of what, for example, makes a B-class article different from a C-class article or a Good Article; these definitions can be bewildering for those who haven't absorbed Wikipedia's norms. Like our core policies and guidelines, the 1.0 assessment system squeezes a lot of Wikipedia culture into a small package. The goal of the public policy system is to unpack that culture, making more explicit what Wikipedians expect from high-quality articles. We believe this explicitness may reduce some of the inconsistency in the 1.0 system, as well.

Rating the ratings

A second and closely related effort is the plan by our research analyst, Amy Roth, to test how consistent Wikipedia's article ratings are. We are assembling a small team—a mixture of Wikipedians and non-Wikipedian public policy experts—to periodically rate and re-rate a random sample of public policy articles. Amy will measure how closely results from our system match the standard ratings, how much ratings vary from person to person, how well the ratings can account for changes in article quality, and whether outside experts' assessments differ significantly from those of Wikipedians. Amy's test may shed light on the inconsistency of assessments in the middle ranges of the standard scale, particularly Start, C, and B-class.

Recruiting for the assessment team has gone poorly so far, but we have plans to run a watchlist notice to attract more attention to assessment efforts (as well as potentially enlarging the group of Online Ambassadors to keep pace with the expanding number of students who will be participating in Wikipedia assignments).

Input from readers

The Public Policy Initiative will test a new Article Feedback Tool. Beginning 22 September, the feature will be enabled for most of the articles within WikiProject United States Public Policy (it will not be enabled on the most trafficked articles to avoid overtaxing the servers). Editors interested in seeing the extension in action on particular U.S. public-policy-related articles should ensure the articles are tagged with the project banner, {{WikiProject United States Public Policy}}, and assessed with the WikiProject's numerical system.

The current iteration of the Article Feedback Tool, which will appear at the bottom of WikiProject United States Public Policy, beginning 22 September

This pilot is also part of the Wikimedia Foundation's longer-term strategy to explore different mechanisms of quality assessment. The potential upside of reader ratings is straightforward: we may be able to get a large number of ratings, and with a largely external audience judging quality (as opposed to Wikipedians judging their own work). The potential downside is also clear: non-experts may submit low-quality ratings, or there may be attempts to game the system. The rating tool includes a small survey that will complement the collected data.

Together with the technology team, we will test the technology, analyze the data, and continue discussions about how a reader-focused rating and comment system might be used in the next academic term in the Public Policy Initiative, as well as on Wikipedia more broadly. I'm personally very excited about the possibility of creating a robust system for reader feedback, and I hope this test sparks serious discussion about what such a system should look like. A set of Questions and Answers regarding the feedback tool, as well as a general discussion page about it, will be available soon.

← Previous "Public Policy Initiative"

In this issue

13 September 2010 (all comments)

News and notes

In the news

Public Policy Initiative

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

An interesting project. One immediate concern is that articles might be rated on how individual readers feel about the subject of the article, rather than the quality of its content. Jezhotwells (talk) 19:54, 13 September 2010 (UTC)[reply]

While certainly better than the current rating system of... peer review, I personally don't think it can stand the test of time in the sense that articles can change quickly as new information becomes available or, say, if a single Wikipedian decides to take matters into his own hands. I'm not very sure how well an automated system would work well for this task, most of all. But this will be a Wikipedia-wide (yet optional, although this wasn't clarified on too much) switchover, and we will all stumble upon it at some point. My biggest question is: Is this update for the US only? --Γιάννης Α. ^✆|☑ 20:03, 13 September 2010 (UTC)[reply]

If you're referring to the Reader Assessment Tool, it's not automated; it presents aggregate scores based on all the ratings that individual users have done. As for the test of time, that's definitely a big challenge, figuring out how to deal with stale ratings. For the time being, I think the developers are things like a time-based half-life for ratings or a number-of-intervening-edits based half-life for ratings. It might be possible down the line to do better with a tool that compares how much of the current text is present in earlier rated versions to determine how much weight old ratings get. This will not be Wikipedia-wide at this point, it will only be available for articles in WikiProject United States Public Policy. This is basically a technology test and a conversation starter, at this point. But it will definitely be available to whichever wikis want it once the technology stabilizes.--Sage Ross - Online Faciliator, Wikimedia Foundation (talk) 21:08, 13 September 2010 (UTC)[reply]

On controversial articles someone will try to game this system but better they waste their time trying to game the article assessment than trying to game the article content. When 4Chan gets Goatse rated as the best article on Wikipedia we can all have a chuckle but it won't have any other effect. This tool looks like it might be useful for other sorts of automated polling of readers. We could add questions on age, income, education level, language fluency to the quiz and get some correllation going. Different readers could be asked different questions so the questionnaire doesn't get too long. When you start asking that level of detailed question however the question on anonymity comes up. Does this tool record the IP of each respondent?filceolaire (talk) 21:25, 13 September 2010 (UTC)[reply]

I think the rating scale is indeed better in the aspects mentioned in the text, but I would suggest it to be simplified with a 5-point scale for each measure. That would make it more intuitive (worst-bad-average-good-best), it would relate directly to the star rating from the Article Feedback Tool, and the template could still calculate the overall score by giving different weights to each measure. As it is right now, in includes the same kind of learning curve than the 1.0 system (note how both can be partially solved with the use of html comments, but fail on their absence). --Waldir ^talk 11:37, 14 September 2010 (UTC)[reply]

Yeah, it's a tough needle to thread. A 5-point scale for each factor has two downsides: there aren't enough points to differentiate between every class for comprehensiveness (Stub, Start, C, B, GA, and A/FA all have different requirements for that), and it risks implying that every factor is equally important. But I agree with the advantages you point out. Whether they outweigh the downsides, I'm not sure. Personally, I hope the Article Feedback Tool will evolve in a more reader-oriented direction, because I don't think readers think about article quality in the same terms as editors. For Article Feedback Tool, I'd like to see something like a single 5-star rating for the whole article, then a question like "Did you find the information you were looking for? [yes, some, no]" and an input box to leave comments.--Sage Ross - Online Faciliator, Wikimedia Foundation (talk) 12:46, 14 September 2010 (UTC)[reply]

I thought the 1.0 correspondence was given by a weighted average of all the components, not only from the comprehensiveness scale. In that sense, I don't see why a 5-point scale invalidates the inference of the 1.0 rating, but I probably am interpreting the conversion the wrong way. As for the implication that all factors are equally important, that imo doesn't sound as much of a problem. And even if it did that, I don't see how it would affect the assessment. --Waldir ^talk 19:38, 14 September 2010 (UTC)[reply]

I don't think it makes sense to have a complicated system of assessment. An editor who is experienced in an area can eyeball an article and tell you whether it is Start, or C or B. As someone said above, every article is a moving target anyway, so why worry so much about assessements. So what if a C-class article gets grade-inflated to B: It still needs careful work to be ready for GA. IMO, editors spending all this time on assessment ought to be researching and writing instead. -- Ssilvers (talk) 21:15, 14 September 2010 (UTC)[reply]

I agree; there's no point in devising an assessment system so complicated that it takes significant time away from editing. Especially since we don't really know how much article assessment helps towards article improvement. Lampman (talk) 04:43, 15 September 2010 (UTC)[reply]

Of course, writing is a more important task than assessing. But effort and activity on Wikipedia is not very fungible; different people put their energy into different things, and we can't just transfer that energy from one area to another (for the most part). More detailed assessments (especially optional ones like this, where the simpler version is always an option) provide an opportunity to a) give a more accurate indication of an article's quality, which is important for things like creating offline versions and b) give editors a more specific indication of how an article can be improved.

In this case, we also need to do measurements of article quality as part of the requirements of the Public Policy Initiative grant, which Amy Roth determined would be impractical without a more quantifiable assessment system.--Sage Ross - Online Facilitator, Wikimedia Foundation (talk) 14:48, 15 September 2010 (UTC)[reply]

What I like to do with assessments is to leave a list of suggestions for improvement on the talk page, like I did today at Talk:Kerry Ellis. -- Ssilvers (talk) 05:22, 16 September 2010 (UTC)[reply]

I agree with the comment above, that it won't prevent Wikipedia:Gaming the system. But it might ease tension a bit in cases where determined POV pushers repeately delete the "NPOV dispute" tag from articles which they are censoring or otherwise distorting.

It will work best on articles which aren't the target of edit wars. --Uncle Ed (talk) 20:03, 18 September 2010 (UTC)[reply]

I really like the idea of "group-sourcing" the quality of an article with this mechanism. If I may, there does need to be some sort of counter n = X which shows how many votes have been received to generate the ratings... 20,000 responses means more than 2 responses that way. There also needs to be some protection against "revoting," which will inevitably happen in contentious articles as a sort of plebiscite on whether the reader approves of the content... Still, this is a really good step and I hope there comes a day in the not too distant future when all Wikipedia articles have a sort of "group-sourced" feedback section. Carrite (talk) 16:27, 20 September 2010 (UTC)[reply]

I just thought of a problem. Wikipedia articles generally start small and weak and get bigger and better over time. Yet an article can accumulate ratings for years in its small, weak state — then be improved — and still be saddled with obsolete "old" ratings. There needs to be some sort of a reset mechanism for massively expanded articles or some sort of automatic elimination of ratings more than, let's say, a year old to keep the ratings more or less as fresh as the article. Carrite (talk) 16:31, 20 September 2010 (UTC)[reply]

Yep. That's one of the big challenges that the developers are thinking about, how to deal with stale ratings. Hopefully, once people get some experience with how the ratings work during this pilot, we can come up with some ideas for dealing with that problem effectively.--Sage Ross - Online Facilitator, Wikimedia Foundation (talk) 16:36, 20 September 2010 (UTC)[reply]

Without considering technical matters, it's pretty easy to know when an article has mostly likely moved out of stub or start status, simply by looking at length and number of footnotes. A 1000+ word article with 5+ footnotes can't possibly be a stub; a 1500+ word article with 10+ footnotes is almost certainly "C" class (or better) rather than "start class". I'm not arguing here for machine-grading; rather. Rather, it seems clear that it's easy for a computer to determine that a specific, older rating should be discarded because an article has changed sufficiently since that particular rating was done. -- John Broughton (♫♫) 19:31, 20 September 2010 (UTC)[reply]