The Signpost

Special report

Revision scoring as a service

Contribute  —  
Share this
By Aaron Halfaker, とある白い猫 and He7d3r


Wikipedia relies heavily on artificial intelligence (AI) based tools in order to operate at the scale that it does today. The use of AI is most apparent in counter-vandalism tools, like those used to revert nearly all the vandalism on the English Wikipedia: ClueBot NG, Huggle and STiki. These advanced wiki tools use intelligent algorithms to automatically revert vandalism or triage likely damaging edits for human review. It's arguable that these tools saved the Wikipedia community from being overwhelmed by the massive growth period of 2006–2007.

Regretfully, developing and implementing such powerful AI is hard. A tool developer needs to have the expertise in statistical classification, natural language processing, and advanced programming techniques as well as access to hardware to store and process large amounts of data. It's also relatively labor-intensive to maintain these AIs so that they stay up to date with the quality concerns of present day Wikipedia. Likely due to these difficulties, AI-based quality control tools are only available for English Wikipedia and a few other, larger wikis.

Our goal in the Revision Scoring project is to do the hard work of constructing and maintaining powerful AI so that tool developers don't have to. This cross-lingual, machine learning classifier service for edits will support new wiki tools that require edit quality measures.

We'll be making quality scores available via two different strategies

via our Web interface (for bots and gadgets)

http://ores.wmflabs.org/scores/enwiki?models=reverted&revids=644899628|644897053

{"644899628": 
  {"damaging": 
    {"prediction": true, 
     "probability": {'true': 0.834253, 'false': 0.165747}
    }
  },
 "644897053":
  {"damaging": 
    {"prediction": false, 
     "probability": {'false': 0.95073, 'true': 0.04927}
    }
  }
}
via our library (batch processing)
from mw import api
from revscoring.extractors import APIExtractor
from revscoring.scorers import MLScorerModel

model = MLScorerModel.load(open("enwiki.damaging.20150201.model"))
api_session = api.Session("https://en.wikipedia.org/w/api.php")
extractor = APIExtractor(api_session, model.language)

for rev_id in [644899628, 644897053]:
    feature_values = extractor.extract(rev_id, model.features)
    score = model.score(feature_values)
    print(score)

We'll also provide raw labelled data for training new models.

Project status and getting involved

Mockup of the hand-coding interface

We've already completed our first milestone: replicating the state of the art in damage detection for English, Turkish and Portuguese Wikipedias. In the next two months, we will construct a manual hand-coding system and ask a set of volunteers to help us categorize random samples of edits as "damaging" and/or "good-faith". These new datasets will help us train better classifiers. If you'd like to help us gather data or extend the scoring system to more languages, please let us know by saying so on our talk page.

See also


+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Please exercise extreme caution to avoid encoding racism or other biases into an AI scheme. For example, there are some editors who have a major bias against having articles on every village in Pakistan, even though we have articles on every village in the U.S. Any trace of the local writing style, like saying "beautiful village", or naming prominent local families, becomes the object of ridicule for these people. Others can object to that, however. But AI (especially neural networks, but any little-studied code really) offers the last bastion of privacy. It's a place for making decisions and never mind anybody how that decision was decided. My feeling is that editors should keep a healthy skepticism - this was a project meant to be written, and reviewed, by people. Wnt (talk) 12:58, 20 February 2015 (UTC)[reply]

  • I agree. Many of these articles (e.g. articles on villages in Pakistan) are started in good faith by new editors trying to add information about where they live, usually in an underrepresented area here. Do we really want to discourage contributions from these parts of the world? What harm is being done, considering (for example) the amount of allowable cruft added by fan or ideological-based editors on topics primarily of interest to US editor? EChastain (talk) 14:08, 21 February 2015 (UTC)[reply]
    • Hi Wnt and EChastain. I agree. In fact, it's concerns about this sort of potential damaging behavior that lead me to start this project in the first place. A substantial portion of my scholarly work has been studies of the effect that quality control algorithms have been having on the experience of being a new editor (see my pubs and WP:Snuggle). My hope is that, by making AI easy in this way, we'll be able to develop *better* ways to perform quality control work with AI -- e.g. we could develop a user-gadget that Wikipedia:WikiProject_Pakistan members could use to review recent newcomers who work on project related articles. One way that you can help us out is by helping us build a dataset of damaging/not-damaging edits that does not flag good, but uncommon edits as damaging. Let us know of the talk page if you're interested in helping out. :) --Halfak (WMF) (talk) 17:59, 7 March 2015 (UTC)[reply]
    • Hello, Wnt and EChastain. Version control on Wikipedia has a purpose that is a lot more than simply maintaining a quality of standard for content, it also has an additional purpose to detect new users and guide these new users to be better editors. The purpose of our policies on neutrality, verifyability, notability etc. isn't intuitive to most new editors at first. One of the goal of this project is to have a system that among other things can distinguish good faith edits that inadvertently end up being damaging from malicious bad-faith edits that are intended to be damaging in the first place. After this distinction human editors would have more time focusing on guiding new good faith editors rather than wasting time reverting obvious malicious edits. For instance with such a distinction we can have the option to simply let humans process good faith edits that are inadvertently damaging. This is a community decision however. We are merely facilitating such community discussions and decisions by reducing the overall workload (by eliminating portion of problem with AI leaving more complicated aspects to humans) which tends to dissuade people from measures that may end up biting newer editors that mean well but are not fully versed in policy etc. By having a centralized AI system that shares resources from various AI tools we make systematic bias (and really any other accumulated problem) that may creep into AI algorithms - as unlikely as it may be - far more noticeable. -- A Certain White Cat chi? 07:46, 14 March 2015 (UTC)(touling viengmany)-2099985286



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0