Vandalism study

Study examines Wikipedia authorship, vandalism repair

An academic study combining editing data with page view logs has added some new understanding about the quality and authorship of Wikipedia content. It concluded that frequent editors have the most impact on what Wikipedia readers see, while the effect of vandalism is small but still a matter of growing concern.

The results of the study are reported in a paper titled "Creating, Destroying, and Restoring Value in Wikipedia" (available in PDF), to be published in the GROUP 2007 conference proceedings. It was put together by a research group in the University of Minnesota department of computer science and engineering. Based on sampled data provided by the Wikimedia Foundation, showing every tenth HTTP request over a one-month period, they created a tool for estimating the page views for a Wikipedia article during a given timeframe.

In the absence of this type of data, previous studies have largely relied on an article's edit history for analysis. Interestingly, the study concluded that there is "essentially no correlation between views and edits in the request logs."

The study estimated a probability of less than one-half percent (0.0037) that the typical viewing of a Wikipedia article would find it in a damaged state. However, the chances of encountering vandalism on a typical page view seem to be increasing over time, although the authors identified a break in the trend around June 2006, late in the study period. They attributed this to the increased use of vandalism-repair bots.

Authorship and value

Addressing the debate over "who writes Wikipedia", whether most of the work is done by a core group or occasional passersby, the study introduced a new metric which it called the "persistent word view" (PWV). This gives credit to the contributor who added a sequence of words to an article, combined with how many times article revisions with that contributor's words were viewed. The study came down largely in favor of the core group theory, concluding, "The top 10% of editors by number of edits contributed 86% of the PWVs". However, it may not necessarily refute Aaron Swartz's contention that the bulk of contributions often comes from users who have not registered an account; the Minnesota researchers excluded such edits from parts of their analysis, citing the fact that IP addresses are not stable.

The study built on previous designs for analyzing the quality of Wikipedia articles, notably the "history flow" method developed by a team from the MIT Media Lab and IBM Research Center and the color-coded "trust" system created by two professors from the University of California, Santa Cruz. In their own way, both earlier approaches focused on the survival of text in an article over the course of its edit history. Refining these with its page view data, the Minnesota study argued that "our metric matches the notion of the value of content in Wikipedia better than previous metrics."

Damage control

Looking at the issue of vandalism, the study focused primarily on edits that would subsequently be reverted. Although the authors conceded this might include content disputes as well as vandalism, their qualitative analysis suggested that reverts served as a reasonable indicator of the presence of damaged content.

Statistically, they estimated that about half of all damage incidents were repaired on either the first or second page view. This fits in with the notion that obvious vandalism gets addressed as soon as someone sees it; even in the high-profile Seigenthaler incident it's unlikely that many readers saw the infamous version of the article at the time, as a previous Signpost analysis indicated. However, the study also found that for 11% of incidents, the damage persisted beyond an estimated 100 page views. A few went past 100,000 views, although the authors concluded after examining individual cases that the outliers were mostly false positives.



Also this week:
  • From the editor
  • Vandalism study
  • Myanmar to Burma
  • WikiWorld
  • News and notes
  • In the news
  • WikiProject report
  • Features and admins
  • Technology report
  • Arbitration report

  • Signpost archives

    + Add a comment

    Discuss this story

    These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
    Michael, great job.

    I think the study does effectively provide an answer to Aaron Swartz's question of who writes Wikipedia vis-a-vis anons: "The total number of persistent word views was 34 trillion; or, excluding anonymous editors, 25 trillion."(p. 5). So immediately we know that registered users contribute about 74% of PWVs.

    It's not clear (at least to me) whether the further analysis of editors by decile and PWV contributions is based on 34 or 25 trillion, but it looks the the deciles themselves are calculated by excluding anonymous editors. So it's either that the top 10% of registered editors (by edit count) contributed 86% of PWVs, or (if anon PWVs are excluded) 63% (i.e., 86% of 74%).

    In either case, it strongly suggests that, at least when weighted by how popular content is, Swartz was largely wrong. (It might be the case, however, that registered editors are more likely to edit popular topics, while anons contribute large word counts to obscure topics.) Note further that the exclusion of anons from the editor decile rankings means that the 10% decile is actually a smaller proportion of total editors, when anons as well as registered users are considered as editors.--ragesoss 23:02, 8 October 2007 (UTC)[reply]

    After a closer look at the graphs, particularly Figure 3, it looks like the percentiles are based on 25 trillion PWVs (i.e., excluding the anon PWVs), since I count 9 lines. This implies that the final, 100% decile line is coincident with the 100% PWV line. So the top 10% of registered editors (ca. 420,000 people) account for ca. 63% of all PWVs, the top 1% (ca. 42,000) account for around 51%, and the top .1% (ca. 4,200) account for 32%.
    It's interesting that the distribution if flattening for all segments except the very top (the top 4,200, and probably somewhat above that), which is on the same order of magnitude as the persistent core community: according to Erik Zachte's stats (which use the same main dataset, the October 2006 dump), there were 4330 editors with >100 edits in October 2006, and about 10 times that many with >5 edits. Contra Swartz, the visible community is becoming more, not less significant (as measured by PWVs). However, anons could also be gaining in PWV share; the study doesn't give any indication there. It seems like the only segment that is losing PWV share is the 10%-1% segment, many of whom are no doubt formerly active Wikipedians who have left the project.--ragesoss 23:32, 8 October 2007 (UTC)[reply]
    Thanks for the feedback and additional analysis. You may be right about how this stands up against Swartz's analysis, but in typical "we report, you decide" fashion I didn't want advocate a conclusion about it, and without having examined their data closely I wasn't sure what considerations might remain unaccounted for. One issue that comes to mind is that in the past two years, with the restriction of article creation to registered users along with the increased use of semi-protection, over time the balances weigh increasingly in favor of editing with an account. The world this study looks at has changed, and while the changes were already underway, Swartz may not have recognized their full impact. His argument is more anecdotal than systematic anyway. With those trends in mind, it stands to reason that known personalities would be pulling more of the weight now, as you say. --Michael Snow 05:03, 9 October 2007 (UTC)[reply]
    I would also add that the designations of "top x%" are sort of meaningless, since there are literally millions of accounts that have never made an edit and anons aren't counted in those calculations. If you look at it by numbers rather than percentile, the distribution does seem modestly level, at least compared to the idea of a small core contributing the significant majority of content, as Jimbo used to argue. 50% of the PWVs come from the top 42,000 editors. According to the abstract, "we show that an overwhelming majority of the viewed words were written by frequent editors and that this majority is increasing." I think this is incorrect, based on the rest of the paper. The PWV share of the 10% and 1% groups is decreasing, and the the .1% share is increasing but does not account for a majority of PWV (only about 32%). And the 10% group, the top 420,000 editors responsible for 63% of total PWVs, can hardly be considered "frequent contributors", since only a tenth of that number have more than 5 edits per month.--ragesoss 05:53, 9 October 2007 (UTC)[reply]
    After looking into it further, I think that I was mistaken in assuming that the graphs are based on only registered accounts (and only the 25 trillion PWVs associated with them). They say they analyze 4.2 million accounts, which is too high for just registered accounts ca. October 2006. This makes interpretation of the graphs and assessment of the "who writes Wikipedia" question much more complicated. 27% of PWVs come from anons, yet the top 10% of all editors is responsible for 86% of PWVs. So anons must be well-represented in the upper deciles. I've emailed the lead author and hopefully I can get some more clarification.--ragesoss 07:15, 9 October 2007 (UTC)[reply]
    Hi folks,
    I'm the lead author of the work, and I'll reply to ragesoss's email here. It's rare indeed that research papers generate such immediate interest among practitioners, so it's very exiting to receive his email and see this talk page. The five questions that seem to have arisen, and our thoughts, are:
    1. The abstract makes a claim unsupported by data. Specifically, we claim that "we show that an overwhelming majority of the the viewed words were written by frequent editors and that this majority is increasing". Indeed, this sentence is wrong, for the reasons you've noted. While the 10% and 1% cohorts contributed about 85% or 75% of the value, respectively, these cohorts' shares are not increasing. Only the 0.1% cohort's share is increasing. You could also reasonably argue that overwhelming majority was inappropriate, and perhaps strong majority would be better.
    2. The term "frequent" is not well defined. We've used the term frequent informally to refer to editors with the higher edit counts, but we never defined the term carefully, which we should have. So under our terminology, someone who edits only a handful of times per month but still edits more than e.g. 90% of other editors would be labeled frequent.
    3. Do figures 3 and 4 include anonymous editors? The short answer is yes, they do. They include all editors appearing at least once in the history dump that we analyzed. There was definitely an opportunity to be more clear in the paper about this. :) We chose to do this because we didn't see fundamental differences between the graphs as presented in the paper and considering only the non-anonymous universe (25T PWVs). We've posted the figures with anonymous editors (and their PWVs) excluded: [Figure 3] [Figure 4].
    4. Will you publish a list of the top editors by PWV? We're happy to, provided there aren't any privacy issues. What are the privacy issues from the Wikpedia community perspective? Do you publish lists of editors ordered by different metrics?
    5. It would be great to see these analyses run on more current dumps. We agree. :) We worked with what we had at the time: when this paper was being written in April and May, the Nov. 4, 2006 dump was the most current available.
    It's too late to make changes to the paper, but this feedback will inform our presentation and discussions at GROUP. It's appreciated. We'll continue to watch this talk page, so feel free to direct additional questions to us here. --R27182818 20:03, 12 October 2007 (UTC)[reply]
    In response to Question 4., if I understand correctly the PWV metric only uses publicly available data, in the sense that I could work out my own PWV value by trawling the (public) history of each article I've edited (also public). So there are no genuine privacy issues here; there are only issues of courtesy. We have had Wikipedia:List of Wikipedians by number of edits since June 2004, and there have never been any issues with that until a few months ago, when a handful of editors decided they would prefer not to appear on it, and removed their names. We also previously had Wikipedia:List of Wikipedians by number of most recent edits, not maintained since 2004; and Wikipedia:List of Wikipedians by number of recent edits, not maintained since 2005. Both lists were dropped only because no-one could be bothered maintaining them, not because of privacy issues - indeed, the data is still up, albeit grossly out of date. So I think there is sufficient precedent for these lists to be published, without any need to fret about privacy. Hesperian 05:38, 13 October 2007 (UTC)[reply]
    Hi folks, sorry to keep you waiting so long. Please find a list of the top editors by PWV at http://www.cs.umn.edu/~reid/pwv-list-4200.txt. PWV scores are percentages. Enjoy! --R27182818 18:55, 14 November 2007 (UTC)[reply]

    Why, oh why, don't we have a more current dump? Something about the technical problems with complete database dumps of en-wiki might be useful in the article.--ragesoss 00:20, 9 October 2007 (UTC)[reply]

    Or we could just wait for more people to apply for funding to study Wikipedia, or for a large amount of funding for studies of Wikipedia. How much research is being done on Wikipedia, out on interest? Is there a way to um, study the study of Wikipdia? Carcharoth 16:22, 9 October 2007 (UTC)[reply]
    There's a WikiProject Wikidemia that would essentially aim to be the forum for your last question, though its activity level is not that high. There's also the Wikimedia Research Network. To the extent that the study of Wikipedia is being done through data extraction and analysis, it's often going to be conducted outside of normal wiki activity, so it's not always easy to know what's currently happening. Greg Maxwell is the Wikimedia Foundation's Chief Research Officer, and he along with some of the people who work on the toolserver are probably who you'd want to talk to for more of a sense of this activity. --Michael Snow 16:37, 9 October 2007 (UTC)[reply]
    Thanks! Carcharoth 17:35, 9 October 2007 (UTC)[reply]

    Further questions

    It might be worthwhile to submit some further questions to the authors, since there is probably a lot of interesting information that they could easily provide that isn't in the paper.--ragesoss 00:20, 9 October 2007 (UTC)[reply]

    • Who are the top 4,200 editors by PWV?
    • What would Figure 4 look like if IP edits were included? Have the relative PWV contributions of anonymous editors been increasing or decreasing over the period analyzed?
    • [add your questions here]

    What is a false positive?

    This story is excellent except for the last sentence. What is a false positive in the context of persistent vandalism? 1of3 21:41, 10 October 2007 (UTC)[reply]

    It's not talking about "persistent" vandalism, it's talking about any vandalism. With this much data, the study has to apply mechanical tests to identify what it thinks is vandalism, which in this case is largely based on edits that got reverted. Since the test has no human input, it can pull in cases that upon further review do not actually involve vandalism, which are the false positives. --Michael Snow 21:58, 10 October 2007 (UTC)[reply]
    But it is talking about persistent vandalism, saying that most of what the program flagged as vandalism which persisted for more than 100,000 page views was actually not vandalism.--ragesoss 22:01, 10 October 2007 (UTC)[reply]
    Right, of course, sorry I was thinking the question involved "persistent" vandalism in the sense of repeated action (even though the story itself talks about persisting cases). Anyway, the explanation of what is a false positive is still applicable. --Michael Snow 22:13, 10 October 2007 (UTC)[reply]
    Thanks kindly, that makes sense now. 1of3 23:15, 10 October 2007 (UTC)[reply]



           

    The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0