The Signpost

By the numbers

How many actions by administrators does it take to clean up spam?

Contribute   —  
Share this
By MER-C and Smallbones

Administrators clean up the messes left by other editors. The time and effort spent by admins is a key to building and maintaining quality of the encyclopedia. Among the actions they take are blocking other editors, deleting articles, and protecting articles from vandalism. MER-C has collected the data for these and other admin actions taken in 2019, mostly on the English language Wikipedia (enWiki). See github for his methodology and this page for the raw data.

While the descriptive statistics themselves may be of interest, especially to administrators, our main purpose in examining them is to explore the burden that spam places on admins. Spam is not identical to paid editing – for example, an unpaid fan of an entertainer might wish to post the website of the entertainer's fanclub on dozens of pages. We believe, however, that most spam is inserted by editors, including paid editors, with a more serious conflict of interest.

As a rough indication of importance of spam to admins, we summed the number of blocks, deletions, and protections related to various wiki-offenses on enWiki. Using an open proxy had the highest total actions for 2019 (387,984), spam has the second highest total (81,699), followed by vandalism (68,039) and sockpuppeting/long term abuse (46,029). Not all admin actions require the same amount of time or dedication – discovering and blocking open proxies may be fairly simple or automatic and it is difficult to compare the time required for the three other major wiki-offenses. But as a first approximation this simple measure lets us know that spam is one of the most frequent problems for admins.

Blocks

Other than a global lock, which prevents an editor from editing on all WMF sites, a block on English Wikipedia is the most serious action that an editor faces. The table below records all blocks on enWiki for 2019. The use of an open proxy or web host accounts for almost 70% of the more than half a million blocks. These open proxy blocks may be because of the effectiveness of a bot, ProcseeBot, in uncovering proxy users.

Vandalism, spamming, and sockpuppeting and long-term abuse are responsible for the large majority (72.4%) of the remaining 168,649 blocks after the open proxy blocks are subtracted. For all blocks, spamming follows vandalism as the most important reason for these blocks and is ahead of sockpuppeting. Dividing the data into registered accounts and anonymous (IP) editors, we see that blocks for spamming are predominantly for registered accounts, while blocks for vandalism are more evenly divided. Thus for registered accounts, spamming is the most frequent reason for blocking. Many vandals may feel that registering an account is too time consuming for editing that will almost surely get them blocked, whereas spammers may feel that their editing is more difficult for admins to discover if they have a registered account. Either that or vandals mostly target existing articles, whilst spammers often want to create an article on a non notable business - and for that they need an account.

All enWiki blocks for 2019
Reason All Blocks IP Blocks Account Blocks
Total 556,633 448,515 108,118
Open proxy/web host 387,984 387,979 5
Vandalism 53,451 30,700 22,751
Spamming 38,112 970 37,142
Sockpuppetry and long term abuse 30,541 8,003 22,538
Disruptive editing 7,928 4,839 3,089
Anonymous blocks 6,029 5,814 215
Not here to build the encyclopedia 5,902 81 5,821
Other inappropriate username 5,782 - 5,782
Unclassified 5,385 2,384 3,001
Triggering the edit filter 4,630 1,894 2,736
Range blocks 3,313 3,260 53
Promotional username soft blocks 2,360 - 2,360
BLP violations 1,255 697 558
Harassment 1,047 560 487
Edit warring 872 285 587
Unauthorized, malfunctioning bot or bot username 339 - 339

Looking at global locks rather than just enWiki blocks shows an even larger relation to spamming. Just over 200 locks per day, or 73,474 for the year, are performed because of spamming, accounting for 72.7% of all global locks. Many of these locks are likely due to the use of spam-bots, which apparently find it easy to avoid Wikipedia's CAPTCHA screening at registration. These locks are normally performed by stewards.

All Global locks (and unlocks) for 2019
Reason Count Percent
Total 101,108 100%
Spamming 73,474 72.7%
Long term abuse 22,795 22.5%
Cross wiki abuse 2,720 2.7%
Unclassified 1,063 1.1%
Vandalism 820 0.8%
Inappropriate username 183 0.2%
Compromised 53 0.1%

Deletions

Deletions are the most serious action that can be taken for articles, user pages including drafts, and files. Spamming itself is only named as the cause in 4.7% of deletions of articles on enWiki. However, other named causes may also be related to spamming or paid editing. For example articles for deletion (AfD) discussions and expired proposed deletions (PRODs) together account for 27.8% of article deletions and a major proportion of these may be due to spam.

Examining deletions in all namespaces presents a clearer picture. Spam is the 4th most frequent reason for deletion in all namespaces. Over 118 items per day, or 43,342 for the year, were deleted. Many of these deletions are likely draft articles, e.g. those being reviewed at WP:Articles for creation or being prepared in user space. Abandoned drafts, which are also likely to be related to spam or paid editing, were responsible for 67,253 deletions for the year. The overall picture appears to be a multi-level of screening for deletion of spam on enWiki. In the first level, large numbers of drafts are submitted and later abandoned as the authors discover that we consider the draft to be spam. This includes up to 67,253 deletions. In subsequent screening levels, drafts are outright deleted at AfC or the draft stage amounting to 43,342 deletions. Many of the 38,287 miscellany for deletion (MfD) discussions may also be related to spam or paid editing, as are some of the 25,297 expired PRODs. These 4 categories (which may include some double counting) add up to a total of 174,179 possible deletions (477 per day) at the draft stage. After an article is accepted, it may later be deleted as spam (4,825 per year) or as an expired PROD (9,271) or at an AfD debate (19,225). The battle of admins to clean up spam by deletion is spread out in many stages and is clearly time consuming.

2019 deletions
Reason Articles All Namespaces
Total 102,344 623,202
Dependent on deleted page 21,134 181,164
Deletion debate (AFD) 19,225 20,772
Expired PROD 9,271 25,297
Maintenance 8,789 46,909
Deletion debate (RFD) 5,792 7,316
Created by block/ban evading sockpuppet 5,727 12,523
Fails to give reason for inclusion 5,410 5,473
Cross-namespace redirect 5,125 5,180
Spam 4,825 43,342
Author/user request 3,384 21,409
Unclassified 2,686 18,128
Copyright violations 2,386 10,761
Implausible redirect 1,357 1,720
Unclassified nukes 1,161 4,941
No content or context 1,053 1,083
Repost of deleted content 1,000 1,397
Unnecessary disambiguation 692 705
Vandalism 663 6,116
Copyright problems 603 656
Redundant 455 2,943
Test page 366 3,563
Deletion debate (MFD) 335 38,287
No reason given 216 1,416
Expired BLP PROD 177 179
Made up one day 176 180
Attack page 125 1,627
Patent nonsense 118 1,062
Foreign language 54 55
Abandoned draft 22 67,253
Misuse of Wikipedia as a webhost 6 28,813
Deletion debate (TFD) 5 24,634
File redirect to Commons 5 178
User page where user does not exist 1 837
Problems with non-free files - 23,248
File moved to Commons - 13,796
Deletion debate (CFD) - 9,947
Empty category - 9,682
Lack of copyright information (files) - 4,805
Category renaming or merger - 3,196
Deletion debate (FFD) - 2,201
Corrupt file - 323

Protections

At first glance, the use of article protection is not extensively used by admins to stop spamming. Spamming is the tenth most common reason for article protection for both articles and in all namespaces. Some protections caused by spamming might be included under other headings, for example, unclassified disruptive editing, addition of unsourced material, unclassified salting, or unclassified.

Protections for 2019
Reason Articles All Namespaces
Total 22,543 27,275
Vandalism 7,634 8,472
Unclassified disruptive editing 5,170 5,416
Sock puppetry 2,395 2,965
Addition of unsourced material 1,916 1,955
BLP violations 1,791 1,816
Unclassified salting 1,265 1,890
Unclassified 726 1,439
Edit warring/content dispute 837 897
High risk page 351 1,805
Spamming 197 245
Arbitration enforcement 197 202
Copyright violations 64 75
User request - 98

Conclusions

Spamming is the most common reason for actions by administrators other than for the use of an open proxy. It is the most common reason for blocking registered accounts and is cited for 72.7% of all global locks.

Spam is only the ninth most common reason for article deletions, but the fourth most common reason for deletion in all namespaces. It appears that the effort to delete spam crosses many of the classifications used for deletions, e.g. abandoned drafts, MfD, AfD, and expired PRODs.

The least important aspect of the use of admin tools against spam appears to be article protection.


S
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
== Slight inaccuracy? ==

"Other than a global lock, which prevents an editor from editing on all WMF sites, a block is the most serious action that an editor faces on Wikipedia."

I'd consider a site ban more serious than a block. --kingboyk (talk) 18:54, 1 March 2020 (UTC)[reply]

You are right, I missed that during copyediting prior to publication. I hope that the new sentence makes more sense. ☆ Bri (talk) 18:57, 1 March 2020 (UTC)[reply]
@Kingboyk: Now I think I see your point better, ban versus block. Maybe the text should have said the block is the strongest technically enforced action that an editor faces. Perhaps I'll leave this to the original author to consider. ☆ Bri (talk) 19:12, 1 March 2020 (UTC)[reply]
The problem is, none of the variations on blocks are meaningful deterrents to spam. Spam accounts are free, disposable, and easily replaced. The only thing a spammer cares about is the content, since that's what they get paid for. -- RoySmith (talk) 19:06, 1 March 2020 (UTC)[reply]
We think they actually get paid for getting spam to stick for a finite period of time. Which is why detecting and removing it early is so important. There's another UPE model of long-term article "monitoring" for a set fee, but that's a different story. ☆ Bri (talk) 19:08, 1 March 2020 (UTC)[reply]
I've read similar info about spammers' terms of payment and the resulting need to remove offending material quickly written by the spam-fighting Charcoal team over at Stack Overflow. I would wager you are correct. --kingboyk (talk) 19:19, 1 March 2020 (UTC)[reply]

From a look at the numbers above it appears that the workload of all editors in fighting spam would be made substantially lighter if edits by only registered users were permitted. The spammers would then be much easier to trace as it would not be so easy for them to hop to another IP address. In lieu of that I would like to see administrators take a much more proactive approach to semi-protection. Xxanthippe (talk) 05:02, 2 March 2020 (UTC).[reply]

There is nothing that will actually solve the problem unless we ask for identification of editors working on certain types of articles, and I think most of us would regard it as a last-ditch measure, an\ unacceptaable compromise of "anyone can edit". What might help is explicitly asking editors to declare whether or not they have a COI, and if so a paid COI. Some will blatantly deny it, but I think about half the people with coi would in fact declare. This will at least provide a solution to editors with a nonfinancial COI to come clean about it. (I did propose this a year or two ago, and it was soundly rejected. Maybe by now there will be a better reliazation of the problems. ) DGG ( talk ) 06:21, 2 March 2020 (UTC)[reply]
On my watch page of a few hundred articles I identify at least a score with likely abusive editing: suspected paid editors and professional reputation managers, POV editors, ego editors, attack editors etc. I do not have the resources of time to follow them all up and these editors usually have unlimited energy to pursue their individual obsessions. The suspected rogue contributors are registered editors, often redlinks, and IPs in roughly equal proportion. The result is that the articles that they attack degrade as time goes by. There will never be a complete solution to the problem, but I think it could be made a bit more manageable by banning IP edits as this would push those into becoming the more traceable registered editors. I know that there has long been a prejudice to allow anybody can edit as IPs, but I have never seen the force of that as anybody can register anyway. Because of its growing maturity, Wikipedia is different to what it was fifteen years ago, and because of its size, it is becoming more difficult to maintain its quality in many areas. A change in the policy of IP editing is needed now to make curation easier. Xxanthippe (talk) 03:36, 3 March 2020 (UTC).[reply]
WMF has a proposal I think you would like, Xxanthippe, if I understand your point and the proposal IP Editing: Privacy Enhancement and Abuse Mitigation. [E]dits will be recorded using an automatically-generated, unique, human-readable identifier instead of the IP address when an edit is made by an unregistered user. This identifier will stay consistent over a session and possibly longer... What do you think of it? ☆ Bri (talk) 05:03, 3 March 2020 (UTC)[reply]
Thank you for bringing my attention to this proposal, which I had not seen before. Unfortunately I was unimpressed by what I saw. I found the proposal obscure, incomplete, and likely to make vandal detection more difficult. There was even a suggestion to put cookies on people's computers. Most security conscious users delete their cookies on a regular basis, so this would not work. The proposal seemed like a Heath Robinson contraption (any unnecessarily complex and implausible contrivance). The direct solution is to ban IP edits. Xxanthippe (talk) 05:14, 4 March 2020 (UTC).[reply]
Keeping peoples contributions associated with a single account, if possible, I believe will help. Of course there is always work arounds. But many will not bother to figure them out, especially initially.
We could disallow IPs editing of certain types of articles (such as small companies and BLPs) if we so chose. Anything with a specific project page on the talk page could be semi protected automatically for example. We would need to figure out how we would measure if this is effective or not before we do it though.
Well Wikipedia accounts involve zero investment, those at Upworks/Fiver etc require significant investment before they become useful. We really need to push these entities to work with us. Doc James (talk · contribs · email) 16:52, 4 March 2020 (UTC)[reply]
Your suggestion of not allowing IP edits on some categories of topic is excellent provided that it is allowed by the system and that community-wide consensus is obtained. BLPs would be a good place to start. I have seen BLPs, for example B. Wongar: Revision history or Alain de Botton: Revision history (I don't know why these are redlinks), that have been blighted for years by tendentious IP edits of all sorts. Semi-protection works for a while, but when it ends the trouble resumes. Assessing if the suggested scheme works can be done by seeing how many complaints arise. If an IP wants to make a change to one of these protected BLPs they can ask on the talk page, as at present. If they don't wish to be geolocated by their talk page edit they can register and make their complaint that way (and make their edit anyway as a registered editor!). Xxanthippe (talk) 03:41, 6 March 2020 (UTC).[reply]
@Xxanthippe: These links work: Special:History/B. Wongar and Special:History/Alain de BottonBri (talk) 04:57, 6 March 2020 (UTC)[reply]
Thx! Xxanthippe (talk) 05:02, 6 March 2020 (UTC).[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0