Automated copy-and-paste detection under trial

One of the problems Wikipedia faces is users who add content copied and pasted verbatim from sources. When we follow up on a person's work, we often don't check for this, and a few editors have managed to make thousands of edits over many years before concerns are detected. In the past year, I've picked up three or four editors who have made many thousands of edits to medical topics in which their additions contain word-for-word copy from elsewhere. Most of those who only make a few edits of this nature are usually never detected.

After a user detects this kind of editing, clean-up involves going through all their edits and occasionally reverting dozens of articles. Unfortunately, sometimes it means going back to how an article was years back, resulting in the loss of the efforts of the many editors who came after them. Contingency reverts can end up harming overall article quality and frustrate the core editing community. What is the point of contributing to Wikipedia if it's simply a collection of copyright-infringed text cobbled together, and even your own original contributions disappear in the cleanup? Worse, the fallout can cause editors to retire. If we could have caught them early and explained the issues to them, we'd not only save a huge amount of work later on, but might retain editors who are willing to put in a great deal of time.

So what is the solution? In my opinion, we need near real-time automated analysis and detection of copyright concerns. I'd been trying to find someone to develop such a tool for more than two years; then, at Wikimania in London, I managed to corner a pywikibot programmer, ValHallASW, and convinced him to do a little work. This was followed by meeting a wonderful Israeli instructor from the Sackler School of Medicine Shani Evenstein who knew two incredibly able programmers, User:Eran and User:Ravid ziv. By the end of Wikimania our impromptu team had produced a basic bot – User:EranBot – that does what I'd envisioned. It works by taking all edits over a certain size and running them through Turnitin / iThenticate. Edits that come back positive are listed for human follow-up. Development of this idea began back in March of 2012 by User:Ocaasi and can be seen here.

Why near real time?

Determining copy-and-paste issues becomes more difficult the longer one waits between the initial edit and the checking, as one then has to deal with mirroring of Wikipedia content across the Internet. As well, many reliable sources – including peer-reviewed journals and textbooks – have begun borrowing liberally from Wikipedia without attribution. So if we're looking at copyright issues six months or a year down the road, we need to look at publication dates and go back in the article history to determine who is copying from whom.

In short, it's far more difficult for both humans and machines.

Why Turnitin?

Turnitin is an Internet-based plagiarism-prevention service created by iParadigms, LLC, first launched in 1997; it is one of the strategies used by some universities and schools to minimise plagiarism in student writing. The company that developed and owns the product has agreed to give us free access to their tools and API. Even though it's a for-profit company, there won't be obtrusive links from Wikipedia to their site, and no advertising for them will ever appear on Wikipedia.

Why would they want to be involved with us? Letting us use their tools doesn't cost them anything and is no disadvantage to shareholders. Some companies are willing to help us just because they like what we do. We've had a number of publishers donate large numbers of accounts to Wikipedians for similar reasons. They have extra capacity just sitting there, so why not give it away? They also know we're volunteers and are not going to buy their capacity anyway. Other options could include Google, but they don't allow their services to be used in this way, and it appears that Yahoo is currently charging for use by User:CorenSearchBot, which checks new articles for issues.

Benefits

How many edits are we looking at? Currently the bot is running only on the English Wikipedia's medical articles. In 2013, there were 400,000 edits to medical content – around 1,100 edits per day. Of these only about 10% are of significant size and not a revert, so we're looking at an average of around maybe 100 edits per day. If we assume a 10% rate of copyright concerns and three times as many false positives as true positives, we're looking at 40 edits per day at most. Who would follow-up? With the number of concerning edits in the range of 40 per day, members of WikiProject Medicine will be able to handle the load. This is much easier than catching 30,000 edits of copyright infringement after the fact, with clean-up taking many of us away from writing content for many days.

The Wiki Education Foundation has expressed interest in the development of this tool, since edits by students have previously contained significant amounts of plagiarism, kindling much discontent with Wiki Education's predecessor. The Hebrew Wikipedia is also currently working with this bot, and we'd be happy to see other topic areas and WMF language sites use it.

There are still a few rough aspects to iron out. The parsing out of the new text added by an edit is not as good as it could be. Reverts should be ignored. These issues are fairly minor to address, and a number have already been dealt with. While there were initially about three false positives for every true positive, we should have this down to a more even 50–50 split by the end of the week. Already in its early stages, this has turned out to be an exceedingly useful tool.

The views expressed in this opinion piece are those of the author only; responses and critical commentary are invited in the comments section. Editors wishing to propose their own Signpost contribution should email the Signpost's editor in chief.

Next "Op-ed" →

In this issue

3 September 2014 (all comments)

Arbitration report

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

This seems to be a great initiative, and if the preliminary results prove to be accurate, it should be extended from changes to medical articles to all substantive changes to all articles. A friendly, welcoming, informative message about copyright issues should be posted on the talk page of any editor whose edits are flagged by this bot. Cullen³²⁸ Let's discuss it 08:56, 6 September 2014 (UTC)[reply]
- Cullen³²⁸, thanks for the suggsestion. Currently it isn't yet accurate enougth, so I don't think it should post to talk pages, but maybe in the near future it can notify users using "someone mentioned you on...". Eran (talk) 21:06, 7 September 2014 (UTC)[reply]
  - Concerns are only brought to a person's attention if a human editor verifies them. As there are so many mirrors of Wikipedia it may be some time before we reach the point were messages could be left automatically. Doc James (talk · contribs · email) (if I write on your page reply on mine) 11:01, 8 September 2014 (UTC)[reply]
I agree with Cullen328, great initiative @Jmh649: Doc James, and let's hope it proves a success. Just one slight oddity, and not really the subject of this article, but you have mentioned "reliable sources" which are prone to lift Wikipedia content without attribution. It seems to me that the fact that a journal publishes such material makes it ipso facto not a reliable source. Thanks! — Amakuru (talk) 09:45, 6 September 2014 (UTC)[reply]
- Some peer reviewed journal article are beginning to have Wikipedia material in them. But yes I generally agree.Doc James (talk · contribs · email) (if I write on your page reply on mine) 11:15, 7 September 2014 (UTC)[reply]
Yes, this seems like a great idea; thank you for bringing it to wider attention. I'm not 100% on board with the decision to ignore reverts; surely the reverted material could easily contain copyrighted material from before the bot started running? Still, this is a great step in the right direction. I hope it works well and can be adopted by the rest of the site. Matt Deres (talk) 11:50, 6 September 2014 (UTC)[reply]
This is a fantastic idea. Great work on it! Hope to see it expand. Jason Quinn (talk) 12:23, 6 September 2014 (UTC)[reply]
Sounds like a great tool, since when I have taken text from an article I suspect of being cut and paste the searched for selected passages at Google and Google books, it always felt like something a bot could have done. I note the part "After a user detects this kind of editing, clean-up involves going through all their edits and occasionally reverting dozens of articles. Unfortunately, sometimes it means going back to how an article was years back, resulting in the loss of the efforts of the many editors who came after them." This suggests that a dickish editor on copyvio patrol could take a fine article, detect a copyvio 1000 edits back and blindly revert it back several years to remove the old copyvio, thereby destroying hundreds of hours of work by other goodfaith editors who followed the copyvio edit. Instead of that act of what is legalistic vandalizing, why not edit the copyvio portion to render it acceptable? That would preserve the contributions of other editors. But let's see a bot do that. Edison (talk) 12:40, 6 September 2014 (UTC)[reply]

As someone who has devoted a lot of time towards such copy-paste violations, this is a marvellous idea. I did already think CorenSearchBot was doing something similar, but now I see that just involves new pages (not edits to existing pages). I strongly suggest that this type of tool be helped and funded by the community with a long-term goal of running on all Wikipedias. The benefits for editors, readers, the site's reputation, and licensing terms are very clear. SFB 13:04, 6 September 2014 (UTC)[reply]
Wonderful! BTW, the redlink should be ithenticate. --Randykitty (talk) 13:21, 6 September 2014 (UTC)[reply]
According to her userpage, Shani is "now working towards an M.A. in East Asian Studies", so "professor" is maybe not the right word. Great initiative though! Johnbod (talk) 20:51, 6 September 2014 (UTC)[reply]
- Thank you are correct. An instructor not a professor. Doc James (talk · contribs · email) (if I write on your page reply on mine) 11:13, 7 September 2014 (UTC)[reply]
  - John, thanks for the reminder to update my user page on En-Wiki. But you're right -- not a professor. Just teaching a wiki-Med course at Sackler. :) Shani. (talk) 13:34, 7 September 2014 (UTC)[reply]
This is a great idea. Two more aspects: 1) it would be great to find duplicated content from other parts of the Wikipedia, too, as these are also problematic (redundant information is hard to maintain 2) There's a Open Source project WikiDuper that searches for duplicated sentences. It might be used for this, so we don't have to rely on only one provider (turnitin). --Dnaber (talk) 11:53, 7 September 2014 (UTC)[reply]
- Dnaber, thanks for the suggestion and WikiDuper seem to be really cool project. However I think copyright violation is different problem and different treating: delete it VS editorial choice of what and where to place longer explantation and where to place only a link to extended article. Another difference is that such tool can run offline. BTW, maybe you can contact the authors to give you this data and place it on the wiki (in Wikipedia:Similar articles?). Once you get such page you can suggest a collaboration of the week of editing such articles :) Eran (talk) 21:06, 7 September 2014 (UTC)[reply]
Hi Doc James, we spoke at Wikimania about this problem too. Do be aware that turnitin/itenticate suffer from both false positives AND false negatives. It's a tool, but you have to look at the results, not just trust the score reported. I've been testing the software since 2004. People want to believe that it detects every and all plagiarism, but it doesn't: no systems do. I do feel that it is only proper for Turnitin to give Wikipedia access to their API, as they display Wikipedia content in their reports in a non-license-conform manner. I have suggested for quite some time that they should provide API access in return. It would be useful if other Wikis (also Wikia Wiki Admins) could have access to this tool as well. --WiseWoman (talk) 20:06, 7 September 2014 (UTC)[reply]

Yes agree. This is not a stand alone solution. Each concern requires human follow up. With respect to not picking up all cases. Yes I agree this may occur. We are trying to prevent those who make dozen's or thousands of copyright violations from slipping trough the cracks. Even if we miss a couple here and there the long term copy and paster will be fairly quickly detected. Doc James (talk · contribs · email) (if I write on your page reply on mine) 10:59, 8 September 2014 (UTC)[reply]

In my opinion, this tool is extremely useful. While we have the CorenSearchBot for new pages being blatant copyright violations, this is easily missed in existing articles. Thanks for giving the bot some attention, it looks potentially useful in the long run. --k6ka (talk | contribs) 02:07, 8 September 2014 (UTC)[reply]
Editors who would like to help evaluate the bot's findings should participate in the discussion at Wikipedia:Bots/Requests for approval/EranBot, to report the results of their evaluation of the bot's work. – Wbm1058 (talk) 14:26, 11 September 2014 (UTC)[reply]
Sounds like a very useful tool especially since new editors - who are often unaware of copyright issues - are more likely to augment an existing article rather than author a new one from scratch. It will certainly protect the project before a copy/paste problem balloons into a CCI. However I don't feel that it will help with editor retention. Rightly or wrongly copyvio is a permanent black mark on an editor's contribution history, and they know it. Any doubters have only to look at any RFA proposal.Blue Riband► 03:57, 13 September 2014 (UTC)[reply]
As someone who's been helping to tidy up the sad mess that is all that is left of one of those medical/biological articles, this is enormously welcome. ClueBot has vastly reduced the hassle from vandalism; I hope the new bot will do the same for the largely hidden problem of copyright violation. Chiswick Chap (talk) 07:58, 13 September 2014 (UTC)[reply]