Update on EranBot, our new copyright violation detection bot

Special report

Update on EranBot, our new copyright violation detection bot

Copying within Wikipedia requires attribution

One thing I've noticed while reviewing the bot reports is that many established editors don't realise that attribution is required within Wikipedia under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License. Some of you may have already received a notice from me letting you know about this requirement!

When copying within Wikipedia, at a minimum, what you need to do is mention in your edit summary that the content has been copied from another article. Here’s a sample edit summary: "Attribution: this material was copied from Example on April 1, 2016. Please see the history of that page for attribution." That's the ideal version, but even a simple edit summary such as "copied from Example" is better than nothing! If you're copying material you wrote yourself, attribution is technically not required, but it would help simplify checking the bot reports—and it would save you from receiving irrelevant notices from me or potential future helpers.

It's good practice, especially if copying is extensive, to also place a {{copied}} template on the talk pages of the source and destination. In cases where you add public domain material from US government websites or other PD sources, you should place the template {{PD-notice}} immediately after your citation, and for material under a compatible Creative Commons license, you should use {{CC-notice}}. There's also a whole host of other useful attribution templates available at Category:Attribution templates.

Helpers needed

As the bot gets more reliable and begins to run 24 hours a day, there’s going to be more work to do than can be managed by one person alone. It would be great if interested people had a look at User:EranBot/Copyright/rc or the archived batches, with a view to helping out by assessing diffs, removing violations from article space, and issuing warnings to the editors involved. If you have any questions about how to perform this task, please let me know on my talk page, or post a message at User talk:EranBot/Copyright/rc.

The WMF Community Tech team has been working with Eran to further improve the bot, based on user feedback. The hope is that usability and reliability will improve over time. Further details are here.

← Previous "Special report"

Next "Special report" →

In this issue

24 April 2016 (all comments)

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Please consider the inclusion of User:EranBot/Copyright/rc at Wikipedia:Backlog.

—Wavelength (talk) 23:41, 24 April 2016 (UTC)[reply]

User:Wavelength how best to add it? Doc James (talk · contribs · email) 11:06, 25 April 2016 (UTC)[reply]

User:Doc James, I posted a request at User talk:Alvin Seville, because that editor has been updating Wikipedia:Backlog.

—Wavelength (talk) 16:22, 25 April 2016 (UTC)[reply]

I've manually added Category:Wikipedia backlog to User:EranBot/Copyright/rc and the subpage I am currently working on (Batch 30). — Diannaa (talk) 20:16, 25 April 2016 (UTC)[reply]

I think there is something distressing about the "Copied" template. It gives the feeling that Wikipedia text is not really portable and reusable after all - that the deletion of the originating page or the eventual collapse of the site throws all its material into doubt. I think our best practices should involve extracting whatever-the-hell-it-is-we-have-to-keep from the history of the originating page into a single text file and posting that for attribution somewhere standard and close by the page that needs it, so that the page is relatively free to continue wandering about the Web. Wnt (talk) 12:21, 25 April 2016 (UTC)[reply]

I appreciate we have a problem with deletionism, but how often have we seen articles deleted that have been partially copied elsewhere? Would a better solution be to exempt such source articles from the prod process and have a bot that notified AFDs so that this could be taken into account? As for the eventual collapse of the site, I think with the WMF starting an endowment fund the more likely scenario is that eventually the site or at least archives of its early twentyfirst century versions, will fall out of copyright. Ϣere SpielChequers 06:20, 1 May 2016 (UTC)[reply]

Couple of questions - why is the handling of the bot report built as a complete separate process from what happens at established copyright checking places, like WP:SCV? While adding new automated detection is invaluable, is the multiplication of processes to check similar reports the best way to deal with the endemic problem of copy / pasting? And for the matter, does EranBot use the same whitelist than CorenSearchBot? MLauba ^(Talk) 17:59, 26 April 2016 (UTC)[reply]

Does not use the same whitelist. We definitely should consider this though. Were is CorenSearchBot's whitelist?

I assume it is these ones [1] User:MLauba?

You can see the whitelist for eranbot is longer for what it is worth per here [2] Doc James (talk · contribs · email) 18:31, 26 April 2016 (UTC)[reply]

I think this is all the more reason to have both bots use the same list. Pinging Coren to get his perspective. MLauba ^(Talk) 14:52, 29 April 2016 (UTC)[reply]

Perhaps this has been mentioned elsewhere, but the use of Turnitin.com is a bit problematic. It has been banned in the past at at least one university as it in itself is a copyright violtion machine: it makes copies of student essays and archives them. HappyValleyEditor (talk) 23:26, 26 April 2016 (UTC)[reply]
- From what I understand students sign a form giving Turnitin the right to this use. Doc James (talk · contribs · email) 21:02, 27 April 2016 (UTC)[reply]
- It was informally banned at a few places where I have taught with the reasoing that requiring students to sign a copyright release form was considered to be a contract of adhesion--i.e. a forced contract. HappyValleyEditor (talk) 05:38, 28 April 2016 (UTC)[reply]

Which is different than saying they infringe on copyright. There is also fair use. Doc James (talk · contribs · email) 18:14, 28 April 2016 (UTC)[reply]

What we really need to do as a community is have a chat about Contributor Copyright Investigations (CCI). I first came into contact with it in conjunction with a copyright-related Arbcom case several years ago and I figured out right away that the CCI methodology does not scale, that backlogs would continue to grow, and that most cases would never be resolved. This has proven correct. Now the backlog is five years. Next year the backlog will be six years. The year after that the backlog will be seven years. At what point do we recognize that the system has failed and scrap it for another? Carrite (talk) 15:50, 27 April 2016 (UTC)[reply]
- There's a simple solution - offer the users to clean up their own mess and putting them under a temporary topic ban for creating new content until it's done, and if they're unwilling, ban them and nuke their contributions. Except then we'll get people trying to posit that it is unreasonable to require cleaning up 5 years worth of copyvios. MLauba ^(Talk) 14:52, 29 April 2016 (UTC)[reply]
  - Have you seen this work, requiring people to clean up their copyvios? Doc James (talk · contribs · email) 15:36, 29 April 2016 (UTC)[reply]
    - No, but simply stopping to care about the massive mess some have created without doing anything is not an option either. MLauba ^(Talk) 01:07, 30 April 2016 (UTC)[reply]
      - Agree. Which is why we have pushed to build this tool. Efforts began after we found a few editors who had made 10s of thousands of copyright violations before being noticed and dealt with. Doc James (talk · contribs · email) 07:24, 30 April 2016 (UTC)[reply]
      - It's not a question of not caring, it's a question of using our limited resource (editor time) in a more efficient way. I have already used the new tool to locate and block several repeat violators and discovered some folks who created sockpuppets who continue to insert copyvio. The difference is that they are getting discovered and stopped within a matter of a few days or weeks, instead of continuing to do damage long-term like Epeefleche or continuing to insert copyvio with sockpuppets like Mushroom9. I continue to work on the CCI case for Epeefleche every day. — Diannaa (talk) 13:10, 30 April 2016 (UTC) Just want to add, normally at WP:CCI the violating contributor is not expected to help with the clean-up. Historically it has been rare that this is even permitted. The main thing is to get them to stop adding new violations. — Diannaa (talk) 13:16, 30 April 2016 (UTC)[reply]
        I'd quibble with the historically - as one of the minor co-drafters of the process, I was firmly convinced that cooperation on clean-up would be the best way both for them as a path forward, and for the project as a means to tackle the masses. Except, as you know from more direct experience than I over the past couple of years, that we rapidly confronted users who didn't care in the first place, users who vanished rather than clean up (including a sitting arb, at least for a time), and the group that I suspect Carrite is hinting at above first and foremost, the "too big to fix". I do think we're all in agreement that new tools to help with the cleanup is a godsend. Nonetheless, both the endless stream of new users generating copyvios out of ignorance and the "too big to fix" increasingly call for rethinking the way we deal with them. For the former, having a tool that catches everything that CSB doesn't check will certainly help. For the latter, if they're unwilling to help with cleaning up, the cost / benefit evaluation (particularly if their article creation is subject to the added burden of reviews by others) may need to be rethought. MLauba ^(Talk) 21:31, 30 April 2016 (UTC)[reply]
        Just want to mention that the sitting arb was Rlevse, and he later helped clean up his CCI (under the user name PumpkinSky). — Diannaa (talk) 01:22, 1 May 2016 (UTC)[reply]
I echo what HappyValleyEditor said. Using tools with questionable copyright license to catch copyright infringement sounds a lot like doublespeak. Personally I have done research on classical music (and maybe other topics as well but that is the one which comes to the top of my mind). After writing it in my essay submitted for marks, I also added what I researched onto the music article. Luckily that particular course does not use turnitin but if it did, the Wikipedia music article would flag the phrases I added as identical to those in my assignment. And EranBot would unintentionally "out" the editor's public identity because obviously the student doesn't identify themselves with their wiki username on their class assignment (and doubtful that turnitin will generate the full report of that piece so that the student can proof it's their own work "double-dipping" on assignment and wiki). My question is, why is EranBot configured to search student essays when it's extremely unlikely that a third-party person would gain access to the digital copy of the student's assignment submitted to turnitin and the same third-party person would submit it to Wikipedia? OhanaUnited^{Talk page} 16:03, 27 April 2016 (UTC)[reply]
- It is mainly used because it also includes webpages, textbooks, and journal articles. Doc James (talk · contribs · email) 21:02, 27 April 2016 (UTC)[reply]

What do you think of The Signpost? Share your feedback.

Home

About