The Signpost


Special report

Update on EranBot, our new copyright violation detection bot

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Wavelength (talk) 23:41, 24 April 2016 (UTC)[reply]

User:Wavelength how best to add it? Doc James (talk · contribs · email) 11:06, 25 April 2016 (UTC)[reply]
User:Doc James, I posted a request at User talk:Alvin Seville, because that editor has been updating Wikipedia:Backlog.
Wavelength (talk) 16:22, 25 April 2016 (UTC)[reply]
I've manually added Category:Wikipedia backlog to User:EranBot/Copyright/rc and the subpage I am currently working on (Batch 30). — Diannaa (talk) 20:16, 25 April 2016 (UTC)[reply]
I think there is something distressing about the "Copied" template. It gives the feeling that Wikipedia text is not really portable and reusable after all - that the deletion of the originating page or the eventual collapse of the site throws all its material into doubt. I think our best practices should involve extracting whatever-the-hell-it-is-we-have-to-keep from the history of the originating page into a single text file and posting that for attribution somewhere standard and close by the page that needs it, so that the page is relatively free to continue wandering about the Web. Wnt (talk) 12:21, 25 April 2016 (UTC)[reply]
I appreciate we have a problem with deletionism, but how often have we seen articles deleted that have been partially copied elsewhere? Would a better solution be to exempt such source articles from the prod process and have a bot that notified AFDs so that this could be taken into account? As for the eventual collapse of the site, I think with the WMF starting an endowment fund the more likely scenario is that eventually the site or at least archives of its early twentyfirst century versions, will fall out of copyright. ϢereSpielChequers 06:20, 1 May 2016 (UTC)[reply]
  • Couple of questions - why is the handling of the bot report built as a complete separate process from what happens at established copyright checking places, like WP:SCV? While adding new automated detection is invaluable, is the multiplication of processes to check similar reports the best way to deal with the endemic problem of copy / pasting? And for the matter, does EranBot use the same whitelist than CorenSearchBot? MLauba (Talk) 17:59, 26 April 2016 (UTC)[reply]
Does not use the same whitelist. We definitely should consider this though. Were is CorenSearchBot's whitelist?
I assume it is these ones [1] User:MLauba?
You can see the whitelist for eranbot is longer for what it is worth per here [2] Doc James (talk · contribs · email) 18:31, 26 April 2016 (UTC)[reply]
I think this is all the more reason to have both bots use the same list. Pinging Coren to get his perspective. MLauba (Talk) 14:52, 29 April 2016 (UTC)[reply]
Which is different than saying they infringe on copyright. There is also fair use. Doc James (talk · contribs · email) 18:14, 28 April 2016 (UTC)[reply]
  • What we really need to do as a community is have a chat about Contributor Copyright Investigations (CCI). I first came into contact with it in conjunction with a copyright-related Arbcom case several years ago and I figured out right away that the CCI methodology does not scale, that backlogs would continue to grow, and that most cases would never be resolved. This has proven correct. Now the backlog is five years. Next year the backlog will be six years. The year after that the backlog will be seven years. At what point do we recognize that the system has failed and scrap it for another? Carrite (talk) 15:50, 27 April 2016 (UTC)[reply]
    • There's a simple solution - offer the users to clean up their own mess and putting them under a temporary topic ban for creating new content until it's done, and if they're unwilling, ban them and nuke their contributions. Except then we'll get people trying to posit that it is unreasonable to require cleaning up 5 years worth of copyvios. MLauba (Talk) 14:52, 29 April 2016 (UTC)[reply]
      • Have you seen this work, requiring people to clean up their copyvios? Doc James (talk · contribs · email) 15:36, 29 April 2016 (UTC)[reply]
        • No, but simply stopping to care about the massive mess some have created without doing anything is not an option either. MLauba (Talk) 01:07, 30 April 2016 (UTC)[reply]
          • Agree. Which is why we have pushed to build this tool. Efforts began after we found a few editors who had made 10s of thousands of copyright violations before being noticed and dealt with. Doc James (talk · contribs · email) 07:24, 30 April 2016 (UTC)[reply]
          • It's not a question of not caring, it's a question of using our limited resource (editor time) in a more efficient way. I have already used the new tool to locate and block several repeat violators and discovered some folks who created sockpuppets who continue to insert copyvio. The difference is that they are getting discovered and stopped within a matter of a few days or weeks, instead of continuing to do damage long-term like Epeefleche or continuing to insert copyvio with sockpuppets like Mushroom9. I continue to work on the CCI case for Epeefleche every day. — Diannaa (talk) 13:10, 30 April 2016 (UTC) Just want to add, normally at WP:CCI the violating contributor is not expected to help with the clean-up. Historically it has been rare that this is even permitted. The main thing is to get them to stop adding new violations. — Diannaa (talk) 13:16, 30 April 2016 (UTC)[reply]
            • I'd quibble with the historically - as one of the minor co-drafters of the process, I was firmly convinced that cooperation on clean-up would be the best way both for them as a path forward, and for the project as a means to tackle the masses. Except, as you know from more direct experience than I over the past couple of years, that we rapidly confronted users who didn't care in the first place, users who vanished rather than clean up (including a sitting arb, at least for a time), and the group that I suspect Carrite is hinting at above first and foremost, the "too big to fix". I do think we're all in agreement that new tools to help with the cleanup is a godsend. Nonetheless, both the endless stream of new users generating copyvios out of ignorance and the "too big to fix" increasingly call for rethinking the way we deal with them. For the former, having a tool that catches everything that CSB doesn't check will certainly help. For the latter, if they're unwilling to help with cleaning up, the cost / benefit evaluation (particularly if their article creation is subject to the added burden of reviews by others) may need to be rethought. MLauba (Talk) 21:31, 30 April 2016 (UTC)[reply]
  • I echo what HappyValleyEditor said. Using tools with questionable copyright license to catch copyright infringement sounds a lot like doublespeak. Personally I have done research on classical music (and maybe other topics as well but that is the one which comes to the top of my mind). After writing it in my essay submitted for marks, I also added what I researched onto the music article. Luckily that particular course does not use turnitin but if it did, the Wikipedia music article would flag the phrases I added as identical to those in my assignment. And EranBot would unintentionally "out" the editor's public identity because obviously the student doesn't identify themselves with their wiki username on their class assignment (and doubtful that turnitin will generate the full report of that piece so that the student can proof it's their own work "double-dipping" on assignment and wiki). My question is, why is EranBot configured to search student essays when it's extremely unlikely that a third-party person would gain access to the digital copy of the student's assignment submitted to turnitin and the same third-party person would submit it to Wikipedia? OhanaUnitedTalk page 16:03, 27 April 2016 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0