The Signpost

Opinion

Google isn't responsible for Wikipedia's mistakes

Contribute  —  
Share this
By Zarasophos
Zarasophos is currently working on everything related to Jadidism. The views expressed in this article are his alone and do not reflect any official opinions of this publication.
My work, Google's traffic

If you type "Rizaeddin bin Fakhreddin" into Google, Google will give you a list of links and a small box to the right. The first link will probably be to the English Wikipedia article on bin Fakhreddin, created and written by me; this can easily be checked by going into the page history of the article. But most likely you'll never bother to actually click on the article because of that small box to the right. "Rizaeddin bin Fakhreddin was a Tatar scholar and publicist that lived in the Russian Empire and the Soviet Union", it reads.

I typed that sentence. I also put the birth and death dates onto Wikipedia. I uploaded the picture to Wikimedia Commons and put it into the article – or articles, actually, because I also created the article on the German Wikipedia. But now I find this information directly on Google. There is a link to the Wikipedia article, but that may as well be a result of Father Google's omniscient mercy. Nowhere does the box state that it presents the work of an unpaid volunteer next to Google advertisements. The effect is obvious: In a 2017 study, half of the participants attributed what they found in the Knowledge Graph, which is the name of that small box, not to Wikipedia, but to Google.

Only good enough to blame

The Knowledge Graph has recently been in the news for saying that California Republicans are Nazis. The scandal was reported, discussed, closed, opened again and finally forgotten. Conservatives still think Google is biased against them; Google says the whole thing wasn't its fault.

We regret that vandalism on Wikipedia briefly appeared on our search results. This was not the the result of a manual change by Google.
— Google press release

No, obviously it wasn't. None of the content you presented there was. That was all Wikipedia's.

But the interesting thing is that in the public eye, this was still Google's fault. Read through the Twitter thread; none of the enraged commenters there seem to believe that this wasn't an action by a Google employee. "Google: Republicans are Nazis", read the headline on the Drudge Report article exposing the issue, and Wired magazine made a whole story out of making clear that the vandalism itself happened on Wikipedia. And all of that while more Wikipedia editors quickly did the dirty work; they hunted down the specific edit that caused the problem, corrected the vandalism and placed the page under semi-protection to prevent copycats. Meanwhile, the Knowledge Graph is still humming along, the ideology section removed, the rest still filled with Wikipedia data, and Google can be happy until the next scandal.

And we are left with a question: Why do we let this happen? Why do we let a multi-billion dollar company exploit us as uncredited mules – as long as there isn't a need for someone to shift the blame to? Where is the organization that should be responsible for protecting the rights of its volunteer editors – where is the WMF? Traditionally, Google is one of the biggest sponsors of the Foundation; for example, they chucked Jimmy Wales a $2m grant in 2010, more than they donated the whole last year. A few months later, they acquired the knowledge base Freebase, which was to form the basis for the Knowledge Graph, for an undisclosed sum.

Exploiters of free content should give back

After the recent scandal surfaced, the Foundation took an apologetic stance. "We're sorry", its statement seems to say, "and no, online encyclopedias still aren't a bad thing." But on 15 June, WMF executive director Katherine Maher, writing an opinion piece in Wired, saw the other side: "If Wikipedia is being asked to help hold back the ugliest parts of the internet, from conspiracy theories to propaganda, then the commons needs sustained, long-term support", she says, "The companies which rely on the standards we develop, the libraries we maintain, and the knowledge we curate should invest back. And they should do so with significant, long-term commitments that are commensurate with our value we create."

This is a step in the right direction. At the very least, the platform economies of the world should give something back to the largest source of the information they feed their algorithms with. As Maher concludes, "we shouldn’t be afraid to stand up for our value", but maybe it is time we see Google – and Facebook, and Amazon – not only as partners, but also as the ones making huge profits sustained by our unpaid labor.

S
In this issue
+ Add a comment

Discuss this story

My comments which are three months late

I haven't had time to read the signpost. I just found this. I was expecting to read about the problem of the knowledge graph having inaccurate information, which is a frequent complaint on the Help Desk and the Teahouse. This information is not on Wikipedia. I'm not sure where they got it. The person who complained is advised to give feedback to Google. I have done that about one particular mistake many times and gotten no results. Maybe it works for some people.— Vchimpanzee • talk • contributions • 20:36, 17 September 2018 (UTC)[reply]

Discussion that was already here

  • It's certainly minimal, but I don't think it counts as a formal attribution. When it's in the form of some basic information followed by a Wikipedia link, the section reads more like "Here's some basic information, and a place you can read more about it," rather than "here's some information from this place." A better way for Google to frame the information would be "Rizaeddin bin Fakhreddin was a Tatar scholar and publicist who lived in the Russian Empire and the Soviet Union. (from Wikipedia)," perhaps also including the Creative Commons license display that is conspicuously absent. Aside, though this may be Fair Use, whatever happened to the ShareAlike part of the Wikipedia text's license? lethargilistic (talk) 07:12, 3 July 2018 (UTC)[reply]
  • @Nick-D: As a contrast to your experience, I have been spending quite a lot of time working on improving the coverage, quality, and accuracy of content in power station article infoboxes for the past year or so, and a significant part of my motivation for doing so was because of my desire to make the content more useful to third-party users such as Google (and although Google was certainly not the main type of third-party user I had in mind when I first started, I've realized since then that the benefits from my effort are quite clearly practically realized by Google far more than by any other type of third-party users).
I personally agree that the attribution to Wikipedia on Google search result pages containing content from Wikipedia in sidebar boxes on the search results page is unacceptably poorly done in its current form (an issue that has been bothering me for quite a while), but unfortunately since all Wikipedia content is dual-licensed under CC BY-SA 3.0 and the GFDL with the minimal explicitly specified attribution requirements being nothing more than a simple hyperlink, Google is technically already meeting the minimum obligations for attribution (although the fashion in which they do so is incredibly poorly done and doesn't even make it clear that they're attributing Wikipedia for content, let alone conveying the full scope of what content the attribution applies to — which honestly feels like an extremely insulting move on Google's part), and so sadly there is no real incentive for Google to bother with giving Wikipedia a more appropriate level of clearly defined and scoped attribution. While technically speaking Google does seem to actually currently be in violation of both GDFL and CC BY-SA 3.0 licensing terms due to their complete failure to comply with the requirements regarding copyright/licensing notices and potentially also those regarding redistribution licensing (as well as a few other related issues), I kinda doubt that they would take a complaint about these issues very seriously, and I'm not 100% sure that there isn't a loophole somewhere they could exploit to avoid these requirements for their particular use cases (I also personally don't really care too much about non-major violations of the relicensing terms as long as the rest of the requirements were complied with, although in this case, the rest of the requirements were seemingly not complied with, and so I am still annoyed about this because of that).
With regards to the issue of commercial use in general though, I have no problems with that as long as the attribution is clear, copyright/licensing notices were correctly included, and the redistribution of content does not grossly violate the relicensing terms. So if a company wants to benefit off of reusing content that I created or modified, they are more than welcome to go right ahead — if you aren't accepting of the fact that this type of reuse is allowed, then you shouldn't be editing Wikipedia at all. You don't have to be comfortable with it, but honestly, if you're volunteering your time to edit in order to improve the knowledge on here, shouldn't you be happy when said knowledge gets more exposure & usage? Or are you truly only happy as long as the exposure and usage exclusively happens on Wikimedia Foundation sites? Because that seems rather absurd to me. Actively avoiding making any edits that could potentially result in Google gaining more scrapable data is an utterly terrible idea if for no other reason than the fact that this adversely affects the quality of Wikipedia as a whole. Garzfoth (talk) 17:38, 30 June 2018 (UTC)[reply]
I would just add that if we want to be very technical about what CC BY-SA allows, a licensing notice which explicitly states that the material being reused is available under CC BY-SA is also required in addition to the attribution with a hyperlink. The relevant policy is Wikipedia:Reusing Wikipedia content. With that being said, the Google Knowledge Graph data is generally only a short blurb, and the Wikipedia link makes it fairly clear where it comes from, so I don't think this is a big deal. Mz7 (talk) 20:13, 30 June 2018 (UTC)[reply]
We also shouldn't take responsibility for how others choose to use our content. Wikipedia is remarkably accurate for an encyclopedia that anyone can edit, but it is not a reliable source and nobody should be republishing anything (from Wikipedia or elsewhere) without doing some basic fact-checking. Blaming Wikipedia for providing bad information wouldn't fly in a high school writing class and it sure as hell shouldn't fly at Google. Using an algorithm to do your heavy lifting does not change this.
We should continue to produce quality content while addressing vandalism, to meet Wikipedia's goals and nobody else's. Republishing with proper attribution doesn't create extra work for us, but it should be understood that what we write is provided "as is" with no guarantee of accuracy. –dlthewave 19:22, 1 July 2018 (UTC)[reply]
If you consider the photographs I take, where I am sole author, Wikipedia re-uses them and there is no indication on the page whatsoever that I am the author or what licence it is used by (it is CC BY-SA 4.0, which is different to Wikipedia text). You have to know that clicking on the image will deliver the file-description page, and it is there that you will read the relevant attribution and licence details. If you Google for "Ravens of the Tower of London" you'll get a snippet from Wikipedia. The format is a bit different to the above example. It is more clear the text comes from Wikipedia. However the image is curious. If you click on that you get a Google Image page with text "Ravens of the Tower of London - Wikipedia". If you follow the "Visit" button it takes you to the Ravens of the Tower of London page. This is wrong. Firstly they are displaying a full-size image that did not actually come from that Wikipedia page (which only shows a thumbnail). But more importantly their page is the place where they should have the attribution and licence details. So to get to that information, you need to click on the thumb in the Google results, click on the "visit page" to get to Wikipedia, find my photo and know already that you can click on the photo, and then you get the attribution and licence terms. Google should fix that and properly link to the file-description page, which is where they got their image from. The problem is that there is minimal and there is best-practice, and Wikipedia already does minimal internally, so how can it persuade others that they should follow best practice? -- Colin°Talk 07:42, 5 July 2018 (UTC)[reply]
  • @Colin: When reusing that kind of content within Wikimedia Foundation projects such as the English Wikipedia, attribution requirements are minimal because all Wikimedia Foundation project contributors have already released their content under compatible licenses within the project (you agree to this with every edit you make), all project pages already include the full & correct copyright/licensing notices necessary for this type of reuse, and for the specific case of images, the author attribution info is always available by simply clicking on the image (both Mediaviewer and the local Wikipedia copy of/proxy to Wikimedia Commons images show the file's author & copyright information).
Google, in contrast, does not have the appropriate full copyright/licensing notices required under either of the licenses for Wikimedia Foundation project content. For example, if they were to choose to comply with the CC BY-SA 3.0 terms (as complying with GFDL terms would likely be impractical for their use cases), they would need to add a link to http://creativecommons.org/licenses/by-sa/3.0/, and I think they also need to specify the content's license (CC BY-SA 3.0) as well. Garzfoth (talk) 09:23, 5 July 2018 (UTC)[reply]
  • Garzfoth. The difference is not because Wikipedia has a special contract with contributors. The agreement with text contributors is CC BY-SA 3.0 & GFDL. This does not allow Wikipedia to reuse content any differently to how Google or any other reuser does (such as someone who clones Wikipedia). For the issue of images, the image creator often has no special agreement with Wikimedia at all: many of the images on Commons came from third parties (Flickr, etc) and were uploaded without the creator's knowledge.
So both Google and Wikipedia have identical requirements to attribute and display the licence details. Neither choose to do so on the page where the material is displayed. But at least Wikipedia does so on a page it hosts itself (e.g., a copy of the Commons file description page). Google instead relies on several jumps of third-party hyperlinks to satisfy the terms. I think that is dangerous practice because if someone removes my ravens photo from the article, and google continue to display it in their snippet, then their use of my image is unlicensed and so a copyright violation. If they linked directly to the Commons file description page, then that would be a bit safer. However even then, my image could be deleted from Commons (unlikely, but technically valid), or renamed. This is one reason Commons is reluctant to rename files, but it just comes from supporting bad practice. -- Colin°Talk 11:42, 5 July 2018 (UTC)[reply]
  • @Colin: The agreement between the contributor and the Wikimedia Foundation specifies that attribution via article hyperlink under the terms of the CC BY-SA 3.0 license is acceptable to the contributor (as well as use under GFDL, and alternatively attribution via two other alternative methods). Wikipedia pages already contain the required CC BY-SA 3.0 copyright/licensing notices that allow any content that is compatible with that license to be used within them (the "Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply." bit in the page footer). For the issue of rehosted images, those are rehosted under different terms from normal user-licensed images as you do not hold their copyright and there are already specific exceptions for these cases laid out in the Terms of Use as well as on each individual project.
The requirements differ depending on what is being reused. For Wikipedia text, this can only be reused under either GDFL or CC BY-SA 3.0, and Google is irrefutably not in compliance with either of those licensing terms. For images specifically, in the US and any other areas with equivalent copyright laws in this area, I believe that Google is technically protected by the fact that their use of said images can be considered "fair use" (which is an exemption that Creative Commons licenses respect). Also for images on their image search they are MUCH clearer about the fact that the image is potentially copyrighted content, and they don't seem to be rehosting full resolution content from Wikipedia/Commons both in image search results and in website search results. I am not quite sure if their reuse of text beyond the minima required for any basic generic short website summary in search results could possibly be considered "fair use" though...probably not, especially for use in knowledge graph... Garzfoth (talk) 13:24, 5 July 2018 (UTC)[reply]
Garzfoth, my argument is mostly about the images, where Wikipedia is not following best practice, but is a whole lot better than Google. For images on Wikipedia there is no contributor agreement that a hidden hyperlink is acceptable, but at least their hyperlink provides the goods. I think a fair use claim could be used by Google when the image appears as a snippet in a search results that clearly links to Wikipedia. Their fair use argument does not hold when they format the search results as an information box like the example in this article, where Google is effectively acting as an Encyclopaedia rather than web search engine. The don't mention the image is "potentially copyright content" in the search results at all, only when you click on the image and get the dark Google Images format page, and then that text is generic for all images they display. Their CC BY-SA requirement is for them to display attribution and licence details, which they don't. Expecting the user to hunt through Wikipedia to find such attribution and licence details is not acceptable imo, and liable to break when the article changes. For the Google Images page, they are hosting an enlarged image that does not come from the Wikipedia article thumb, so the CC BY-SA licence best practice is to state where they got the image from, which isn't the Wikipedia article, but the file description page which includes attribution and licence details. -- Colin°Talk 11:23, 7 July 2018 (UTC)[reply]
So, who wants to contact the WMF legal department and see if they wish to send a lawyer letter to Google about violating the the CC BY-SA license? --Guy Macon (talk) 13:59, 7 July 2018 (UTC)[reply]
WMF do not own the Wikipedia content or images. WMF legal represent WMF. The CC BY-SA violation, should it exist, is a legal issue between photographers, writers and Google, not WMF. So I don't think they would be involved. The most I've seen WMF legal do is give hints about certain interpretations of law wrt copyright. Perhaps you could get WMF legal to advise Jimbo about what he might want to say or write. But if Google are a big donor to Wikimedia, then don't hold your breath. -- Colin°Talk 22:07, 7 July 2018 (UTC)[reply]





       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0