The Signpost


Commons Picture of the Year; Wikidata licensing

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Requiring attribution for Wikidata

Requiring attribution and the same license for derivatives for Wikidata seems like common sense. Is there a good reason we are not doing this? Doc James (talk · contribs · email) 00:22, 16 June 2016 (UTC)[reply]

I'm not sure Silicon Valley would've poured money into Wikidata if it had a copyleft license. Google could've simply continued work on Freebase. It's a very common trend in tech to see proprietary companies spend money on permissive-licensed projects at the expense of copylefted ones (cf. Apple's funding of Clang versus GCC), so that they can profit from freedoms without having to release their modifications freely as well. NMaia (talk) 02:13, 16 June 2016 (UTC)[reply]
We have to require attribution eventually, because copyright grants can be recinded after decades. It's better to get it over with sooner than later. EllenCT (talk) 14:39, 16 June 2016 (UTC)[reply]
@Doc James: You'd have to ask Denny and Eloquence; it was their decision, according to Denny's post at Wikipedia_talk:Wikipedia_Signpost/2015-12-02/Op-ed. An interesting question is whether the mass imports of Wikipedia data to Wikidata that have been happening are a violation of Wikipedia's licensing terms; meta:Wikilegal/Database Rights, compiled by WMF Legal, is the best summary of the legal issues I've seen.
Speaking of Wikidata, there are two other discussions that may be of interest: Wikipedia:Village_pump_(policy)#RfC:_Wikidata_in_infoboxes,_opt-in_or_opt-out? and User_talk:Iridescent#Infobox_&_Wikidata. --Andreas JN466 19:06, 16 June 2016 (UTC)[reply]
Is it too late to switch now because I would assume CC0 is irrevocable. OhanaUnitedTalk page 21:21, 16 June 2016 (UTC)[reply]
CC0 grants users *all rights*, including the right to relicense it. NMaia (talk) 21:36, 16 June 2016 (UTC)[reply]
Current legislations do not support the licensing of individual facts, only of databases as a whole, and only in some countries. What you are asking for is Wikidata to lobby for the introduction of new notions of "copyright" which do not exist today. Yes, you could use these laws to enforce attribution and share-alike, but companies will also use the same laws to enforce conditions on using "their" facts. This is not desirable. Plain data is free from such legal control, and this is the position of the EFF (see this recent article) and also of many people in our community. Concepts like the infamous illegal prime express the fundamental opposition that free culture proponents have against putting terms and conditions on data items. By suggesting that laws should be more restrictive, the article is arguing against some of the basic freedoms we are supporting with our movement. --Markus Krötzsch 22:43, 16 June 2016 (UTC)[reply]
Wikidata, as a Freebase successor, has commercial re-users who import – or aim to import – the whole database, doesn't it. Andreas JN466 02:30, 17 June 2016 (UTC)[reply]
The arguments for an attribution licence – which Google's own Freebase had!!!!! (so much for the argument that facts can't be copyrighted ...) – are the same as ever:
  1. Clear traceability of data provenance for the end user.
  2. Visibility for Wikidata when its data appear in commercial products – assuring that volunteer labour is credited, and aiding editor recruitment.
  3. Assurance that derivatives are also published under a copyleft licence, rather than appearing as a proprietary black box. Andreas JN466 18:45, 17 June 2016 (UTC)[reply]
Wikipedia has a long history of insisting on commercially reusable resources. I think this actually makes a lot of sense because otherwise, you're saying that what you're building is just a hobby, but people who want to do something real have to go and pay somebody for a real version. If you want to strike a blow for free culture, find one of those lawyers who sue illegal downloaders and give him an ISIS style flying lesson if you want, but don't put extra copyright restrictions on our data. Wnt (talk) 00:52, 17 June 2016 (UTC)[reply]
"Wikipedia has a long history of insisting on commercially reusable resources": Then why not use a licence equivalent to Wikipedia's. --Andreas JN466 02:30, 17 June 2016 (UTC)[reply]

Of various points Andreas made about Wikidata, I thought the area raised about commercial reuse was the weakest, really. And of the points made above, I think the non-copyright nature of "facts" is the strongest.

Argument by analogy with Wikipedia text is certainly not convincing, nor should it be. Attribution is actually different from referencing, even though both are of interest in the general matter of understanding "provenance", which does indeed matter. I would say the way forward is with WikiCite, i.e. trying to standardise and solidify sourcing from external references. Which is what everything rests on. If you think about the potential of data-mining, e.g. the ContentMine project, the key aspect would seem to be machine-readable referencing styles everywhere.

Legal status is going to be less important than "audit trails" for purported facts. Charles Matthews (talk) 08:19, 17 June 2016 (UTC)[reply]

Charles Matthews: External referencing in Wikidata won't help much if re-users are under no obligation to indicate those references, or indeed that they got their data from Wikidata. We owe it to consumers, for all sorts of reasons, to enable them to verify the provenance of the data they are given.
One of NMaia's strongest arguments to my mind was one I did not make in my December op-ed: that derivatives from Wikidata should also be open and free and traceable, rather than forming the content of a black box marked "proprietary". The latter is, unfortunately, exactly what is happening, isn't it. Andreas JN466 17:28, 17 June 2016 (UTC)[reply]

So re-users of data who don't indicate their sources will lose credibility, no? If their "business model" is simply to claim they have "authoritative" data, without giving adequate referencing, they become like, well what? Tabloid newspapers, that is one thing that comes to mind. Plagiarists, is another.

I think those comparisons show something about the idea of imposing obligations or constraints on users. Frankly, there are shameless people out there anyway, and it is better not to get too involved with them, when one can avoid that. Facts really can be treated differently from authored material. Well, I suppose at some point this is an issue on which people may have to agree to disagree.

How Wikipedia Works conformed to the GFDL by adding many pages of attribution, just to quote WP; neither Phoebe or I (mostly Phoebe) would want to go through that again. If you go seriously into the data reuse question in education, though, you can see why CC0 might be a good idea (allows lightweight reuse in cases of AGF). I wouldn't like to think that fairly generic tables of the world's longest rivers, or time-series of rainfall data in Australia, would have to carry compliance overheads

In any case, we here have plenty of experience of the hazards of using unreferenced material, and very little in the sort of direction you suggest. Looks like Stallmanitis to me. Charles Matthews (talk) 18:24, 17 June 2016 (UTC)[reply]

Don't be naive, Charles. If Google says in a Knowledge Graph panel or answer box that X = Y, as it does today, the empirically observed effect is not that people think Google is a tabloid, but that the statement X = Y is widely believed to be accurate, and propagated. Andreas JN466 18:45, 17 June 2016 (UTC)[reply]

I don't think you need to call me naive. You are talking about an effect created by a lack of critical thinking (of some people). I'm talking about a chilling effect on reuse, in schools short of resources for example. If what schools taught about the Internet was more up-to-date, critical thinking everywhere would be in a better state. The naivety doesn't lie with those like me who have written on information literacy.

Here's what I mean, in an example, anyway. With a few facts from Wikidata, I can make the multiple-choice question "Was Albert Einstein's birthplace (a) Bremen, (b) Munich, or (c) Ulm?" This sort of thing can and should be done on a large scale, starting from Wikidata. If I created a database of such questions (and this is a project of mine) it would be helpful to record both the authoring of questions (say user-generated and bot-generated), and the source of Wikidata facts (dated, for maintenance purposes).

But if someone just wants to generate a printed quiz with 20 questions, using a front end of such a database, or just by hand from Wikidata, I don't a legal framework to compel them to carry along such provenance metadata is what we want. In practice it would, I believe, have a "chilling effect": any mention of intellectual property does. We should in this case be thinking that such quizzes could, more slowly and "by hand", be taken from Wikipedia pages.

In any case I don't intend to lose sleep over Google's Knowledge Graph. I think prioritising Wikimedia's brand in terms of easy reuse is worth more attention. Charles Matthews (talk) 09:02, 18 June 2016 (UTC)[reply]

You are wrong. I blogged and explained why. Thanks, GerardM (talk) 13:57, 18 June 2016 (UTC)[reply]

Copying my ML post: Added to that, even if it were possible to copyright facts, I think using restrictive license (and make no mistake, any license that requires people to do specific things in exchange for data access is restrictive) makes a lot of trouble for any people using the data. This is especially true for data that is meant for automatic processing - you will have to add code to track licenses for each data unit, figure out how exactly to comply with the license (which would probably require professional help, always expensive), track license-contaminated data throughout the mixed databases, verify all outputs to ensure only properly-licensed data goes out... It presents so much trouble many people would just not bother with it. It would hinder exactly the thing opens source excels at - creating community of people building on each other's work by means of incremental contribution and wide participation. Want to create cool a visualization based on Wikidata? Talk to a lawyer first. Want kickstart your research exploration using Wikidata facts? To the lawyer you go. Want to write an article on, say, gender balance in science over the ages and places, and feature Wikidata facts as an example? Where's that lawyer's email again? You get the picture, I hope. How many people would decide "well, it would be cool but I have no time and resource to figure out all the license issues" and not do the next cool thing they could do? Is it something we really want to happen?

And all that trouble to no benefit to anyone - there's absolutely no threat of Wikidata database being taken over and somehow subverted by "enterprises", whatever that nebulous term means. In fact, if Google example shows us anything, it's that "enterprises" are not very good at it and don't really want it. Would they benefit from the free and open data? Of course they would, as would everybody. The world - including everybody, including "enterprises" - benefited enormously from free and open participatory culture, be it open source software or free data. It is a good thing, not something to be afraid of!

Wikidata data is meant for free use and reuse. Let's not erect artificial barriers to it out of misguided fear to somehow benefit somebody "wrong". Smalyshev (WMF) (talk) 02:46, 23 June 2016 (UTC)[reply]

Bof. Freebase had (or has) a Creative Commons Attribution Licence, so it was clearly "possible" for Google to "copyright facts" when it was their own project. As far as attribution is concerned, you could simply ask people to put "Powered by Wikidata" somewhere users can see it. In fact, when Bing, Google et al. copy chunks of Wikipedia text in their SERP knowledge panels and timelines, a hyperlink to Wikipedia is all that's given. That's obviously enough to satisfy Wikipedia's CC-BY-SA licence requirements in that use case, given that WMF Legal has shown no sign of complaining. To me, all of this reads like smoke and mirrors designed to protect the interests of commercial reusers, rather than those of volunteer contributors or end users, or indeed the project itself. --Andreas JN466 12:40, 23 June 2016 (UTC)[reply]
It is a bit of a non-sequitur. Google can put any license on anything, that doesn't mean they have the rights that you think they claim. And I am not going to seriously discuss the claim that my argument is "smoke and mirrors designed to protect the interests of commercial reusers" - if you are not interested in discussion, so be it, fencing with baseless accusations of nebulous conspiracies is not how I would like to spend my time. When you are interested in discussing my actual position and not in dismissing it by means of baseless conspiracy claims, please come back with a real argument addressing my points. Or don't.
The interests of the contributors and the project itself is for the data to be widely available and used. Erecting legal barriers on the way of the users is the worst possible way to ensure it. Especially barriers that most users would require professional help to understand and comply with. Google can handle any license, they probably have many more lawyers than Wikidata has developers. Hobbyist researchers, hackers and data users can not. That's who you will be hurting with restrictive licenses. Smalyshev (WMF) (talk) 18:09, 23 June 2016 (UTC)[reply]
Don't give me that phony bullshit about "hurting" people, Smalyshev (WMF). Do you think people will break a finger typing "Powered by Wikidata"? It's just emotionally manipulative poppycock. On the other hand, there is a very real potential for harm in inundating end users the world over with "information" of unknown and untraceable provenance. --Andreas JN466 03:02, 29 June 2016 (UTC)[reply]
@Jayen466: I'd ask you to tone down a bit. You use a lot of swear words and much less arguments. Throwing around words like "bullshit" and "poppycock" does not make your argument stronger. In fact, it doesn't make argument at all, just running your mouth (or in this case, fingers). Your tone is abusive and insulting, and does not contribute to the discussion of the actual point. I believe it is completely possible to express your argument without swearing - as I and other discussion participants amply demonstrated. If you find your argument can only stand when you use words like "bullshit" - that means it can not stand at all. Now to the point.
Provenance and reliability is important, but restrictive license does not improve the reliability or provenance of information in Wikidata. If anything, it may only make it worse by dissuading contributors that will be unable to reuse the results and thus will see no point in participating.
The problem with restrictive license, as I pointed out, is not typing three words - the problem is automatic processing and reuse of mixed data. Not by people - by automatic agents, that have no fingers and can not type anything. Each time restrictive license is used, data under this license must be special-cased in all automatic tools and code has to be validated and vetted for license compliance. Each particular instance of this may be easy to do, but doing this over many pieces of code, many source and many licenses becomes unmanageable - and some people with limited resources would rather opt not to use restricted data at all than risk getting into legal trouble over misunderstanding of the license. This does not help anybody. --Smalyshev (WMF) (talk) 04:41, 29 June 2016 (UTC)[reply]
Smalyshev (WMF), when you're talking about "restrictive licences" that "hurt" people, then we have well and truly entered the hyperbolic domain of political spin. Let's be real: People can at most be inconvenienced if Wikimedia asks Wikidata re-users for attribution (just as Wikipedia's "restrictive licence" has always asked re-users for attribution), but they cannot be "hurt" by your asking for this.
And I don't see it as progress for humanity if machines mix and match high-quality and low-quality data from an array of sources so "unmanageable" that in the end no one can figure out any more which source any item of information comes from. Wikipedia is deeply flawed in many ways, but at least it has traceability of provenance as a core ideal. In theory, at least, it places very great value on informing the end user of the origin of the information contained in it, and expects contributors to go to a great amount of trouble and effort to satisfy this requirement. That is part of its beauty and a very large part of why it works to the extent it does. Wikidata appears to have jettisoned that ideal (along with the commitment to copyleft for derivatives), and everything you have said here to date only serves to deepen that perception in me. Andreas JN466 05:30, 29 June 2016 (UTC)[reply]
I am not sure I can follow your argument here. I agree that Wikipedia and Wikidata both have idea of providing provable and reliable information, though currently they/we are not completely in agreement with this ideal yet. I also agree that it might be beneficial, in some cases, to inform users about provenance of each piece of information. In other places it may be impossible or irrelevant, as soon as we know the information source is reliable enough for our purposes. You seem to be mixing efforts of Wikidata and Wikipedia editors to source information and ensure its reliability (which are admirable and laudable) with restrictive licensing of Wikidata facts, which have nothing to do with it. Certainly Wikidata never changed the commitment to the former, but it in no way requires the latter. If for your dataset it is important to know provenance, "comes from Wikidata" does not work anyway - Wikidata is only a secondary source by design, so you need deeper provenance. And if you already implement deeper provenance, there's no need to force you to comply with any restrictive license, it does not help data reliability in any way. Did I miss some important part of the argument that made it work?
Mixing high-quality and low-quality data may indeed be undesirable, but again, I do not see how restrictive license is going to help. If, as you claim, it does not prevent usage (and thus mixing) of data sets, nothing changes at all. If it does prevent, how would this prevention only apply to mixing with low-quality data and not with high-quality data? I do not see any way that the license may be of any help here. It is certainly an issue, but not one solvable in any way by licensing.
I am not sure what is "commitment to copyleft", but I certainly do not see it as a worthy ideal. Copyleft is a tool, the purpose is to create and distribute free information. If certain license does not serve it, it should be jettisoned mercilessly and without any hesitation, license itself has no value, only what it can be used to achieve. As I argued, Wikidata goals are better achiever without use of restrictive licenses. --Smalyshev (WMF) (talk) 06:21, 29 June 2016 (UTC)[reply]
If for your dataset it is important to know provenance, "comes from Wikidata" does not work anyway - Wikidata is only a secondary source by design, so you need deeper provenance. Of course you need that. But Wikidata fails on both counts. "Comes from Wikidata" would be the first link in the provenance chain enabling the end user to trace the information source. Even that first link is broken when re-users don't have to say they got the info from Wikidata. And even if that first link were there, the deeper provenance is lacking whenever Wikidata lacks a source citation, or merely cites "Italian Wikipedia" or something like that. Not citing your sources because of "lack of time" and "too much trouble" is tremendously harmful to information integrity, even with Wikipedia. See e.g. the comments on Dickens in the recent Times Literary Supplement piece. When you say that all those people wanting to build "the next cool thing" should not be put to too much trouble, to me that sounds exactly like saying, "Gosh, citing sources in Wikipedia is so much trouble ... people wanting to do cool stuff go elsewhere. Let's dispense with sourcing requirements so that contributing is easy and everyone can have fun participating in Wikipedia." Yes, you'd get more free information that way, but even more of it would be dross. We're trying to map reality, not simply to provide an answer so people can avoid the uncomfortable feeling of not knowing.
This is just one aspect. Another is that volunteers work for nothing on the project, and now their project is not even credited in a minimal way, while others are earning billions from it. That's as exploitative as working arrangements in the early days of the industrial revolution. Another related aspect is that Wikidata abandons the ShareAlike principle that volunteers have signed up to. The whole point of ShareAlike is to ensure that those who use licensed work make public contributions as well; that's the idea that underpins the whole "information wants to be free" credo. Abandoning it is essentially selling out, opening the back door to the precise opposite. What volunteers are ending up with is proprietary black boxes they've fed, but now can't look into.
The purpose of the entire effort to me is not the creation and distribution of free information for its own sake: the purpose is to serve people—end users first and foremost, but volunteer contributors are people too. End users' and contributors' interests are served by transparency, and that's why attribution and ShareAlike are important concepts to me, Smalyshev (WMF). Cheers, --Andreas JN466 12:43, 29 June 2016 (UTC)[reply]
Echoing Jayen466, I'd just like to contribute to this discussion (regarding so-called "restrictive" licensing) with this enlightening comment by John Sullivan: "Licensing compatibility problems exist primarily because people insist on making proprietary software, but still want to benefit from free software". This is no different from Wikidata's case. Want to reuse Wikidata's sweat of the brow? Great! Just make sure it is publicly available. You don't need a lawyer for that, as most free software projects don't have lawyers and are able to comply with copyleft provisions without a problem. What we may be seeing here is a classic instance of FUD: no one will be "hurt" by "restrictions". But people and companies that want free labor will need to keep that labor free. ~nmaia d 15:51, 29 June 2016 (UTC)[reply]
Switching Wikidata to ODbL would be a terrible idea. ODbL isn't compatible with most Creative Commons licenses[1] so we would be hurting our own ability to reuse the data at least as much as we would be hurting Google's. Kaldari (talk) 09:53, 26 June 2016 (UTC)[reply]


As the uploader of the winning image... Eh, screw it. I'm disappointed that won too. It's a fantastic image, and I was excited to find it, but I believe that's the only image I've ever had the slightest connection to to even make it into the final round. I work in image restoration, and, no matter how carefully one restores an image, it's never going to get that much visiblity in any Commons promotions or contests. For example, I'd argue File:Billy Strayhorn, New York, N.Y., between 1946 and 1948 (William P. Gottlieb 08211).jpg is better than a different restoration being sold, and File:Frances Benjamin Johnston, Self-Portrait (as "New Woman"), 1896.jpg is a massive improvement over both the original source and the best copy we formerly had - but the work done is completely invisible at POTY; for all the POTY voting pages indicate, they may as well be images just grabbed from elsewhere because they're free-licensed.

It's disenheartening. Commons offers monthly contests - but they're only open to photographers. POTY tends to value prettiness at thumbnail over any other consideration, meaning we get situations where, for example, an attempt at making the image more artistic means it's misleading and can't be used in an encyclopedia (the image is a composite: it shows an event that can only happen while electricity is flowing, but removed the source of electricity in photoshop to make the picture more interesting).

POTY could handle this; indeed, even if it simply emphasised the winners of the various categories (and accurately categorised them - this year, all sorts of non-paintings were put into a category named "Paintings") - then it would at least make a start on recognising the variety of content.

I think Commons is a wonderful project, but what it most heavily promotes and what it seems to get used for most outside of itself and Wikipedia seem to be very different things. Adam Cuerden (talk) 00:50, 16 June 2016 (UTC)[reply]

You may indeed have a good idea there, because it's not really a fair contest. As for me, I don't vote on these things because literally all the candidates are too good for me to rate. I mean, there's nothing for me to compare them to; each is unique and gorgeous in its own way, some way I've never seen before. Wnt (talk) 00:56, 17 June 2016 (UTC)[reply]
What's wrong is it's a photo contest set within an illustration system. Commons has a few different uses but the biggest one is to illustrate articles. Yes an art contest brings out beautiful art. However its relevance is incidental rather than essential. Jim.henderson (talk) 17:42, 17 June 2016 (UTC)[reply]

I share the concerns of Tony1 about imported images winning the contest. The contest should promote Wikimedia contributors. If Creative Commons hosted such a worldwide contest, that's fine, but we should focus on our community of collaborators. --NaBUru38 (talk) 17:30, 18 June 2016 (UTC)[reply]

I really love the POTY competition but also find it slightly frustrating. I vote in both rounds but am totally unqualified. I'd love for the second round to be judged by an expert panel. Perhaps one instruction could be given (both rounds) that might favor community created/restored images: entries should be judged in part on how educational (I almost wrote 'encyclopedic', but want to include value to all Wikimedia projects) they are. In part to make up for my inability to pick between two stunning images (or even notice obvious flaws) I try to bias my votes in this way. But I 99% just love the POTY competition and think it does an OK job of picking educational-looking winners already. Still I'd love to see tweaks which make it even better, perhaps by partially disenfranchising me. On the other topic, I think CC0 is fine for Wikidata, but I'm slightly vexed to read folks who want a conditional license suggesting the FDL of data rather than CC-BY[-SA]. Mike Linksvayer (talk) 21:39, 19 June 2016 (UTC)[reply]

Tony1, I caution against thinking that "a panel of experts" would do better. I remember reading an aphorism about photo competitions (which concerned those who enter their own images, but I guess the same is true of those who have their own favourites among the entries): "If your image does well, the judges were wise and had a good eye for what makes an outstanding image. If your image does badly, the judges were blind fools." Any popularity contest, fully open to anyone regardless of experience and training (never mind recognised expertise), is going to choose "popular" images. Experts in most creative fields tend to have a different agenda. Think of popular music, popular fashion, art that people actually buy to put on their walls, books that people read in the millions, vs the kind of bands that only music critics love, clothes that only supermodels could wear, art that common people don't understand, books that are worthy but dull, etc, etc.

We had experts judging the final stages of WLM UK in the two years it ran, and I have to say I was very disappointed with their choices. See here and here. There are a few good ones, but compared to what normally passes at Commons FP on a daily basis, I suspect Tony, even with his unexpert eye, would also be disappointed at their choices. Some were very poor technically, and in 2013 many were very low resolution. And generally the winning photographers weren't regular at WP or Commons and didn't stay. Unlike the regulars, they submitted small, heavily-processed and arty images rather than the accurate documentary and high-resolution images that our community values. In other words, the experts didn't share our values.

I don't think Adam's restoration images will tend to do well in any popularity contest. Appreciating the work that went into the restoration (vs the talents of the artist who drew or photographed the original) is too complex a task and not suited to pressing "Like" buttons and when faced with over a thousand excellent alternatives. I agree with him that the two lightbulb winning images in previous years, though great works of art, aren't the finest example of educational images, being contrived and manipulated.

I recommend you consider POTY as just the bit of fun that it is, and accept the attributes of popularity contests, good and bad. Most of the Featured Pictures on Commons are excellent. That's the point. Don't consider the selection of a handful of images out of over a thousand as a contest designed to "recognise" the skills of our community. We have other forums that do that. And don't make the mistake of thinking the result reflects Commons' community values -- the voting is open to anyone with a Wikimedia account and it certainly attracts those from all projects. As I browse the images in the round one of the contest, I can celebrate the fine free works that Commons offers as a repository of educational works. As a creative contributor to Commons (rather than an uploader of others' works) of course I would like to be appreciated, but Commons is more than just an image bank for amateur photography, so POTY should not ignore those who do the uploading or who negotiate free licensing. -- Colin°Talk 12:19, 21 June 2016 (UTC)[reply]

Colin, thanks. Good reason to have a people's choice and a panel-judged set of prizes to a set of technical criteria. Tony (talk) 04:44, 22 June 2016 (UTC)[reply]
Well nobody reads the instructions, experts especially so. If you hire people for their supposed expertise and judgement, then you are at the mercy of their opinion and judgement. [I'm not being anti-expert, btw, but this isn't science or engineering -- there's a huge amount of personal taste and fashion and bias at play]. It's really not easy to get an external person to understand Wikipedia/Commons values -- as we all know from the various dubious research papers and studies performed on us, which fail to grasp what we're about. All competitions of this scale need a "round one" that uses cheaper/informal/crowd labour, and the experts only see a selection of the best. Oh, and the other thing we learned from WLMUK was that the experts couldn't agree on their top 10 either. There is virtually no overlap. So my point is that after "round one", the selected images of POTY should all be generally so excellent that it isn't really interesting whether this expert prefers that image or that expert prefers this image. What we can say, is that the winners of POTY are very popular, and we have a pretty large sample size on which to draw that conclusion. And perhaps that is a more interesting and useful result than what some panel thinks. -- Colin°Talk
An expert panel? All photos in the contest are featured, and these are selected by consensus on strict criteria. -- (talk) 16:49, 23 June 2016 (UTC)[reply]

Articles aren't only created by Wikimedians

I just wanted to point out that there is a fundamental error in the second section of this piece. Articles are not at all only the work of Wikimedians — often we use or adapt other CC or PD content, which is the very same thing as what you highlight. I've been involved with the Heart article which to a large degree builds upon the CC-BY textbook CNX: Anatomy & Physiology, which is currently undergoing GA review. I would be devastated if it failed that review only because it uses content produced externally. That goes against the very nature of Wikipedia's mission to spread knowledge. Even when we don't take and adapt text directly we include and adapt free images, and sound-files, and videos, placing them in articles in a way where external content is part of our creations. Wikimedia should be a platform for all free content, and we should simply promote what is best, not what we happened to know is produced by a friend from Wikipedia. Carl Fredrik 💌 📧 15:48, 16 June 2016 (UTC)[reply]

Thank you, Carl, it's a good point to raise. I presume you're referring to the images in the article Heart. I was referring to the overwhelming majority of article content, which is text. Text that is quoted has to be marked clearly as such, or it's plagiarism; external text that is (non-closely) paraphrased must also be attributed, which still takes significant skill by Wikimedians.

It's a question of the ratio of the external sourcing and internal input of skill and effort. Yes, the balance goes both ways: you'll notice that a little time went into writing the description page for the winner (and significantly more for No. 6, which along with the noise reduction does at least exonerate from my point about outsourcing—though not enough to win a top place, in my view, and I suspect that only a tiny proportion of votes were cast by people who had taken this into account). Let's also consider that the task of choosing and integrating images into an article, and writing appropriate captions, is normally greater than the energy put into writing description pages for externally sourced images.

You write: "Wikimedia should be a platform for all free content, and we should simply promote what is best, not what we happened to know is produced by a friend from Wikipedia." My responses are first that it's nothing to do with friends on Wikipedia, or the whole featured-content system would be discredited by accusations of nepotism. Second, featured picture forums already provide a significant way of judging and rewarding the best free content, internal and external. Third, I didn't propose that POTY be restricted to internally produced material—one improvement might be to retain the current blindness to the internal–external divide in the round 1 category competitions and give those results more publicity, but to restrict the more prominent and symbolic round 2 to internals; and it's probably not the only solution. Tony (talk) 05:03, 17 June 2016 (UTC)[reply]

No, I mean the text. The text is taken in large part from the CC-BY textbook. In fact roughly 80% of it is an adaptation of that text, changed for correct tone and tense. This is also very apparent in the reference section, there is a tag on the page that it uses this content. Promoting internal images is negative to our cause — which is to spread knowledge in any way shape or form it presents itself to us. Your argument is the same as saying that if an expert group has helped in the formation of an article, then we should not feature it on our main page. But, this has already happened, in fact yesterday when the article Pancreatic cancer was featured. It incorporated: text, reviews and discussions that were held entirely off-wiki as well as those that were held on-wiki. We have to stop pretending there is a value in promoting sub-standard work onto our main pages. The same is true for some awful anatomy images that took ages to delist, even though there were better images that could replace them, just because the replacements weren't created by Wikimedians.
I'm all for holding internal competitions, but then we can't call them picture of the year, or featured images or featured articles — which are for the best stuff we have, regardless its source. Carl Fredrik 💌 📧 08:00, 17 June 2016 (UTC)[reply]

"Copyleft matters"? Facts should also matter.

I have commented on some important factual issues in this post on a mailing list and maybe it is best to keep replies there. But for the benefit of readers here, let me quote the main points I replied to the author:

You say that Microsoft donated to Wikidata. Is it possible that you have just made this up since it fits the picture you want to paint? No concerns about misinforming your readers here?

You claim that Google is using Wikidata content. I have not seen any proof of this. I have challenged Mr. Kolbe about this before, and indeed it seems that he is now avoiding this claim in the text you cite. The fact that Google stopped working on the Freebase imports does not seem to suggest that they are very interested in the data right now. Maybe you have new information you would like to share with us? It would surely be of interest to many people here.

You mention "vain threats made by those who wish to use us as mere free labor for their enterprises". Which threats? Who made them? What are they threatening with? Are you just trying to stir the emotions of the reader, making them wish to rebel against some imagined enemy?

Please check the linked thread to see if the author has replied. Maybe one or the other point I make here can still be clarified by the author, who may have sources that I am not aware of. (It would be greatly appreciated if replies posted here could be sent to the mailing list as well, so as to keep the thread complete there.) --Markus Krötzsch 22:54, 16 June 2016 (UTC)[reply]

Microsoft qua Microsoft did not fund Wikidata development. Microsoft co-founder Paul Allen's AI institute (which includes in its board of directors both Allen and the Bill & Melinda Gates Chair in Computer Science & Engineering, and whose Advisory Board includes the Director of Microsoft Research as well as the VP Engineering of Siri, another company in the proprietary question-answering business) provided half the original external funding, with further contributions from search engines Google and Yandex. So while the statement indeed lacks precision, it's not exactly "made up" to say that Microsoft money went into the development of Wikidata.
In SEO circles, it's taken as read that Wikidata informs Google's Knowledge Graph. People have published experimental evidence of that, e.g. this article (with screenshots of the Knowledge Graph panel before and after updating a company's Wikidata profile). Google staff have made statements to that effect; Denny indicated last year, for example, that Wikidata would be "one source among many" for the Knowledge Graph. As someone working on the Knowledge Graph at Google, he can be assumed to know what he was talking about.
As I recall, you yourself wrote that Wikidata would have "a prominent role as an input for Google's Knowledge Graph" ( – and then deleted that sentence from the Wikidata description on your institute's website when I pointed it out to you. That did happen, didn't it?
Perhaps the thinking at Google has changed over the past year or so; you and your mate Denny are in a much better position to know than me. So, if you know anything, tell us, rather than posting riddles. But I note that this year, Wikidata's Lydia Pintscher co-published with four Google employees a Google research paper about "The Great Migration" from Freebase to Wikidata. This makes it sound extremely unlikely that Google has abandoned its interest in Wikidata, no? --Andreas JN466 18:25, 17 June 2016 (UTC)[reply]
You reply to two of the three points I have raised. So let me reply to both:
(1) "Microsoft". So you believe that AI2 is pursuing business goals of Microsoft because it is run by people who have connections to Microsoft? That's simply not true. If Microsoft wanted to donate to Wikidata, they would just do so. The history for the AI2 funding is very different, and I can tell you about it if you care to hear it. The whole thing came out of a previous project called "Halo" that was run by Vulcan Inc. (another company run by Mr. Allen). The goal of project Halo was to create a smart system dubbed "digital Aristotle", which should be able to answer common knowledge questions on basic topics.
As part of this project, Semantic MediaWiki has been funded as an approach of overcoming the knowledge-acquisition bottleneck in such applications (entering all relevant knowledge for such a smart system is time consuming and technology support was considered to be necessary to allow more people to contribute). Some of the programming work I did on Semantic MediaWiki as a PhD student was funded by this project (not directly by Vulcan, but by the German company Ontoprise which was their subcontractor at the time).
Project Halo has ended, but the wiki-related activities have been pursued further by the newly created AI2, though not as a long-term effort. The institute has now a range of own projects, some in the spirit of the original "digital Aristotle", but it no longer supports Wikidata directly. I still hope that our data can contribute to their ambitious goals at some point, but as far as I know there is no ongoing collaboration. Microsoft has nothing to do with any of this, and I think you should be able to trace much of the goals and doings of Vulcan/AI2 in this area for many years (possibly predating even the time when such data became interesting for search engines). You can also check out talks by former Vulcan project manager Mark Greaves, who has been close to project Halo at the time, and who has had significant influence on the decisions regarding the course of the project (to the best of my knowledge, the shift of focus towards semantic wikis was based on his initiative). Again, you can see that there is no relevant connection to Microsoft's business ops on this level.
(2) "Google". The question is not if Google will make use of Wikidata at some point (it's free, they can do this if they feel like it; and they probably won't notify me when they do). The question is whether this makes Wikidata anywhere near as relevant for their business as the post suggests here. The picture that is painted here is that Google is exploiting Wikidata without giving anything back. At the same time, the post complains about the fact that they donated money to support Wikidata. Isn't this a wee bit contradictory?
I don't know why people distrust donations so much. Maybe they think that the donors could influence the project in some way? I don't think this has ever happened in Wikidata, especially not related to licensing, which is a very important topic to the WMF and to us personally. I remember us having discussions about this topic, and to look at the legal situation in this field. The interests of Google have never played a role there. In fact, throughout the design and development discussions I have had with the team at Berlin and with WMF staff in the US, there has never been any mentioning of Google or related interests. It should also be fairly clear that it does not make a big difference for Google what kind of license we pick in the end. They integrate (copyleft) Wikipedia information as well as data coming from many different websites and databases into their knowledge graph displays without any problems.
You know well why I have removed the claim that Google is using Wikidata. I explained it to you in detail. Not sure what is the purpose of hiding this fact from your readers here. Doesn't fit your argument?
I don't know about the SEO people. Obviously, if a new platform is coming up, this opens new possibilities for their business. They now could manipulate Wikidata for money, just like they manipulate Wikipedia, Facebook, and whatnot. This business only works if their customers believe that this type of spam has impact, i.e., if it ends up on Google. They really would like this to happen, and they will try to prove it (to their customers). They might really be the first to find some real evidence, but they also will surely be the first to seed rumors. When you read their publications, you should not forget that it is their very business to manipulate opinion.
You cite a research paper written by Denny and Lydia as an evidence of the interest of Google. This is extremely naive. Of course neither Denny nor Lydia have lost interest in Wikidata, but that does not tell you anything about the business strategy of the multi-billion dollar company that is Google. The paper in fact is mainly about the Primary Sources project that Google has now abandoned.
I think there is a recurring type of irrational reasoning we need to get beyond in these discussions. You look at small "signs" that you find on the internet: a quick sentence I write on some minor project page, a suggestive sounding remark on IRC, a joint research publication about a small (and completely public) collaboration, a claim by some SEO blog. And then you deduce from this collected evidence some kind of hidden agenda of large-scale organisations (Google, Microsoft, Wikimedia). You could easily find as much evidence for the opposite. Maybe it helps to think practical: Do you think I would bother to answer here on a Saturday morning if I could somehow be making secret plots for world domination with my "buddys" at Google instead? --Markus Krötzsch 10:03, 18 June 2016 (UTC)[reply]
Just on a tangent, I have strong reason to believe Facebook uses Wikidata without attribution. Or I came across an example that proves they do, but missed out on documenting it. If anyone is interested I can probably reproduce it. Carl Fredrik 💌 📧 14:46, 18 June 2016 (UTC)[reply]
Found an example of absolute proof, mail me if interested. Carl Fredrik 💌 📧 15:00, 18 June 2016 (UTC)[reply]
Methinks Markus Krötzsch doth protest too much. I find it grating to see corporations use the work of volunteers for their own profit, although we all realize than not only is this enterprise not set up to keep eyeballs and credit here, the WMF shows not real indication of caring. WMF likely sold our efforts to buy furniture and pay people who don't work there anymore. But yes, Andreas has been doing the legwork on this beat for years but don't let that dissuade you from calling him a liar. Chris Troutman (talk) 19:30, 18 June 2016 (UTC)[reply]

Why public domain makes sense for Wikidata

As a potential Wikidata contributor, I am driven by the following consideration: I want to contribute to Wikidata, so no one will ever need to repeat my efforts. Were share-alike or attribution stipulations placed on Wikidata, I would not contribute. Share alike creates incompatibilities. For example, share alike would prevent integrating Wikidata with CC BY-NC content. Integration is especially important with respect to data (the most valuable applications occur only once data is integrated). Additionally, data licensing is a relatively new legal issue, with much uncertainty. I support public domain dedication (such as CC0), because it reduces the burden of content reuse. There is a growing consensus in the scientific data fields that any stipulations regarding data reuse are damaging. I've personally experienced how licenses that do not waive all copyright protections make data integration a nightmare. I strongly urge the Wikidata community to consider what option will be the best for the longterm reuse and preservation of Wikidata content. I firmly believe that the future will be built on public domain data rather than data encumbered with incompatibility- and legalese-ridden licensing.

Daniel.himmelstein (talk) 13:25, 20 June 2016 (UTC) Daniel Himmelstein[reply]

Big monopolists can use Wikidata for free but so can small companies. Big monopolists could pay money to develop their own data sets. Small startups can't.
As AI becomes more important with Sundar Pichai speaking about the AI first world, that AI needs access to data to work. That data should be freely accessible to both small companies and large companies. Having a permissive license allows everyone to use the data and build productes without first paying huge sums of money to license data. Google, Microsoft or Facebook can afford to pay large sums of money for data but small startups can't.
A huge advantage of Wikidata are the translations. That means that nonwestern countries are more likely to profit from AI developed on the basis of it, because the translation into other language get's easier.
I'm more motivated to contribute to Wikidata when the impact is bigger and the impact is bigger with a permissive license. ChristianKl (talk) 20:41, 20 June 2016 (UTC)[reply]

Wikidata is a connector

As a contributor to and user of Wikidata, I feel strong about keeping Wikidata under a CCZero waiver. The original op-ed article ends with: "Among all Wikimedia projects, Wikidata is conspicuously alone in not being copylefted." Copylefting (or not) has been heavily and religiously debated for many, many years in the open source community. I have never seen strong examples why either would be better for open source. Second, data is not text and is not source code. It's different and "conspicuously alone" is a false argument that suggests that for data the same arguments apply as to other content types. "Perhaps we should start asking why that is the case" Two possible reasons why this is and should be the case I just discussed. Add to that that in many jurisdictions, facts are not copyrightable in the first place, though in many jurisdictions too, a collection of facts can be (like in The Netherlands). About: "and whose interests benefit from weak licensing choices," that's the wrong way around. CCZero is a stronger license (actually, it's not a license, but a waiver): it gives people more freedom, removes many more hurdles. And exactly these strong freedoms are for me the reason to contribute my effort (time, and with that, money) to Wikidata. Wikidata, with a strong mechanism for sourcing data, and identifiers, can play a criticial role in connecting scientific knowledge. That is greatly inhibited by changing Wikidata to a copylefting license. It would be a significant step back. Finally, I disagree with this point: "and start to organize ourselves to fix this" There is nothing to fix. CCZero without copylefting gives more freedom and for me that main reason to invest my time. Before you start talking about "fixing", realize you will also loose. Egon Willighagen (talk) 12:09, 22 June 2016 (UTC)[reply]

What would it take to do it?


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0