The Signpost


Op-ed

Whither Wikidata?

Wikidata, a Wikimedia project spearheaded by Wikimedia Deutschland, recently celebrated its third anniversary. The project has a dual purpose: 1. Streamline data housekeeping within Wikipedia. 2. Serve as a data source for re-users on the web; in particular, Wikidata is the designated successor to Google's Freebase, designed to deliver data for the Google Knowledge Graph.

We need to talk about Wikidata.

Wikidata, covered in last week's Signpost issue in a celebratory op-ed that highlighted the project's potential (see Wikidata: the new Rosetta Stone), has some remarkable properties for a Wikimedia wiki:

This op-ed examines the situation and its implications, and suggests corrective action.

But first ...

A little bit of history

Wikidata is one of the younger Wikimedia projects. Launched in 2012, the project's development has not been led by the Wikimedia Foundation itself, but by Wikimedia Deutschland, the German Wikimedia chapter.

The initial development work was funded by a donation of 1.3 million Euros, made up of three components:

The original team of developers was led by Denny Vrandečić (User:Denny), who came to Wikimedia Deutschland from the Karlsruhe Institute of Technology (KIT). Vrandečić was, together with Markus Krötzsch (formerly KIT, University of Oxford, presently Dresden University of Technology), the founder of the Semantic MediaWiki project. Since 2013, Vrandečić has been a Google employee; in addition, since summer 2015 he has been one of the three community-elected Wikimedia Foundation board members.

Microsoft co-founder Paul Allen's Institute for Artificial Intelligence provided half the funding for the initial development of Wikidata

Wikimedia Deutschland's original press release, dated 30 March 2012, said,

Wikidata thus has a dual purpose: it is designed to make housekeeping across the various Wikipedia language versions easier, and to serve as a one-stop data shop for sundry third parties.

To maximise third-party re-use, Wikidata—unlike Wikipedia—is published under the CC0 1.0 Universal licence, a complete public domain dedication that waives all author's rights, to the extent allowed by law. This means that re-users of Wikidata content are not obliged to indicate the source of the data to their readers.

In this respect Wikidata differs sharply from Wikipedia, which is published under the Creative Commons Attribution-ShareAlike 3.0 Unported Licence, requiring re-users of Wikipedia content to credit Wikipedia (attribution) and to distribute copies and adaptations of Wikipedia content only under the same licence (Share-alike).

Search engines take on a new role as knowledge providers

Google contributed a quarter of the initial funding for the development of Wikidata, which is now replacing Freebase as one of the sources for the Google Knowledge Graph

The March 30, 2012 announcement of the development of Wikidata was followed six weeks later, on May 16, 2012, by the arrival of a new Google feature destined to have far-reaching implications: the Google Knowledge Graph. Similar developments also happened at Microsoft's Bing. These two major search engines, no longer content to simply provide users with a list of links to information providers, declared that they wanted to become information providers in their own right.

The Google Knowledge Graph, Google said, would enable Internet users

The move makes sense from a business perspective: by trying to guess the information in which people are interested and making that information available on their own pages, search engines can entice users to stay on their sites for longer, increasing the likelihood that they will click on an ad—a click that will add to the search engine's revenue (in Google's case running at around $200 million a day).

Moreover, search engine results pages that do not include a Knowledge Graph infobox often feature ads in the same place where the Knowledge Graph is usually displayed: the right-hand side of the page. The Knowledge Graph thus trains users to direct their gaze to the precise part of a search engine results page that generates the operator's revenue. Alternatively, ads may also be (and have been) inserted directly into the Knowledge Graph itself.

Microsoft's Bing search engine has followed much the same path as Google with its "Snapshot" feature drawing on Wikimedia content
Microsoft's Bing followed a very similar development from 2012 onwards, with Bing's Satori-powered "Snapshot" feature closely mimicking the appearance and content of Google's Knowledge Graph. Bing has used some of the same sources as Google, in particular Wikipedia and Freebase, a crowdsourced database published under a Creative Commons Attribution Licence that was acquired by Google in 2010.

Neither Freebase nor Wikipedia really profited from this development. Wikipedia noted a significant downturn in pageviews that was widely attributed to the introduction of the Google Knowledge Graph, causing worries among Wikimedia fundraisers and those keen to increase editor numbers. After all, Internet users not clicking through to Wikipedia would miss both the Wikimedia Foundation's fundraising banners and a chance to become involved in Wikipedia themselves.

As for Freebase, Google announced in December 2014, a little over four years after acquiring the project, that it would shut it down in favour of the more permissively licensed Wikidata and migrate its content to Wikidata—Freebase's different Creative Commons licence, which required attribution, notwithstanding.

"The PR Pros and SEOs are Coming"

Freebase was widely considered a weak link in the information supply chain ending at the Knowledge Graph. Observers noted that search engine optimization (SEO) specialists were able to manipulate the Knowledge Graph by manipulating Freebase.

In a Wikidata Office Chat conducted on March 31, 2015, future Wikimedia Foundation board member Denny Vrandečić—juggling his two hats as a Google employee and the key thought leader of Wikimedia's Wikidata project—spoke about Google's transition from Freebase to Wikidata, explaining that Wikidata's role would be slightly different from the role played by Freebase:

Denny Vrandečić, the co-founder of the Semantic MediaWiki project, has to juggle three hats: he is a Google employee as well as a community-elected Wikimedia Foundation board member and the primary Wikidata thought leader

Noam Shapiro, writing in Search Engine Journal, drew the following conclusions from his review of this chat, focusing on the statements highlighted in yellow above:

Shapiro's point concerning spam and bias mentioned "the need for recognized references". This is a topic that we will shortly return to, because Wikidata seems to have adopted a very lax approach to this requirement.

The relationship between Wikidata and Wikipedia: Sources? What sources?

Citations to Wikipedia (blue) outnumber all other sources (red) together (yellow = unreferenced)

The fact that Wikidata and Wikipedia have what seems on the face of it incompatible licences has been a significant topic of discussion within the Wikimedia community. It is worth noting that in 2012, Denny Vrandečić wrote on Meta,

More recently, the approach seems to have been that because facts cannot be copyrighted, mass imports from Wikipedia are justified. The legal situation concerning database rights in the US and EU is admittedly fairly complex. At any rate, whatever licensing qualms Denny may have had about this issue at the time seem to have evaporated. If the original plan was indeed "not [...] to extract content out of Wikipedia at all", then the plan changed.

Bot imports from Wikipedia have long been the order of the day. In fact, in recent months contributors on Wikidata have repeatedly raised alarms about mass imports of content from various Wikipedias, believing that these imports compromise quality (the following quote, written by a non-native speaker, has been lightly edited for spelling and grammar):

The circular reference loop connecting Wikidata and Wikipedia

The result of these automated imports is that Wikipedia is today by far the most commonly cited source in Wikidata.

According to current Wikimedia statistics:

References to a Wikipedia do not identify a specific article version; they simply name the language version of Wikipedia. This includes many minor language versions whose referencing standards are far less mature than those of the English Wikipedia. Moreover, some Wikipedia language versions, like the Croatian and Kazakh Wikipedias, are not just less mature, but are known to have very significant problems with political manipulation of content.

Recall Shapiro's expectation above that spam and bias would be held at bay by the "need for recognized references". Wikidata's current referencing record seems unlikely to live up to that expectation.

Of course, allowances probably have to be made for the fact that some statements in Wikidata may genuinely not be in need of a reference. For example, in a Wikidata entry like George Bernard Shaw, one might expect to receive some sympathy for the argument that the statement "Given name: George" is self-evident and does not need a reference. Wikidata, some may argue, will never need to have 100 per cent of its statements referenced.

However, it does not seem healthy for Wikipedia to be cited more often in Wikidata than all other types of sources together. This is all the more important as Wikidata may not just propagate errors to Wikipedia, but may also spread them to the Google Knowledge Graph, Bing's Snapshot, myriad other re-users of Wikidata content, and thence to "reliable sources" cited in Wikipedia, completing the "citogenesis" loop.

Data are not truth: sometimes they are phantoms

Citogenesis

As the popularity of Wikipedia has soared, citogenesis has been a real problem in the interaction between "reliable sources" and Wikipedia. A case covered in May 2014 in The New Yorker provides an illustration:

It seems inevitable that falsehoods of this kind will be imported into Wikidata, eventually infecting both other Wikipedias and third-party sources. That this not only can, but does happen is quickly demonstrated. Among the top fifteen longest-lived hoaxes currently listed at Wikipedia:List of hoaxes, six (nos. 1, 2, 6, 7, 11 and 13) still have active Wikidata entries at the time of writing. The following table reproduces the corresponding entries in Wikipedia:List of hoaxes, with a column identifying the relevant Wikidata item and supplementary notes added:

Hoax Length Start date End date Links Wikidata item
Jack Robichaux
Fictional 19th‑century serial rapist in New Orleans
10 years,
1 month
July 31, 2005 September 3, 2015 Wikipedia:Articles for deletion/Jack Robichaux https://archive.is/Z6Gne Note: The English Wikipedia link has been updated.
Guillermo Garcia
"Highly influential" but imaginary oil and forestry magnate in 18th-century South America
9 years,
10 months
November 17, 2005 September 19, 2015 Wikipedia:Articles for deletion/Guillermo Garcia (businessman) https://archive.is/0pprA
Gregory Namoff
An "internationally known" but nonexistent investment banker, minor Watergate figure, and U.S. Senate candidate.
9 years,
6½ months
June 17, 2005 January 13, 2015 Wikipedia:Articles for deletion/Gregory Namoff Archive https://archive.is/urElB Note: 10 months after the hoax article was deleted on Wikipedia, a user added "natural causes" as the manner of death on Wikidata
Double Hour
Supposed German and American television show, covering historic events over a two-hour span.
9 years,
6 months
September 23, 2005 April 4, 2015 Double Hour (TV series) deletion log https://archive.is/rjFjw Note: This item has only ever been edited by bots.
Nicholas Burkhart
Fictitious 17th-century legislator in the House of Keys on the Isle of Man.
9 years,
2 months
July 19, 2006 September 26, 2015 Wikipedia:Articles for deletion/Nicholas Burkhart https://archive.is/A0lt7
Emilia Dering
Long-lived article about a non-existent 19th century German poet started with the rather basic text "Emilia Dering is a famous poet who was Berlin,Germany on April 16, 1885" by a single-purpose account
8 years,
10 months
December 6, 2006 October 6, 2015 Emilia Dering deletion log; deleted via A7. On the day of the article's creation, a person claiming to be the granddaughter of Emilia Dering published a blog post with a poem supposedly written by her. https://archive.is/eNJbc

Using the last entry from the above list as an example, a Google search quickly demonstrates that there are dozens of other sites listing Emilia Dering as a German writer born in 1885. The linkage between Wikidata and the Knowledge Graph as well as Bing's Snapshot can only make this effect more powerful: if falsehoods in Wikidata enter the infoboxes displayed by the world's major search engines, as well as the pages of countless re-users, the result could rightly be described as citogenesis on steroids.

The only way for Wikidata to avoid this is to establish stringent quality controls, much like those called for by Kmhkmh above. Such controls would appear absent at Wikidata today, given that the site managed to tell the world, for five months in 2014, that Franklin D. Roosevelt was also known as "Adolf Hitler". If even the grossest vandalism can survive for almost half a year on Wikidata, what chance is there that more subtle falsehoods and manipulations will be detected before they spread to other sites?

Yet this is the project that Wikimedians like Max Klein, who has been at Wikidata from the beginning, imagine could become the "one authority control system to rule them all". The following questions and answers are from a 2014 interview with Klein:

Given present quality levels, this seems like a nightmare scenario: the Internet's equivalent of the Tower of Babel.

What is a reliable source?

An aardvark

A crowdsourced project like Wikidata becoming "the one authority control system to rule them all" is a very different vision from the philosophy guiding Wikipedia. Wikipedians, keenly aware of their project's vulnerabilities and limitations, have never viewed Wikipedia as a "reliable source" in its own right. For example, community-written policies expressly forbid citing one Wikipedia article as a source in another (WP:CIRCULAR):

Wikidata abandons this principle—doubly so. First, it imports data referenced only to Wikipedia, treating Wikipedia as a reliable source in a way Wikipedia itself would never allow. Secondly, it aspires to become itself the ultimate reliable source—reliable enough to inform all other authorities.

For example, Wikidata is now used as a source by the Virtual International Authority File (VIAF), while VIAF in turn is used as a source by Wikidata. In the opinion of one Wikimedia veteran and librarian I spoke to at the recent Wikiconference USA 2015, the inherent circularity in this arrangement is destined to lead to muddles which, unlike the Brazilian aardvark hoax, will become impossible to disentangle later on.

The implications of a non-attribution licence

Not an aardvark

The lack of references within Wikidata makes verification of content difficult. This flaw is only compounded by the fact that its CC0 licence encourages third parties to use Wikidata content without attribution.

Max Klein provided an insightful thought on this in the interview he gave last year, following Wikimania 2014:

Klein seems torn between his lucid rational assessment and his appeal to himself to "really believe in the Open Source, Open Data credo". Faith may have its rightful place in love and the depths of the human soul, but our forebears learned centuries ago that when you are dealing with the world of facts, belief is not the way to knowledge: knowledge comes through doubt and verification.

What this lack of attribution means in practice is that the reader will have no indication that the data presented to them comes from a project with strong and explicit disclaimers. Here are some key passages from Wikidata's own disclaimer:

Internet users are likely to take whatever Google and Bing tell them on faith. As a form of enlightenment, it looks curiously like a return to the dark ages.

When a single answer is wrong

Jerusalem—one of the most contested places on earth

This obscuring of data provenance has other undesirable consequences. An article published in Slate this week (Nov. 30, see this week's In the Media) introduces a paper by Mark Graham of the Oxford Internet Institute and Heather Ford of the School of Media and Communication at the University of Leeds. The paper examines the problems that can result when Wikidata and/or the Knowledge Graph provide the Internet public with a single, unattributed answer.

Ford and Graham say they found numerous instances of Google Knowledge Graph content taking sides in the presentation of politically disputed facts. Jerusalem for example is described in the Knowledge Graph as the "capital of Israel". Most Israelis would agree, but even Israel's allies (not to mention the Palestinians, who claim Jerusalem as their own capital) take a different view – a controversy well explained in the lead of the English Wikipedia article on Jerusalem, which tells its readers, "The international community does not recognize Jerusalem as Israel's capital, and the city hosts no foreign embassies." Graham provides further examples in Slate:

Ford and Graham reviewed Wikidata talk page discussions to understand the consensus forming process there, and found users warring and accusing each other of POV pushing—context that almost none of the Knowledge Graph readers will ever be aware of.

In Ford's and Graham's opinion, the envisaged movement of facts from Wikipedia to Wikidata and thence to the Google Knowledge Graph has "four core effects":

This is a remarkable reversal, given that Wikimedia projects have traditionally been hailed as bringing about the democratisation of knowledge.

Conclusions

Errors can always be fixed

From my observation, many Wikimedians feel problems such as those described here are not all that serious. They feel safe in the knowledge that they can fix anything instantly if it's wrong, which provides a subjective sense of control. It's a wiki! And they take comfort in the certainty that someone surely will come along one day, eventually, to fix any other error that might be present today.

This is a fallacy. Wikimedians are privileged by their understanding of the wiki way; the vast majority of end users would not know how to change or even find an entry in Wikidata. As soon as one stops thinking selfishly, and starts thinking about others, the fact that any error in Wikidata or Wikipedia can potentially be fixed becomes secondary to the question, "How much content in our projects is false at any given point in time, and how many people are misled by spurious or manipulated content every day?" Falsehoods have consequences.

Faced with quality issues like those in Wikidata, some Wikimedians will argue that cleverer bots will, eventually, help to correct the errors introduced by dumber bots. They view dirty data as a welcome programming challenge, rather than a case of letting the end user down. But it seems to me there needs to be more emphasis on controlling incoming quality, on problem prevention rather than problem correction. Statements in Wikidata should be referenced to reliable sources published outside the Wikimedia universe, just like they are in Wikipedia, in line with the WP:Verifiability policy.

Wikidata development was funded by money from Google and Microsoft, who have their own business interests in the project. These ties mean that Wikidata content may reach an audience of billions. It may make Wikidata an even greater honey pot to SEO specialists and PR people than Wikipedia itself. Wikis' vulnerabilities in this area are well documented. Depending on the extent to which search engines will come to rely on Wikidata, and given the observed loss of nuance in Knowledge Graph displays, an edit war won in an obscure corner of Wikidata might literally re-define truth for the English-speaking Internet.

If information is power, this is the sort of power many will desire. They will surely flock to Wikidata, swelling the ranks of its volunteers. It's a propagandist's ideal scenario for action. Anonymous accounts. Guaranteed identity protection. Plausible deniability. No legal liability. Automated import and dissemination without human oversight. Authoritative presentation without the reader being any the wiser as to who placed the information and which sources it is based on. Massive impact on public opinion.

... to rule them all

As a volunteer project, Wikidata should be done well. Improvements are necessary. But, looking beyond the Wikimedia horizon, we should pause to consider whether it is really desirable for the world to have one authority—be it Google or Wikidata—"to rule them all". Such aspirations, even when flying the beautiful banner of "free content", may have unforeseen downsides when they are realised, much like the ring of romance that was made "to rule them all" in the end proved remarkably destructive. The right to enjoy a pluralist media landscape, populated by players who are accountable to the public, was hard won in centuries past. Some countries still do not enjoy that luxury today. We should not give it away carelessly, in the name of progress, for the greater glory of technocrats.

One last point has to be raised: Denny Vrandečić combines in one person the roles of Google employee, community-elected Wikimedia Foundation board member and Wikidata thought leader. Given the Knowledge Graph's importance to Google's bottom line, there is an obvious potential for conflicts of interest in decisions affecting the Wikidata project's licensing and growth rate. While Google and Wikimedia are both key parts of the world's information infrastructure today, the motivations and priorities of a multi-billion-dollar company that depends on ad revenue for its profits and a volunteer community working for free, for the love of knowledge, will always be very different.


Further reading


Andreas Kolbe has been a Wikipedia contributor since 2006. He is a member of the Signpost's editorial board. The views expressed in this editorial are his alone and do not reflect any official opinions of this publication. Responses and critical commentary are invited in the comments section.

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

@ Andreas Kolbe, Thanks for an excellent article highlighting the pitfalls in Wikidata policies.Hope to see the foundation act on the issues. --Arjunaraoc (talk) 03:05, 7 December 2015 (UTC)[reply]

"Strong and explicit disclaimers" is, in practice, a joke, since I'm pretty sure if you took a random sample of visitors to Wikimedia projects 99% would be unaware that the disclaimers exist. From what I've read anecdotally, a sizable number of people think there's a paid staff that writes Wikipedia, or that the Foundation has editorial control over the projects. --71.119.131.184 (talk) 03:46, 7 December 2015 (UTC)[reply]

You're right: there are indeed many, many people still operating under these mistaken assumptions. On the other hand, Wikipedia's Wikipedia:General_disclaimer does average well over 3,000 views a day (currently ranking #1664 in traffic on en.wikipedia.org), and there has been substantial public discussion of the fact that an openly editable crowdsourced encyclopedia cannot be relied upon to present correct information at any given point in time. As long as the Knowledge Graph or Snapshot still contains the word "Wikipedia", at least some people will bear that in mind. The moment the attribution disappears, however, the chances of people doing that diminish. Andreas JN466 06:51, 7 December 2015 (UTC)[reply]

@ Andreas Kolbe Thank you for an enlightening article. Especially significant: the loss of provenance (verifiability) due to clear violations of Wikipedia's generous but restrictive licensing terms, e.g., importing Wikipedia's CC BY-SA 3.0 licensed content (not facts, but claims of fact) without required attribution directly into Wikidata under the permissive CCO public domain dedication. A great investment for Google and Microsoft, which have the financial means and technical infrastructure to continually analyze, refine, and commercially exploit Wikidata's now totally free crowd-sourced claims of fact without any community responsibilities whatsoever — other than those due to their shareholders. -- Paulscrawl (talk) 05:15, 7 December 2015 (UTC)[reply]

  • There are two different issues. One is data quality; for instance, unreferenced data. The other one is how people admit data as valid.
    As for quality, I'm worried about lack of references as mentioned above. But I'm also worried about corporate bias -Google and Microsoft are mentioned- as I do not bite the hand that feeds me (so I never ever edit about the company I work for, my personal policy).
    But as for data validation by users, that's a different case. I do not trust any statement based on a single unknown source. An extreme case, my mother (a female) was studying Medicine in 1960. The local census for her home town shows that the number of female university students in that place in 1960 was zero. The census is obviously wrong. Does it mean that cesuses are always wrong? No, in fact they are mostly right. But they can be wrong. And a Spanish census is a quite well done official source of information. So if I-don't-know-who says that an avocado is a kind of Nepalese oceangoing vessel... well, I should double check. In wiki and out of wiki, pre-wiki, post-wiki, inter-wiki.
    Is information neutral? Are data? Not really, based on our own experience in life. We are just used to live in this kind of context. I know that saying Myanmar or Burma, Alboraya or Alboraia, football or soccer, are non-neutral decisions, we know what to expect from texts making such word use and we evaluate them accordingly. It is experience and prudence, the same things that keep us from being run by a car when we cross the street. B25es (talk) 07:03, 7 December 2015 (UTC)[reply]
  • it is an opinion. It is severely flawed and, you know what happened to the ring that ruled them all. For want of a better world it was destroyed. This opinion demonstrates a total lack of understanding of what a wiki is and the quality that Wikidata brings. It deserves a rebuttal and I would love to write one. Thanks, GerardM (talk) 06:44, 7 December 2015 (UTC)[reply]
DarTar, Jayen466: It doesn't look like this discussion is taken serious by wikidata devs like User:Markus Krötzsch, who prefers to tell so in the wikidata mailinglist-echo chamber. On the other hand, nobody cares to reject ludicrous arguments by GerardM that poisoning wikidata with bad data is no problem, just like carelessly poisoning the Rhine is apparently no problem downstreams in the Netherlands because "shit happens and we can deal with that." Surreal. As a Wikipedian, i can tell you that i don't want to be forced to deal with the shit that happened at wikidata, thank you very much. --Atlasowa (talk) 12:40, 7 December 2015 (UTC)[reply]
Thanks for the pointer, Atlasowa. I hadn't actually seen that response from Markus; it didn't come through to the Wikimedia-l mailing list. Andreas JN466 13:47, 7 December 2015 (UTC)[reply]
I have now replied to Markus. [1] --Andreas JN466 23:21, 7 December 2015 (UTC)[reply]
  • The actual and major point of Wikidata, as far as I can see (I have been working on it for a year and a few months) is that it is versatile. It is not for just one thing; it began as an interwiki index, but has moved quite a distance from that position. So I think we can pretty much forget about considerations based on the business interests of the original sponsors, for example. The scope is broad rather than narrow, and many people and institutions are going to find it useful. (I was in an GLAM meeting on Wednesday and the institution in question seemed to find it an eye-opener how much has already happened.) Another point is that Wikidata after three years is much like Wikipedia after three years, i.e. 2004 here. Which I remember quite well: it has the same feeling of a huge amount to do wherever you look. So, naturally, if you are picky you can find things to be picky about. Put another way, guidelines are not yet well developed, systems not in place. The Wikidata community seems to function quite reasonably, and that is a reason to be hopeful that issues will find solutions. The third point I'd like to make is that areas like "authority control" seem to be crying out for something like Wikidata - I have become familiar with VIAF through Wikidata work, and what Wikidata adds to that major system is already substantial, though in need of some checking because the early bot work was a bit careless about disambiguation. In fact I came up just recently with a thought (Wikidata is a database that "can do outreach") which made me conclude that the "linked structured data" model in use is a big advance. I have come in from the merging encyclopedias direction, and (via Magnus Manske) I have come to see that the old way of thinking in the "missing article" area is obsolescent, with Wikidata able to provide a much better environment for what can only be called digital scholarship. And also, for example, able to support editathons by supplying "redlink lists" of missing articles to work on. WP:WPDNB and its talk page archives show the emergence of some of the new thinking. It would be silly to ignore the real problems with data integrity on Wikidata; but the standards of referencing are going up, and one shouldn't use metrics that are somewhat naive to argue about that issue. Charles Matthews (talk) 07:22, 7 December 2015 (UTC)[reply]
    • Wikidata is the designated successor to Freebase, used as a source for SERP infoboxes by both Google and Microsoft. So I wouldn't say that there are no business interests involved: the impact of infobox features on users' interaction with search engine results pages is profound. 2004: One point people raised in the Wikimedia-l discussion was that Wikidata should take the lessons learned by Wikipedia in its early years on board, rather than replicating these errors. I find that argument fairly compelling. Referencing: Standards of referencing do seem to be going up – in June of this year, only 17% of Wikidata statements referenced what in Wikipedia would be considered a reliable source, and now it is 21% – but there is still a long way to go. Andreas JN466 08:26, 7 December 2015 (UTC)[reply]
      • What I meant was not that Wikidata is "decoupled" from business, which it isn't, but that I don't see the argument that it should be decoupled as particularly interesting (to me). Yes, I agree that the lessons of history are important, and my positive verdict on the Wikidata community factors in the way discussions are actually conducted, which seems much more helpful in practice (people generally less stubborn, for example). On referencing, looking at biographies which are about 20% of items, referencing vital dates is much more important than referencing occupations (say). It is interesting to see the efforts of the Library of Congress and Union List of Artist Names to reference dates, for example: these are major authoritative database sources, but they don't have as transparent a system as Wikidata now proposes. With 50% references on statements, we are in classic "glass half full/empty" territory anyway. What Wikidata has going for it is the ability, for example, to search for unreferenced death dates. The status quo, before Wikidata, was that such major databases could disagree, and no one pointed a finger at anybody. Charles Matthews (talk) 08:57, 7 December 2015 (UTC)[reply]
  • What I would like to see is any practical discussion of how Wikidata is useful, now. Because I am pretty sure that the promised "population of birth/death" dates feature, for example, has not yet happened. As a Wikipedia article writer, the only use I see we get out of Wikidata is a centralized repository for interlanguage links. Not the best investment for the mentioned 1+ million euros. --Piotr Konieczny aka Prokonsul Piotrus| reply here 07:39, 7 December 2015 (UTC)[reply]
    • Just a quick point regarding development costs: My understanding is that the mentioned 1.3 million Euros from the three sponsors funded initial development work begun in 2012, and that a substantial part of the movement's funds granted to Wikimedia Deutschland annually since then has supported further development of Wikidata (see related comments by the Funds Dissemination Committee quoted in last week's News and notes). Andreas JN466 08:05, 7 December 2015 (UTC)[reply]
@Piotrus: "Not the best investment for the mentioned 1+ million euros." Making it much, much less work to maintain a small Wikipedia sounds to me like a very good investment, for which millions of euros is small compared to the long-run benefits. There are other ways Wikidata benefits Wikipedia, including automatic generation of lists with Listeria (yes, some of these will be incomplete or incorrect, just like manually-generated Wikipedia lists). I find it useful to look up on one page how a term is represented in lots of different languages. Then stepping away from the "As a Wikipedia article writer..." to Wikisource, it's great that I can add metadata to a Wikisource author profile just by linking from Wikidata, without having to paste in and maintain an image or authority file links. Stepping away from the other Wikimedia projects entirely, Wikidata is already an awesome free knowledge project in its own right: the people I'm training at the University of Oxford are really impressed with Histropedia timelines, the Reasonator, the map interfaces, the ongoing integration of scholarly authority files. And that's just what's happening at what we all agree is a very early stage of Wikidata's evolution. Yes, millions of euros is a lot of money, but it needs to be seen in perspective of the value created, and in this context it's frankly not much. A case can be made that Wikidata will ultimately be more important than Wikipedia to the web as a whole. MartinPoulter (talk) 17:39, 8 December 2015 (UTC)[reply]

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

— Charles Babbage, Passages from the Life of a Philosopher (1864)
  • WikiProject Medicine participants and others are discussion an application of Wikidata in Wikipedia infoboxes at "Another reform proposal - split infobox into "human readable" and "non human readable" and call from Wikidata", or see this discussion archived. The article above is great. The "Poof it works" article on gene information from Wikidata in Wikipedia by Benjamin Good and team is still the best profile I have seen of this application. Blue Rasberry (talk) 13:09, 7 December 2015 (UTC)[reply]
  • A pitall of that reasoning : references come from users, just as reference on Wikipedia. if you don't have any (serious) user, then you don't have people to correctly reference the facts. Then you don't have trustworthy datas. And ... if you don't have datas, then you don't have users. The most important thing to understand about Wikidata is that its quality will be improved the most it is used. It will be used if there is datas to used. When Wikidata will have kickstart, then more and more Wikipedia will use the data, so more and more user will require source for the datas they have. But datas won't come by themselves and we have to start somewhere to realize the kickstart. TomT0m (talk) 14:40, 7 December 2015 (UTC)[reply]
    A corrolary : there is an opposite not virtuos circle : Wikidata don't have any other datas. Ther big wikipedias will continue to ignore the project because ... they have more and better data ? Why would they bother ? Then if we stay like this ... little Wikipedia won't benefit the datas for their project, and it's as there were no Wikidata at all, and no community. TomT0m (talk) 14:44, 7 December 2015 (UTC)[reply]
  • @ Andreas Kolbe: thanks for the article, I do not agree with many of your opinions, but it's gold. So thank you. I think, with many people, that there are many topics involved here: CC0, the role of "over-the-top" companies like Google and Bing, data quality. I want to address just one bit, though: "the authority control to rule them all". I still think that Wikidata can be a "super authoity control", because it is perfect as an aggregator of identifiers, and for things that are unique (like persons) has already proven its worth. VIAF can check if one of its authors is the same of other authority controls via Wikidata, using bots and a bit of AI. This is already useful and helpful, just because Wikidata is a place where you can import many authority controls and reconcile them with each other. I don't really see a problem here. Aubrey (talk) 15:07, 7 December 2015 (UTC)[reply]
  • Agree, of course: it is already happening. And enWP will benefit when those about to write a biographical article here routinely check for a Wikidata item, not only for existing language versions, but database links. And images, naturally. Could take a few years. Charles Matthews (talk) 15:31, 7 December 2015 (UTC)[reply]
  • One point to extend what you wrote about verification in Wikidata. Having dabbled over there earlier this year, unless its interface has radically changed in the last few months, I found adding references there to Wikipedias difficult & references to sources beyond Wikipedias practically impossible. And I say this as someone who is computer savvy. Documentation would help, but I suspect the ability to verify statements was added more of an afterthought than part of the original design. (For one thing, it's easy to add links to other Wikipedia nodes, which represent notable items; however most sources, either primary or secondary are not & will not be notable per Wikipedia consensus.) -- llywrch (talk) 16:33, 7 December 2015 (UTC)[reply]
    On the technical point, adding references to a Wikidata statement is straightforward once you know the drill. There are "reference URL" (diff) and "stated in" options, and I use these all the time. Also "imported from" for a Wikipedia import. Charles Matthews (talk) 06:37, 8 December 2015 (UTC)[reply]
    No, they are notable on Wikidata, as they "fulfil a structural need" per d:WD:N. Something has not necessarily to be notable on a wikipedia to be notable on Wikidata, although anything that has an article on any Wikipedia is notable on Wikidata. Sourcing was taken into account from the beginning, although it takes time to be well implemented, as anything else on Wikidata.
    To ease sourcing on Wikidata there is also project, for example I just recieved a mail through wikidata ml about Strep Hit who just got accepted as and IEG Grant. TomT0m (talk) 18:08, 7 December 2015 (UTC)[reply]
  • Well written paper, but useless. We could write the same paper about Wikipedia. But the community will reply : it's a wiki, so fix it, it's a project of encyclopedia, a work in progress, etc. Pyb (talk) 20:16, 7 December 2015 (UTC)[reply]
    An important point here is that wikidata (in particular from the central storage perspective) is different project and needs different requirement. In that sense it is not just a wiki and should not be treated as such, it is a central and its errors multiply throughout the system and hence it needs even stricter requirements for its data than wikipedia.--Kmhkmh (talk) 00:14, 8 December 2015 (UTC)[reply]
    That's just focusing on the wrong facet of the coin. An error on enwiki is boradcasted to any enwiki readers and dbpedia, so ... it's not really much different. We can also count to the fact that a more visible error will be corrected faster than a burried in a not often readed article, so an error in Wikidata propagated in several wiki will be corrected overall faster than an isolated error in a Wikipedia. Lastly, actually wikidata as a central rep has its own error detection mechanisms, helped by the structuring effort on datas, which adds up to the sum of all error detection mechanisms in the local wikis. So overall, this fast propagation of errors might be more than compensated by the advantages of centralizing datas, which mutualize all the efforts to improve quality in the short and long run. TomT0m (talk) 08:46, 8 December 2015 (UTC)[reply]
  • Regarding the graph at the beginning, many statements on Wikidata are self-evident and don't really need sources (i.e. Authority control, instance of, sex or gender, etc.) Of course, lots of statements on Wikidata do need sources and it's a lot harder to get people to provide them than it is on Wikipedia, but I feel this graph misrepresents the project. FallingGravity (talk) 04:54, 8 December 2015 (UTC)[reply]
    • It's official WMF data. References are clearly more important in some cases than in others. I'm not losing sleep over the fact that there is no reference in Wikidata supporting the assertion that the mother of Jesus was "Mary" (or that Mary was female). But in fact Wikidata has four references for Barack Obama being male, for example. Andreas JN466 14:05, 8 December 2015 (UTC)[reply]
  • Very interesting article. Thank you! -- œ 12:20, 8 December 2015 (UTC)[reply]
  • If one would count the sentences contained in Wikipedia and add to that all the infobox elements, and divide the sum by the references on Wikpedia, the result would be far worse than for Wikidata in my opinion. - And just to offer a different view of the data: The amount of references within the Wikipedia-Wikidata ecosystem is increasing in absolute numbers. - The numbers in the graph are a fact, but it is odd that the author assumes that his reading of the graph is intrinsic truth of the numbers, which scinece knows does not exist. --Tobias1984 (talk) 16:51, 8 December 2015 (UTC)[reply]
  • As wikidatian I agree with some criticisms of the article. The problem for me is the oblivion of the initial objective of WD: to be the reference database of Wikipedia. People in WD are playing their own game without any considerations of the final data users and their requirements. I have the impression that people are playing with data import because they can do it and not because they have an objective. They only want to fill memory without any thinking about the use of that data. The license is a problem too and I think we missed an important step when the choice of the license was made. CC0 just means you can't access to most of the reference data because the minimal license is the CC BY-SA in most of the official databases. Snipre (talk) 01:25, 9 December 2015 (UTC)[reply]
    • Snipre, I find the following interesting: Google said, "When we publicly launched Freebase back in 2007, we thought of it as a "Wikipedia for structured data." So it shouldn't be surprising that we've been closely watching the Wikimedia Foundation's project Wikidata[1] since it launched about two years ago. We believe strongly in a robust community-driven effort to collect and curate structured knowledge about the world, but we now think we can serve that goal best by supporting Wikidata -- they’re growing fast, have an active community, and are better-suited to lead an open collaborative knowledge base. So we've decided to help transfer the data in Freebase to Wikidata, and in mid-2015 we’ll wind down the Freebase service as a standalone project. Freebase has also supported developer access to the data, so before we retire it, we’ll launch a new API for entity search powered by Google's Knowledge Graph. Loading Freebase into Wikidata as-is wouldn't meet the Wikidata community's guidelines for citation and sourcing of facts -- while a significant portion of the facts in Freebase came from Wikipedia itself, those facts were attributed to Wikipedia and not the actual original non-Wikipedia sources. So we’ll be launching a tool for Wikidata community members to match Freebase assertions to potential citations from either Google Search or our Knowledge Vault[2], so these individual facts can then be properly loaded to Wikidata. We believe this is the best first step we can take toward becoming a constructive participant in the Wikidata community, but we’ll look to continually evolve our role to support the goal of a comprehensive open database of common knowledge that anyone can use."
    • Wikidata would seem to me to be doing exactly what Freebase did, i.e. cite Wikipedia and not the external sources, and it is interesting that Google thought this disqualified Freebase from being imported directly. Andreas JN466 04:01, 9 December 2015 (UTC)[reply]
  • The CC0 license is a non-issue. Data is not copyrightable in the US (or most of the world for that matter), so there is no way to require attribution regardless of what license you want to stick on the site. Kaldari (talk) 03:48, 9 December 2015 (UTC)[reply]

A letter to Andreas

Andreas,

I understand that this is an opinion piece, and not an article written based on actual research, but I am still disappointed by the fact that you, although we were in an active discussion last week, did not spend the time on actually counterchecking your conjectures with me or anyone else. Independently of whether this article raises some important questions or not - and I think it does, but they are well buried in a long and meandering prose - it contains plenty of falsehoods, which could have easily been dispelled by simply asking. Since you, Andreas, are on the Editorial Board of the Signpost, I don't assume that there is anything that can be done in order to ensure any basic fact-checking or vetting for critical pieces like this one, although I think it would be a display of respect and decency towards our volunteer-lead projects.

To name just a few of the obvious falsehoods:

  1. Wikidata was not, as you write, "designed to deliver data for the Google Knowledge Graph". Wikidata was, first and foremost, designed to support the Wikimedia projects. If it were designed to deliver data to the Google Knowledge Graph, it would look very different. The data models of both are rather different - why go through that pain? Because the requirements for Wikidata that I wrote down are very different. I had a few very intense months discussing the foundations of the data model with Markus Krötzsch and Daniel Kinzler - you are free to ask them both how often compatibility with Google was mentioned.
  2. In a discussion of this article on the German Wikipedia, you explicitly say what you also insinuate here: that you expect that Google and Microsoft "made their preference clear" regarding the license of Wikidata. This was not the case. Neither of them at any point in time had any influence on the question of the license. Erik Möller, back then Deputy Director of the Foundation, and I, back then Wikidata director, came independently to the conclusion that CC0 was the best choice for a license. My opinion about data licensing is recorded, and I had some furious discussions with researchers in the Semantic Web area on that topic - if you want, I can point you to them, they will surely remember. This predates my employment with Google and also my employment with Wikimedia Deutschland.
  3. You state that Microsoft donated towards the development of Wikidata. To the best of my knowledge, this never happened.

There are many more issues, but I'll leave it at that for now.

I welcome a critical piece - in particular when it touches upon important problems. I think there has to be a proper conversation about those, and some of these issues need to be made more explicit, in order to find solutions for them in a wider societal context. But the way you present them here - buried and mixed with a number of conspiracy theories and a dismissive, unrespectful tone towards a volunteer-driven project - I simply don't think that this is a good or even effective way to start this conversation. This is similar to the way Mark Graham keeps writing about these issues: I think it is extremely unfortunate that his latest piece in Slate was buried in comments about his unfortunate choice of example, and that this entirely overshadowed his message - a message that I indeed consider important, as I have told Mark repeatedly.

So, to make it very explicit: I welcome critical articles on myself and my work. I was available and reachable to answer questions beforehand, in order to ensure that basic, and often merely tangential, errors are avoided, which might distract from the substantial points. I don't expect you or anyone to simply believe what I say, but I would have at least expected, and hoped for, a chance to explain myself, offer my memories of events, talk about these issues, and maybe point to a few things that you have missed. I would have expected this basic respect from someone who is collaborating with me on Wikimedia and has the same goal. I am saddened by the fact that instead you choose to call one of our projects nonsense, to insinuate that I have participated in a conspiracy in order to weaken our projects, and that, instead of helping to fix the issues on Wikidata, you write about them outside of the project. If you think that by bashing Wikimedia projects on The Register, on Wikipediocracy, or on sister projects is the most effective way to correct them, then I have to admit that I disagree.

In your header you promise to "suggest corrective action". Unfortunately, you seem to have forgotten about this by the end of the article. But I guess it was too much to hope that you would keep your own promise of a constructive contribution. --denny vrandečić (talk) 05:50, 9 December 2015 (UTC)[reply]

Denny, are you, personally, or in your capacity as a Google employee, concerned about the unreliability of wikidata? I'd also appreciate your thoughts on the appropriateness of you sitting on Wikimedia's board and being a thought leader at Wikidata, while being paid by Google.
Regarding your third point, in this article Andreas says, "Half the money came from Microsoft co-founder Paul Allen's Institute for Artificial Intelligence (AI2)."
Regarding Andreas's suggestion for corrective action, I thought he was pretty clear, and I agree with him: "...if falsehoods in Wikdata enter the infoboxes displayed by the world's major search engines, as well as the pages of countless re-users, the result could rightly be described as citogenesis on steroids. The only way for Wikidata to avoid this is to establish stringent quality controls, much like those called for by Kmhkmh above." --Anthonyhcole (talk · contribs · email) 14:28, 9 December 2015 (UTC)[reply]
denny vrandečić, I suggest you write an oped in one of the next sign posts, succinctly addressing all the issues you have with Andreas' article, point by point. Right now I am confused as to who is right and what really happened. The matter is too important for WP to be buried on this discussion page IMO. would that be ok Andreas ? --Wuerzele (talk) 02:30, 10 December 2015 (UTC)[reply]
Of course, Wuerzele. It's important to have a debate about this, and there should be a rebuttal from Lydia in the upcoming issue. Andreas JN466 11:55, 10 December 2015 (UTC)[reply]
Denny, the very press release announcing Wikidata, quoted in the op-ed, said it was "expected to be beneficial for numerous external applications, especially for annotating and connecting data in the sciences, in e-Government, and for applications using data in very different ways". Given that two search engines (Google and Yandex) have sponsored the project's development, along with the institute of the co-founder of Microsoft, which also operates a major search engine, do you really expect us to believe that use by these players' search engines wasn't on anyone's mind? Those are the people that paid for the project at its beginning! And I wonder why you, as a Google employee "working at the Knowledge Graph", are even working on this project. I don't believe you are doing it all in your spare time.
Yesterday, Markus Krötzsch altered the description of this page on his university's website to remove the subclause "which gives Wikidata a prominent role as an inut for Google Knowledge Graph." On the mailing list, he now apologises that "this quickly-written intro on the web has misled" me. (Here is what the page looked like before.) Yet you yourself stated on IRC that Wikidata would be a source for the Knowledge Graph, even if the linkage may end up being less direct than it was with Freebase.
I'll grant you that the wording "designed to deliver data for the Google Knowledge Graph" in the first image caption is grammatically ambiguous, in that it can be read to refer either to Freebase or to Wikidata. If you'd like me to rephrase it, that's something we can look at. But note that there are innumerable references online to Google's having designated Wikidata as the successor to Freebase—which was indeed used by Google to deliver data for the Knowledge Graph—and to the intended and ongoing transfer of its Freebase data to Wikidata. This includes public communications from Google's own public sources:
So we've decided to help transfer the data in Freebase to Wikidata, and in mid-2015 we’ll wind down the Freebase service as a standalone project. Freebase has also supported developer access to the data, so before we retire it, we’ll launch a new API for entity search powered by Google's Knowledge Graph. Loading Freebase into Wikidata as-is wouldn't meet the Wikidata community's guidelines for citation and sourcing of facts -- while a significant portion of the facts in Freebase came from Wikipedia itself, those facts were attributed to Wikipedia and not the actual original non-Wikipedia sources. So we’ll be launching a tool for Wikidata community members to match Freebase assertions to potential citations from either Google Search or our Knowledge Vault[2], so these individual facts can then be properly loaded to Wikidata.
The last point is particularly interesting: Loading Freebase into Wikidata as-is wouldn't meet the Wikidata community's guidelines for citation and sourcing of facts -- while a significant portion of the facts in Freebase came from Wikipedia itself, those facts were attributed to Wikipedia and not the actual original non-Wikipedia sources. I see no sign of these community guidelines for citation and sourcing of facts preventing as-is import of Freebase data, because what Wikidata is now doing is exactly what Freebase did: importing data that are "attributed to Wikipedia, and not the actual original non-Wikipedia sources." By the million.
Moreover, you said yesterday on the German Wikipedia that you stand by your statement that Wikidata under its CC0 licence should not import content from Share-Alike sources. Yet Wikipedia is a Share-Alike source, and you're importing data from it. How and why is Wikipedia different?
Yes, we were in discussion last week. Your argument consisted in pointing me to the lead of a featured article and stating that this did not include references either, and therefore Wikipedia was not so much better than Wikidata. On the basis of the lack of citations in that article lead, you suggested that "Much more than half of all claims in Wikipedia are without reference, probably much more than 90%", which is a quite bizarre statement to make. I can only interpret it as indicative of the fact that you have only made very limited content contributions to this project. I am the first to admit that Wikipedia has its own problems, but to believe that 90% of all claims in Wikipedia are unreferenced shows that you are out of touch with sourcing practices in Wikipedia. Actually, after clicking random article a couple of dozen times, I'd like to concede that you are probably correct on this point. I made you aware that per WP:LEADCITE, facts are sourced in the main body of an article rather than the lead, told you that I thought it was vital for Wikidata content to be referenced to external sources, and expressed my concern that unreliable content in Wikidata would be spread far and wide on the Internet, including by Google. You didn't respond.
As for the Register headline, this was written by the publication's editors, not by me. But frankly, a project that contains data items on fictitious personalities months after these hoaxes have been discovered on Wikipedia and that tells the world for half a year that Roosevelt was also called Adolf Hitler does indeed contain embarrassing "nonsense".
The specific corrective action I suggested in the op-ed is this: there needs to be more emphasis on controlling incoming quality, on problem prevention rather than problem correction. Statements in Wikidata should be referenced to reliable sources published outside the Wikimedia universe, just like they are in Wikipedia, in line with the WP:Verifiability policy. As it stands, you're not even indicating which article and article version a particular statement was taken from. "Latvian Wikipedia" is not a functional reference.
There is now an RfC on Wikidata on the question whether at least the particular article version should be indicated and linked in Wikidata. While this is not as good as a (verified!) external reference, it would be a minimal improvement over how things are currently done. Frankly, I'm flabbergasted at how the present import practice could even develop and be countenanced by the project's leaders.
I believe I also made sufficiently clear that I have concerns about the CC0 licence and would prefer to see something requiring re-users to attribute the material to Wikidata, just as Bing today attributes Snapshot content to Freebase. I understand we are unlikely to agree on this issue, which is fine.
I look forward to reading Lydia's rebuttal. Regards, --Andreas JN466 11:55, 10 December 2015 (UTC)[reply]

References to scholarly sources

Saw this only now and would like to point out that WikiProject Source MetaData is working on a system that would facilitate adding scholarly references to statements in Wikidata. In parallel, the Open Access Signalling project works towards importing the full text of openly licensed articles into Wikisource, which would eventually allow statements in Wikipedia, Wikidata or elsewhere to be supported by deeplinking to the precise statement in the Wikisource copy of the scholarly article, thereby helping WP:V. Both projects would welcome additional contributors. -- Daniel Mietchen (talk) 21:50, 13 December 2015 (UTC)[reply]

Circular sourcing with images

This is a file from commons, with a source description.

As, ahem, illustration to the problem, there are now several new files on commons that give as a source: wikiwand [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14]. But, Wikiwand is just a mirror of Wikipedias! (aka "software interface developed for viewing Wikipedia articles" ^^) And now this is given as the source?!? These are uploads in november and december 2015, and if you look at their traffic [15][16], their traffic is driven by search and the problem is getting bigger over time. Crap. --Atlasowa (talk) 20:42, 20 December 2015 (UTC)[reply]

there was another one

thanks andreas, there was another one: Nam Nguyễn Thành. fascinating that there are bots to create, but not to remove :) --ThurnerRupert (talk) 06:11, 14 February 2016 (UTC)[reply]

Thanks, ThurnerRupert. At the time, I confess I only looked through the top-15 or so longest-lasting hoaxes on the list. There may be more. ;) Of course, deleting stuff off Wikidata doesn't do anything to fix other sites: http://www.footballdatabase.eu/football.joueurs.nguyen-thanh.nam.228406.en.html http://howold.co/nam-nguy-n-thanh etc. If those are all based on the same Wikipedia hoax, then we're creating phantoms. Speaking of which, this is an excellent read, detailing a similar story of Wikipedia corrupting knowledge. Andreas JN466 13:58, 16 February 2016 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0