Wikidata: Knowledge from different points of view

Op-ed

Wikidata: Knowledge from different points of view

What beliefs are at the core of Wikidata? Is it a database like any other?

We built Wikidata with a few core beliefs in mind that shine through everywhere. The most fundamental one is that the world is complicated and there is no single truth--especially in a knowledge base that is supposed to serve many cultures. This belief is expressed in many decisions, big and small:

Wikidata allows you to express many different points of view about the same data point and they can live side-by-side. It allows you to express much more nuance than any other database I know.
Wikidata is not about the truth but about what other sources say. When different sources claim different things, we can record them and expose them to the reader to interpret and decide.
Wikidata doesn’t restrict you. You can say that a city has a cat as a mayor. (Because doh! This really happened.)

All this comes at a cost. My life would be a lot easier if we decided to just build a simple yet stupid database ;-) However we went this way to allow for a more pluralistic worldview as we believe it is crucial in a knowledge base that supports all Wikimedia projects and more. Here are some examples where we are starting to show this potential:

The goal here is to describe the world in a useful way. Even with the possibilities we have built into Wikidata, it will not be possible to truly represent the whole complexity of the world. Natural language, and thus Wikipedia, is much more suited for that and will continue to be. But there is value in a knowledge base for the many pieces of information we encounter every day that do not require that level of nuance. Today already a lot of great things are being built using data from Wikidata. Here are just a few of them:

Inventaire, a website that uses data from Wikidata to build author profiles among other things.
Genewiki improving gene-related articles on English Wikipedia and more
Wikipedia gender indicator giving us a detailed analysis of our content gaps and biases with the help of Wikidata
AskPlatypus answering our natural language questions based on data in Wikidata
Histropedia allowing us to build beautiful timelines powered by Wikidata
Games making it very easy to meaningfully contribute to our projects

Structured data is changing the world around us right now. And I am working towards having a free and open project at the center of it that is more than a dumb database.

Is Wikidata’s data bad? Is Wikipedia’s data better? Does it matter?

For Wikidata to truly give more people more access to more knowledge, the data in Wikidata needs to be of high quality. Right now, no one denies that the quality of the data in Wikidata is not as good as we would like it to be and that there is still a lot of work to do. Where opinions differ is how to get there. Some say adding more data is the way to go, as that will lead to more use and thereby more contributions. Others say removing data and re-adding it with more scrutiny is the only way to go. Others say let’s improve what we have and make usage more attractive. All of them have merit depending on where you are coming from. At the end of the day what will decide is action based on community consensus. Data quality is a topic close to my heart, so I have been thinking a lot about this. We are tackling the topic from many different angles:

More eyes on the data: The belief behind this is that the more people are exposed to data from Wikidata the better the quality will become. To achieve this, we have already done quite some work including improving the integration of Wikidata’s changes in the watchlist and recent changes on Wikipedia and the other Wikimedia projects. Next we are building the ArticlePlaceholder extension and automated list articles for Wikipedia based on the data in Wikidata. We will additionally make it easier to re-use the data in Wikidata for third parties. We will also look into building more streamlined processes for allowing data-reusers to report issues easily to create good feedback loops.

Automatically find and expose issues: The belief behind this is that to handle a large amount of data in Wikidata, we need tools to support the editors in their work. These automatic tools help detect potential issues and then make editors aware of them, so they can look into them and fix them as appropriate. To achieve this, we already have internal consistency checks (to easily spot issues like people who are older than 150 years or an identifier for an external database that has the wrong format). We have also worked on checking Wikidata’s data against other databases and flagging inconsistencies for the editors to investigate. Furthermore, more and more visualizations turn up that make it easier to get an overview of a larger part of the data and spot outliers and gaps. And probably the most important part is machine-learning tools like ORES that help us find bad edits and other issues. We have made great progress in this area in 2015 and will realize more of this potential in 2016. Overall the fact that Wikidata consists of structured data makes it much easier to automatically find and fix issues than on Wikipedia.

Raise the number of references: The belief behind this is that we should have references for many of the statements in Wikidata, so people can verify them as needed. This is also important to stay true to our initial goal of stating what other sources say. We have just recently made it easier to add references hopefully leading to more people adding references. More will be done in this area. The primary sources tool helps by suggesting references for existing statements. And the recently accepted IEG grant for StrepHit will boost this even further. And last but not least, there is a rather active group of editors working on WikiProject Source MetaData. All this will help us raise the number of referenced statements in Wikidata. We have already seen it increase massively from 12.7% to 20.9% over the past year because of these measures as well as a change in attitude.

Encourage great content: Wikidata as a project needs to build processes that lead to great content. It starts with valuing high-quality contributions more and highlighting our best content. We have showcase items for a while now which are supposed to put a spotlight on our best items. The process is currently undergoing a change to make it run more smoothly and encourage more participation.

Make quality measurable: We are working on various metrics to meaningfully track the quality of Wikidata’s data. So far the easiest and most-used metric is the number of references Wikidata has and how many of those refer to a source outside Wikimedia. We should however take into account that Wikidata also has a very significant amount of trivial, self-evident, or editorial statements that do not need a reference. One example of this is the link to the image on Wikimedia Commons. More than three million statements are "instance of: human"! The percentage of references to other Wikimedia projects is especially high for these trivial statements. On the other hand, the percentage of references to better sources is much higher for non-trivial statements like population data. The existing metric is too simplistic to truly capture what quality on Wikidata means. We need to dive deeper and look at quality from many more angles. This will include things like regular checks of a small random subset of the data.

All of those building blocks are being worked on or are already in place. Already today in its arguably imperfect state, Wikidata is helping Wikipedia raise its quality by finding longstanding issues on Wikipedia that only became apparent because of Wikidata, like a Wikipedia having two articles about the same topic without being aware of it. Or two Wikipedias having different data about a person without any useful reference. Wikidata gives a good way to finally expose and correct these mistakes. Once we have a data point and a good reference for it on Wikidata, it can be scrutinised more thoroughly and then used much more widely than before.

Trust and believing in ourselves

Do we trust our own model and way of working? Wikipedia started just the same way as Wikidata. It didn’t have high-quality data and it certainly didn’t have a lot of references for its articles. But with a lot of dedicated work this changed and today Wikipedias (at least the biggest ones!) are of fairly high quality. I see no reason why we can’t do this for Wikidata once again--with an amazing community, better tools at our hands, and the lessons we have learned in Wikipedia. But let’s also not fall into the trap of demanding perfection.

What do we do now?

Encourage more re-users of Wikidata’s data to give their users a way back to Wikidata. Histropedia and Inventaire are two examples of re-users doing that already and it is a mutually beneficial partnership.
Make it easier to use Wikidata’s data inside and outside of Wikimedia.
Improve existing quality tools around Wikidata and make more use of them.
Make existing knowledge diversity tools easier to use, promote them more and make more use of them.
Make the outside world more aware of knowledge diversity and plurality.
Increase the diversity in our contributor base to cover more cultures and worldviews.

At the end of the day, Wikidata is a chance to raise the quality bar across all our projects together. Let’s make it reality. That’s how we give more people more access to more knowledge every day.

Lydia Pintscher is the Product Manager for Wikidata at Wikimedia Deutschland.

← Previous "Op-ed"

Next "Op-ed" →

In this issue

9 December 2015 (all comments)

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

So Wikidata is explicitly intended to accommodate fringe views side-by-side with mainstream views, with no differentiation of which of these views are reliable other than by the reader's own ability to distinguish the quality of sources? There goes any hope of generating Wikipedia content such as infoboxes automatically from Wikidata, for any but the most dry and uncontroversial of topics. Is this biography about a physicist or a crank pseudoscientist? Can't say, sources differ. Is this herbal treatment efficacious in treating certain diseases? Can't say, sources differ. Is this city's English name spelled Kiev or Kyiv? Can't say, sources differ. Is the Riemann hypothesis an open problem in mathematics or already solved? Can't say, sources differ. Was Much Ado About Nothing written by Shakespeare, or by an entirely different person coincidentally named Shakespeare? Can't say, sources differ. So, other than replacing interwiki links, what is all the data in Wikidata actually good for? —David Eppstein (talk) 20:48, 12 December 2015 (UTC)[reply]
- No. There are for example ranks. You can read more about them at d:Help:Ranking. And there is of course the difference between what is technically possible and what is meaningful in a given situation. --Lydia Pintscher (WMDE) (talk) 21:04, 12 December 2015 (UTC)[reply]
  - With only three levels of reliability, none of which cover disputed but minority positions ("deprecated" ranking is described as applying only to claims that are agreed to be erroneous), and no provision for Wiki-specific standards of reliability, the ranking system seems pretty inadequate to me. —David Eppstein (talk) 21:18, 12 December 2015 (UTC)[reply]

It's inadequate for a scientific or political debate - for that, you shoudl always look at the sources. The three ranks are designed to allow for a simple selection of the rough level of certainty you want in a given context. "Preferred" is what you would want to see in an infobox, or what you want to include when compiling a list (of the largest cities or whatever). The "normal" rank would be used for historical values (population in 1927) or minority views, and can be used for more in-depth queries or more detailed display. "Deprecated" is for "known wrong" statements - known fallacies, popular misconceptions, etc. Statements with the "deprecated" rank are rarely used, and mainly serve as a safeguard against such statements being introduced as valid.

Chalsea Manning's gender is a good example: "female" and "trans woman" are both marked as "preferred" - both views are popular and well founded, and should be presented side by side (they don't contradict each other either, but that's not the point here, they might as well). "male" is left with the "normal" rank, since it used to be true, but no longer is. This is further specified with the "end date" qualifier, telling us that Manning used to be male until August 22 2013. For queries and infoboxes, this should be enough information. For scientific or political analysis, you'll of course have to dig deeper.

Wikidata is designed to be flexible and useful, it's founded on the idea that knowledge is intrinsically imprecise, incomplete, and context dependant. Wikidata doesn't claim to represent "the truth" accurately - it just tries to represent other people's statements about the world in a useful and neutral way. Just like Wikipedia. -- Daniel Kinzler (WMDE) (talk) 13:28, 13 December 2015 (UTC)[reply]

"Preferred, normal and deprecated." Boy, can I imagine some edit wars around that. "Jerusalem = Capital of Israel"? "Preferred." "Jerusalem = Capital of Palestine"? "Deprecated." (This is not a description of the current (protected) Wikidata entry on Jerusalem, in which both statements presently have "normal" ranking.) It would be better to list the best-quality sources for each of several conflicting statements, make sure that re-users display those sources, and allow readers to decide for themselves which sources they want to trust.

Incidentally, the Wikidata statement "Jerusalem = Capital of Israel" is sourced to the Wikidata item for Israel, where the CIA Factbook is given as a reference. However, the CIA Factbook says, under the heading "Capital": "Jerusalem: note - Israel proclaimed Jerusalem as its capital in 1950, but the US, like all other countries, maintains its embassy in Tel Aviv". Similarly, the German Foreign Office says, "Capital (not recognised internationally): Jerusalem." That nuance, i.e. the lack of international recognition, does not make it across to Wikidata. Maybe you need a "Proclaimed capital" statement in Wikidata, followed by a list of sources who do or do not recognise it as such. (The second list will be very, very much longer than the first.) Andreas JN 466 15:35, 13 December 2015 (UTC)[reply]

@Jayen466: This is not a fatality. We have created a qualifier(s) for those kind of usecases : "claim disputed by", and we can add some more if needed. see d:Property:P1310. And we can create more to add nuances to claims. This can be used to define, for example, autoproclaim states by saying the UN does not recognize them as states. TomT0m (talk) 13:05, 14 December 2015 (UTC)[reply]

That's good, TomT0m. I have a request: could I ask you to go into the data item and do the necessary adjustment? The protection level doesn't allow me access at present. In my previous post above I mentioned two sources you could cite (CIA Factbook and German Foreign Office); a more authoritative source not tied to any individual state might be this United Nations Department of Public Information publication: [1] (see the chapter: The status of Jerusalem). Andreas JN 466 15:51, 14 December 2015 (UTC)[reply]

So far the tools we have implemented seem to actually be working rather well. For all I can tell we do not have bad edit wars because tools like ranks and qualifiers are actually rather powerful. --Lydia Pintscher (WMDE) (talk) 17:00, 14 December 2015 (UTC)[reply]

Like many Wikipedians, I suspect, I have a lot to learn about Wikidata, so thank you for this clear, readable explanation. My main concern is with the data's reliability. Regarding, "We have already seen [the number of referenced statements] increase massively from 12.7% to 20.9% over the past year because of these measures as well as a change in attitude," I'm curious about the change in attitude; can you elaborate on that, perhaps pointing to public discussions exemplifying the evolving attitude toward citations, please? --Anthonyhcole (talk · contribs · email) 21:24, 12 December 2015 (UTC)[reply]
- Great to hear you found it helpful! I have a hard time pointing out specific things. It is more something I am seeing in many places and how it changed since the beginning of Wikidata. We started out with an empty database. Then a lot of boot-strapping happened in large part with the help of data already in Wikipedia. The need for this boot-strapping is going away now. Now instead we're seeing a shift towards working more with outside sources for data imports for example. There have been several collaborations with GLAMs as part of the WikiProject Sum Of All Paintings. Or collaborations with research institutions as part of Wikidata for Research. I also mentioned the reworking of the process for highlighting quality content. It is a long process but I think now that Wikidata is finding its feet firmly on the ground we're on the right track. Hope that answers your question at least in part. --Lydia Pintscher (WMDE) (talk) 17:06, 14 December 2015 (UTC)[reply]

It would be wise to split up Wikidata into different language versions, as all other Wikimedia projects. Wikidata cannot be compared to Commons because pictures etc. are obviously different from the data we speak about here. The world is one, but it also falls into many cultural spheres, and experience from almost fifteen years of Wikipedia is that there is not one Wikipedia, but there are almost 200 of them. So, get real, please, and draw the line.--Aschmidt (talk) 19:34, 13 December 2015 (UTC)[reply]
- That would severely undermine one of the main reasons Wikidata was created in the first place: to help small Wikipedias in order to give more people more access to more knowledge. --Lydia Pintscher (WMDE) (talk) 17:08, 14 December 2015 (UTC)[reply]
  - As I said, it would be wise to accept that it turned out that this plan has failed because you will not solve cultural issues with technology, as most nerds are apt to to. All attempts like this have failed in the past. Wikidata might be the project that taught us so in terms of all things Wikimedia.--Aschmidt (talk) 23:42, 14 December 2015 (UTC)[reply]

Compatibility of Wikidata's CC0 licence with Wikipedia's CC-BY-SA licence

Lydia, Wikidata imports large amounts of data from Wikipedia infoboxes, templates etc. which it then republishes it under the CC0 licence.

Unlike Wikipedia's CC BY-SA licence, which retains the right to attribution and imposes on re-users the obligation to ShareAlike, assuring that Wikipedia will be named as the source, CC0 waives all authors' rights. This means that content compiled in Wikipedia under the CC BY-SA licence is republished by Wikidata under a licence that does away with the rights contributors here were told they possessed when they contributed to Wikipedia.

Given this background – and the information on database rights provided by the WMF Legal Team in https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights – could you comment on these recent mailing list posts [2][3]? Originally people were told that Wikidata did "not plan to extract content out of Wikipedia at all", but would "provide data that can be reused in the Wikipedias".

Note also this post on de:WP calling for an RfC to determine whether the community would be willing to agree to waive its CC BY-SA rights for Wikidata, and that DBpedia, engaged in a similar endeavour of data extraction from Wikipedia, made a conscious choice to use the same licence as Wikipedia. Thank you. Andreas JN 466 21:23, 12 December 2015 (UTC) expanded by --Andreas JN 466 16:52, 13 December 2015 (UTC)[reply]

Which data goes into Wikidata is an editorial choice that I try to stay out of as much as possible - just like I stay out of which Wikipedia articles get written or not written. What I have done back then is ask for the information by the legal team you have linked to. --Lydia Pintscher (WMDE) (talk) 17:11, 14 December 2015 (UTC)[reply]

Usefulness of unsourced Wikidata imports to Wikipedia

Lydia, most Wikipedias do not allow users to cite one Wikipedia article as a source in another. It's a basic principle of Verifiability. If Wikidata imports unsourced content from various Wikipedia language versions' infoboxes, it seems to me that these imported Wikidata contents can only be used to populate infoboxes in another Wikipedia if they contain an external source (something that is generally not the case in Wikidata today). Moreover, this source would have to be visible in every Wikipedia that draws on the relevant Wikidata content, as otherwise the content would be unverifiable for the reader. Does that match your vision? Andreas JN 466 21:23, 12 December 2015 (UTC)[reply]

That's not the case. The material needs only be verifiable, not sourced, on Wikipedia. All the best: Rich Farmbrough, 21:58, 12 December 2015 (UTC).[reply]

Well, yes, and WP:Verifiability explains that the presence of an in-line citation is what makes content verifiable: All content must be verifiable. The burden to demonstrate verifiability lies with the editor who adds or restores material, and is satisfied by providing a citation to a reliable source that directly supports the contribution. How on earth do you suppose readers will be able to verify a figure or statement if it comes literally out of nowhere??? Moreover, WP:V policy continues, Any material lacking a reliable source directly supporting it may be removed and should not be restored without an inline citation to a reliable source. That would seem to include everything in Wikidata that does not cite a reliable source, and that's clearly the majority of it. Andreas JN 466 14:27, 13 December 2015 (UTC)[reply]

Sadly Wikidata is the shimmering turd of the project, far removed from any sense or intelligence displayed by sentient beings. It takes "facts" from ,is leading crap included in, for example, an IB, and make them available to dissemination to other sources, like the boxes on the right hand of a google search (that's the Google who have paid for our "free" data). Sadly some people looking for knowledge stop at the Google search page and don't bother to visit us, and thus they don't ever actually learn anything. Wikidata is a huge problem: it mistakes "data" for knowledge and "facts" for understanding, without ever understanding the difference. On the few occasions I have ever visited the alien pages of Wikidata, I've found the pages there to carry serious errors, but that's the problem of trying to get computers to rip "facts" from anything: they always get the wrong end of the stick. – SchroCat (talk) 00:22, 13 December 2015 (UTC)[reply]

I am concerned that some might view the Wikidata material as an acceptable "reliable source" whether directly or indirectly for actual articles. Such a position, I fear, would be a wondrous Pandora's box indeed. Instead, I suggest that a wall be placed here at the outset - let those who are completely outside Wikipedia be able to use Wikidata, but estop those within this walled area from using it. Collect (talk) 13:01, 13 December 2015 (UTC)[reply]

So you don't want hundreds of Wikimedia projects to benefit even from well sourced information on Wikidata? Why exactly? Or do you just recommend that statements that have no source (or no source outside of Wikimedia) should not be used on Wikipedia? I personally do agree with that.

Unsourced information is a huge problem. We should (and do) try to get better at this. This is not only true for Wikidata, but also for Wikipedia - what percentage of "facts" in infoboxes have a source? Would you recommend removing all unsourced information from infoboxes? I'd actually be with you there, if that drive was combined with a concerted effort to find and include sources. Yes, we do need better tools for this. Wikidata could be a big help with those tools, be providing information about authors, publications, etc. -- Daniel Kinzler (WMDE) (talk) 13:36, 13 December 2015 (UTC)[reply]

It's not information tho. It's a collection of trivial facts. Crap, really. Without context or background the data is pointless. Take, for random example, Winston Churchill's date and place of birth: on there own they mean sweet FA. Having the context of him being born into a well-connected upperclass family with a strong Tory cabinet minister as a father and an American socialite as a mother starts giving that context, but the "data" that Wikidata feeds on, strips all the useful and intelligent context out and helps who, exactly? It's the same pointless arguments seen in IB discussions: people focused on the crappy and pointless nonsense (which they mistake as "facts") without seeing anything useful or intelligent like knowledge or understanding. – SchroCat (talk) 14:27, 13 December 2015 (UTC)[reply]

There is a problem with unsourced statements in Wikipedia. Wikidata appears to be making the problem worse by spreading it across "hundreds of Wikimedia projects" simply because there seems to be no differentiation between sourced, badly sourced, unsourced or made-up information. This benefits nobody. pablo 14:38, 13 December 2015 (UTC)[reply]

Well, it might benefit the likes of Google, which has its own very sophisticated algorithms comparing multiple online sources and calculating probabilities that any given statement is factual, based on its prevalence in different sources. To Google, Wikidata would be just one more ingredient in the stew helping them calculate their probabilities. As far as Wikipedia is concerned, however, Wikidata would indeed be making the problem worse, in exactly the way you describe. Andreas JN 466 15:43, 13 December 2015 (UTC)[reply]

I think quantification helps greatly here.

There are about 3 million biographical entries on Wikidata: so 3 million statements of the type "X is an instance of human". These will mostly be unreferenced. How helpful would it be to reference "Winston Churchill is an instance of human"? Helpful, doubtless, to those who suspected he was a chimpanzee, or giant lizard.

The only subclasses of those statements where references seem positively helpful are where the person could be fictitious (saints and other medieval people, some characters from Dark Ages genealogies ...) Well under 1% therefore. Again for gender the "edge cases" exist, but are a small proportion. "Given name =" - how urgent is sourcing when you have a full name? Not very, in most cases. If you do the math, that means that for over 10% of current Wikidata statements, adding a reference is more like "make-work".

Now look at further statements about humans, such as VIAF or Library of Congress identifiers. These do need verification, but this verification is unlikely to be in the form of a citation. Someone needs to check that person X in Wikidata is the same as person Y in another database. But simply linking to the database in the "reference URL" field doesn't do that! It is a duplicate form of the statement of the identifier, so adding it in to tick the box is pretty meaningless. Right now there are 932023 VIAF identifiers in Wikidata.

So some of this talk, about how poorly referenced Wikidata is, seems to be made by people who assume that it mainly consists of facts such as the boiling point of sulphur, which do deserve a reference. Many millions of statements are quite different in kind.

In fact finding where person X is mentioned in other databases is the kind of compilation of "external links" that aids verifiability. Any metric that counts adding such a link as a negative for verifiability is fairly obviously screwy. Charles Matthews (talk) 19:31, 13 December 2015 (UTC)[reply]

Um -- where I have used an LoC number for an author and pen names, I have carefully made sure it refers to the correct person, and have, in fact, cited the LoC site for the information. If some do not, then they are not following Wikipedia policies in the first place. Collect (talk) 20:01, 13 December 2015 (UTC)[reply]

The "OMG, no references!" meme also overlooks the self-healing properties of sufficiently dense networks. Statements that are mutually inconsistent or implausible could be automatically detected even if they lack individual references. IMO it's well past time for us to be thinking more strategically about referencing than the current standard of 'drop a cite template after the closest full stop'. We currently treat referencing atomically - one fact, one footnote - and intelligent use of Wikidata would allow us to be much more effective at detecting internal inconsistencies and propagating well-substantiated information from a well-developed article to neighboring less-developed ones on related subjects. Opabinia regalis (talk) 20:12, 13 December 2015 (UTC)[reply]

Opabinia regalis, what about WP:V, including WP:CIRCULAR? It's policy in the English Wikipedia, and I believe all mature Wikipedias have equivalents. I can't see arguments citing the self-healing properties of dense networks changing that fact in the foreseeable future (and personally I would argue against scrapping WP:CIRCULAR). Sure, you can do Wikidata as a thing on its own, but that means it is not fit for what was described as its primary purpose: delivering data for inclusion in Wikipedias. Andreas JN 466 21:20, 13 December 2015 (UTC)[reply]

Short-term answer: neither of the two things I suggested actually requires editing articles; it would be sufficient to have a bot deposit information on the talk page or alert an editor if their edit introduces a potential inconsistency. Say you write an article on John Smith, born 1953, and link his father Robert Smith, but Robert's article says he died in 1890. "Did you mean Robert P. Smith, died 1976, linked as John's father in the Wikidata item, imported from the German Wikipedia? (probability score: 0.92902)" isn't really about whether the Wikidata item itself is "referenced".

Long-term answer: we have policies written for citations in free-text prose that don't map well to a structured data repository and aren't well-suited for effective text-mining either. The concept of WP:CIRCULAR is basically deliberately hobbling ourselves because we aren't good enough at distinguishing good information from bad, or at keeping track of all the places editors might spread bad information. If the information-propagation step is implemented technically and logged, and the source of the data is traceable (not just dumped into wikitext), it would be much easier to keep groups of articles consistent and current while controlling the spread of problematic content. (You could even imagine a "PC0.5" protection that required human approval only of the wikidata-recommended edits, and allowed the human reviewer to reject all linked changes if a particular update is identified as a problem.) Opabinia regalis (talk) 23:28, 13 December 2015 (UTC)[reply]

If 10% are statements like "Given name=X" (an example I gave in last week's op-ed), that still leaves another 30% that's only referenced to a Wikipedia and another 40% that's completely unreferenced. What are Wikipedias supposed to do with that material? Given Wikipedia policy, they have to ignore it. Andreas JN 466 21:20, 13 December 2015 (UTC)[reply]

Andreas, the way you have been arguing seems to me, unfortunately, to be the one-sided argument version of the cherry-picking fallacy. There are over 2000 different types of statements in Wikidata. To take another example, there are over 1 million statements of the "image" type: and we don't expect that identifying a Commons image that illustrates an item will be supported by a citation. The situation with an image in an infobox that has been pulled in from Wikidata will be exactly the same as for an image placed in an infobox here: it might not correctly illustrate what we want it to, but that would be an error no different from what we are used to (could be caused by incorrect metadata at Commons, for example). A more serious study of Wikidata would start with some understanding of all these issues. Charles Matthews (talk) 08:01, 14 December 2015 (UTC)[reply]

Well, what's your best estimate then for the total number of trivial statement types in Wikidata that, by and large, won't ever have and won't ever need a reference? Maybe we should agree on a public list of such statement types, and then generate referencing stats that exclude those statements. My subjective impression from looking around Wikidata items on various topics is that there are vast numbers of statements that would need a reference to be included in the English Wikipedia today, yet in Wikidata either have no reference at all or only an "Imported from Latvian Wikipedia"-type reference (where the hyperlink placed on "Latvian Wikipedia" leads not to the Latvian article and article version the info comes from, but to the Wikidata item describing "Latvian Wikipedia"). Andreas JN 466 11:08, 14 December 2015 (UTC)[reply]

I agree with what Charles said and I have said as much in the op-ed. Simply taking the number of references is too simplistic a view of data quality on Wikidata. d:Help:Sources/Items not needing sources has the start of trying to define which properties do not need references - incomplete still though. --Lydia Pintscher (WMDE) (talk) 17:15, 14 December 2015 (UTC)[reply]

Yes, thanks. It's a start, but needs more work. Andreas JN 466 02:05, 15 December 2015 (UTC)[reply]

(Edit conflict) Andreas, you are actually assuming, are you not, that WP:V is to be applied in "challenged or likely to be challenged" mode, which now I see, by "mission creep", requires an inline citation? Simple "verifiability" means only the assertion that the fact in question belongs to the general stock of things that one can look up.

I'm not trying to dodge your question, though. I actually don't know the answer, and since I work on Wikidata almost always on the biographical items, I don't have a good feel for the taxon area, which is big, or the location area, also big. These are probably less contentious in general, though.

What I take from this discussion is that there needs to be a "triage" before there can be serious answers: into statements where citations are significant (facts "likely to be challenged"), statements where citations are not of much significance (for example identifications of people, except in some few hard cases), and the middle ground for which there is properly scholarly interest in having a citation.

Judging by the number of death dates in Wikidata, which is over 1 million, there are probably at least 25% of the biographical statements that should really be referenced (there will be more birth than death dates). In fairness, one should explain that referencing to a high standard 1 million death dates requires at least mastering the Julian, Gregorian and other calendars (lunar, Islamic, and so on); and which calendar is in use is often tacit in scholarly sources, though respectable scholars do worry about such things.

Also, and I think this puts into perspective what Wikidata is attempting, many of the library databases use old scholarship, and are not consistent with each other. I see this all the time in comparing the Oxford Dictionary of National Biography with what is in VIAF.

In brief, the death dates at least should be done properly. This is not going to happen overnight, and has hardly been attempted on the scale we are talking about except by the librarians, who don't have definitive information. The impact on the Wikipedias is potentially positive. Charles Matthews (talk) 17:23, 14 December 2015 (UTC)[reply]

Charles, readers today really do appreciate having citations they can look up. It's one thing that distinguishes the Wikipedia of 2015 from the Wikipedia of 2004. Look for example at the recent article in The Age, one of the most celebratory articles about Wikipedia that I've ever seen: But most interesting to me is the ban on primary research. The demand that every input be traced to a published and authoritative source doesn't make it true, necessarily, but does enable genuine crowd-sourcing of scholarship. This is a revelation, and a revolution. You could make an argument (and numerous people have in fact made it) that the cited sources are the most valuable thing in Wikipedia. Without sources, it's really not much better than the last thing you've heard someone say about a topic in the pub. (See also https://lists.wikimedia.org/pipermail/wikimedia-l/2015-December/080303.html on Wikimedia-l.) Andreas JN 466 02:02, 15 December 2015 (UTC)[reply]

Indeed. The interest is reflected in d:Wikidata:WikiProject Source MetaData, where progress has been made in recent months, for open access scientific journals. Charles Matthews (talk) 07:43, 15 December 2015 (UTC)[reply]

Speaking of VIAF, I don't understand why Wikidata references citing VIAF don't at least link to the corresponding page in VIAF. For example, in Barack Obama, the first of the 4 references for "Sex or gender = male" reads "imported from: Virtual International Authority File". But when you click on "Virtual International Authority File", you get to the Wikidata item on VIAF, rather than the VIAF entry for Barack Obama. Andreas JN 466 22:31, 13 December 2015 (UTC)[reply]

I would make two comments on the above. The first is that Wikidata is still quite a recent development. Like Wikipedia, over the years it is likely to improve with additional, more reliable sourcing as well as on the basis of user feedback from a range of outside applications. That said, at the moment I think it presents a considerable risk to Wikipedia by making "facts" available without adequate review, especially those which are drawn without checking from infoboxes in other language versions. If the situation is to be improved over the shorter term, Wikipedia articles containing details from Wikidata should make specific reference to the page(s) on Wikidata where they appear. This will serve two purposes: it will alert editors to the need for verification and warn end users that the data have not necessarily been checked by those writing or improving a particular language version of an article.--Ipigott (talk) 13:37, 15 December 2015 (UTC)[reply]

Building a multi-lingual repository of useful data

Speaking as a Wikidatan, I enjoy working on Dutch 17th-century paintings and last year I joined the "d:Wikidata:WikiProject sum of all paintings" group. Wikidata is constantly improving the number of properties that can be used on paintings and I have helped propose and model the usage of these. Today I add reference statements because in the beginning I didn't know how. If all statements are from the same reference (like a museum website) I just used the "Described at url" property to link back to the website entry because I was used to working that way on Commons. We are still searching for ways to describe paintings in terms of movement, style, and period. I have re-used a number of references added by others on Wikipedia projects which were very beneficial to me as a Wikipedian, so now it's my way of giving back, by adding these to Commons images or Wikidata items where appropriate. Each project has its own community of volunteers and there don't seem to be many who venture out into the others like I do. I have been a member of the Wikipedia:Visual arts group on Wikipedia as well, but on Wikidata the interaction with like-minded people happens less on talk pages, because we don't speak a common language. Wikidata enables data-sharing by leveling the playing field to all mono-lingual players. I am used to working on image files of paintings on Wikimedia Commons, where we have similar language issues, but there we also have lots of complicated discussions about copyright problems. On Wikidata it doesn't matter whether you are working on a painting collection of modern art or 17th-century art, because the data model is basically the same. We may not be able to show you an image, but we can tell you where it is and all sorts of other things about copyrighted images. In some cases we can link out to a picture of it somewhere.

As far as data quality goes, what I think a lot of people don't understand is that if one Wikipedia is in disagreement with another Wikipedia on any issue (such as a painting attribution to its creator), on Wikidata both statements can reside side by side with the "Normal" rank. Currently Wikipedia only has two ranks for "points of view" in statements; namely published and deleted. As it is now, everything deleted from Wikipedia just disappears, whether it is an alternate point of view or pure vandalism. On the other hand, everything that is published has a ring of "truth" to it, whether or not it's under discussion. The Wikidata item, like any wiki, has these published ("Normal") and deleted state for statements, but it also allows these two extra states ("Preferred" and "Deprecated"). We see the deprecated state used a lot for past designations, so for example in buildings from the project Wiki Loves Monuments where old buildings are re-purposed over the years and so on. The "Preferred" status is used to indicate the value that is considered as coming from the latest, or "most reliable source". This may be seen as a controversial issue, namely choosing which source, and one can argue that this is of course a matter of opinion, but it is much more often a matter of consilience. I have used the "Preferred" value several times for painting attributions, using catalogs by leading art historians as references. When in doubt, I allow multiple statements to reside side by side with the "Normal" rank. We see this occasionally on Wikipedia with a lead statement such as "...is a painting by XXX or associated workshop". Wikidata can add precision to this statement by actually naming the individuals of that workshop to whom the painting has been attributed in the past. Jane (talk) 16:44, 13 December 2015 (UTC)[reply]

What do you think of The Signpost? Share your feedback.

Home

About