This is the third in a series of recent Signpost op-eds about Wikidata, including "Wikidata: the new Rosetta Stone" and "Whither Wikidata?".
Building blocks of Wikidata's quality
Wikidata has recently celebrated its 3rd birthday. In these three short years it has managed to become one of the most active Wikimedia projects, won prizes, and is starting to show its true potential for improving Wikipedia. It is being used more and more both inside and outside Wikimedia every day. At the core of Wikidata is the desire to give more people more access to more knowledge. This is what we should be held accountable to. And I am the first to admit that we still have a long way to go.
What beliefs are at the core of Wikidata? Is it a database like any other?
We built Wikidata with a few core beliefs in mind that shine through everywhere. The most fundamental one is that the world is complicated and there is no single truth--especially in a knowledge base that is supposed to serve many cultures. This belief is expressed in many decisions, big and small:
- Wikidata allows you to express many different points of view about the same data point and they can live side-by-side. It allows you to express much more nuance than any other database I know.
- Wikidata is not about the truth but about what other sources say. When different sources claim different things, we can record them and expose them to the reader to interpret and decide.
- Wikidata doesn’t restrict you. You can say that a city has a cat as a mayor. (Because doh! This really happened.)
All this comes at a cost. My life would be a lot easier if we decided to just build a simple yet stupid database ;-) However we went this way to allow for a more pluralistic worldview as we believe it is crucial in a knowledge base that supports all Wikimedia projects and more. Here are some examples where we are starting to show this potential:
The goal here is to describe the world in a useful way. Even with the possibilities we have built into Wikidata, it will not be possible to truly represent the whole complexity of the world. Natural language, and thus Wikipedia, is much more suited for that and will continue to be. But there is value in a knowledge base for the many pieces of information we encounter every day that do not require that level of nuance. Today already a lot of great things are being built using data from Wikidata. Here are just a few of them:
Structured data is changing the world around us right now. And I am working towards having a free and open project at the center of it that is more than a dumb database.
Is Wikidata’s data bad? Is Wikipedia’s data better? Does it matter?
For Wikidata to truly give more people more access to more knowledge, the data in Wikidata needs to be of high quality. Right now, no one denies that the quality of the data in Wikidata is not as good as we would like it to be and that there is still a lot of work to do. Where opinions differ is how to get there. Some say adding more data is the way to go, as that will lead to more use and thereby more contributions. Others say removing data and re-adding it with more scrutiny is the only way to go. Others say let’s improve what we have and make usage more attractive. All of them have merit depending on where you are coming from. At the end of the day what will decide is action based on community consensus. Data quality is a topic close to my heart, so I have been thinking a lot about this. We are tackling the topic from many different angles:
More eyes on the data: The belief behind this is that the more people are exposed to data from Wikidata the better the quality will become. To achieve this, we have already done quite some work including improving the integration of Wikidata’s changes in the watchlist and recent changes on Wikipedia and the other Wikimedia projects. Next we are building the ArticlePlaceholder extension and automated list articles for Wikipedia based on the data in Wikidata. We will additionally make it easier to re-use the data in Wikidata for third parties. We will also look into building more streamlined processes for allowing data-reusers to report issues easily to create good feedback loops.
Automatically find and expose issues: The belief behind this is that to handle a large amount of data in Wikidata, we need tools to support the editors in their work. These automatic tools help detect potential issues and then make editors aware of them, so they can look into them and fix them as appropriate. To achieve this, we already have internal consistency checks (to easily spot issues like people who are older than 150 years or an identifier for an external database that has the wrong format). We have also worked on checking Wikidata’s data against other databases and flagging inconsistencies for the editors to investigate. Furthermore, more and more visualizations turn up that make it easier to get an overview of a larger part of the data and spot outliers and gaps. And probably the most important part is machine-learning tools like ORES that help us find bad edits and other issues. We have made great progress in this area in 2015 and will realize more of this potential in 2016. Overall the fact that Wikidata consists of structured data makes it much easier to automatically find and fix issues than on Wikipedia.
Raise the number of references: The belief behind this is that we should have references for many of the statements in Wikidata, so people can verify them as needed. This is also important to stay true to our initial goal of stating what other sources say. We have just recently made it easier to add references hopefully leading to more people adding references. More will be done in this area. The primary sources tool helps by suggesting references for existing statements. And the recently accepted IEG grant for StrepHit will boost this even further. And last but not least, there is a rather active group of editors working on WikiProject Source MetaData. All this will help us raise the number of referenced statements in Wikidata. We have already seen it increase massively from 12.7% to 20.9% over the past year because of these measures as well as a change in attitude.
Encourage great content: Wikidata as a project needs to build processes that lead to great content. It starts with valuing high-quality contributions more and highlighting our best content. We have showcase items for a while now which are supposed to put a spotlight on our best items. The process is currently undergoing a change to make it run more smoothly and encourage more participation.
Make quality measurable: We are working on various metrics to meaningfully track the quality of Wikidata’s data. So far the easiest and most-used metric is the number of references Wikidata has and how many of those refer to a source outside Wikimedia. We should however take into account that Wikidata also has a very significant amount of trivial, self-evident, or editorial statements that do not need a reference. One example of this is the link to the image on Wikimedia Commons. More than three million statements are "instance of: human"! The percentage of references to other Wikimedia projects is especially high for these trivial statements. On the other hand, the percentage of references to better sources is much higher for non-trivial statements like population data. The existing metric is too simplistic to truly capture what quality on Wikidata means. We need to dive deeper and look at quality from many more angles. This will include things like regular checks of a small random subset of the data.
All of those building blocks are being worked on or are already in place. Already today in its arguably imperfect state, Wikidata is helping Wikipedia raise its quality by finding longstanding issues on Wikipedia that only became apparent because of Wikidata, like a Wikipedia having two articles about the same topic without being aware of it. Or two Wikipedias having different data about a person without any useful reference. Wikidata gives a good way to finally expose and correct these mistakes. Once we have a data point and a good reference for it on Wikidata, it can be scrutinised more thoroughly and then used much more widely than before.
Trust and believing in ourselves
Do we trust our own model and way of working? Wikipedia started just the same way as Wikidata. It didn’t have high-quality data and it certainly didn’t have a lot of references for its articles. But with a lot of dedicated work this changed and today Wikipedias (at least the biggest ones!) are of fairly high quality. I see no reason why we can’t do this for Wikidata once again--with an amazing community, better tools at our hands, and the lessons we have learned in Wikipedia. But let’s also not fall into the trap of demanding perfection.
What do we do now?
- Encourage more re-users of Wikidata’s data to give their users a way back to Wikidata. Histropedia and Inventaire are two examples of re-users doing that already and it is a mutually beneficial partnership.
- Make it easier to use Wikidata’s data inside and outside of Wikimedia.
- Improve existing quality tools around Wikidata and make more use of them.
- Make existing knowledge diversity tools easier to use, promote them more and make more use of them.
- Make the outside world more aware of knowledge diversity and plurality.
- Increase the diversity in our contributor base to cover more cultures and worldviews.
At the end of the day, Wikidata is a chance to raise the quality bar across all our projects together. Let’s make it reality. That’s how we give more people more access to more knowledge every day.
- Lydia Pintscher is the Product Manager for Wikidata at Wikimedia Deutschland.