The Signpost

Special report

Wikipedia's not so little sister is finding its own way

Contribute  —  
Share this
By Lydia Pintscher
Wikidata is arguably one of Wikipedia's most successful sister projects. It has had a profound impact on Wikipedia in just a few years. Lydia Pintscher is the Product Manager for Wikidata at Wikimedia Germany. This essay was first published at Wikipedia @20 and has been licensed by the author CC-BY SA 3.0

In 2012, Wikipedia had grown and achieved so much in over a decade of creating an encyclopedia. But it was also at a point where fundamental change was needed: The world around Wikipedia was changing and Wikimedia had to find ways to make its content more accessible and support its editors in maintaining an ever increasing body of content in over 250 languages. The vision of a world in which every single human being can freely share in the sum of all knowledge was not achievable in this scattered way.

Ever since 2005 at the very first Wikimania, Wikimedia’s annual conference, one idea kept coming up: to make Wikipedia semantic and thus make its content accessible to machines. Machine-readability would enable intelligent machines to answer questions based on the content and make the content easier to reuse and remix. For example, it was not possible to easily find an answer to the question of what are the biggest cities with a female mayor because the necessary data was distributed over many articles and not machine-readable. Denny Vrandečić and Markus Krötzsch kept working on this idea and created Semantic MediaWiki, learning a lot about how to represent knowledge in a wiki along the way. Others had also started extracting content from Wikipedia, with varying degrees of success, and making the information available in machine-readable form.

So when the first line of code for the software that came to power Wikidata was written in 2012, it was an idea whose time had come. Wikidata was to be a free and open knowledge base for Wikipedia, its sister projects and the world that helps give more people more access to more knowledge. Today, it provides the underlying data for a lot of technology you use and the Wikipedia articles you read every day.

Being able to influence the world around you is such an important and empowering thing and yet we are losing this ability a bit more everywhere every day. More and more in our daily lives depends on data so lets make sure it stays open, free and editable for everyone in a world where we put people before data. Wikipedia showed how it can be done and now its sister Wikidata joins to contribute a new set of strengths.

Growing up

Wikidata always had bigger ambitions, but it started out by focusing on supporting Wikipedia. There were nearly 300 different language versions of Wikipedia, all covering overlapping (but not identical) topics without being able to share even basic data about these topics. Considering that most of these language versions had only a handful of editors, this was a problem. Small language versions were not able to keep up with the ever changing world and, depending on which language you could read, a vast amount of Wikipedia content was inaccessible to you. Perhaps someone famous had died? That information was usually available quickly on the largest Wikipedias but took a long time to be added to the smaller ones — if they even had an article about the person. Wikidata helps fix this problem by offering a central place to store general purpose data (like those found in the infoboxes on Wikipedia, such as the number of inhabitants of a city or the names of the actors in a movie) related to the millions of concepts covered in Wikipedia articles.

To start this knowledge base, Wikidata began by solving a simple but long-standing problem for Wikipedians, the headache of links between different language versions of an article. Each article contained links to all other language versions covering the same topic but this was highly redundant and caused synchronisation issues. Wikidata’s first contribution was to store these links centrally and thereby eliminate needless duplication. With this first simple step, Wikidata helped eliminate over 240 million lines of unnecessary wikitext from Wikipedia and at the same time created pages for millions of concepts on Wikidata, providing the basis for the next stage. Once the initial set of concepts were created and connected to Wikipedia articles, it was time for the actual data to be added, introducing the ability to make statements about the concepts (e.g. Berlin is the capital of Germany). After that, last but not least, came the capability to use this data in Wikipedia articles. Now Wikipedia editors could enrich their infoboxes automatically with data coming from Wikidata.

Along the way, a fantastic community maintaining that data developed, much faster than the development team could have dreamed. This new community included new people who had never contributed to a Wikimedia project before and were now becoming interested because Wikidata was a good fit for them. It also included contributors from adjacent Wikimedia projects who were more interested in structuring information than writing encyclopedic articles and found their calling in Wikidata.

The number of concepts represented in Wikidata items
The number of editors on Wikidata since its start (the circles indicate the beginning and end of the mass-import of interwiki links)

Later, Wikidata's scope expanded to support other Wikimedia projects, such as Wikivoyage, Wikisource, and Wikimedia Commons, allowing them to benefit from a centralized knowledge base as Wikipedia did.

As it evolved, Wikidata became an attractive source for Wikimedia projects and those who used to data-scrape Wikipedia infoboxes. External websites, apps, and visualisations used this information as a basic ingredient: from a website for browsing artwork, to book inventory managers, to history teaching tools, to digital personal assistants. Now, Wikidata is used in countless places without most users even being aware of it.

Most recently, it became clear that we need to think beyond Wikidata to a large network of knowledge bases running the same software (Wikibase) to publish data in an open and collaborative way, called the Wikibase ecosystem. In this ecosystem, many different institutions, activists and companies are opening up their data and making it accessible to the world by connecting it with Wikidata and among each other. Wikidata doesn't need to be and shouldn't be the only place where people collaborate to produce open data.

At the time of writing of this chapter, Wikidata provides data about more than 55 million concepts. It includes data about such things as movies, people, scientific papers and genes. Additionally, it provides links to over 4,000 external databases, projects and catalogs, making even more data accessible. This data is added and maintained by more than 20,000 people every month and used in over half of all articles in Wikimedia projects.

Helping people (and machines) come together

Just like Wikipedia is not like any other encyclopedia, Wikidata is not like any other knowledge base. There are a number of things that set Wikidata apart. They are a result of striving to be a global knowledge base and covering a multitude of topics in a machine-readable way.

The most important differentiator is probably the acknowledgement that the world is complex and can’t easily be pressed into simple data. Did you know that there is a woman who married the Eiffel Tower? That the Earth is not a perfect sphere? A lot of technology today is trying to simplify the world by hiding necessary complexity and nuance. Conflicting worldviews need to be surfaced. Otherwise we take away people’s ability to talk about, understand, and ultimately resolve their differences. Wikidata is striving to change that by not trying to force one truth but by collecting different points of view with their sources and context intact. This additional context can, for example, include which official body disputes or supports which view on a territorial dispute. Without this focus on verifiability instead of truth and not trying to force agreement it would be impossible to bring together a community from different languages and cultures. For the same reason, Wikidata doesn’t have an enforced schema that restricts the data, but, rather, has a system of editor-defined constraints that highlight potential problems.

Being able to cover different points of view and nuance is not enough however for a truly global project. The data also needs to be accessible to everyone in their language without privileging any particular language by design. Because of this, every concept in Wikidata is identified by a unique ID instead of an English name. Q5, for instance, is the identifier for the concept of a human. It is then given labels in the different languages: “human” in English, “người” in Vietnamese and “ihminen” in Finnish. This way the underlying data is language-independent and everyone can see the data in their language when viewing or editing it. This of course does not eliminate the language issue but it goes a long way towards more equity in contributing to Wikimedia’s content.

Besides fabulous people, Wikidata’s ultimate secret sauce are its connections. All concepts in Wikidata are connected to each other through statements. The statement “Iron Man -> member of -> Avengers” for example tells us that Iron Man is a member of the Avengers. That one connection alone does not tell us much yet. But if you take a number of other similar connections you can easily get a list of all Avengers. And then make a list of the movies they first appeared in and the actors they were portrayed by. A lot of simple individual connections taken together are powerful. If you add on top of that the wide range of topics Wikidata covers it becomes even more powerful because you can make connections that have not been made before. How about a list of species named after politicians? Now possible, thanks to these simple connections! And those are just the connections inside Wikidata itself; Wikidata also connects to a large amount of external databases, catalogs and projects that make even more data available. Since Wikidata has such a large number of links to external resources it can act as a hub so that way you, and even more importantly any machine, can find a vast amount of additional information based on a single piece of data. If the ISBN of a book is known, then knowing its entry in the relevant national library is just a hop away. There might not be a direct link from an artist’s entry in the Louvre’s catalog to their entry in the Rijksmuseum’s catalog but with Wikidata this connection is easily made, opening up yet more options for discovering knowledge.

Wikidata links to more than 4,000 external databases, projects and catalogs, creating a vast network.

Impacting Wikipedia

Its close connection to Wikipedia made all the difference for Wikidata, especially at the start. Without the community, experience, mindshare and tools that Wikipedia provided, Wikidata would not be where it is today. Wikidata gained a lot from its close association with Wikipedia. It is also giving back of course, not just by significantly lowering maintenance burdens through centralisation of data but also in a number of more subtle and indirect ways.

Before Wikidata the different Wikimedia projects and language versions of each project worked in silos to a large degree. There was little collaboration on content across project and language boundaries. Wikimedia Commons had been around for a while as a central repository for media files that are shared between all Wikimedia projects, but by its nature it did not force a lot of collaboration. Because of this a large part of the editors associated first and foremost with their language version of Wikipedia and only a distant second, if at all, with the Wikimedia Movement as a whole. Statements like “The Wikipedia in this and that language is terrible” were not uncommon when Wikidata started. The thought of using content that is shared with these other Wikipedias that were perceived as inferior was deemed frightening. Equally, the thought that the large Wikipedias could gain anything from contributions by smaller projects was unthinkable. By helping people connect across language and project boundaries, Wikidata has helped to steer Wikipedia away from a silo mentality towards a truly global movement where every project is recognized and valued for their contribution to the sum of all knowledge.

Improved search box using structured data from Wikidata

Wikidata also helps Wikipedia by being a fundamental building block for technical innovation - big and small. Simple changes like the improved search box when linking to another article in VisualEditor become possible thanks to structured data in Wikidata. Now the selector shows you the short description from Wikidata and you can select the right article to link to without having to look it up. Wikidata also makes possible more fundamental changes like overhauling Wikimedia Commons in order to make images more discoverable for Wikipedia editors and others. Wikidata provides the data necessary to build better experiences for Wikipedia’s editors and readers.

Through the data in Wikidata we can also understand Wikipedia better. We can analyse much more easily what content is covered and what is missing. Take the gender gap. It was known for a long time that Wikipedia’s content is skewed towards covering men. The simple fact that there are more Wikipedia articles about men than women is not very helpful for a big community though as it is too broad a problem to be motivated by and meaningfully make progress on. Wikidata allows us to see a more detailed picture and analyse the content by time period, country, profession of the person and other relevant characteristics. We can also see if there is a difference between the language versions of Wikipedia to see if any of them has a particularly narrow gender gap so we can learn from them. We can also see the geographic distribution of Wikipedia’s content and find blind spots on Wikipedia’s map of the world. The same can be done for any other content bias or gap that needs to be understood better. This way, Wikidata helps Wikipedia learn more about itself.

The gender gap on Wikipedia visualized per country of citizenship of the article’s subject. (tool by Envel Le Hir at denelezh.org)

Better understanding the knowledge that Wikipedia covers is a necessary first step towards countering biases and filling gaps. Wikidata can also help there by making it possible to generate automated worklists for a topic you care about. Interested in video games? You can make a list of all video games released in the last 10 years which are missing a publisher and start adding that data. How about party affiliations of politicians in your recent local election? Monuments in the city you last visited that are missing street addresses? All that is just a few clicks away, making it easier to contribute to collecting the sum of all human knowledge and making Wikipedia more complete.

And last but not least, Wikidata helps bring new contributors to Wikipedia. It opens up Wikimedia to new types of people, ones more interested in structuring information and connecting data points than writing long prose. And the small contributions that can be made on Wikidata lend themselves well to beginners who are initially overwhelmed by writing full articles. It also is a gateway for institutional contributors like galleries, libraries, archives and museums who want to make their content accessible.

Wikidata’s influence on Wikipedia far exceeds simply providing a few data points for infoboxes. It is a driver and supporter of change. Growing up with a big sister is not always easy. There’s the occasional disagreement and even fight but in the end you make up and stick together because you are the best team there could be. It is amazing to have someone to look up to. Wikidata is a project in its own right now, with its own reason for existence… but it will always be there to support Wikipedia.

Thank you, big sister! Wikidata owes you.


S
In this issue
+ Add a comment

Discuss this story

Given the way the system is setup at the moment when importing claim it's possible to set the infobox to only import claims with citations that don't come from Wikimedia projects. If there would be more precise definitions of what makes a valid source that aren't depended on human value judgements we could have a more complex whitelisting or blacklisting system for what claims can be imported via the infobox.
As far as qualitative value judgements go, for biographies of living people Wikidata now has https://www.wikidata.org/wiki/Wikidata:Living_people which allows deletion of claims qualitatively judged to be badly sourced. ChristianKl❫ 13:47, 31 August 2020 (UTC)[reply]
@NMaia: The Wikidata bridge doesn't help with any issue discussed above. It won't change labels of items. It doesn't deal with sourcing standards and even the fact that it should be able to enter source isn't build into the prototype that exists currently (but hopefully before the bridge get into contact with the bigger Wikis). ChristianKl❫ 13:47, 31 August 2020 (UTC)[reply]
Not in its first iteration, but technically there wouldn't be any hurdles for that to happen in the future. As for sourcing, neither I nor the person I was replying to were talking about that. But if you must, I think the sourcing point is moot since you can set up infoboxes to only fetch statements with references, and many language editions already do that. ~nmaia d 04:42, 1 September 2020 (UTC)[reply]
To have an integrated system to add sources, we requires phab:T199197.--GZWDer (talk) 08:07, 3 September 2020 (UTC)[reply]
@Nmaia:, but presumably that would only tell us that something had been given as a source. That source might meet lower WD requirements but not en-wiki's (say, not a secondary source where one was needed) Nosebagbear (talk) 19:04, 3 September 2020 (UTC)[reply]
I think this is a shame and a big failure on the WMF developers' side because short descriptions would be the number two thing I would think Wikidata useful for on WMF projects, after inter-language links (which the article mentions and is quite correct about the utility of). It should not be this long after Wikidata was started that we neither have an option to view Wikidata item changes on our en.wiki watchlists (if we opt to) nor the option to watchlist one particular element of a Wikidata item (in this case, the English description). — Bilorv (talk) 08:31, 1 September 2020 (UTC)[reply]
I wonder if part of the problem is lack of information flow between the projects. For the short descriptions example, it'd be good to have the option to also be alerted in a WP (or cross-wiki) watchlist when just the description section of the associated WD item is edited, and to be able to edit the description section of a WD item from the visualeditor interface of a WP page. Similarly, a slight interface improvement for WD-powered infoboxes would be that when editing the WP page in VE, the relevant WD statements could be edited through the same template parameter editing box to allow editing of WD statements via the same interface as editing. Possible the same for the WD cite_Q template. The way we handle commons on WP is similarly odd sometimes, where the local page for File:xyz.png seems to usually be a redundant copy of its commons version. On commons it's then not possible to get a list of all the captions used for it from different pages. T.Shafee(Evo&Evo)talk 00:53, 3 September 2020 (UTC)[reply]
A short description is not (just) data, it is content. Consider that any given short description refers to the specific article in the specific Wikipedia in the language of that Wikipedia. The closest equivalent in another Wikipedia should describe the article in that Wikipedia in the language of that Wikipedia, which in the general case may be different. It makes plenty of sense to record the short descriptions from all Wikipedias in Wikidata, but not to source them from Wikidata, as they should be written by the people who write the articles, and who know what the articles are actually about, and preferably are sufficiently competent in the language used and sufficiently knowledgeable to write an adequate short description (not always easy, as some articles have a very poorly written lead section). The appropriate short description may change as the article changes, or as an editor of the article sees a better way to express it. It is also inherently sourced to the article itself as an editorial judgment by a Wikipedian. The labels on Wikidata are unsourced and their provenance is obscure. They may be fit for Wikidata purposes, but have been found unfit for English Wikipedia purposes. I do not presume to speak for other Wikipedias. Cheers, · · · Peter Southwood (talk): 06:36, 4 September 2020 (UTC)[reply]
I think we can tease out something very interesting from this comparison to Commons. When I take a Commons picture of an oboe and put it in the article Oboe with the caption "A 20th century oboe", where is the reliable source for that? When our internal search engine takes the short description from Wikidata and displays it underneath the article title, where is the reliable source for that? Are these two scenarios different in a key way, which could explain the different community response to them, or similar in a key way, such that the community response to them is based more in the context of Wikipedia's history (Commons is old and uncontroversial; Wikidata is new and unfamiliar) than it is the actual content? — Bilorv (talk) 19:21, 4 September 2020 (UTC)[reply]
I still think this is mostly an enwp issue, not a Wikidata one. If enwp had just decided to display short descriptions on desktop view, this whole issue could have been avoided. You can already watch changes to descriptions on the enwp watchlist, just turn on the display of Wikidata edits. Changing wikis to edit content shouldn't be a big deal nowadays (I'm constantly jumping between enwp/wikidata/commons + other language versions, it's not a big deal). Also, reminder that Commons uses around 3 million English short descriptions from Wikidata that are now out of sync with enwp's. Thanks. Mike Peel (talk) 19:30, 4 September 2020 (UTC)[reply]
(ec) I tried to bring this argument a couple of years ago, when Wikidata discussions were on their peak, but nobody was interested in listening. At best they would say that the Commons policies are more strict that those at the English Wikipedia, and we do not have to worry - which is completely irrelevant to the argument.--Ymblanter (talk) 19:32, 4 September 2020 (UTC)[reply]
Bilorv, they're different inasmuch as you can remove or change a Commons picture in a Wikipedia article without having to leave Wikipedia, whereas to change the short description you have to go to a different project (which has a relatively steep learning curve).
A Commons picture appears in a Wikipedia article by dint of Wikipedia mark-up editors here control. The short description, on the other hand, appears in Wikipedia because of content in another project that the editors of that project control.
Moreover, if someone changes the image you put in a Wikipedia article, you can see that in the edit history of the Wikipedia article, and if you have the Wikipedia article watchlisted, it will be flagged to you. This is not so with the short descriptions.
A Commons equivalent would be if someone were to replace image file "oboe.jpg" in Commons, showing an oboe, with an identically named image file showing a different oboe – or a trombone, for that matter. Now, that sort of thing rarely happens in Commons, but is an everyday occurrence with verbal descriptions in Wikidata. Those are some of the differences. --Andreas JN466 14:12, 15 September 2020 (UTC)[reply]





       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0