The Signpost

File:Wikidata 6th Birthday cake Wikimedia Norge.jpg
Alicia Fagerving
CC BY-SA 3.0
60
425
Op-Ed

Wikidata to split as sheer volume of information overloads infrastructure

Contribute   —  
Share this
By Bluerasberry

The Wikimedia Foundation will soon split parts of the WikiCite dataset off from the main Wikidata dataset. Both data collections will be available through the Wikidata Query Service: although in queries, by default users will get content from the main graph, and can afterwards take extra effort to request WikiCite content. This is the start of query federation for Wikidata content, and is a consequence of Wikidata having so much content that the servers hosting resources of the Wikidata Query Service are under strain.

I support this as a WikiCite editor, because WikiCite is consuming considerable resources, and the split preserves the content by reducing its accessibility. This split could also be the start of dedicated support for Wikimedia citation data products.

I am wary of the split, because it only gives about three more years to look for another solution, and we have already been seeking one since 2018. The complete scholarly citation corpus of ~300 million citations is not a large dataset by contemporary standards, but our Blazegraph backend strains to include 40 million right now. Even after a split, Wikidata will fill with content again. Fear of the split has been slowing and deterring Wikidata content creation for years, and we do not have long-term plans for splitting and federating Wikibase instances repeatedly.

The split will create a WikiCite graph separate from the main Wikidata graph. The main Wikidata graph will retain content of broader interest, including items for authors, journals, publishers, and anything with a page in a Wikimedia project.

This challenge does not have an obvious solution. I have tried to identify experts who could describe the barriers at d:wikidata:WikiProject Limits of Wikidata, but have not been able to do so. I asked if Wikidata could usefully expand its capacity with US$10 million development, and got uncertainty in return. I have no request of the Wikimedia community members who read this, except to remain aware of how technical development decisions determine the content we can host, the partnerships we can make, and the editors we can attract. The Wikimedia Foundation team managing the split have documentation which invites comment. Visit, and ponder the extent to which it is possible to discuss the scope of Wikidata.

That is the summary, and all that casual readers may wish to know. For more details, read on!

I am writing this article as an opinion or personal statement. I am unable to present this as fact-checked investigative journalism, and am presenting this from my own perspective as a long-term WikiCite contributor who has incorporated this project into many sponsored projects in my role as Wikimedian in residence at the School of Data Science at the University of Virginia. I have a professional stake in this content, and wish to be able to anticipate its future level of stability.

Why split WikiCite from Wikidata?

About 1/3 of Wikidata items are WikiCite content

Wikipedia is a prose encyclopedia established in 2001. Over the years, the Wikipedia community deconstructed Wikipedia's parts into Wikimedia sister projects, one of which was Wikidata, established in 2012. Wikidata is designed such that, to the casual observer, it seems to accomplish magic to solve multiple billion-dollar challenges facing humanity. Soon after its establishment, though, its infrastructure hit technical limitations. Those limits prevent current (and especially prospective) projects from importing more content into Wikidata.

The only large Wikidata project to continue limited development was WikiCite, which is an effort to index all scholarly publications. WikiCite grew, and is currently a third of the content on Wikidata. Users access WikiCite content through multiple tools; the tool I watch is Scholia, a scholarly profiling service serving this content 100,000 times a day. The point of Wikimedia projects is to be popular and present content that people want, and there is agreement that WikiCite is a worthwhile project. While Wikidata is overstuffed with content, use of the Wikidata Query Service strains Wikidata's computational resources and causes downtime. Reducing the costs and strain with a WikiCite split is one solution to manage them.

Scholia gets 100,000 views a day. Many outages since 2022 are of the toolforge:Toolviews tool, which is another unrelated outage issue.

The problem is that Wikidata is facing an existential crisis, due to reaching many of the challenges reported in WikiProject Limits of Wikidata. Users must be able to access Wikidata content through database queries, and the amount of content in Wikidata is large enough that more queries are failing, and more frequently. The short term solution which is happening right now is the first Wikidata graph split, which will result in the separation of the WikiCite dataset from the main Wikidata graph. This is not a long term solution, because Wikidata will fill up with data again. If users had their way, Wikidata would expand in resource use to index all public information on people, maps, the Sum of All Paintings, video games, climate, sports, civics, species, and every other concept which is familiar to Wikipedia editors and which could be further described with small – meaning not big data – general reference datasets.

In 1873 optical engineer Ernst Abbe discovered a formula describing the optical limit of microscopes. Limits can be surpassed when understood. Describe Wikidata's limits at d:Wikidata:WikiProject Limits of Wikidata

Here is a timeline of the discussions:

Here are some questions to explore. Ideally, answers to these questions could be comprehensible to Wikimedia community members, technology journalists, and computer scientists. I do not believe that published attempts at answering these questions for those audiences exist.

  1. If we could predict Wikidata's future capacity, then editors could strategically plan to acquire data at the rate of growth. Will Wikidata's capacity in 3 years be more or the same as current capacity?
  2. WikiCite hit many upload limits in 2018. In the 6 years since, we have not identified a solution. What could we have done differently to develop appropriate discussion at the time the problem was identified?
  3. Suppose that the Wikimedia community developed a successful product – like WikiCite and Scholia – which also came with expenses. How can the Wikimedia community assess the value of such things and determine what support is appropriate?

Scholarly profiling

Google Scholar is a popular scholarly profiling service, but it is also non-free, which is why Wikipedia cannot post screenshots of its outputs.

Scholarly profiling is the process of summarizing scholarly metadata from publications, researchers, institutions, research resources including software and datasets, and grants to give the user enough information to gain useful insights and to tell accurate stories about a topic. For example, a scholarly profile of a researcher would identify the topics they research, their social network of co-authors, history of institutional relationships, and the tools they use to do their research. Such data could be rearranged to make any of these elements the subject of a profile, so for example, a profile of a university would identify its researchers and what they study; a profile of software would identify who uses it and for what work; and a profile of a funder would tell what impact their investments make.

The easiest way to understand scholarly profiling is to use and experience popular scholarly profiling services.

Google Scholar is the most popular service and is a free Google product. It presents a search engine results page based on topics and authors. Scopus is the Elsevier product and Web of Science is the Clarivate product. Many universities in Western countries pay for subscriptions to these, with typical subscription costs being US$100,000-200,000 per year.

A search for "influenza" in Internet Archive Scholar suggests papers from 1971, 2007, and 1890. Semantic Scholar claims copyright of the website, and OpenAlex is ambiguous with licensing.

Free and nonprofit comparable products include Semantic Scholar developed by the Allen Institute for AI, OpenAlex developed by OurResearch, and the scrappy Internet Archive Scholar developed by Wikimedia friend Internet Archive.

Other tools with scholarly profiling features include ResearchGate, which is a commercial scientific social networking platform, and ORCID, which compiles bibliographies of researchers.

OpenAlex, Semantic Scholar and Internet Archive Scholar designate the data as openly licensed and allow export, but all of these have ambiguous open licensing terms for elements of their platforms. Google Scholar, Scopus, and Web of Science slurp data that they find and encourage crowdsourced upload of data, but their terms of use do not allow others to export it as open data. It has been a recurring thought that the WikiCite and Scholia could meet institutional needs at a fraction of the Scopus and Web of Science subscription costs. ORCID also encourages data upload and entire universities do this, but only for living people, and the data is only public with consent of the individual profiled.

Statements such as the Barcelona Declaration on Open Research Information seek to gather a collaboration which could manifest an ideal profiling platform, which would be open data, exportable, allow crowdsourced curation, encourage public community discussion of the many social and ethical issues which arise from presenting a platform like this, and of course be sustainable as a tool which used computing resources. Scholia is these things, except for hitting technical limits.

WikiCite and Scholia

WikiCite is a collection of scholarly metadata in Wikidata, the WikiProject to curate that data, and the name of the Wikimedia community who engage in that curation. Scholia is a tool on Toolforge which generates scholarly profiles by combining WikiCite and general Wikidata content into a reader-friendly format. Scholia is preloaded with about 400 Wikidata queries, so instead of any new user needing to learn queries, they can use the Scholia interface to run queries to answer common questions in academic literature research.

Anyone can use WikiCite content for applications on or off wiki. Histropedia emphasizes timelines, and uses WikiCite content to visualize research development over time.

WikiCite is the single most popular project in Wikidata in terms of amount of content, number of participants, depth of engagement of participants, count of institutional collaborations, and donation of in-kind labor from paid staff subject matter experts contributing to the project. In terms of content, WikiCite is about 40 million of Wikidata's 110 million items. Because it is openly licensed, many other applications ingest this content, including the other scholarly profiling services but also free and open services such as Histropedia. Four WikiCite conferences have each convened 100 participants. WikiCite presentations have been a part of many other Wikimedia conferences for some years. The largest WikiCite project in terms of participants was the WikiProject Program for Cooperative Cataloging, which recruited staff at about 50 schools to make substantial WikiCite contributions about their own research output. In the context of the Wikimedia Foundation investing in outreach, there are projects like this which are outside of that investment, but which attract investors, new editors, and institutional partnerships.

The promise of WikiCite is to collect research metadata, confirm its openness, then enrich it with further metadata including topic tagging and deconstruction of the source material to note use of research resources, such as software, datasets, protocols, or anything else which could be reusable. Scholia presents all this content. Example Scholia applications are shown here, with links to the queries and pages which present such results.

What next?

The Wikidata Query Service is failing more often. 99% of the time it works, but 1% failure of a tool central to accessing Wikidata is an emergency to address immediately. To ensure continued access to Wikidata content, the Foundation has responded with a recently refined plan announced here incorporating everyone's best ideas for what to do.

It is challenging to coordinate the Wikipedia and Wikimedia community. The above mentioned Barcelona Declaration asks organizations to commit to "make openness the default", "enable open research information", "support the sustainability of infrastructures for open research information", and "support collective action to accelerate the transition to openness of research information", which are all aims of WikiCite, Scholia, and Wikimedia more broadly, but in my view Wikimedia projects have been too independent to join such public consortia. If we could reach community consensus to join such programs, then I think experts in that group could advise us on technical needs, and funders would consider sponsoring our proposals to develop our technical infrastructure. If the Wikimedia Movement had money, then based on my incomplete understanding of the limit problems, I recommend investing in Wikidata now so that we can better recruit expert partnerships and contributors. Since we lack money, the best idea that I have is to find the world's best experts considering comparable problems, and explore options for collaboration with them. I wish that "Wikipedia" could sign the Barcelona Declaration or a similar effort, and get more outside support.

S
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

I am amazed that a major Wikimedia porject is about to fail, and this is the first we have heard of it. Scaling databases is not a new science, the principles have been known for decades, I am certain there is a technical solution for this, and I haven't even looked at the "Limits" page yet. All the best: Rich Farmbrough 11:11, 16 May 2024 (UTC).[reply]

P.S. the link https://www.wikidata.org/wiki/WikiProject:Limits_of_Wikidata doesn't work. I'll check back later. All the best: Rich Farmbrough 11:14, 16 May 2024 (UTC).[reply]
@Rich Farmbrough: Good catch -- fixed. Thanks! jp×g🗯️ 11:28, 16 May 2024 (UTC)[reply]
well its not something that most people have a good grasp on, so only once you really start hitting the limits, it becomes more interesting for people who don’r really care about the details. Its also not uncommon, it’s just that usually people are able to find better solutions in the nick of time and most outsiders don't even realized something changed. Reminder that all of wmf was once one database server, and one rack in one datacenter, and that we ran without multimedia backups for over 15 years. In reality, most wmf projects have existential (technical) crises on a almost continual basis and hardly anyone notices (see also the major pagelinks db table normalizations, going on right now). —TheDJ (talkcontribs) 14:34, 16 May 2024 (UTC)[reply]
One problem is that Blazegraph in terms of development is somewhat in dead in the water as Amazon silently hired Blazegraph developers to develop the closed source Amazon Neptune service. However, Qlever could be a valid alternative for Blazegraph if it will scale high enough. --Zache (talk) 16:50, 16 May 2024 (UTC)[reply]
WMF does have money. They could hire a team to revive blazegraph development. We can't always rely on the open source world to provide us with software, sometimes we have to make it ourselves. To be clear, this is an expensive option. DB developers are much more expensive than php web dev. Trade offs would probably have to be made. Like all things in life it comes down to deciding what is important.Bawolff (talk) 17:35, 16 May 2024 (UTC)[reply]
Scaling RDF databases is a pretty different ballgame then scaling relational databases. I would consider scaling SPARQL based databases to very much be an open research problem. That doesn't mean there are no solutions, but they are full of trade-offs, and often look like doing the exact same thing (partitioning the graph) with some front end to keep the queries the same. Bawolff (talk) 17:33, 16 May 2024 (UTC)[reply]

I operate my own copy of the Wikidata Query Service that continues to provide a unified graph. If you do not have time to rewrite your queries to use the split graph, you should be able to run queries on my query service. I will continue operating this unified graph until it is literally no longer possible. Please reach out by email if you are interested in using this. Harej (talk) 21:41, 16 May 2024 (UTC)[reply]

@Harej: Technical question. How do you keep it in sync with wikidata changes? (ie. is it near realtime, is there periodical updates etc) I think that keeping it sync is biggest unsolved issue if one has its own replica. --Zache (talk) 23:52, 16 May 2024 (UTC)[reply]
Zache, syncing is not an unsolved issue if you use Blazegraph. The Wikimedia Foundation has developed tools to synchronize Wikibase instances, including Wikidata, via Recent Changes. So my query service keeps up more or less with real time. Harej (talk) 05:03, 20 May 2024 (UTC)[reply]

Didn't realize Wikipedia has now (or will soon have) had Wikidata for longer than it didn't. Nardog (talk) 23:44, 16 May 2024 (UTC)[reply]

Bad idea when I first heard about this option at Wikimania Capetown, still a bad idea now. I support this as a WikiCite editor - well, I'm a WikiCite editor, and I don't support this. The truths about Wikidata's quite genuine "big data" issues need to be addressed, for sure. Scholia, as a front end, surely also is a different matter from these back end questions. (That is not to discuss funding, which obviously is a point.) I was struck at WikidataCon 2019, where much was said about federation, how little of it made sense from a technical point of view. As someone who puts time into "main subject" and "author disambiguation" questions for article items, I see the direct link into Wikipedias and other sites from the subject and author items as fundamental. The Capetown advocacy was solely in terms of convenience to bot operators, which I would say was a blinkered outlook. Charles Matthews (talk) 04:21, 17 May 2024 (UTC)[reply]

Oh, and I note that the feedback period for comment ended the day before the publication of this article. Charles Matthews (talk) 07:17, 19 May 2024 (UTC)[reply]

Scholia usage volume

Scholia, a scholarly profiling service serving this content 100,000 times a day - what is the source for this number, and does it distinguish between human users and automated traffic? (It matched the 2018 stats here, but that publication explicitly warned that it "cannot be used to draw any conclusions other than that usage grows [as of 2018]. Multiple contributing factors seem likely, including web crawlers, generic growth of Wikidata and WikiCite content, increased interlinking both within Wikidata and between Wikidata and other websites, especially Wikimedia projects, as well as WikiCite or Wikidata outreach activities, which often feature Scholia prominently"). Regards, HaeB (talk) 06:30, 17 May 2024 (UTC)[reply]

@HaeB: The source is https://toolviews.toolforge.org/ which is a Wikimedia platform service. Like so many of our fundamentals, documentation is lacking, but based on rumor that service has same specifications for reporting a view as wikitech:Tool:Pageviews, which is how we report Wikipedia traffic to the world and which we consider reliable. The narrative for both pageviews and toolviews is that the Wikimedia platform reports web traffic in the way standard to the field, so regardless of bot/human ratio, our reporting should be comparable to anyone else reporting any other web platform traffic. Bluerasberry (talk) 18:09, 17 May 2024 (UTC)[reply]
Thanks, that's helpful. Some quick observations:
  • based on rumor that service has same specifications for reporting a view as wikitech:Tool:Pageviews - from a quick look at the code (good point about documentation btw), that does not appear to be the case, as Toolviews does not use the same pageview definition. In particular, regarding my question above, it does not seem to attempt to detect automated traffic. (Which does not mean it is unreliable per se - it does what it does - just that the conclusions we can draw from this data are limited, as warned about by the authors of that 2018 paper, which included yourself.)
  • so regardless of bot/human ratio, our reporting should be comparable to anyone else reporting any other web platform traffic - that doesn't seem to be true. Bot pageviews are usually excluded from web traffic reporting. That's also the default setting in the Pageviews tool (in views like [1] you have to change the setting under "Agent" to include "Spider" and "Automated" views alongside human/"User" views; their FAQ has more on the distinction.)
  • One way to get closer to an answer to the question of how much human usage Scholia actually sees might be the unique daily visitor counts whose existence the Toolviews API documentation advertises. Unfortunately though they seem to be broken (the API claims 0 visitors for Scholia and all other tools for recent dates; and tt turns out that MusikAnimal filed a bug about this last year already).
  • We can get a clue though by looking at the series of daily hits/pageviews for Scholia over time. Spot-checking January-April 2024, it's very interesting that while the numbers are indeed above 100k most days, there are several days where they are much smaller (e.g 2024-01-02: 34, 2024-01-12: 12, 2024-02-08: 28, 2024-03-04: 26. Similar a year earlier, e.g. 2023-02-15: 31, 2023-03-30: 8, 2023-04-17: 34). Such extremely large, isolated drops basically never occur for web traffic that is substantially human-generated. Now there might be some different explanations (e.g. further bugs in the Toolviews tool), but the most likely one is that Scholia sees very little direct human usage.
Regards, HaeB (talk) 07:49, 18 May 2024 (UTC)[reply]
PS, re the last bullet point: I noticed since that these weird drops are also showing for other tools (example), so the further bugs in the Toolviews tool possibility I mentioned is in fact the more likely one. And you already pointed that out elsewhere in the article (Many outages since 2022 are of the wikitech:Tool:Pageviews tool - sorry for overlooking that in my last comment; as I said, it consisted just of some quick observations). That's a bit in contrast though with which we consider reliable, no?
Either way, the question remains how much human usage the Scholia tool actually sees. I'm a bit surprised that this Op-ed describes it as a successful product citing nothing more than a metric that the 2018 paper warned should not be used for such conclusions.
Regards, HaeB (talk) 12:26, 20 May 2024 (UTC)[reply]
I don't quite follow everything being said here, but to whom it may concern, there will be a Toolviews visualization akin to Pageviews relatively soon. It's a volunteer project I've slowly been picking away at since 2019. It's maybe 90% of the way there, so stay tuned :) MusikAnimal talk 22:51, 20 May 2024 (UTC)[reply]
Belated apologies for pinging you here without indicating more clearly why. It was because I saw you had filed that bug about that unique visitor data in toolviews. It made me think that you might be able to provide more insight in general on how reliable the toolviews numbers are, in particular when it comes to distinguishing tool usage by humans and spiders/automata (as WMF does for content pageviews). I realize though that you are not responsible at WMF for addressing that kind of problem.
All that being said, I would be curious whether/how it is planned to alert users of that forthcoming visualization of the Toolviews data about such issues (will it include a FAQ like the Pageviews Analysis tool?).
Regards, HaeB (talk) 06:38, 10 September 2024 (UTC)[reply]
PS: While closing tabs I noticed the comments at phab:T87001#8899655 ff. (e.g. The question of bot traffic vs. human was the first thing that came up when I showed toolviews to people at the hackathon and The very streamlined data collection that toolviews is doing based on the front proxy nginx logs does not contain any user-agent based processing. The idea was to be able to show traffic patterns more than to provide any sort of detailed traffic analysis. I agree that we should probably put a FAQ section on the https://wikitech.wikimedia.org/wiki/Tool:Toolviews page that I've not yet bothered to create for this tool). So it definitely looks like this issue is unsolved and that this Signpost article should not have used these numbers in this way to imply conclusions about the popularity of Scholia among human users. Regards, HaeB (talk) 16:30, 10 September 2024 (UTC)[reply]
@HaeB: I admit near total ignorance in interpreting the metrics which the Wikimedia platform makes available, including the limits of their reliability and what to notice when they conflict with each other.
I will be presenting some of the information in this report in about a month at wikiconference:Submissions:2024/WikiCite_-_proposed_as_Wikimedia_sibling_project with care to make the corrections you have indicated. My presentation will include published slides and a recording, so if you have specific suggestions for what I should correct, then say so, as everything you have said so far is valuable to me. I will follow up with an email to ask you for a voice/video chat, and if you can meet me, I will take notes because you understand this and are the only person giving me feedback on this in all the times I have presented this. Thanks so much. Bluerasberry (talk) 18:06, 10 September 2024 (UTC)[reply]

What's the limiting factor?

I was surprised that WD, having about as many items as Commons has pictures, should be subject to severe strain that Commons is spared. I guess it's because WD gets a lot more queries. I bring up WikiShootMe more days than not, and Commons App almost as often, and probably each instance means multiple searches among those hundred million items most of which have nothing to do with geocoordinates. So, it makes sense that my favorite views, the ones that are all about geocoordinates, should be limited to a separate database for places, and users who want to know something about people will use a separate biographical one, and scholarly articles? Every one that was published in the past century? Yipes; that must be a terribly heavy load, so that means separation again. And so forth. Kind of a shame that separate but affiliated websites must be set up to work around technical limitations, but I hope they can all be made to work together without too much confusion. Jim.henderson (talk) 02:02, 18 May 2024 (UTC)[reply]

Wikidata has 125 million items. Does Commons have 125 million pictures? Ymblanter (talk) 07:08, 18 May 2024 (UTC)[reply]
Wikidata and Commons have a similar number of pages, around 100 million. Last time I checked, the issues with Wikidata were due to the frequency in updates. Wikidata has over 2 billion edits, while Commons has less than 1 billion (or about 20 million monthly edits vs. 10). Nemo 09:19, 18 May 2024 (UTC)[reply]
To amplify (first posting on this got lost) Commons has 105 million media files. There is no direct comparison. If the commons:SPARQL query service was as much used as the Wikidata SPARQL, and the metadata for files were fully expressed in commons:Structured data, there would be a basis for comparison. It would likely show up that few media files had really large metadata, that typically the metadata were not very much updated after initial upload, and certainly Commons had few projects for systematic metadata expansion (unless the role of categories changed). These differences can help to explain why from the point of view of "big data" issues you cannot really equate the two sister projects. For example query.wikidata.org needs to refresh its working dump of Wikidata on a timescale of seconds to function properly, which is not a relevant requirement for Commons at present. Charles Matthews (talk) 09:53, 18 May 2024 (UTC):[reply]
So I would say that the "sheer volume" headline is more than a trifle misleading on the technical side. The worst-case WikiCite items are things like the Higgs boson article item with several thousand authors. But it is probably the average-case ones, say with 30 "cites-work" statements, that fatten up the graph, and these get edited often enough, for example with tools, to link to author items rather just giving author strings. UniProt is twice the size of Wikidata in terms of items[2]. Charles Matthews (talk) 14:22, 18 May 2024 (UTC)[reply]
So, it's not the number of items as the headline might suggest, nor the size of items, nor even the traffic of queries as I suspected. Rather, the major problem is the heavy traffic of changes, and most of those changes are by bots. It's pleasant to know that we Commons users are not much the problem; mostly we are among the many who are at risk of suffering. Even though my uploads via WikiShootMe and Commons App are automatically linked in WD, that's a small part of the WD load. Yippee; it's not my fault; I'm just a victim! Jim.henderson (talk) 16:08, 18 May 2024 (UTC)[reply]
I'm not familar with the exact issues involved, but i strongly suspect it is not simply data churn rate, but that size of the underlying graph is a very significant issue here. Bawolff (talk) 08:58, 19 May 2024 (UTC)[reply]
I mentioned Uniprot, and https://sparql.uniprot.org/ is not having the same problems, I believe, despite having 150 billion triples. I'm no expert. It does seem to get by with an update every eight weeks. Charles Matthews (talk) 10:12, 19 May 2024 (UTC)[reply]
[I should preface this with im just guessing here]. I'm more just saying that the factors are inter-related and we probably shouldn't assume its just one single cause but a bunch of factors in combination. I suspect (albeit this is a total guess with nothing backing it up) things would be a lot easier if the graph size would be much smaller or the churn rate was significantly lower. As far as churn rate goes - it should be noted that there is a significant difference between a db with a low churn rate vs one with zero churn (a static db you have to rebuild if you want to change something). The static case opens up certain possibilities that just aren't possible if the data changes at all. Bawolff (talk) 18:58, 19 May 2024 (UTC)[reply]
I read a little more some of the on wiki pages. wikitech:Wikidata Query Service/ScalingStrategy seems to imply that there are two problems - capacity (Overall graph size) & update lag (data churn rate). It sounds like some improvements were made on the update lag front so it is less pressing at least in the near term, and that the larger concern at the moment is around capacity. The uniprot endpoint is using Virtuoso, which from what I understand makes some trade-offs to allow for really high capacity. In particular it has bad support for incremental updates (So no live updates), and it also has a feature where if the query is hard, it may return its best guess instead of the correct answer. Having the query engine be a static snapshot is a trade-off that might be acceptable to some users, but it is a pretty big trade-off and probably not to be taken lightly. Bawolff (talk) 08:04, 20 May 2024 (UTC)[reply]
Capacity is indeed the most important issue here. As of now, true horizontal scaling (sharding) of graph databases is an unsolved technological problem – but this would be the solution that is needed for Wikidata.
Currently, the entire graph representation of Wikidata needs to fit into the memory of a single computer for the Wikidata query service (WDQS), which is the most important interface to consume data. Adding more memory to this computer (vertical scaling) becomes increasingly expensive and does not scale indefinitely anyways. Due to the lack of horizontal scaling for graph databases, it is currently not an option to distribute the graph to several computers which behave as if they were one database—as it can easily be done for many other database types.
There are a few graph database engines that claim to have solved horizontal scaling, but usually this comes at the expense of plenty of query performance (i.e. consuming data is usually significantly slower and less predictable when horizontal scaling is used), and those engines which have gotten this feature only got it rather recently. At this point it is not obvious which engine will emerge as the clear go-to option in the future for a graph of the size of Wikidata.
I'm afraid there is really not much that can be done differently than the proposed split at this point; it is not a question of resources, or lack of will. We have to accept that multiple graphs and query federation are the way to go. —MisterSynergy (talk) 19:33, 21 May 2024 (UTC)[reply]
Does blazegraph really need everything in memory? Its not advertised as an in-memory database. Of course, regardless of that, memory pressure still has a significant affect on query performance as the data set gets larger. While i agree that vertical scaling can't go on forever, it does seem like we are very far away from the limit if money was no option. For example, AWS offers servers with 24 terabytes of ram (e.g. u-24tb1.112xlarge). Such things are very expensive, but they do exist. Bawolff (talk) 22:50, 21 May 2024 (UTC)[reply]
As much as I am aware, the memory spec of the current setup is indeed relatively moderate, compared to what is technically possible. However, please also consider that there are already north of 20 servers behind WDQS, distributed to two different data centers and partially reserved for internal and external usage. Each of those servers is capable of running queries on the entire graph, and the load from different requests is somehow evenly distributed among them.
If you want to scale vertically, you need to do this for all of these machines. Doing so might help for possibly a couple of years, but the fundamental problem—the absence of a viable technical solution—would very likely not be solved by then either. —MisterSynergy (talk) 21:29, 23 May 2024 (UTC)[reply]
Its important to distinguish the wikidata query service from the wikidata wiki. The database powering the wikidata wiki is scaling fine. Wikimedia commons is a bad comparison because the wiki itself is not having scaling issues, it is the query service that is having issues. Bawolff (talk) 08:51, 19 May 2024 (UTC)[reply]
Indeed the analysis of the alternatives has the number of triples and the frequency of their update as major criteria. I'd say it's debatable where the boundaries of "the wiki itself" end though: what about native search, for example? At one point ElasticSearch was struggling to handle the updates on Wikidata, as well, for pretty much the same reason (IIRC). Nemo 09:41, 19 May 2024 (UTC)[reply]

Well, what Lane has written is indeed an opinion piece, leaving room for other opinions and discussions. My own takeaways are more complex. Leaving aside implications for my own future WikiCite involvement, and the whole tech funding debate, here are some:

Enough for now, really. Charles Matthews (talk) 05:28, 19 May 2024 (UTC)[reply]

Missing aspects

The other day, I recommended this Signpost article as background reading in a Facebook discussion (about this recent announcement on the WDQS backend update). I received this feedback:

I think it should be stressed that this is an opinion piece and rather incomplete. The author writes it himself [...]
the author does not seem to be aware of https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/WDQS_backend_alternatives , which shows throughout the article. It's perfectly fine because of the disclaimer he adds himself. I just wanted to point that out.

Just wanted to post this here as context for other Signpost readers, and also a bit in the context of general discussions we have been having in the Signpost team on how much to vet or review opinion articles. (Again, obviously the author of this one put a lot of work into it and also, as the above quoted critic acknowledges, diligently included a disclaimer.)

Regards, HaeB (talk) 06:46, 10 September 2024 (UTC)[reply]

In fairness, I don't know how much that changes this article. Like yes, the plan seems to be: sharding graph for short term, move to QLever in the longer (or at least medium) term. It is not like we're just sharding the graph and then totally out of ideas. The question still remains - will it work and how much scaling does that buy us? The initial benchmarks do look extremely promising, but we're talking about a small number of benchmarks done in a non-production context not under load on a system that is quite frankly relatively immature (e.g. UPDATE support was just written recently). There is still a ton of uncertainty here. At a high level, I think this op-ed is essentially asking the question of what is the viability of WDQS at the different levels of scale we might optimistically see wikidata reach in the next 10 years assuming no artificial limitations are placed on it. Community members want to know this because they need to make plans based on the answer of this question. Not knowing creates fear, uncertainty and doubt in the Wikidata community. The WDQS backend alternative document doesn't answer that question. Bawolff (talk) 08:43, 3 October 2024 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0