Facto Post: a fresh take

Op-ed

Facto Post: a fresh take

Wikicite 2017

I come late to the vision thing. I remember still that when I was standing for the Foundation Board in 2006, one Wikimedian described my platform as "pragmatic", though not in a good way. I suppose I have usually felt that the main way to build an encyclopedia is an enormous amount of painstaking effort. Right now, though, I feel the need to kick up a fuss.

The catalyst was the latest in the WikiCite conference series. I missed the Vienna meeting in late May, but it was clearly vibrant in a way that can only be welcomed. I started the Facto Post mass message to bottle the buzz.

Backstory: Wikimedia integration

I count myself as a four-tab Wikimedian. This means that when I sit down to my machine, I have Wikipedia, Commons, Wikisource and Wikidata tabs open. I have been heavily involved with Wikisource since 2009, and Wikidata since 2014. I arrived on Wikipedia in June 2003. So, where is Wikimedia heading right now? I have taken part in the current Wikimedia movement strategy exercise, and have mixed feelings about it. Radicalism? I don't see it there.

I have tried thinking about Wikimedia integration around Wikidata. I think this is happening, but it is hard to explain to anyone not already a Wikimedian working on several of the sister projects. Some people seem to feel threatened by Wikidata. Others regard it, with rather more justification, as the sonic screwdriver of the Wikimedia universe: Brion Vibber is supposed to have said that it solves all problems.

Presentation and content

I put my head over the parapet with s:Wikisource talk:Wikimedia Strategy 2017#Greater scope for data, citation reform and integration on Wikipedia, and make the clear case for our place in education. What would I be meaning there?

"Citation reform" suggests something is broken. Not everyone would agree. But consider whether the reader is able to view Wikipedia references consistently, in a given style. Is there a setting in "Preferences" for that? No, there may be 100 different referencing styles used in Wikipedia, and by convention there has to be a good reason for an editor to change the referencing style in an article. Normally, and this is a strength of Wikipedia, the reader is the customer here. In the way references are presented, the original author of an article has more of the status of someone who is "always right", in selecting the citation style.

Software engineers are going to recognise the issue here, namely separation of presentation and content. The essential content of a reference can be displayed in numerous ways, e.g.: which comes first, given name or family name of an author (content)? The reader who really wants family name written first, which always reminds me of old library card indexes, could in principle have that option via "Preferences" (presentation). That is a futuristic idea: another is that we should actually know the area of text that a reference applies to. (Strange but true, we don't now.) In any case, Wikidata could do the job of implementing the separation.

Integration: a fresh take

ContentMine logo

Here and now, I'm still talking about integration, but in a more encyclopedic way. Crucially, too, in a community way. The input-output issues around Wikidata now seem like a good way to understand things in the large, not just Wikidata's place among sister projects. Wikidata inputs (automated, semi-automated, and via the fact mining which I'm working on at WikiFactMine project). Holding areas such as mix'n'match, potentially LibraryBase. Wikidata outputs, not just to infoboxes but via SPARQL, and some form of WikiCite export (in other words, reuse of bibliographic and citation data held in Wikidata).

What I was saying in detail about citation reform is a technical possibility once the WikiCite project takes hold. It is a good example of a way ahead. I would think less of a Wikimedia movement strategy that didn't mention such things.

So I mean to take "post-Wikidata" seriously. About five years since its inception, there is a new perspective available, coming from Wikidatans, but not only them. Librarians find it of interest, some of the open science crowd, those looking for the salvation of digital humanities.

Facto Post

I felt, already last summer, that Wikidata was undeniably doing something for the digital humanities, moving our take beyond GLAM. See Andrew Gray's blogpost in the first issue of Facto Post. People really should get behind new tech possibilities for Wikimedia, I say. I believe that the "technophile versus Luddite" stand-off is divisive rather than helpful. I respect the caveat-oriented scepticism that is appropriate to new technology, but the difference between entering a caveat and nitpicking is a judgement call. So, I will go so far as to question the judgement of those who can only find nay-saying in their hearts.

To get past the title, Facto Post is a play on words. Ex post facto is Latin for "retrospectively", so reversed is possibly "prospectively"? But the play is also from the middle of "WikiFactMine", on which I'm currently working: I have a summer job as Wikimedian in Residence, at ContentMine, whose project it is. Fact as in "fact mining", a subarea of text and data mining, for us the extraction of scientific facts from original papers. Some of them are headed for Wikidata, as referenced entries.

Tim Berners-Lee himself is planning a revised Web; he praised our governance, if adding that Wikipedia is not perfect. And it is not. We are still straining to adjust Wikipedia to the semantic Web concept, his previous version. In fact, the potential is only just becoming apparent in terms of Wikimedia content being much more easily manipulated. Taming the plethora of referencing styles is just a start. The excitement is emergent, not just another "next big thing". I sought to nail it in the Editorial to the first issue of Facto Post. No doubt several passes will be needed.

← Previous "Op-ed"

Next "Op-ed" →

In this issue

23 June 2017

News and notes

In the media

Op-ed

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

One tried-and-true method of sharing sister-project data on Wikipedia was through links on the bottom of templates. Author, artist, and other templates were able to include links to the invaluable and informative pages at Wikiquote, Wikisource texts (adding the descriptor "texts" to "Wikisource" 'fixes' a not-quite-clear term), Commons, etc. After being used for years, a 2015 RfC, which included late-arriving evidence of their six-year problem-free use on the main {{Wikipedia}} template that even the closer missed, allowed editors to remove them. As far as I've seen only one editor has done so, and I've been advocating for their return ever since. Here is what one looks like. Adding these links back, and limiting them to three a template, will add a world of information to the readers and provide a logical sister-project crossover. Hopefully they will be returned by 2030 (if not next Monday). Thanks for your op-ed, nice work. Randy Kryn (talk) 10:26, 23 June 2017 (UTC)[reply]

Thanks, I wasn't aware of that issue, and will take it on board. It is the sort of area in which Wikidata might usefully be factored in, and clearly hasn't yet been. Charles Matthews (talk) 12:04, 23 June 2017 (UTC)[reply]

I like that this article got me interested and now I want to read more of this. This op-ed is a series of unanswered questions - "How would a data scientist say that Wikidata is underrepresented in the Wikimedia strategy planning?", "How can the Wikimedia community keep people focused on content by satisfying requests for personalized presentation of content?", "What tools are likely to become available to reuse Wikidata content, and why should the Wikimedia community care?". I have my own thoughts about these things even if I hardly understand them myself. It is nice to see Wikidata writing in The Signpost. Blue Rasberry (talk) 16:41, 23 June 2017 (UTC)[reply]

Tim Berners-Lee is concerned about the centralization of the Internet to a few large sites, and is working to reinvigorate decentralization; it seems to me that centralizing facts to a single database is a step in the wrong direction. You are selling a site where a FACT can quickly and easily be changed with minimal oversight. What is to stop Wikidata from becoming the go-to source for Fake Facts? Wikidata does serve a useful purpose though, in connecting the many different language Wikipedias. We don't allow content forks in a single language, but forks in other languages could serve a useful decentralization and quality-control function. Wikidata itself shouldn't be editable, rather it should constantly be monitoring the identical facts as stated in multiple languages. When all languages agree on a data item, then Wikidata may "publish" that as a "certified" data item. If even a single language edition has a different value for the data "fact", then Wikidata should flag this as a "disputed fact" for the attention of editors who will either revert the vandalism of the outlier or update all the out-of-date editions with the new value of the data item. wbm1058 (talk) 18:05, 23 June 2017 (UTC)[reply]

I think the first point is more of a debating point: surely the multiple routes for reusing Wikidata content, which is freely licensed, make the site different in kind from, say, Facebook. The second point should be well taken: Wikidata is at a point in its history where the community agenda should be turning in the direction of data integrity (people do use watchlists there, by the way), but also "manual of style", in other words more explicit discussion on "data models". As for multiple languages, I'd like to make the points in the other direction. Firstly, that Content translation has huge potential. One obstruction is that reusing references is often difficult. I think that underlines the need for some "citation reform" thinking. Secondly, anomalies between different language versions. "Death anomalies" have been monitored for years now. The idea could well be scaled up. Thank you for raising these big issues. Charles Matthews (talk) 06:58, 24 June 2017 (UTC)[reply]

Very worthy points, Charles. Tony (talk) 12:24, 24 June 2017 (UTC)[reply]

Mindless, automated content translation was pushed out too fast. Content translation might better help with ensuring data integrity. Flag content where the translated content of another language is at odds with what the content in your native language says, for further investigation. There is strength in having multiple points of failure, with each language representing a single point of failure. Pushing out Wikidata that is maintained in one database with a single point of failure, without adequate content controls in place, could cause more harm than good. wbm1058 (talk) 13:19, 24 June 2017 (UTC)[reply]

I guess (having given a talk on Thursday on Wikipedia's reliability) that the appropriate concept of "failure" is of failure to notice that data integrity has been harmed. It is more a question, therefore, of alertness.

Under the "many eyeballs" doctrine, a change made in 100 places really is more likely to be noticed quickly. Defining the question more quantitatively, as I suggested in my talk, can start with talking about median and mean times to fix.

We have to be concerned, fundamentally, with factors that keep the mean time high. Centralising to a so-called single point of failure does not harm the ambition to bring down the mean time to fix. Errors on Wikipedia can take between 48 hours (say – via someone's watchlist) and ten years to be corrected. Let's call that three orders of magnitude. Are you really saying that we should worry, when the choice is between a "broadcast" error, and individually maintained values on say, 100 wikis, only 10 of which may have patrolling comparable to what goes on here? Charles Matthews (talk) 13:55, 24 June 2017 (UTC)[reply]

There is too much reliance on "many eyeballs" to solve all the hard problems. That errors go years without being corrected is evidence that there is a shortage of eyeballs. The advantage of automated systems is that they can have thousands of eyeballs, while humans only have two. I can't be watching 100 wikis at once, but you can design automated systems that can watch 100 wikis at once, and throw a red flag when a data item on wiki #37 changes to something different than the other 99 wikis have. I spend relatively little time looking at my watchlist, because watchlists don't have any intelligence to filter out the benign changes and flag the problematic changes. The work queues I focus on have flagged with virtual 100% certainly an issue that needs to be addressed. I'm often the only editor working these queues. I have more queues than time. I'm like the boy with ten fingers trying to plug a Dutch dyke with 500 holes in it. You can propagate Wikidata out into the wild to 100 places in a few minutes that if bad can take human editors years to repair all the damage. How many eyeballs are there really watching the Wikidata anyway? Is there a list of Wikidatans by number of edits? My understanding is that the most prolific Wikidata editors are all bots. I don't know anything about what sort of BRFA process there might be on Wikidata, but on English Wikipedia WP:BRFA goes to extreme lengths to avoid allowing bots to spread trash en-masse. I have a bot that's been waiting six months for approval. wbm1058 (talk) 16:37, 24 June 2017 (UTC)[reply]

I have to say I think you may misunderstand. Let me make some comments. I think, though I have not tried this out myself, that it would be relatively easy for someone to copy their watchlist here on English Wikipedia to an equivalent one on Wikidata. That would be an application of the PagePile tool. In other words editors here who wanted to watch the content of infoboxes or anything else here drawn from Wikidata have a technically quite simple way to reproduce the sources of that content. (Things might be just a bit more complicated if an infobox used "arbitrary access" to pull in data from more than one Wikidata item.) A typical Wikidata item will be visited less often by bots doing incremental maintenance than a typical page here. So the Wikidata watchlist should be easier to monitor for substantive changes.

Further, what you say about propagating Wikidata errors doesn't make sense to me. That model is true for translation: if an enWP article is translated for deWP, and then a correction is made to the English version, there remains a correction to make in the German version. On the other hand, if an error is introduced on Wikidata, and then corrected, the correction propagates in just the same way that the error does. Charles Matthews (talk) 06:42, 25 June 2017 (UTC)[reply]

Charles, we're talking past each other a bit here. I had not seen PagePile before. "PagePile manages lists of pages". That doesn't really tell me much about how it might be useful to me. I don't see much in the way of documentation on how to use it. As I said, I don't make big use of my watchlist. I give it a cursory glance several times a day, yes, but I don't step through it item by item and examine each change. Most changes reported on my watchlist are ignored by me. As a "power user tool", it's rather weak. Perhaps some whose focus is edit filters or vandal patrol make better use of watchlists (mine isn't, I look more for good-faith errors – but my patrols for errors do catch a lot of vandalism as well). I patrol for these things.

I get what you're saying about the data being hosted on Wikidata, I think. The data is "transcluded" to English Wikipedia from Wikidata, so the data cannot directly be edited on Wikipedia, one must go to Wikidata to change the data. Can you show me some examples of that? Is there a category for all such items, so they can be monitored? wbm1058 (talk) 13:32, 25 June 2017 (UTC)[reply]

Yes, "talking past each other" is what writing an op-ed in the Signpost is designed to get round. PagePile is probably not adequately documented. The original blogpost gives a general idea: it is a utility that can translate lists and output them in various ways.

As for details of infobox use of Wikidata: Template:Infobox telescope is an example. It was converted to a Lua module that pulls in content from Wikidata around July 2016; so that the code contains #invoke. Your type of question is something like "which pages here use {{infobox telescope}}, and how would one get a list of the Wikidata items that correspond?" I'm not the cleverest at these things, but the query https://petscan.wmflabs.org/?psid=1133692 runs to list the English Wikipedia pages with the template, and the Wikidata items corresponding (final column). So, for this example, one can get a good idea where the data comes from.

For a fuller answer, I would use Category:Infobox templates using Wikidata, which has 94 entries right now. I would modify the Petscan query so that the Templates&Links page used "Has any of these templates" for that list of 94. Charles Matthews (talk) 15:33, 25 June 2017 (UTC)[reply]

It's interesting to see that an article like Green Bank Telescope's infobox is created by just {{Infobox Telescope}} – with no parameters at all! Essentially that makes Wikidata not a separate database at all, but a component of the Wikipedia database. As with Lua modules, an element that poses a learning curve. I've fiddled with Wikidata a bit, with mixed success. Sometimes I just leave something for others to fix, if I can't figure it out reasonably quickly. I find all those little pencils a bit distracting – isn't the [edit on Wikidata] at the bottom of the infobox sufficient? Noting also that other templates use Wikidata in a more subtle and less obvious way than {{Infobox Telescope}} does. A bit more challenging to vandalize... and fix. So not sure whether it's a net positive or not on a quality-assurance basis. Perhaps postive if equivalent "telescope" templates are used in other languages. Would be a plus with a system to ensure that the {{Convert|2.3|acre|m2}} in the source text of the article and the 2.3 obtained from Wikidata were in sync. Also the article cites a ref. for the 2.3 but that ref. hasn't made its way into Wikidata. wbm1058 (talk) 20:25, 25 June 2017 (UTC)[reply]

I'm afraid I must disagree fundamentally with Tim Berners-Lee's solutions to the problems he identifies. Control over personal data, spreading misinformation, and political advertising are real enough (although commercial advertising in general is at least as much a problem as political). Nevertheless I can't agree that the solution is to fragment the conduits for these activities. The solution is going to be the continued and increasing success of projects like Wikipedia, which are based on models of personal privacy, verifiability of data, and neutrality of content. From that perspective, it's important to build and grow Wikimedia projects, and to be vigilant in defending them from the sort of threats that Berners-Lee identifies. Centralising facts to a reliable, trusted, comprehensive database is a step in the right direction, because such a database that became the pre-eminent source of data – in the way that Wikipedia has become as a source of encyclopedic information – would be our strongest hedge against privacy bandits, fake-fact fabricators and targeted exploiters. The antidote to being lied to, is to find somebody whom you trust to tell the truth. --RexxS (talk) 12:43, 24 June 2017 (UTC)[reply]

You included a link to wikt:vision thing, which is unreferenced and therefore problematic. Shame on editors adding content without making it verifiable. I don't think Wiktionary should include phrases, at all. Something some guy said once isn't worthy of inclusion, to my mind. Further, there was a link from Wiktionary back to here, pointing to a disambig page where I had to remove the unreferenced definition per WP:DISAMBIG because we don't host content on disambig pages, just links to other pages. No amount of integration changes our poorly-written products created by slipshod editors. Chris Troutman (talk) 09:53, 24 June 2017 (UTC)[reply]

You are entitled to your opinion. It is not mine. As it happens, Wiktionary is set to be integrated into Wikidata. On Wikidata, checking what is referenced and what is not can be automated, making maintenance easier. Via tools, links to dab pages can be found, again simplifying that maintenance task. String searching to find page titles with spaces in ... need I go on? I think the perspective I'm advocating has something to say to about the sentences numbered 1, 3 and 5 in your comment. What you add in sentences 2, 4, and 6 provides examples of comment of another kind. As I say, the opinions you are entitled to. I certainly don't share that tone in our discussions, if I can at all help it. A tradition here of random abuse of the communities in smaller sister projects is more honoured in the breach than in the observance, I feel. Charles Matthews (talk) 11:32, 24 June 2017 (UTC)[reply]

You should not attribute to malice that which can be explained by simple indignation. Whereas you seem to perceive an attack leveled at you or Wikidata, I mean to communicate a general dislike of our editors, who leave us problems to fix. Chris Troutman (talk) 11:41, 24 June 2017 (UTC)[reply]

Well, what I perceived was a certain rhetorical strategy of interleaving technical comments with divisive remarks. Those apparently were aimed at the Wiktionary community. I certainly did not take it as ad hominem; and I would say the term and imputation "malice" is one you are introducing here. Anyway, the context is the Wikimedia movement strategy exercise. If you have contributed to it on the appropriate use of cross-project shaming, I'll give your view due consideration. Charles Matthews (talk) 11:59, 24 June 2017 (UTC)[reply]

It's really more than time that references were handled systematically. Your point that presentation should be separated from content must be the right way to go. The implication is that a citation template (or equivalent Wikidata) should become the standard, with each element (forename, surname, ...) separated. The user can then choose to see John C. Doe; Doe, John C.; Doe J.C.; or Doe JC exactly as she pleases: and the same for all other aspects of reference formatting. This would make both editing (all articles structured alike) and reading (you see refs as you like 'em) easier and more pleasant for everyone. Chiswick Chap (talk) 09:17, 26 June 2017 (UTC)[reply]

Glad someone is on my side! I would say that is a little way off, though. I recall that having dates displayed as the user wished (e.g. DD-MM-YY versus MM-DD-YY) was tried and withdrawn. The infrastructure to handle "references with preferences" would be hugely more. Charles Matthews (talk) 09:27, 26 June 2017 (UTC)[reply]

Charles, sorry to be the bearer of bad news. I can see you have a lot invested in this. I agree that the complexity of our referencing templates and systems is something that needs to be improved. So I went looking into that, and stumbled into another problem: Template:FJC Bio. This template was dependent on numeric ID numbers at the Biographical Directory of Federal Judges website, and these ID numbers were installed as a Wikidata item. Guess what, the website changed their URLS so that the old ID numbers don't work anymore! This was also reported at d:Property talk:P2736#Links broken. First, a patch was installed to use the Wayback Machine, then a new parameter was added that bypasses the obsolete Wikidata item. Now one of our most active and experienced editors (who ranks just outside the top 100 by all-time edit count) is using WP:AWB to add parameters back to the template. I'm not sure of the merits of porting third-party website IDs that are subject to change at any time, at the whim of their webmaster, to Wikidata. Is there a way to efficiently update these with an AWB-type of tool? wbm1058 (talk) 16:32, 26 June 2017 (UTC)[reply]

Indeed, bad news. The Wikidata view would be, naturally, that we are interested in stable identifiers. Which properties are admitted on Wikidata is a community decision, and identifier properties would be discussed on the assumption that the identifiers are stable. Which is only an assumption, of course: institutions do not feel bound to keep identifiers stable under all circumstances. So, in the case of Biographical Directory of Federal Judges ID (P2736) it seems that the webmaster or someone higher up the food chain wasn't so concerned?

With 4000+ properties, those being (I believe) over 50% of identifier type, this kind of glitch will turn up, regularly I suppose. With Art UK artist ID (P1367), which used to be "BBC Your Paintings artist ID", they migrated their site away from the BBC's; but had the savvy to leave redirects. That made it a quick bot job to harvest the new IDs, and the transition was quick.

So this case sounds a bit worse. If websites simply change everything (which happened for example with https://www.encyclopediaofmath.org, before Wikidata), then templates here stop working. And links from anywhere stop working. Charles Matthews (talk) 17:50, 26 June 2017 (UTC)[reply]

To answer your question on a technical level: if the only solution is to scrape or rescrape the website, the next step can be to use the mix'n'match tool. A new catalog for it can be created by a third party, using the import page for the tool, from some data in columns that includes a snippet from each biographical page. Then various types of matching can be applied (automated, semi-automated, gamified). Probably less skill required than AWB, in some sense. Charles Matthews (talk) 19:05, 26 June 2017 (UTC)[reply]

External sites will always be subject to the whims of their owners to restructure them. However, the penalty suffered in search engine rankings for sites that break incoming links is now so great that it is far less likely to happen on any established site. The solution for Wikidata is to drop properties that turn out to be unstable as being more trouble than they are worth. --RexxS (talk) 20:03, 26 June 2017 (UTC)[reply]

In the broader context of (for example) the 400+ datasets in mix'n'match, there is a clear need for periodic rescraping, anyway. The Web is a dynamic place: identifiers change piecemeal, and are added to, as well as being subject to clumsy mass link rot. I know the discussed type of solution here, which is still at the string-and-sealing-wax stage of stored regexes, AFAIK. But, while it would indeed be sensible to stop bothering about sites which show themselves to be unworried by the issue, I expect a practical form of power archiving for sites with identifiers to emerge out of the existing set of ideas.

I mean, saving bits of distinctive content so that search can be used is not beyond the wit of man; and automation can then be used. Doubtless sites that actually want to implement anti-scraping technical measures can spend resources to do that, creating an arms race. But that really is going to be effective only where it is done ab initio, not in preventing rescraping. Charles Matthews (talk) 04:57, 27 June 2017 (UTC)[reply]

Comment Many of us who write large amounts of Wikipedia want to see a fair bit of reference data within the text of the article as we edit (such as year of publication, title, pmid, etc). How would pulling data from WD affect this? A single identifier within the text is (1) unclear (2) too easy to vandalise Doc James (talk · contribs · email) 17:53, 26 June 2017 (UTC)[reply]

As far as I can see (and I'm certainly no expert on Lua, if that is the route taken), the various fields in a reference template could be filled up as an infobox is. The contents would be taken from a Wikidata item that housed the data for a given paper. You raise two points, firstly (I think) that this way should be something that could be previewed. I imagine that could be done. The second, that intentionally (or not) the wrong paper could be cited. Well, I think that is a criticism that could be made of any attempt to "tokenise" references in place, on Wikipedia. I'm not sure that I want to attempt a slick answer to that point. Vandalism is a background problem here.

It is certainly true that changing one token to another would not be perspicuous, from the point of view of patrolling: yet it would be clear enough that some change had been made. There would at least be a trade-off: a token could mean that, across all Wikipedias, one could see where a given paper was used. If findings were later revised, it would be possible to track all uses and see if the text needed to be changed. Translation could be simplified. Charles Matthews (talk) 19:20, 26 June 2017 (UTC)[reply]

We already have infoboxes that draw references from Wikidata and the Lua code is already written. For that case, there are no tokens involved as the reference is directly related to the claim it supports, so it can be fetched at the same time as the claim is. The main barrier to further development is the display problem. While we have no agreed standard for references, they will have to be displayed in the same manner as the owner of the article demands (the CITEVAR problem), which is impossible to code automatically. It would be nice to think we could separate the content of a reference from its presentation (e.g. the CS1-style citations are capable of being displayed in MLA-style by setting a parameter), but there's nowhere other than cookies to store a user's preference unless they are a registered editor. Part of the reason why the date formatting experiment failed was that less than 5% of our readers could make any use of it. --RexxS (talk) 20:03, 26 June 2017 (UTC)[reply]

Indeed, I referred to the issues round WP:CITEVAR in the article; and called the Preferences solution "futuristic", for that reason, and for the technical reasons RexxS mentions. CITEVAR is supported, in the guideline page, by reference to an ArbCom principle of 2006, from Wikipedia:Requests for arbitration/Sortan. I was one of the Arbitrators of the time, this being one of my early cases.

This principle could be revisited. I don't see why not: it wouldn't offend me to be told that I didn't have a crystal ball at the time. (There was a large backlog of cases, I recall. Fred Bauder made herculean efforts to get on top of it.) ArbCom has never been the legislature, and was concerned at the time about disruptive editing, as the decision page makes clear. Charles Matthews (talk) 04:39, 27 June 2017 (UTC)[reply]

What do you think of The Signpost? Share your feedback.

Home

About