A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
This paper, to be presented at next month's ICWSM conference, provides a dataset containing "all Wikipedia links posted on Twitter in the period 2006 to January 2021" - 35,252,782 URLs altogether, from 34,543,612 unique tweets. While framed as a dataset paper designed to enable future research, it also reports various exploratory data analysis results, for example on the distribution of links across Wikipedia languages:
more than half of all links posted on Twitter (54%) are taken from the English language version. Links from the Japanese version account with 24% for the second highest share followed by Spanish, German and French.
The author notes that the Dutch Wikipedia received a high number of Twitter links relative to its share of pageviews.
Analysing the linked articles by topic category (relying on the language-agnostic automated ORES article topic classification rather than Wikipedia categories), the author finds that
"The ranking of article meta categories from most frequent to least frequent is Culture, Geography, STEM and History & Society and this ranking does not change radically through the years. The popularity of Culture might be traced back to biography links which account for 21.3% of all linked items ..."
Concerning the popularity of concepts across languages (i.e. Wikidata items)
"more than half of all concepts were only posted once and that the distribution is again highly skewed. Among the top five most popular concepts we do not find historical figures or events as one could expect, but two boy bands, the South Korean boy band Bangtan Boys (BTS) and the Filipino boy band SoundBreak 19 (SB19). While being among the most linked concepts they still account only for a very small percentage ..."
It is worth bearing in mind that Twitter links provide only a small percentage of the traffic that Wikipedia received from external referrers (where search engines dominate), and in a weekly list of articles that received most social media traffic on English Wikipedia that the Wikimedia Foundation has been publishing until the end of last year, Twitter seems to have appeared less often as referrer than Reddit or Facebook.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:
"We introduce the task of fact-checking in dialogue, which is a relatively unexplored area. We construct DIALFACT, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DIALFACT: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information."
As an example dialogue where the (itself automatically generated) response was successfully refuted by the automatically retrieved evidence snippet from Wikipedia, the authors offer the following:
From the abstract:
"... we present ProWD, a framework and tool for profiling the completeness of Wikidata [...]. ProWD measures the degree of completeness based on the Class-Facet-Attribute (CFA) profiles. A class denotes a collection of entities, which can be of multiple facets, allowing attribute completeness to be analyzed and compared, e.g., how does the completeness of the attribute "educated at" and "date of birth" compare between male, German computer scientists, and female, Indonesian computer scientists? ProWD generates summaries and visualizations for such analysis, giving insights into the KG [ knowledge graph] completeness."
From the abstract:
"The author used an extended example, the Wikipedia article on the Philippine–American War, to illustrate the unfortunate effects that accompany a lack of attention to the kind of sources used to produce narratives for the online encyclopaedia. [...]
Inattention to sources (a lack of bibliographical imagination) produces representational anomalies. Certain sources are privileged when they should not be and others are ignored or considered as sub-standard. Overall, the epistemological boundaries of the article in terms of what the editorial community considers reliable and what the community of scholars producing knowledge about the war think as reliable do not overlap to the extent that they should."
See also our coverage of earlier papers by the same author
From the abstract:
"... we propose a task of detecting self-contradiction articles in Wikipedia. Based on the "self-contradictory" template, we create a novel dataset for the self-contradiction detection task. Conventional contradiction detection focuses on comparing pairs of sentences or claims, but self-contradiction detection needs to further reason the semantics of an article and simultaneously learn the contradiction-aware comparison from all pairs of sentences. Therefore, we present the first model, Pairwise Contradiction Neural Network (PCNN), to not only effectively identify self-contradiction articles, but also highlight the most contradiction pairs of contradiction sentences. [...] Experiments conducted on the proposed WikiContradiction dataset exhibit that PCNN can generate promising performance and comprehensively highlight the sentence pairs the contradiction locates."
As an example of a pair of contradictory sentences that were detected successfully (i.e. in accordance with the ground truth), the paper offers the following from the article The Silent Scream (1979 film):