Researchers from Google AI describe[supp 1] a new dataset for machine learning composed of annotated images ("multimodal visio-linguistic model") scraped from Wikipedia. It's not quite the largest image dataset the authors compared to prior work, but has by far the largest amount of accompanying text, with more than 37M image-text associations. Text was derived from the article title and description, and other contextual information and metadata such as image captions, alt-text and title of the section an image appeared in. Interestingly, "hate speech" articles were ruled out for the dataset (exactly how was not defined), perhaps to head off future problems with machine learning bias.
The Google AI researchers also announced that "we are hosting a competition with the WIT dataset in Kaggle in collaboration with Wikimedia Research and other external collaborators." In its own announcement,[supp 2] the Wikimedia Foundation's research team explained that it is hosting this competition with the aim of "foster[ing] the development of systems that can automatically associate images with their corresponding image captions and article titles. [...] you will be providing open, reusable systems that could help thousands of editors improve the visual content of the largest online encyclopedia".
In a new Scientific Reports paper titled "Ecology of the digital world of Wikipedia", the authors define the metrics "scatteredness" of editors and "complexity" of articles, then use the metrics to show how Wikipedia articles tend to improve over time. The metrics are defined in an recursive but computable way:
"...we define the scatteredness Di of an editor i, as the harmonic sum of the article complexities he or she edits. The complexity of an article is then naturally defined as a harmonic sum of the scatteredness values of the editors who edited the article..."
When plotted against each other, then tracked over time, the data suggest an evolutionary "flow" in which articles trend toward greater quality during their life (shown in accompanying graphic).
A blog post by the Wikimedia Foundation reports on the results of an experiment conducted in collaboration with the search site DuckDuckGo. The A/B test examined effects of the presence or absence of "Information modules, also referred to as 'knowledge panels' or 'information boxes,' [which] are the boxes on search result pages, generally to the right of the blue links. They often include a short summary of information from Wikipedia alongside images, facts, and links to relevant websites, including Wikipedia". When Google introduced them back in 2012, they soon gave rise to concerns that relieving (some) surfers of the need to click through to Wikipedia - by already excerpting some of its information onto the search engine results page - might be "killing Wikipedia", which derives a large majority of its traffic from Google (or at least substantially decrease its pageviews, edits and donations).
In contrast to these concerns, when the box was removed in the A/B test on DuckDuckGo, "95% of the clicks that would have gone to the Wikipedia information module instead went to Wikipedia blue links [in the standard search results list on the left]". Wikipedia's click-through rate (per SERP view) was actually higher when the information module was present (15.9%) than when it was missing (15.0%). "This indicates that the vast majority of people are not choosing Wikipedia just because it happens to be ranked high in Search and prominently in the information module but because they are explicitly looking for Wikipedia."
This increase in clickthrough rates is not entirely surprising, given that the box usually contains at least one prominent additional link to Wikipedia (example). But it is in stark contrast to the earlier fears that it would decrease traffic. A 2017 study by McMahon, Johnson & Hecht[supp 3] had actually observed a decrease when removing the box in a lab experiment. But as the coauthor of a followup study pointed out, "a big limitation of this kind of [lab] study is that researchers have to select 'important' queries. But this very recent collab study from Wikimedia + DuckDuckGo bypasses that limitation."
Besides the A/B test, which was conducted on users from the US and Germany, the Foundation also analyzed existing aggregate data from DuckDuckGo from these countries, finding among other results that "Wikipedia is the most common result across all DuckDuckGo searches. It shows up either as a module or one of the top five blue links in more than 15% of searches in the United States, more than any other website."
Alongside other results, the post concludes that
"Wikipedia is central to the success of Search, and, in turn, Search is core to how people find Wikipedia. Wikipedia is ranked highly because people are looking for it."
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:
"This study investigates trends from 2002 to 2020 in citing two crowdsourced and two expert-based encyclopedias to investigate whether they fit differently into the research landscape: Wikipedia, Britannica, Baidu Baike, and Scholarpedia. [...] Scopus searches were used to count the number of documents citing the four encyclopedias in each year. Wikipedia was by far the most cited encyclopedia, with up to 1% of Scopus documents citing it in Computer Science. Citations to Wikipedia increased exponentially until 2010, then slowed down and started to decrease. Both the Britannica and Scholarpedia citation rates were increasing in 2020, however. Disciplinary and national differences include Britannica being popular in Arts and Humanities, Scholarpedia in Neuroscience, and Baidu Baike in Chinese-speaking countries/territories."
From the abstract:
"... we applied a string matching function to the text associated with each Wikipedia revision entry. The matching function uses a regular expression to identify trigram noun phrases to match entities like ‘The White House’, ‘Barack Hussein Obama’ or ‘Empire State Building’ for example. In this situation Transcendental Information Cascades form a network of article edits, linked together by the shared trigrams found within the edit revision text. By enriching the article edits with contextual knowledge about article categories from DBpedia (http://dbpedia.org) it was possible to find that this cascade network represents meaningful article relationships not available within the explicit network of linked Wikipedia articles. [... For example,] a burst of activity was observed featuring a series of edits made within a short duration of time beginning with identifiers found in edits on the article about Edward Snowden. The cascade then branched out to span across many other articles incorporating various identifiers related to Edward Snowden’s life. A detailed inspection of the time frame when the cascade emerged showed that it coincided with a presentation given by him at the SXSW conference. In other words, a relationship between an external phenomenon and a short, bursty cascade of edits within Wikipedia, which would not have been available to a more contextualized investigation, was uncovered using the method."
From the abstract:
"In this paper we [are] analyzing the Wikipedia edit history to see how spontaneous individual editors are in initiating bursty periods of editing, i.e., individual-driven burstiness, and to what extent such editors’ behaviors are driven by interaction with other editors in those periods, i.e., interaction-driven burstiness. We quantify the degree of initiative (DoI) of an editor of interest in each Wikipedia article by using the statistics of bursty periods containing the editor’s edits. The integrated value of the DoI over all relevant timescales reveals which is dominant between individual-driven and interaction-driven burstiness. We empirically find that this value tends to be larger for weaker temporal correlations in the editor’s editing behavior and/or stronger editorial correlations [...]"
From the abstract:
"We analyze a series of trials that randomly assigned Wikipedia users in Germany to different web banners soliciting donations. The trials varied framing or content of social information about how many other users are donating. Framing a given number of donors in a negative way increased donation rates. [e.g. "Our donation banner is viewed more than 20 million times a day, but only 115.000 people have donated so far" (negative) vs. "... Already 115.000 people have donated so far" (positive).] Variations in the communicated social information had no detectable effects. "