In the media: <a href="//" title="Wikipedia:Wikipedia Signpost/2023-11-20/In the media">Propaganda and photos, lunatics and a lunar backup</a><br /><br /> News and notes: <a href="//" title="Wikipedia:Wikipedia Signpost/2023-11-20/News and notes">Update on Wikimedia's financial health</a><br /><br /> Traffic report: <a href="//" title="Wikipedia:Wikipedia Signpost/2023-11-20/Traffic report">If it bleeds, it leads</a><br /><br /> Recent research: <a href="//" title="Wikipedia:Wikipedia Signpost/2023-11-20/Recent research">Canceling disputes as the real function of ArbCom</a><br /><br /> Wikimania: <a href="//" title="Wikipedia:Wikipedia Signpost/2023-11-20/Wikimania">Wikimania 2024 scholarships</a><br /><br />
The Signpost
Single-page Edition
20 November 2023

In the media
Propaganda and photos, lunatics and a lunar backup
News and notes
Update on Wikimedia's financial health
Traffic report
If it bleeds, it leads
Recent research
Canceling disputes as the real function of ArbCom
Wikimania 2024 scholarships


Propaganda and photos, lunatics and a lunar backup

Contribute  —  
Share this
By Bri, Oltrepier, Smallbones, and HaeB


Bernie Sanders at a 2016 rally, one of many photos of public figures taken by Gage Skidmore, well-known for his photos of candidates for the American presidential elections that have been used on- and off-wiki

Il Post published a "flash" article (in Italian) highlighting American photographer Gage Skidmore, starting from a question made by a Reddit user in the Wikipedia-related thread of the site last October: “Why is everyone's photo on Wikipedia a picture of them at San Diego Comic-Con?” The reason is actually Skidmore’s work: as mentioned by Il Post, since 2009 the Indiana-native photographer has been to each year’s Comic-Con to take pictures of guest actors and actresses (including Tom Holland and Scarlett Johansson, among many others), before publishing them on Wikimedia Commons under the Creative Commons license.

Throughout the years, the dozens of thousands of pictures taken by Skidmore, who is also well-known for his photos of candidates for the American presidential elections, have been used not only on Wikipedia, but also on many media, such as The Washington Post, The Atlantic, Associated Press and NPR. Those who want to help out themselves can consult Wikipedia:Uploading images. – O

Russian propagandist accused of plagiarizing Wikipedia in thesis

A flag with the top half blue, the bottom half yellow
The flag of Ukraine

The music video starts without music. We look down on a yellow wheat field, otherwise only seeing a young man walking away from us. Five seconds in, the view jumps so that the horizon divides the screen in half — the blue sky above, the yellow wheat below — a tableau of the Ukrainian flag. At the same time, the young man screams "I'm Russian", seeming to say that Ukraine (or its flag) is part of Russia. Strangely, that position is Vladimir Putin's justification for the invasion: that Ukraine was, is, and always will be part of Russia. Of course, the pop patriotic anthem genre is not unique to Russia. But some artists handle it very differently.

The singer Yaroslav Dronov, better known as Shaman, first became popular with the song Встанем ("We rise up"), released on February 23, 2022. Befitting any song released on Defenders of the Fatherland Day, it praises Russians who sacrificed themselves to rid the world of fascism, and calls upon today's Russians to be prepared to take up the same cause. The next day, Russia invaded Ukraine, and suddenly the song had a different meaning. Five months later, he had another huge hit in Russia, the pop patriotic anthem discussed above, Я русский ("I'm Russian").

The Moscow Times wrote on November 7 that "critics (are) accusing the singer of acting as part of the Kremlin’s wartime propaganda machine." MT also wrote "According to Dissernet, more than half of Shaman's 2016 thesis, which earned him the equivalent of a Ph.D. in art history at the Gnessin Academy of Music, contained excerpts lifted directly from other sources." Out of 35 total pages in the thesis, 6 were plagiarized from Wikipedia, 13 other pages plagiarized other sources, leaving only 16 pages (including the title page, the table of contents, and some appendices) which did not contain plagiarism. One should note that Dissernet only publishes preliminary reviews, that can be re-evaluated or deleted at any time. The review MT cited was deleted on November 8.

Putin is getting ready to declare his re-election campaign for the Russian presidency in mid–December according to The Bell. Shaman, who "has become the latest symbol of Russian military propaganda", is expected to be part of a small group of influencers acting as key campaigners and supporters for Putin. – S

Wikipedia's billion-year lunar backup to be updated

According to a press release, the Arch Mission Foundation's "second installment of the historic Lunar Library will launch to the Moon's surface later this year aboard Astrobotic's Peregrine Lander." The library's "foundational components" include "the Wikipedia" alongside "collections from Project Gutenberg, the Internet Archive, and the Long Now Foundation's Rosetta Project and PanLex datasets." Stored in the form of laser-etched analogue images on thin sheets of nickel, the library is assumed to be "capable of lasting for up to billions of years on the Moon." In 2019, transportation of the first installment of the library onboard the Israeli Beresheet mission had ended in a crash landing, but Arch Mission stated at the time that the contents likely survived intact (Signpost coverage: "Vital Articles backed up on the Moon").
The "Lunar Library" project is not to be confused with the "Wikipedia to the Moon" effort championed by Wikimedia Germany, which was envisaged to bring a disc with a community-selected collection of articles to the Moon by 2017 (Signpost coverage in 2016: "Mixed reactions to Wikipedia's lunar time-capsule"). The chapter's partner (now called Planetary Transportation Systems after several renames and a bankruptcy) does not yet appear to have launched or participated in a Moon mission at the time of writing. – H

In brief

Placeholder alt text
Have you seen this gentleman ... on X?
Wales covers a lot of ground. He's looking at AI for Wikipedia to identify errors. AI can help Wikipedia community do a better job. It shouldn't be used in place of human-mediated encyclopedia due to its propensity for errors / confabulation. 200,000 startups can't be effectively regulated, "I don't see a role for regulators that makes any sense". Incredible positive things are coming, but there are real threats. Material on EU regulation starts at 03:45.
A man on a horse looking at the Sphinx
Napoleon à la Wikipedia
Two men in a formal setting, sitting in front of microphones, with a sign behind them concerning a "cramming workshop"
Cramming: we're here to help speakers brush up on a forgotten plot, or really, anybody

Do you want to contribute to "In the media" by writing a story or even just an "in brief" item? Edit our next issue in the Newsroom or leave a tip on the suggestions page.

Reader comments


Update on Wikimedia's financial health

Contribute  —  
Share this
By Andreas Kolbe, Bri, NightWolf1223, and Red-tailed hawk

Wikimedia Foundation publishes audit report for FY2022–2023

Bar chart showing green, red and black bars representing revenue, expenses and net assets in years 2003–2023
Wikimedia Foundation revenue, expenses and net assets (in US$), 2003–2023
Green: revenue (excluding direct donations to the endowment)
Red: expenses (including WMF payments into the endowment)
Black: net assets (excluding the endowment)

The Wikimedia Foundation has released the audit report for the fiscal year 2022–2023, prepared by its auditors, KPMG. You can read the full report here and a summary on Diff. The main takeaways are slowed financial growth in line with targets, and record donations income. Here are some key figures.

The table below shows the development of Wikimedia Foundation finances over the past ten years, as indicated by its audit reports. Annual support and revenue has more than tripled, expenses have more than quadrupled, and net assets at the end of the financial year (not including the Wikimedia Endowment, which is organizationally separate) have increased more than fivefold.

Year Source Revenue Expenses Asset rise Net assets at
end of year
2022/2023 PDF $180,174,103 $169,095,381 $15,619,804 $254,971,336
2021/2022 PDF $154,686,521 $145,970,915 $8,173,996 $239,351,532
2020/2021 PDF $162,886,686 $111,839,819 $50,861,811 $231,177,536
2019/2020 PDF $129,234,327 $112,489,397 $14,674,300 $180,315,725
2018/2019 PDF $120,067,266 $91,414,010 $30,691,855 $165,641,425
2017/2018 PDF $104,505,783 $81,442,265 $21,619,373 $134,949,570
2016/2017 PDF $91,242,418 $69,136,758 $21,547,402 $113,330,197
2015/2016 PDF $81,862,724 $65,947,465 $13,962,497 $91,782,795
2014/2015 PDF $75,797,223 $52,596,782 $24,345,277 $77,820,298
2013/2014 PDF $52,465,287 $45,900,745 $8,285,897 $53,475,021
2012/2013 PDF $48,635,408 $35,704,796 $10,260,066 $45,189,124

The Foundation also made a belated correction to the Endowment figures published a few weeks ago (see previous Signpost coverage). The table provided in late September had erroneously indicated financial years ending 30 June; in fact, the Foundation said, the figures provided related to fiscal years ending 31 December, in line with the Tides Foundation's accounting period. – NW1223, AK

Help wanted: Sockpuppet investigations

An orange cat lying in a partially open dresser drawer full of socks
He investigates socks. You can, too!

A large backlog has developed at WP:Sockpuppet investigations, where there are dozens of cases pending in Category:SPI cases awaiting review and, at one point prior to publication of this issue, over 140 cases awaiting administrative finalization in Category:SPI cases awaiting archive.

Key to keeping this process running are the SPI clerks. Currently, only about a dozen are active. Clerks are an important part of alignment of English Wikipedia with Wikimedia Foundation Access to Nonpublic Personal Data Policy, reviewing cases carefully for evidence and endorsing Checkuser use of tools that can reveal users' IP addresses and other private information. Such review and concurrence prior to use of the tools is important to maintain community trust in pseudonymity and integrity surrounding use of Checkuser tools.

From the SPI Clerks page, this is what the clerks actually do:

Clerks analyze behavior, make findings, and either impose or decline imposing sanctions.
Clerks help to ensure the smooth operation of SPI pages, cases and processes.
  • Ensuring SPI cases and processes stay in good order, including obtaining reasonable and productive conduct by participants;
  • Endorsing or declining CheckUser requests;
  • Ensuring cases have proper evidence (especially for CheckUser requests) and requesting such evidence when not provided; and
  • Assisting with housekeeping tasks, including closing, archiving, merging and formatting of cases.

Any user in good standing is considered qualified to apply at Wikipedia:Sockpuppet investigations/SPI/Clerks, and a talkpage discussion there (begun by this Signpost contributor) has indicated interest in new applicants. Applicants go through a semi-formal training process; non-admin trainees usually show good experience and working knowledge of the community's policies and practices at the point they request traineeship, and clerking can be a step on the way to adminship for some. – B

Help wanted: Election for ArbCom

If you want to run for a place on the Arbitration Committee you've got until this Tuesday, at 23:59 UTC, November 21 to self-nominate in this year's election. Eight editors already have (in random order): Cabayi, ToBeFree, Sdrqaz, Z1720, Aoidh, HJ Mitchell, Maxim, and Firefly.

Qualifications include:

There is one week after the end of the self-nomination period before voting begins on Tuesday, November 28. Editors may use this period to ask questions of the candidates.

You may vote from Tuesday 00:00, 28 November 2023 (UTC) until Monday 23:59, 11 December 2023 (UTC) if you meet the following qualifications:

See WP:ACE2023 for further details. – S

Help wanted: Wikimedia New York City (WMNYC)'s first Executive Director

Founded in 2009, WMNYC is currently looking to hire its founding Executive Director.

Their duties will include:

Remote workers are allowed. See Wikimedia New York City/Jobs for further details. – S

WikiConference North America receives bomb threat

View of the corner of a multistory building with a brick facade
Toronto Reference Library, site of WikiConference North America, and still standing

Wikiconference North America 2023 was held from November 9 to 12, 2023 at the Toronto Reference Library. The program was interrupted on the morning of Saturday, November 11, when the library received a bomb threat. According to local media, the threat was received at 8:44 A.M. and the building was placed in a hold-and-secure state thereafter while police searched the building. No explosive devices were found, and the hold-and-secure state was lifted by 11:45 A.M., allowing programming to resume following that point.

The bomb threat comes two weeks after the Toronto Public Library system, of which the Toronto Reference Library is part, was hit with a ransomware attack on October 27. The ransomware attack resulted in staff social insurance numbers being compromised, and has caused prolonged outages in many of the library's digital systems.

On another note, many slide decks from conference's various presentations are available on Commons. – R

Brief notes

Wikimedia Commons now contains more than 100 million uploaded files.

Reader comments


If it bleeds, it leads

Contribute  —  
Share this
By Igordebraga, Ollieisanerd, Death Editor 2, Ltbdl, CAWylie

This traffic report is adapted from the Top 25 Report, prepared with commentary by Igordebraga, Ollieisanerd, Death Editor 2, Ltbdl, and CAWylie.

Someone I'll always laugh with (October 29 to November 4)

Rank Article Class Views Image Notes/about
1 Matthew Perry 13,203,826 Two years ago, Friends: The Reunion had many Wikipedia readers searching for the portrayer of Chandler Bing given his withdrawn performance. And now even more went to Perry's article to mourn his death at 54, capping a hard life marked by struggles with alcoholism and drugs (mostly prescription ones) — which he recalls with the same Chandler-like self-deprecating humor in his memoir Friends, Lovers, and the Big Terrible Thing. Aside from Friends, Perry had roles in film and television including The Odd Couple, Go On, The Whole Nine Yards, and 17 Again.
2 2023 Cricket World Cup 4,882,524 The premier cricket tournament heats up with two teams having qualified for the semi-finals and two others already disqualified, the latter of which ironically includes the reigning champions England. The runners up of the previous edition New Zealand also seem to be struggling after a great start.
3 Cricket World Cup 3,674,514
4 John Bennett Perry 1,552,592 #1's father, who left his mother (a Canadian former beauty pageant who would work for years with Prime Minister Pierre Trudeau) when he was still a baby seeking an acting career, that would go on to include Old Spice commercials, movies like George of the Jungle and shows like Falcon Crest, and related to Matthew's career, an episode of Friends and two occasions actually portraying his father (the film Fools Rush In and an episode of Scrubs).
5 Five Nights at Freddy's (film) 1,475,816 After years in development hell, during which it got its lead in the killer animatronics genre taken by Willy's Wonderland and The Banana Splits Movie, this video game adaptation finally hit theaters. While it got negative reviews claiming the movie only works for previous fans, name recognition led to huge box office numbers — even if it was also available with a Peacock subscription — opening to $80 million in North America while costing only a fourth of that, and it has earned more than 10 times its budget with $200 million worldwide!
6 Leo (2023 Indian film) 1,244,556 Kollywood filmmaker Lokesh Kanagaraj has a cinematic universe to call his own, and the latest installment featuring Vijay as a man pursued by gangsters is one of India's highest-grossing films of the year. A sequel is expected, although Lokesh has three other movies to finish first.
7 2023 Israel–Hamas war 1,168,496 The war continues, with Israel bombing refugee camps, hospitals, and ambulances. Thousands of children have already died, and many more will continue to die. The light at the end of the tunnel will only get darker and darker as the war marches on.
8 Halloween 1,040,028 The spooky holiday, held on October 31, marks the 30th anniversary of the one movie that can be watched on both this day and December 25, The Nightmare Before Christmas.
9 Deaths in 2023 998,516 A favorite this time of year due to the above:
Spooky, Scary Skeletons,
Send shivers down your spine,
Shrieking skulls will shock your soul,
Seal your doom tonight.
10 Killers of the Flower Moon (film) 970,325 Martin Scorsese's epic about the Osage Indian murders that happened in Oklahoma earned lots of critical praise and, while it only earned half of its massive $200 million budget at the box office thus far, production company Apple Studios probably doesn't care — especially if big viewership numbers happen whenever it moves from theaters to Apple TV+.

It's like you're always stuck in second gear (November 5 to 11)

Rank Article Class Views Image Notes/about
1 2023 Cricket World Cup 5,430,721 India hosts the world championship of its most popular sport and dominates, winning all the games in the recently finished group stage. The semifinals have the Indians against New Zealand's Black Caps in one side, with Australia facing South Africa's Proteas in the other. (Also, even if these articles finally broke the 96% mobile views threshold that would warrant an exclusion, we'll give it a pass, it's only two more weeks anyway.)
2 Cricket World Cup 4,193,641
3 The Marvels 1,472,894 As She-Hulk: Attorney at Law mocked, the Marvel Cinematic Universe sadly reached a phase earning much contempt by the manosphere, who were rooting against the return of Brie Larson as Carol Danvers/Captain Marvel, now joined by Iman Vellani's Kamala Khan/Ms. Marvel (star of an eponymous show) and Teyonah Parris' Monica Rambeau (introduced as a child in Captain Marvel, but her adult form and powers first appeared in WandaVision). Even if analysts are expecting an unimpressive opening weekend due to, among other things, superhero fatigue and promotion being kneecapped by the actors' strike ending the day before The Marvels would open, critical reception has been mixed to positive, noting that it's a fun, unambitious project that can win audiences that don't go in expecting to hate the movie. How to disapprove a project with a musical number straight out of Bollywood — ironically, with one of the three heroines being Pakistani-American... In any case, at least 2024 will reduce the MCU's overexposure, as the Hollywood strikes ensured that Deadpool 3 will be the year's only theatrical release.
4 Deaths in 2023 918,367 Ev'ry Time We Say Goodbye
I wonder why a little
Why the gods above me
Who must be in the know
Think so little of me
They allow you to go
5 Matthew Perry 911,854 The world continues to mourn the unfortunate and accidental death of Matthew Perry, which has not been given an official cause but is certainly linked to years of substance abuse (his autobiography recalls periods where he took dozens of pills per day, and that he spent millions of dollars trying to stop drinking). Among the tributes, HBO Max added a dedication to Perry at the start of each season of Friends.
6 2023 Israel–Hamas war 888,327 Still happening, and still awful. And as if bombings and ground invasions weren't enough, Israel earns extra criticism from the international community for cutting off resources from the Gaza Strip to make life even worse for those caught in the crossfire, with the Gazan health system being particularly hindered.
7 Leo (2023 Indian film) 888,124 Co-written and directed by Lokesh Kanagaraj (pictured) and released in mid-October, Leo has become the second-highest grossing Tamil film of 2023. It is the third in the Lokesh Cinematic Universe.
8 Virat Kohli 841,214 Making his fourth appearance in a Cricket World Cup (see #1 and 2), Virat has scored 1,000 runs this year (the eighth time in his career), and, on his birthday (5 November), he broke the record for the fastest 49th century (277 innings).
9 Diana Nyad 750,906 Netflix released Nyad, where Annette Bening plays this author and long distance swimmer who in 2013 decided at the age of 64 to swim from Cuba to Florida.
10 Josh Dobbs 671,189 Since being drafted into the National Football League in 2017, Dobbs has played for seven teams (some twice). On 5 November, this perpetual second-string quarterback secured a comeback win for the Minnesota Vikings by throwing a touchdown pass, the first player at that position to do so consecutively for three different teams in one season.


Reader comments


Canceling disputes as the real function of ArbCom

Contribute  —  
Share this
By Bri, Ca, and Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"Canceling Disputes: How Social Capital Affects the Arbitration of Disputes on Wikipedia"

Reviewed by Bri

This provocative paper in Law and Social Inquiry[1] by a socio-legal scholar shows, through research mostly based on interviews with Wikipedia insiders, that the Arbitration Committee functions to cancel disputes, not to arbitrate to a compromise position, nor to reach a negotiated settlement, nor to actively promote truthful content (which one might naïvely have inferred from the name of the Committee).

Some of the arguments used in the paper are both arresting and concerning. This reviewer found the interpretive language, and the often verbatim quotes of people involved in the arbitration process — often deeply involved, including at least one described as a member of the Committee — more compelling than the light data analysis included in the paper. The author interviewed 28 editors: current and former members of the Committee, those who have been involved parties, those who have commented on cases, and those "who have knowledge of the dispute resolution process due to their long-standing involvement with Wikipedia" (not further defined).

"Social Capital and the Arbitration Committee's Remedies" (figure 2 from the paper)

The data analysis consisted of a breakdown of sanction severities against edit count (as a proxy for social capital). It found a negative correlation between social capital and severity, by examining edit count against light severity outcomes (admonishment) and heavy severity (up to and including site bans); see figure 2 above. The author presented two potential interpretations: one, the conventional one, that more mature and upstanding editors with deep social capital were more likely to obey norms; the other, that those editors with the social capital were free to disobey norms without severe consequences because of the wiki's empowerment of bad behavior through various means. In essence, this would validate the idea of a "cabal", or that a "too essential to be lost" mentality endows a "wiki aristocracy" capable of creating either true consensus or promoting their "version of the truth", to quote the paper (p. 15). It was this non-data-driven approach that attempted to find which of the competing theories was correct.

The key idea in the paper is that social capital — largely built up and represented by an editor's edit count regardless of their ability to peacefully coexist with other editors — is the most important factor when it comes to arbitration. The committee's purpose is to quash disputes in order for editing to continue, not to reach a "just" outcome in some broader sense. One way the social capital is expressed and brought to bear is essentially in the opening phases of an arbitration case, called preliminary statements. If one reads between the lines of the paper, the outcome is frequently predetermined by these opening phases and all that the committee can do is go along with the crowd. In fact, it is explicitly stated — again based on evidence gathered from insiders — that cases are frequently orchestrated off-wiki precisely in order to stack the deck against the other side.

[A] Wikipedia insider told me how a disputant prepared her "faction" for months before bringing a case before the Arbitration Committee (which she ended up winning). These efforts are usually made covertly, as Wikipedia norms prohibit what is called "canvassing"...for instance ... on a secret mailing list ... A long-standing editor who was described as a member of Wikipedia's "aristocracy" told me: "we are a tight clique of very long-standing editors and none of our words find their way onto the site"...
— p. 12

Black and white line-drawn cartoon of some hooded figures participating in a ritual
"There's no cabal" (a classic community cartoon, first posted on the French Wikipedia in 2006)

Sadly for Wikipedians, the author concludes that it is the Machiavellian use of power that holds true on Wikipedia, or in other words, that there is a cabal. One passage that comes across as especially skeptical of this structure is found on p. 17: "an editor compared the Arbitration Committee to 'riot cops' ... [who] can be compared to the 'repressive peacemakers' ... guaranteeing the level of social peace that is necessary for the Wikipedia project to unfold, even to the detriment of fairness." Then the author appears to equate the arbitration process to a trial by ordeal, a feudal concept eschewed by the West in favor of due process based legal proceedings, further saying that

My empirical findings are consistent with the argument that, despite its rhetoric of inclusiveness ("anyone can edit"), Wikipedia is a "unwelcoming and exclusive environment" for newcomers, which tends to reinforce the "hegemony" of a consensus that is mostly shaped and controlled by white Western men.
— p. 19

Summing up on the next page:

[W]hat emerges from the evidence I have collected, and is perhaps more conclusive, is that experienced editors with dense networks are well positioned to avoid the consequences of their own breaches and to use their power to prevail in disputes against weaker parties.
— p. 20

In other words, a system that puts the powerful above the law.

15% of datasets for fine-tuning language models use Wikipedia

Reviewed by Tilman Bayer

A new preprint titled "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI"[2] presents results from "a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace [...] the data lineage of 44 of the most widely used and adopted text data collections, spanning 1800+ finetuning datasets" that have been published on platforms such as Hugging Face or GitHub. The authors make their resulting annotated dataset of annotated datasets available online, searchable via a "Data Provenance Explorer".

The paper presents various quantitative results based on this dataset. was found to be the most widely used source domain, occurring in 14.9% (p. 14) or 14.6% (Table 4, p. 13) of the 1800+ datasets. This result illustrates the value Wikipedia provides for AI (although it also means, conversely, that over 85% of those datasets made no use of Wikipedia).

The paper highlights the following example of such a dataset that used Wikipedia:

Surpervised Dataset Example: SQuAD

Rajpurkar et al. (2016) present a prototypical supervised dataset on reading comprehension. To create the dataset, the authors take paragraph-long excerpts from 539 popular Wikipedia articles and hire crowd-source workers to generate over 100,000 questions whose answers are contained in the excerpt. For example:

Wikipedia Excerpt In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity.

Worker-generated question: What causes precipitation to fall? Answer: Gravity

Here the authors use Wikipedia text as a basis for their data and their dataset contains 100,000 new question-answer pairs based on these texts.

The bulk of the paper is of less interest to Wikimedians specifically, focusing instead on general questions about the sourcing information about these datasets ("we are in the midst of a crisis in dataset provenance") and their licenses (observing e.g. "sharp divides in composition and focus of commercially open vs. closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data"). An extensive "Legal Discussion" section acknowledges that the paper leaves out "several important related questions on the use of copyrighted works to create supervised datasets and on the copyrightability of training datasets." In particular, it does not examine whether the Wikipedia-based datasets satisfy the requirements of Wikipedia's CC BY-SA license. Regarding the use of CC-licensed datasets in AI in general, the authors note: "One of the challenges is that licenses like the Apache and the Creative Commons outline restrictions related to 'derivative' or 'adapted works' but it remains unclear if a trained model should be classified as a derivative work." They also remind readers that "In the U.S., the fair use exception may allow models to be trained on protected works," although "the application of fair use in the context is still evolving and several of these issues are currently being litigated".

(The datasets examined in the paper are to be distinguished from the much larger unlabeled text corpuses used for the initial unsupervised training of large language models (LLMs). There, Wikipedia is also known to have been used, alongside other sources such as Common Crawl, e.g. for the GPT-3 family that formed the basis of ChatGPT.)

Wikipedia biggest "loser" in recent Google Search update

A blog post[3] by Search Engine Optimization firm Amsive (recommended as "extensive (and fascinating) research" in a recent The Verge feature about the SEO industry) analyzes the impact of an August 2023 "core update" by Google Search. The post explains that

Google [...] announced a new signal in its December updates to the Search Quality Rater guidelines: “E” for experience. The “E” is a new member of the E-A-T family, now called E-E-A-T, and stands for experience, expertise, authoritativeness, and trustworthiness. According to Google, the amount of E-E-A-T required for a page or site to be considered high-quality depends on the nature of the content and the extent to which it can cause harm to users. [...] Search Quality Raters have been working off this new version of the Quality Guidelines to review the quality of Google’s results and evaluate E-E-A-T for 9 months now, giving Google plenty of time to update its algorithms with the feedback provided by quality raters."

The analysis of Google's August update focuses on "the list of the top 1,000 winners and losers in both absolute and percentage terms, using Sistrix Visibility Index scores using the U.S. index." (Sistrix' - generally not freely available - index is calculated based on search results for one million keywords, weighted by search volume and estimated click probability, and aggregated by domain.) tops the "Absolute Losers" list for Google's August 2023 update, with a larger score decrease than (#2) and (#3). Still, in relative terms, Wikipedia's score decline of -6.75% doesn't even make the "Percent Losers" list of the 250 sites with the biggest percentage declines. And in better news for Wikimedians, ranked #3 on "Absolute Winners" list (right before at #4). also gained, reaching #38 on the same list (with an index increase that is 37.38% in relative terms). What's more, Amsive's similar analysis of Google's preceding March 2023 core update, which had been "highly anticipated given the significant changes affecting organic search" in the preceding months, of which the EEAT announcement was just one, had conversely topped the "Absolute Winners" list, with a 10.16% relative increase. Then again, back then topped the March 2023 update's "Absolute Losers" list ahead of (#2) and (#3), although both had a larger relative decrease than Wiktionary's -22.66%. Wiktionary was found to have declined by -51.70% in this update. This may indicate that such changes are merely palimpsestuous snapshots of the long timeline of Google Search. (And indeed Google has since conducted two further "core updates" for October and November 2023, which Amsive does not appear to have analyzed yet.) Still, these results illustrate that Wikipedia's prominence in search engine results is by no means ubiquitous and static.


Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Ca and Tilman Bayer

"Evaluation of Accuracy and Adequacy of Kimchi Information in Major Foreign Online Encyclopedias"

From the abstract:[4]

In this study, we analyzed the content and quality of kimchi information in major foreign online encyclopedias, such as Baidu Baike, Encyclopædia Britannica, Citizendium, and Wikipedia. Our results revealed that the kimchi information provided by these encyclopedias was often inaccurate or inadequate, despite kimchi being a fundamental part of Korean cuisine. The most common inaccuracies were related to the definition and origins of kimchi and its ingredients and preparation methods.

"Speech Wikimedia: A 77 Language Multilingual Speech Dataset"


"The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models."

"WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections"

From the "Conclusion" section:[6]

"We created WIKITABLET, a dataset that contains Wikipedia article sections and their corresponding tabular data and various metadata. WIKITABLET contains millions of instances covering a broad range of topics and kinds of generation tasks. Our manual evaluation showed that humans are unable to differentiate the [original Wikipedia text] and model generations [by transformer models that the authors trained specifically for this task]. However, qualitative analysis showed that our models sometimes struggle with coherence and factuality, suggesting several directions for future work."

The authors of this 2021 paper note that they "did not experiment with pretrained models [such as the GPT series] because they typically use the entirety of Wikipedia, which would presumably overlap with our test set."

"Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective"

From the abstract:[7]

"Recent advances in machine learning [this sentence appears to have been written in 2020] have made it possible to train NLG [natural language generation] systems that seek to achieve human-level performance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-resourced Wikipedia language versions, which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to generate an introductory sentence from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the summary sentences score well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles."

The paper, published in 2022, does not yet mention the related Abstract Wikipedia project.

"XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages"

From the abstract:[8]

"Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary."

The paper's "Related work" section provides a useful literature overview, noting e.g. that

"Automated generation of Wikipedia text has been a problem of interest for the past 5–6 years. Initial efforts in the fact-to-text (F2T) line of work focused on generating short text, typically the first sentence of Wikipedia pages using structured fact tuples. [...] Seq-2-seq neural methods [including various LSTM architectures and efforts based on pretrained transformers] have been popularly used for F2T. [...]
Besides generating short Wikipedia text, there have also been efforts to generate Wikipedia articles by summarizing long sequences. [...] For all of these datasets, the generated text is either the full Wikipedia article or text for a specific section.

The authors note that most of these efforts have been English-only.

See also our 2018(!) coverage of various fact-to-text efforts, going back to 2016: "Readers prefer summaries written by a neural network over those by Wikipedians 40% of the time — but it still suffers from hallucinations"


  1. ^ Grisel, Florian (2023-05-04). "Canceling Disputes: How Social Capital Affects the Arbitration of Disputes on Wikipedia". Law & Social Inquiry: 1–22. doi:10.1017/lsi.2023.15. ISSN 0897-6546.
  2. ^ Longpre, Shayne; Mahari, Robert; Chen, Anthony; Obeng-Marnu, Naana; Sileo, Damien; Brannon, William; Muennighoff, Niklas; Khazam, Nathan; Kabbara, Jad; Perisetla, Kartik; Wu, Xinyi; Shippole, Enrico; Bollacker, Kurt; Wu, Tongshuang; Villa, Luis; Pentland, Sandy; Roy, Deb; Hooker, Sara (2023-11-04), The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI, arXiv, doi:10.48550/arXiv.2310.16787
  3. ^ Ray, Lily (2023-09-12). "Google August 2023 Core Update: Winners, Losers & Analysis". Amsive blog.
  4. ^ Park, Sung Hoon; Lee, Chang Hyeon (2023). "Evaluation of Accuracy and Adequacy of Kimchi Information in Major Foreign Online Encyclopedias". Journal of the Korean Society of Food Culture. 38 (4): 203–216. doi:10.7318/KJFC/2023.38.4.203. ISSN 1225-7060. (in Korean, with English abstract)
  5. ^ Gómez, Rafael Mosquera; Eusse, Julián; Ciro, Juan; Galvez, Daniel; Hileman, Ryan; Bollacker, Kurt; Kanter, David (2023-08-29), Speech Wikimedia: A 77 Language Multilingual Speech Dataset, arXiv, doi:10.48550/arXiv.2308.15710
  6. ^ Chen, Mingda; Wiseman, Sam; Gimpel, Kevin (August 2021). "WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections". Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Findings 2021. Online: Association for Computational Linguistics. pp. 193–209. doi:10.18653/v1/2021.findings-acl.17. code, data and models
  7. ^ Kaffee, Lucie-Aimée; Vougiouklis, Pavlos; Simperl, Elena (2022-01-01). "Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective". Semantic Web. 13 (2): 163–194. doi:10.3233/SW-210431. ISSN 1570-0844.
  8. ^ Taunk, Dhaval; Sagare, Shivprasad; Patil, Anupam; Subramanian, Shivansh; Gupta, Manish; Varma, Vasudeva (2023-04-30). "XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages". Proceedings of the ACM Web Conference 2023. WWW '23. New York, NY, USA: Association for Computing Machinery. pp. 1703–1713. doi:10.1145/3543507.3583405. ISBN 9781450394161. closed access, eprint version: Taunk, Dhaval; Sagare, Shivprasad; Patil, Anupam; Subramanian, Shivansh; Gupta, Manish; Varma, Vasudeva (2023-03-22), XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages, doi:10.1145/3543507.3583405, code and dataset

Reader comments


Wikimania 2024 scholarships

Contribute  —  
Share this
By Nadzik
This article is written by Nadzik, on behalf of the Wikimania 2024 Core Organizing Team.

Scholarship applications for Wikimania 2024 are now open!

Apply before 18 December 2023

Scholarship applications for Wikimania 2024 are open until 18 December. The Core Organizing Team is offering full and partial scholarships to attend Wikimania in person in Poland. Wikimania will take place either at the end of July or the beginning of August. The exact dates of the event will be announced soon.

Wikimania Spirit

Next year’s Wikimania Spirit will be “Collaboration of the Open” — a celebration of the ways we work together, in the open and for the larger open movement, to bring free knowledge to the world. It will once again be a hybrid event with virtual participation on Eventyay, an open-source event management platform.

More information about scholarships is on the Wikimania Wiki and in the Diff post.

Reader comments

If articles have been updated, you may need to refresh the single-page edition.


The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0