Il Post published a "flash" article (in Italian) highlighting American photographer Gage Skidmore, starting from a question made by a Reddit user in the Wikipedia-related thread of the site last October: “Why is everyone's photo on Wikipedia a picture of them at San Diego Comic-Con?” The reason is actually Skidmore’s work: as mentioned by Il Post, since 2009 the Indiana-native photographer has been to each year’s Comic-Con to take pictures of guest actors and actresses (including Tom Holland and Scarlett Johansson, among many others), before publishing them on Wikimedia Commons under the Creative Commons license.
Throughout the years, the dozens of thousands of pictures taken by Skidmore, who is also well-known for his photos of candidates for the American presidential elections, have been used not only on Wikipedia, but also on many media, such as The Washington Post, The Atlantic, Associated Press and NPR. Those who want to help out themselves can consult Wikipedia:Uploading images. – O
The music video starts without music. We look down on a yellow wheat field, otherwise only seeing a young man walking away from us. Five seconds in, the view jumps so that the horizon divides the screen in half — the blue sky above, the yellow wheat below — a tableau of the Ukrainian flag. At the same time, the young man screams "I'm Russian", seeming to say that Ukraine (or its flag) is part of Russia. Strangely, that position is Vladimir Putin's justification for the invasion: that Ukraine was, is, and always will be part of Russia. Of course, the pop patriotic anthem genre is not unique to Russia. But some artists handle it very differently.
The singer Yaroslav Dronov, better known as Shaman, first became popular with the song Встанем ("We rise up"), released on February 23, 2022. Befitting any song released on Defenders of the Fatherland Day, it praises Russians who sacrificed themselves to rid the world of fascism, and calls upon today's Russians to be prepared to take up the same cause. The next day, Russia invaded Ukraine, and suddenly the song had a different meaning. Five months later, he had another huge hit in Russia, the pop patriotic anthem discussed above, Я русский ("I'm Russian").
The Moscow Times wrote on November 7 that "critics (are) accusing the singer of acting as part of the Kremlin’s wartime propaganda machine." MT also wrote "According to Dissernet, more than half of Shaman's 2016 thesis, which earned him the equivalent of a Ph.D. in art history at the Gnessin Academy of Music, contained excerpts lifted directly from other sources." Out of 35 total pages in the thesis, 6 were plagiarized from Wikipedia, 13 other pages plagiarized other sources, leaving only 16 pages (including the title page, the table of contents, and some appendices) which did not contain plagiarism. One should note that Dissernet only publishes preliminary reviews, that can be re-evaluated or deleted at any time. The review MT cited was deleted on November 8.
Putin is getting ready to declare his re-election campaign for the Russian presidency in mid–December according to The Bell. Shaman, who "has become the latest symbol of Russian military propaganda", is expected to be part of a small group of influencers acting as key campaigners and supporters for Putin. – S
According to a press release, the Arch Mission Foundation's "second installment of the historic Lunar Library will launch to the Moon's surface later this year aboard Astrobotic's Peregrine Lander." The library's "foundational components" include "the Wikipedia" alongside "collections from Project Gutenberg, the Internet Archive, and the Long Now Foundation's Rosetta Project and PanLex datasets." Stored in the form of laser-etched analogue images on thin sheets of nickel, the library is assumed to be "capable of lasting for up to billions of years on the Moon." In 2019, transportation of the first installment of the library onboard the Israeli Beresheet mission had ended in a crash landing, but Arch Mission stated at the time that the contents likely survived intact (Signpost coverage: "Vital Articles backed up on the Moon").
The "Lunar Library" project is not to be confused with the "Wikipedia to the Moon" effort championed by Wikimedia Germany, which was envisaged to bring a disc with a community-selected collection of articles to the Moon by 2017 (Signpost coverage in 2016: "Mixed reactions to Wikipedia's lunar time-capsule"). The chapter's partner (now called Planetary Transportation Systems after several renames and a bankruptcy) does not yet appear to have launched or participated in a Moon mission at the time of writing. – H
The Wikimedia Foundation has released the audit report for the fiscal year 2022–2023, prepared by its auditors, KPMG. You can read the full report here and a summary on Diff. The main takeaways are slowed financial growth in line with targets, and record donations income. Here are some key figures.
The table below shows the development of Wikimedia Foundation finances over the past ten years, as indicated by its audit reports. Annual support and revenue has more than tripled, expenses have more than quadrupled, and net assets at the end of the financial year (not including the Wikimedia Endowment, which is organizationally separate) have increased more than fivefold.
|Year||Source||Revenue||Expenses||Asset rise||Net assets at|
end of year
The Foundation also made a belated correction to the Endowment figures published a few weeks ago (see previous Signpost coverage). The table provided in late September had erroneously indicated financial years ending 30 June; in fact, the Foundation said, the figures provided related to fiscal years ending 31 December, in line with the Tides Foundation's accounting period. – NW1223, AK
A large backlog has developed at WP:Sockpuppet investigations, where there are dozens of cases pending in Category:SPI cases awaiting review and, at one point prior to publication of this issue, over 140 cases awaiting administrative finalization in Category:SPI cases awaiting archive.
Key to keeping this process running are the SPI clerks. Currently, only about a dozen are active. Clerks are an important part of alignment of English Wikipedia with Wikimedia Foundation Access to Nonpublic Personal Data Policy, reviewing cases carefully for evidence and endorsing Checkuser use of tools that can reveal users' IP addresses and other private information. Such review and concurrence prior to use of the tools is important to maintain community trust in pseudonymity and integrity surrounding use of Checkuser tools.
From the SPI Clerks page, this is what the clerks actually do:
Any user in good standing is considered qualified to apply at Wikipedia:Sockpuppet investigations/SPI/Clerks, and a talkpage discussion there (begun by this Signpost contributor) has indicated interest in new applicants. Applicants go through a semi-formal training process; non-admin trainees
usually show good experience and working knowledge of the community's policies and practices at the point they request traineeship, and clerking can be a step on the way to adminship for some. – B
If you want to run for a place on the Arbitration Committee you've got until this Tuesday, at 23:59 UTC, November 21 to self-nominate in this year's election. Eight editors already have (in random order): , , , , , , , and .
There is one week after the end of the self-nomination period before voting begins on Tuesday, November 28. Editors may use this period to ask questions of the candidates.
You may vote from Tuesday 00:00, 28 November 2023 (UTC) until Monday 23:59, 11 December 2023 (UTC) if you meet the following qualifications:
Founded in 2009, WMNYC is currently looking to hire its founding Executive Director.
Their duties will include:
Wikiconference North America 2023 was held from November 9 to 12, 2023 at the Toronto Reference Library. The program was interrupted on the morning of Saturday, November 11, when the library received a bomb threat. According to local media, the threat was received at 8:44 A.M. and the building was placed in a hold-and-secure state thereafter while police searched the building. No explosive devices were found, and the hold-and-secure state was lifted by 11:45 A.M., allowing programming to resume following that point.
The bomb threat comes two weeks after the Toronto Public Library system, of which the Toronto Reference Library is part, was hit with a ransomware attack on October 27. The ransomware attack resulted in staff social insurance numbers being compromised, and has caused prolonged outages in many of the library's digital systems.
|1||Matthew Perry||13,203,826||Two years ago, Friends: The Reunion had many Wikipedia readers searching for the portrayer of Chandler Bing given his withdrawn performance. And now even more went to Perry's article to mourn his death at 54, capping a hard life marked by struggles with alcoholism and drugs (mostly prescription ones) — which he recalls with the same Chandler-like self-deprecating humor in his memoir Friends, Lovers, and the Big Terrible Thing. Aside from Friends, Perry had roles in film and television including The Odd Couple, Go On, The Whole Nine Yards, and 17 Again.|
|2||2023 Cricket World Cup||4,882,524||The premier cricket tournament heats up with two teams having qualified for the semi-finals and two others already disqualified, the latter of which ironically includes the reigning champions England. The runners up of the previous edition New Zealand also seem to be struggling after a great start.|
|3||Cricket World Cup||3,674,514|
|4||John Bennett Perry||1,552,592||#1's father, who left his mother (a Canadian former beauty pageant who would work for years with Prime Minister Pierre Trudeau) when he was still a baby seeking an acting career, that would go on to include Old Spice commercials, movies like George of the Jungle and shows like Falcon Crest, and related to Matthew's career, an episode of Friends and two occasions actually portraying his father (the film Fools Rush In and an episode of Scrubs).|
|5||Five Nights at Freddy's (film)||1,475,816||After years in development hell, during which it got its lead in the killer animatronics genre taken by Willy's Wonderland and The Banana Splits Movie, this video game adaptation finally hit theaters. While it got negative reviews claiming the movie only works for previous fans, name recognition led to huge box office numbers — even if it was also available with a Peacock subscription — opening to $80 million in North America while costing only a fourth of that, and it has earned more than 10 times its budget with $200 million worldwide!|
|6||Leo (2023 Indian film)||1,244,556||Kollywood filmmaker Lokesh Kanagaraj has a cinematic universe to call his own, and the latest installment featuring Vijay as a man pursued by gangsters is one of India's highest-grossing films of the year. A sequel is expected, although Lokesh has three other movies to finish first.|
|7||2023 Israel–Hamas war||1,168,496||The war continues, with Israel bombing refugee camps, hospitals, and ambulances. Thousands of children have already died, and many more will continue to die. The light at the end of the tunnel will only get darker and darker as the war marches on.|
|8||Halloween||1,040,028||The spooky holiday, held on October 31, marks the 30th anniversary of the one movie that can be watched on both this day and December 25, The Nightmare Before Christmas.|
|9||Deaths in 2023||998,516||A favorite this time of year due to the above:|
Spooky, Scary Skeletons,
Send shivers down your spine,
Shrieking skulls will shock your soul,
Seal your doom tonight.
|10||Killers of the Flower Moon (film)||970,325||Martin Scorsese's epic about the Osage Indian murders that happened in Oklahoma earned lots of critical praise and, while it only earned half of its massive $200 million budget at the box office thus far, production company Apple Studios probably doesn't care — especially if big viewership numbers happen whenever it moves from theaters to Apple TV+.|
|1||2023 Cricket World Cup||5,430,721||India hosts the world championship of its most popular sport and dominates, winning all the games in the recently finished group stage. The semifinals have the Indians against New Zealand's Black Caps in one side, with Australia facing South Africa's Proteas in the other. (Also, even if these articles finally broke the 96% mobile views threshold that would warrant an exclusion, we'll give it a pass, it's only two more weeks anyway.)|
|2||Cricket World Cup||4,193,641|
|3||The Marvels||1,472,894||As She-Hulk: Attorney at Law mocked, the Marvel Cinematic Universe sadly reached a phase earning much contempt by the manosphere, who were rooting against the return of Brie Larson as Carol Danvers/Captain Marvel, now joined by Iman Vellani's Kamala Khan/Ms. Marvel (star of an eponymous show) and Teyonah Parris' Monica Rambeau (introduced as a child in Captain Marvel, but her adult form and powers first appeared in WandaVision). Even if analysts are expecting an unimpressive opening weekend due to, among other things, superhero fatigue and promotion being kneecapped by the actors' strike ending the day before The Marvels would open, critical reception has been mixed to positive, noting that it's a fun, unambitious project that can win audiences that don't go in expecting to hate the movie. How to disapprove a project with a musical number straight out of Bollywood — ironically, with one of the three heroines being Pakistani-American... In any case, at least 2024 will reduce the MCU's overexposure, as the Hollywood strikes ensured that Deadpool 3 will be the year's only theatrical release.|
|4||Deaths in 2023||918,367||Ev'ry Time We Say Goodbye|
I wonder why a little
Why the gods above me
Who must be in the know
Think so little of me
They allow you to go
|5||Matthew Perry||911,854||The world continues to mourn the unfortunate and accidental death of Matthew Perry, which has not been given an official cause but is certainly linked to years of substance abuse (his autobiography recalls periods where he took dozens of pills per day, and that he spent millions of dollars trying to stop drinking). Among the tributes, HBO Max added a dedication to Perry at the start of each season of Friends.|
|6||2023 Israel–Hamas war||888,327||Still happening, and still awful. And as if bombings and ground invasions weren't enough, Israel earns extra criticism from the international community for cutting off resources from the Gaza Strip to make life even worse for those caught in the crossfire, with the Gazan health system being particularly hindered.|
|7||Leo (2023 Indian film)||888,124||Co-written and directed by Lokesh Kanagaraj (pictured) and released in mid-October, Leo has become the second-highest grossing Tamil film of 2023. It is the third in the Lokesh Cinematic Universe.|
|8||Virat Kohli||841,214||Making his fourth appearance in a Cricket World Cup (see #1 and 2), Virat has scored 1,000 runs this year (the eighth time in his career), and, on his birthday (5 November), he broke the record for the fastest 49th century (277 innings).|
|9||Diana Nyad||750,906||Netflix released Nyad, where Annette Bening plays this author and long distance swimmer who in 2013 decided at the age of 64 to swim from Cuba to Florida.|
|10||Josh Dobbs||671,189||Since being drafted into the National Football League in 2017, Dobbs has played for seven teams (some twice). On 5 November, this perpetual second-string quarterback secured a comeback win for the Minnesota Vikings by throwing a touchdown pass, the first player at that position to do so consecutively for three different teams in one season.|
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
This provocative paper in Law and Social Inquiry by a socio-legal scholar shows, through research mostly based on interviews with Wikipedia insiders, that the Arbitration Committee functions to cancel disputes, not to arbitrate to a compromise position, nor to reach a negotiated settlement, nor to actively promote truthful content (which one might naïvely have inferred from the name of the Committee).
Some of the arguments used in the paper are both arresting and concerning. This reviewer found the interpretive language, and the often verbatim quotes of people involved in the arbitration process — often deeply involved, including at least one described as a member of the Committee — more compelling than the light data analysis included in the paper. The author interviewed 28 editors: current and former members of the Committee, those who have been involved parties, those who have commented on cases, and those "who have knowledge of the dispute resolution process due to their long-standing involvement with Wikipedia" (not further defined).
The data analysis consisted of a breakdown of sanction severities against edit count (as a proxy for social capital). It found a negative correlation between social capital and severity, by examining edit count against light severity outcomes (admonishment) and heavy severity (up to and including site bans); see figure 2 above. The author presented two potential interpretations: one, the conventional one, that more mature and upstanding editors with deep social capital were more likely to obey norms; the other, that those editors with the social capital were free to disobey norms without severe consequences because of the wiki's empowerment of bad behavior through various means. In essence, this would validate the idea of a "cabal", or that a "too essential to be lost" mentality endows a "wiki aristocracy" capable of creating either true consensus or promoting their "version of the truth", to quote the paper (p. 15). It was this non-data-driven approach that attempted to find which of the competing theories was correct.
The key idea in the paper is that social capital — largely built up and represented by an editor's edit count regardless of their ability to peacefully coexist with other editors — is the most important factor when it comes to arbitration. The committee's purpose is to quash disputes in order for editing to continue, not to reach a "just" outcome in some broader sense. One way the social capital is expressed and brought to bear is essentially in the opening phases of an arbitration case, called preliminary statements. If one reads between the lines of the paper, the outcome is frequently predetermined by these opening phases and all that the committee can do is go along with the crowd. In fact, it is explicitly stated — again based on evidence gathered from insiders — that cases are frequently orchestrated off-wiki precisely in order to stack the deck against the other side.
[A] Wikipedia insider told me how a disputant prepared her "faction" for months before bringing a case before the Arbitration Committee (which she ended up winning). These efforts are usually made covertly, as Wikipedia norms prohibit what is called "canvassing"...for instance ... on a secret mailing list ... A long-standing editor who was described as a member of Wikipedia's "aristocracy" told me: "we are a tight clique of very long-standing editors and none of our words find their way onto the site"...
— p. 12
Sadly for Wikipedians, the author concludes that it is the Machiavellian use of power that holds true on Wikipedia, or in other words, that there is a cabal. One passage that comes across as especially skeptical of this structure is found on p. 17: "an editor compared the Arbitration Committee to 'riot cops' ... [who] can be compared to the 'repressive peacemakers' ... guaranteeing the level of social peace that is necessary for the Wikipedia project to unfold, even to the detriment of fairness." Then the author appears to equate the arbitration process to a trial by ordeal, a feudal concept eschewed by the West in favor of due process based legal proceedings, further saying that
My empirical findings are consistent with the argument that, despite its rhetoric of inclusiveness ("anyone can edit"), Wikipedia is a "unwelcoming and exclusive environment" for newcomers, which tends to reinforce the "hegemony" of a consensus that is mostly shaped and controlled by white Western men.
— p. 19
Summing up on the next page:
[W]hat emerges from the evidence I have collected, and is perhaps more conclusive, is that experienced editors with dense networks are well positioned to avoid the consequences of their own breaches and to use their power to prevail in disputes against weaker parties.
— p. 20
In other words, a system that puts the powerful above the law.
A new preprint titled "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI" presents results from "a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace [...] the data lineage of 44 of the most widely used and adopted text data collections, spanning 1800+ finetuning datasets" that have been published on platforms such as Hugging Face or GitHub. The authors make their resulting annotated dataset of annotated datasets available online, searchable via a "Data Provenance Explorer".
The paper presents various quantitative results based on this dataset. wikipedia.org was found to be the most widely used source domain, occurring in 14.9% (p. 14) or 14.6% (Table 4, p. 13) of the 1800+ datasets. This result illustrates the value Wikipedia provides for AI (although it also means, conversely, that over 85% of those datasets made no use of Wikipedia).
The paper highlights the following example of such a dataset that used Wikipedia:
Surpervised Dataset Example: SQuAD
Rajpurkar et al. (2016) present a prototypical supervised dataset on reading comprehension. To create the dataset, the authors take paragraph-long excerpts from 539 popular Wikipedia articles and hire crowd-source workers to generate over 100,000 questions whose answers are contained in the excerpt. For example:
Wikipedia Excerpt In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity.
Worker-generated question: What causes precipitation to fall? Answer: Gravity
Here the authors use Wikipedia text as a basis for their data and their dataset contains 100,000 new question-answer pairs based on these texts.
The bulk of the paper is of less interest to Wikimedians specifically, focusing instead on general questions about the sourcing information about these datasets ("we are in the midst of a crisis in dataset provenance") and their licenses (observing e.g. "sharp divides in composition and focus of commercially open vs. closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data"). An extensive "Legal Discussion" section acknowledges that the paper leaves out "several important related questions on the use of copyrighted works to create supervised datasets and on the copyrightability of training datasets." In particular, it does not examine whether the Wikipedia-based datasets satisfy the requirements of Wikipedia's CC BY-SA license. Regarding the use of CC-licensed datasets in AI in general, the authors note: "One of the challenges is that licenses like the Apache and the Creative Commons outline restrictions related to 'derivative' or 'adapted works' but it remains unclear if a trained model should be classified as a derivative work." They also remind readers that "In the U.S., the fair use exception may allow models to be trained on protected works," although "the application of fair use in the context is still evolving and several of these issues are currently being litigated".
(The datasets examined in the paper are to be distinguished from the much larger unlabeled text corpuses used for the initial unsupervised training of large language models (LLMs). There, Wikipedia is also known to have been used, alongside other sources such as Common Crawl, e.g. for the GPT-3 family that formed the basis of ChatGPT.)
A blog post by Search Engine Optimization firm Amsive (recommended as "extensive (and fascinating) research" in a recent The Verge feature about the SEO industry) analyzes the impact of an August 2023 "core update" by Google Search. The post explains that
Google [...] announced a new signal in its December updates to the Search Quality Rater guidelines: “E” for experience. The “E” is a new member of the E-A-T family, now called E-E-A-T, and stands for experience, expertise, authoritativeness, and trustworthiness. According to Google, the amount of E-E-A-T required for a page or site to be considered high-quality depends on the nature of the content and the extent to which it can cause harm to users. [...] Search Quality Raters have been working off this new version of the Quality Guidelines to review the quality of Google’s results and evaluate E-E-A-T for 9 months now, giving Google plenty of time to update its algorithms with the feedback provided by quality raters."
The analysis of Google's August update focuses on "the list of the top 1,000 winners and losers in both absolute and percentage terms, using Sistrix Visibility Index scores using the Google.com U.S. index." (Sistrix' - generally not freely available - index is calculated based on search results for one million keywords, weighted by search volume and estimated click probability, and aggregated by domain.)
wikipedia.org tops the "Absolute Losers" list for Google's August 2023 update, with a larger score decrease than youtube.com (#2) and amazon.com (#3). Still, in relative terms, Wikipedia's score decline of -6.75% doesn't even make the "Percent Losers" list of the 250 sites with the biggest percentage declines. And in better news for Wikimedians, wiktionary.org ranked #3 on "Absolute Winners" list (right before britannica.com at #4). wikivoyage.org also gained, reaching #38 on the same list (with an index increase that is 37.38% in relative terms). What's more, Amsive's similar analysis of Google's preceding March 2023 core update, which had been "highly anticipated given the significant changes affecting organic search" in the preceding months, of which the EEAT announcement was just one, wikipedia.org had conversely topped the "Absolute Winners" list, with a 10.16% relative increase. Then again, back then wiktionary.org topped the March 2023 update's "Absolute Losers" list ahead of urbandictionary.com (#2) and thefreedictionary.com (#3), although both had a larger relative decrease than Wiktionary's -22.66%. Wiktionary was found to have declined by -51.70% in this update. This may indicate that such changes are merely palimpsestuous snapshots of the long timeline of Google Search. (And indeed Google has since conducted two further "core updates" for October and November 2023, which Amsive does not appear to have analyzed yet.) Still, these results illustrate that Wikipedia's prominence in search engine results is by no means ubiquitous and static.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:
In this study, we analyzed the content and quality of kimchi information in major foreign online encyclopedias, such as Baidu Baike, Encyclopædia Britannica, Citizendium, and Wikipedia. Our results revealed that the kimchi information provided by these encyclopedias was often inaccurate or inadequate, despite kimchi being a fundamental part of Korean cuisine. The most common inaccuracies were related to the definition and origins of kimchi and its ingredients and preparation methods.
"The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models."
From the "Conclusion" section:
"We created WIKITABLET, a dataset that contains Wikipedia article sections and their corresponding tabular data and various metadata. WIKITABLET contains millions of instances covering a broad range of topics and kinds of generation tasks. Our manual evaluation showed that humans are unable to differentiate the [original Wikipedia text] and model generations [by transformer models that the authors trained specifically for this task]. However, qualitative analysis showed that our models sometimes struggle with coherence and factuality, suggesting several directions for future work."
The authors of this 2021 paper note that they "did not experiment with pretrained models [such as the GPT series] because they typically use the entirety of Wikipedia, which would presumably overlap with our test set."
From the abstract:
"Recent advances in machine learning [this sentence appears to have been written in 2020] have made it possible to train NLG [natural language generation] systems that seek to achieve human-level performance in text writing and summarisation. In this paper, we propose such a system in the context of Wikipedia and evaluate it with Wikipedia readers and editors. Our solution builds upon the ArticlePlaceholder, a tool used in 14 under-resourced Wikipedia language versions, which displays structured data from the Wikidata knowledge base on empty Wikipedia pages. We train a neural network to generate an introductory sentence from the Wikidata triples shown by the ArticlePlaceholder, and explore how Wikipedia users engage with it. The evaluation, which includes an automatic, a judgement-based, and a task-based component, shows that the summary sentences score well in terms of perceived fluency and appropriateness for Wikipedia, and can help editors bootstrap new articles."
The paper, published in 2022, does not yet mention the related Abstract Wikipedia project.
From the abstract:
"Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary."
The paper's "Related work" section provides a useful literature overview, noting e.g. that
"Automated generation of Wikipedia text has been a problem of interest for the past 5–6 years. Initial efforts in the fact-to-text (F2T) line of work focused on generating short text, typically the first sentence of Wikipedia pages using structured fact tuples. [...] Seq-2-seq neural methods [including various LSTM architectures and efforts based on pretrained transformers] have been popularly used for F2T. [...]
Besides generating short Wikipedia text, there have also been efforts to generate Wikipedia articles by summarizing long sequences. [...] For all of these datasets, the generated text is either the full Wikipedia article or text for a specific section.
The authors note that most of these efforts have been English-only.
See also our 2018(!) coverage of various fact-to-text efforts, going back to 2016: "Readers prefer summaries written by a neural network over those by Wikipedians 40% of the time — but it still suffers from hallucinations"
Scholarship applications for Wikimania 2024 are now open!
Scholarship applications for Wikimania 2024 are open until 18 December. The Core Organizing Team is offering full and partial scholarships to attend Wikimania in person in Poland. Wikimania will take place either at the end of July or the beginning of August. The exact dates of the event will be announced soon.
Next year’s Wikimania Spirit will be “Collaboration of the Open” — a celebration of the ways we work together, in the open and for the larger open movement, to bring free knowledge to the world. It will once again be a hybrid event with virtual participation on Eventyay, an open-source event management platform.