A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A preprint titled "Language Agents Achieve Superhuman Synthesis of Scientific Knowledge"[1] introduces
"PaperQA2, a frontier language model agent optimized for improved factuality, [which] matches or exceeds subject matter expert performance on three realistic [research] literature research tasks. PaperQA2 writes cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than existing, human-written Wikipedia articles."
It was published by "FutureHouse", a San-Francisco-based nonprofit working on "Automating scientific discovery" (with a focus on biology). FutureHouse was launched last year with funding from former Google CEO Eric Schmidt (at which time it was anticipated it would spend about $20 million by the end of 2024). Generating Wikipedia-like articles about science topics is only one of the applications of "PaperQA2, FutureHouse's scientific RAG [retrieval-augmented generation] system", which is designed to aid researchers. (For example, FutureHouse also recently launched a website called "Has Anyone", described as a "minimalist AI tool to search if anyone has ever researched a given topic.")
In more detail, the researchers "engineered a system called WikiCrow, which generates cited Wikipedia-style articles about human protein-coding genes by combining several PaperQA2 calls on topics such as the structure, function, interactions, and clinical significance of the gene." Each call contributes a section of the resulting article (somewhat similar to another recent system, see our review: "STORM: AI agents role-play as 'Wikipedia editors' and 'experts' to create Wikipedia-like articles"). The prompts include the instruction to "Write in the style of a Wikipedia article, with concise sentences and coherent paragraphs
".
With an average cost of $5.50, the generated articles tended to be longer than their Wikipedia counterparts and had higher quality, at least according to the paper's evaluation method:
We used WikiCrow to generate 240 articles on genes that already have non-stub Wikipedia articles to have matched comparisons. WikiCrow articles averaged 1219.0 ± 275.0 words (mean ± SD, N = 240), longer than the corresponding Wikipedia articles (889.6 ± 715.3 words). The average article was generated in 491.5 ± 324.0 seconds, and had an average cost of $4.48 ± $1.02 per article (including costs for search and LLM APIs). We compared WikiCrow and Wikipedia on 375 statements sampled from the 240 paired articles. [...] The initial article sampling excluded any Wikipedia articles that were "stubs" or incomplete articles. Statements were then shuffled and given, blinded, to human experts, who graded statements according to whether they were (1) cited and supported; (2) missing a citation; or (3) cited and unsupported. We found that WikiCrow had significantly fewer "cited and unsupported" statements than the paired Wikipedia articles (13.5% vs. 24.9%) (p = 0.0075, χ2 (1), N = 375 for all tests in this section). WikiCrow failed to cite sources at a 3.9x lower rate than human written articles, as only 3.5% of WikiCrow statements were uncited, vs. 13.6% for Wikipedia (p < 0.001). In addition, defining precision for WikiCrow as the ratio of cited and supported statements over all cited statements, we found that WikiCrow displayed significantly higher precision than human-written articles (86.1% vs. 71.2%, p = 0.0013).
For the judgment whether a particular statement was "supported" by the cited references, the concrete question asked to the graders (described as "expert researchers" in the paper) was:
"Is the information correct, as cited? In other words, is the information stated in the sentence correct according to the literature that it cites?"
In addition, among other more detailed instructions, the graders were advised to mark a statement correct as cited even if it was not directly supported by the source, as long as the statement consisted of "broad context" judged to be "undergraduate biology student common knowledge" (akin to an extreme interpretation of WP:BLUE).
The fact that these rating criteria appear to be more liberal than Wikipedia's own, combined with the well-known general reputation of LLMs for generating hallucinations, makes the "WikiCrow displayed significantly higher precision" result rather remarkable. The authors double-checked it by examining the data more closely:
The "cited and unsupported" evaluation category includes both inaccurate statements (e.g. true hallucinations or reasoning errors) and statements that are accurate with inappropriate citations. To investigate the nature of the errors in Wikipedia and WikiCrow further, we manually inspected all reported errors and attempted to classify the issues as follows: reasoning issues, i.e. the written information contradicts, over-extrapolates, or is unsupported by any included citations; attribution issues, i.e. the information is likely supported by another included source, but either the statement does not include the correct citation locally or the source is too broad (e.g. a database portal link); or trivial statements, which are true passages, but overly pedantic or unnecessary [...]. Surprisingly, we found that compared to Wikipedia, WikiCrow had significantly fewer reasoning errors (12 vs. 26, p = 0.0144, χ2 (1), N = 375) but a similar number of attribution errors (10 vs. 16, p = 0.21), suggesting that the improved factuality of WikiCrow over Wikipedia was largely due to improvements in reasoning.
The authors caution that this result about Wikipedians "hallucinating" more frequently than AI is specific to their "WikiCrow" system (and the task of writing articles about genes), and must not be generalized to LLMs in general:
Although language models are clearly prone to reasoning errors (or hallucinations), in our task at least they appear to be less prone to such errors than Wikipedia authors or editors. This statement is specific to the agentic RAG setting presented here: language models like GPT-4 on their own, if asked to generate Wikipedia articles, would still be expected to hallucinate at high rates.
A previous, less capable version of the WikiCrow system had already been described in a December 2023 blog post, which discussed the motivation for focusing on the task of writing Wikipedia-like articles about genes in more detail. Rather than seeing it as an arbitrary benchmark demo for their LLM agent system (back then in its earlier version, PaperQA), the authors described it as being motivated by longstanding shortcomings of Wikipedia's gene coverage that are seriously hampering the work of researchers who have come to rely on Wikipedia:
If you've spent time in molecular biology, you have probably encountered the "alphabet soup" problem of genomics. Experiments in genomics uncover lists of genes implicated in a biological process, like MGAT5B and ADGRA3. Researchers turn to tools like Google, Uniprot or Wikipedia to learn more, as the knowledge of 20,000 human genes is too broad for any single human to understand. However, according to our count, only 3,639 of the 19,255 human protein-coding genes recognized by the HGNC have high-quality (non-stub) summaries on [English] Wikipedia; the other 15,616 lack pages or are incomplete stubs. Often, plenty is known about the gene, but no one has taken the time to write up a summary. This is part of a much broader problem today: scientific knowledge is hard to access, and often locked up in impenetrable technical reports. To find out about genes like MGAT5B and ADGRA3, you'd end up sinking hours into reading the primary literature.
[The 2023 version of] WikiCrow is a first step towards automated synthesis of human scientific knowledge. As a first demo, we used WikiCrow to generate drafts of Wikipedia-style articles for all 15,616 of the Human protein-coding genes that currently lack articles or have stubs, using information from full-text articles that we have access to through our academic affiliations. We estimate that this task would have taken an expert human ~60,000 hours total (6.8 working years). By contrast, WikiCrow wrote all 15,616 articles in a few days (about 8 minutes per article, with 50 instances running in parallel), drawing on 14,819,358 pages from 871,000 scientific papers that it identified as relevant in the literature.
These challenges of covering the large number of relevant genes are not news to Wikipedians working in this area. Back in 2011, several papers in a special issue of Nucleic Acids Research on databases had explored Wikipedia as a database for structured biological data, e.g. asking "how to get scientists en masse to edit articles" in this area, and presenting English Wikipedia's "Gene Wiki" taskforce (which is currently inactive). In a 2020 article in eLife, a group of 30 researchers and Wikidata contributors similarly "describe[d] the breadth and depth of the biomedical knowledge contained within Wikidata," including its coverage of genes in general ("Wikidata contains items for over 1.1 million genes and 940 thousand proteins from 201 unique taxa") and human genetic variants ("Wikidata currently contains 1502 items corresponding to human genetic variants, focused on those with a clear clinical or therapeutic relevance").[2] But it seems that at least from the point of view of the FutureHouse researchers, Wikidata's gene coverage is not a substitute for Wikipedia's, perhaps because it does not offer the same kind of factual coverage (see also the review of a related dissertation below).
The current paper is not peer-reviewed, but conveys credibility by disclosing ample detail about the methodology for building and evaluating the PaperQA2 and WikiCrow systems (also in an accompanying technical blog post), and by releasing the underlying source code and data. The PaperQA2 system is available as an open-source software package. (This includes a "Setting to emulate the Wikipedia article writing used in our WikiCrow publication". However, the paper cautions that the released version does not include some additional tools that were used, and in particular does not provide "access to non-local full-text literature searches", which are "often bound by licensing agreements".) The generated articles are available online in rendered form and as Markdown source (see full list below, with links to their Wikipedia counterparts for comparison). The annotated expert ratings have been published as well.
The authors acknowledge "previous work on unconstrained document summarization, where the document must be found and then summarized, and even writing Wikipedia-style articles with RAG" (i.e. the aforementioned STORM project). But they highlight that
"These studies have not compared directly against Wikipedia with human evaluation. Instead, they used either LLMs to judge or [like STORM] compared ROGUE (text overlap) against ground-truth summaries. Here, we measure directly against human-generated Wikipedia with subject [matter] expert grading."
The "crow" moniker (already used in a predecessor project called "ChemCrow",[supp 1] an LLM agent working on chemistry tasks) is inspired by the fact that "Crows can talk – like a parrot – but their intelligence lies in tool use."
From the abstract of a dissertation titled "Exploiting semi-structured information in Wikipedia for knowledge graph construction":[3]
"[...] we address three main challenges in the field of automated knowledge graph construction using semi-structured data in Wikipedia as a data source. To create an ontology with expressive and fine-grained types, we present an approach that extracts a large-scale general-purpose taxonomy from categories and list pages in Wikipedia. We enhance the taxonomy's classes with axioms explicating their semantics. To increase the coverage of long-tail entities in knowledge graphs, we describe a pipeline of approaches that identify entity mentions in Wikipedia listings, integrate them into an existing knowledge graph, and enrich them with additional facts derived from the extraction context. As a result of applying the above approaches to semi-structured data in Wikipedia, we present the knowledge graph CaLiGraph. The graph describes more than 13 million entities with an ontology containing almost 1.3 million classes. To judge the value of CaLiGraph for practical tasks, we introduce a framework that compares knowledge graphs based on their performance on downstream tasks. We find CaLiGraph to be a valuable addition to the field of publicly available general-purpose knowledge graphs."
Why would one want to use Wikipedia as a source of structured data and build a new knowledge graph when Wikidata already exists? First, the thesis argues that Wikidata — even though it has surpassed other public knowledge graphs in the number of entitities — is still very incomplete, especially when it comes to information about long-tail topics:
"The trend of entities added to publicly available KGs in recent years indicates they are far from complete. The number of entities in Wikidata [195], for example, grew by 26% in the time from October 2020 (85M) to October 2023 (107M) [206]. Wikidata describes the largest number of entities and comprises – in terms of entities – other public KGs to a large extent [66]. Consequently, this challenge of incompleteness applies to all public KGs, particularly when it comes to less popular entities [44]. [...]
On the other hand, an automated process for extracting structured information from Wikipedia may not yet be reliable enough to import the result directly without manual review:
While the performance of Open Information Extraction (OIE) systems (i.e., systems that extract information from general web text) has improved in recent years [159, 97, 112], the quality of extracted information has not yet reached a level where integration into public KGs like Wikidata or DBpedia [104] should be done without further filtering. [...]
[...] first "picking low-hanging fruit" by focusing on premium sources like Wikipedia to build a high-quality KG is crucial as it can serve as a solid foundation for approaches that target more challenging data sources. The extracted information may then be used as an additional anchor to make sense of less structured data.
Chapter 3 ("Knowledge Graphs on the Web") contains detailed comparisons of Wikidata with other public knowledge graphs, with observations including the following:
The main focus of DBpedia is on persons (and their careers), as well as places, works, and species. Wikidata also strongly focuses on works (mainly due to the import of entire bibliographic datasets), while Cyc, BabelNet and NELL show a more diverse distribution. [...]
[...] Wikidata has the largest number of instances and the largest detail level in most classes. However, there are differences from class to class. While Wikidata contains a large number of works, YAGO is a good source of events. NELL often has fewer instances, but a larger level of detail, which can be explained by its focus on more prominent instances.
Wikidata contains about twice as many persons as DBpedia and YAGO [..., which] contain almost no persons which are not contained in Wikidata. In conclusion, combining Wikidata with DBpedia or YAGO for better coverage of the Person class would not be beneficial"
(see also an earlier paper co-authored by the author that was titled "Knowledge Graphs on the Web -- an Overview")
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
A Wikidata taxonomy (from "city or town" to "entity") before and after refinement |
From the abstract:[4]
"Wikidata is known to have a complex taxonomy, with recurrent issues like the ambiguity between instances and classes, the inaccuracy of some taxonomic paths, the presence of cycles, and the high level of redundancy across classes. Manual efforts to clean up this taxonomy are time-consuming and prone to errors or subjective decisions. We present WiKC, a new version of Wikidata taxonomy cleaned automatically using a combination of Large Language Models (LLMs) and graph mining techniques."
From the "Evaluation" section:
"As expected, WiKC is much simpler and much more concise than Wikidata taxonomy. Compared to WiKC, Wikidata taxonomy has a factor higher than 200 in the number of classes, and a factor higher than 10 in the average number of paths from an instance to the root class entity (Q35120)."
"WiKC consistently outperforms Wikidata across all depth ranges. WiKC shows significant accuracy gains at deeper levels (depth 10 or more), suggesting that WiKC has resolved many inconsistency issues in the lower levels of the Wikidata taxonomy."
From the abstract:[5]
"Hundreds of thousands of articles on English Wikipedia have zero or limited meaningful structure on Wikidata. Much work has been done in the literature to partially or fully automate the process of completing knowledge graphs, but little of it has been practically applied to Wikidata. This paper presents two interconnected practical approaches to speeding up the Wikidata completion task. The first is Wwwyzzerdd, a browser extension that allows users to quickly import statements from Wikipedia to Wikidata. Wwwyzzerdd has been used to make over 100 thousand edits to Wikidata. The second is Psychiq, a new model for predicting instance and subclass statements based on English Wikipedia articles. [...] One initial use is integrating the Psychiq model into the Wwwyzzerdd browser extension."
From the paper:[6]
"Translations help people understand content written in another language. However, even correct literal translations do not fulfill that goal when people lack the necessary background to understand them. Professional translators incorporate explicitations to explain the missing context by considering cultural differences between source and target audiences. [...] For example, the name “Dominique de Villepin” may be well known in French community while totally unknown to English speakers in which case the translator may detect this gap of background knowledge between two sides and translate it as “the former French Prime Minister Dominique de Villepin” instead of just “Dominique de Villepin”. [...]
This work introduces techniques for automatically generating explicitations, motivated by WIKIEXPL, a dataset that we collect from Wikipedia and annotate with human translators. [...]
Our generation is grounded in Wikidata and Wikipedia—rather than free-form text generation—to prevent hallucinations and to control length or the type of explanation. For SHORT explicitations, we fetch a word from instance of or country of from Wikidata [...]. For MID, we fetch a description of the entity from Wikidata [...]. For LONG type, we fetch three sentences from the first paragraph of Wikipedia."
From the abstract:[7]
"Knowledge Graph Construction (KGC) can be seen as an iterative process starting from a high quality nucleus that is refined by knowledge extraction approaches in a virtuous loop. Such a nucleus can be obtained from knowledge existing in an open KG like Wikidata. However, due to the size of such generic KGs, integrating them as a whole may entail irrelevant content and scalability issues. We propose an analogy-based approach that starts from seed entities of interest in a generic KG, and keeps or prunes their neighboring entities. We evaluate our approach on Wikidata through two manually labeled datasets that contain either domain-homogeneous or -heterogeneous seed entities."
From the abstract:[8]
"By analyzing the edit history of Wikipedia’s ‘hyperpop’ page, we locate ongoing debates, controversies, and contestations that point to shaping forces around online genre formation. These potentially have a huge impact on how hyperpop is understood both inside and outside of the music community. In locating the most active editors of the hyperpop Wikipedia page and scrutinizing their edit histories as well as the discussions on the hyperpop page itself, we uncovered debates about artistic notability, biases toward specific sources, and attempts at associating or dissociating musical genre from non-musical identities (such as race, gender, and nationality)."
From the abstract:[9]
"Paradoxically, in each language [English/French/Portuguese Wikipedia], the airplane has a different inventor. Through online ethnography, this article explores the multilingual landscape of Wikipedia, looking not only at languages, but also at language varieties, and unpacking the intricate connections between language, country, and nationality in grassroots knowledge production online."
{{cite conference}}
: CS1 maint: DOI inactive as of September 2024 (link) Code/data
Discuss this story
If Chat GPT wasn't prone to hallucinating then this would not have happened. TarnishedPathtalk 04:10, 27 September 2024 (UTC)[reply]
How would one go about having ChatGPT write an article? Would one give it a template, a topic, and the sources, then it writes the page for you? JoJo Eumerus mobile (main talk) 17:31, 28 September 2024 (UTC)[reply]