This is a draft of a potential Signpost article, and should not be interpreted as a finished piece. Its content is subject to review by the editorial team and ultimately by JPxG, the editor in chief. Please do not link to this draft as it is unfinished and the URL will change upon publication. If you would like to contribute and are familiar with the requirements of a Signpost article, feel free to be bold in making improvements! · next-next issue draft
| |||||
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Ashkinaze, Joshua; Guan, Ruijia; Kurek, Laura; Adar, Eytan; Budak, Ceren; Gilbert, Eric (2024-09-04), Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms, arXiv, doi:10.48550/arXiv.2407.04183
From the abstract: "Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors' simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult."LLMs such as ChatGPT struggle with identifying neutrality violations on Wikipedia ("better than random individuals...but worse than expert editors")
"High recall but low precision editing" is (perhaps) a new term, mixing classifier measurements with the act of text generation which large-language models excel at.
Precision and recall concepts of statistical classification
James C. Scott, Seeing Like a State
Large language models (LLMs) are trained on large, broad corpora but then used within smaller communities that have
their own norms. To steer models towards specific norms and values, there is a growing trend of stating high-level rules as prompts.
— Introduction to the paper
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
From the abstract:
...
From the abstract:
...
From the abstract:
...
Discuss this story
(This allows for greater visibility of discussions, makes archiving easier, and prevents discussions becoming disconnected from articles during the publication process)