The Signpost


Recent research

YOUR ARTICLE'S DESCRIPTIVE TITLE HERE

Contribute   —  
Share this
By Tilman Bayer, ...


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


...

Reviewed by ...

...

Reviewed by ...

Seeing Like an AI

"Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms" Reviewed by Bri

Ashkinaze, Joshua; Guan, Ruijia; Kurek, Laura; Adar, Eytan; Budak, Ceren; Gilbert, Eric (2024-09-04), Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms, arXiv, doi:10.48550/arXiv.2407.04183

From the abstract: "Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors' simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult."LLMs such as ChatGPT struggle with identifying neutrality violations on Wikipedia ("better than random individuals...but worse than expert editors")

"High recall but low precision editing" is (perhaps) a new term, mixing classifier measurements with the act of text generation which large-language models excel at.

High recall – Low precision

Precision and recall concepts of statistical classification

James C. Scott, Seeing Like a State

Large language models (LLMs) are trained on large, broad corpora but then used within smaller communities that have

their own norms. To steer models towards specific norms and values, there is a growing trend of stating high-level rules as prompts.


— Introduction to the paper

Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by ...

"..."

From the abstract:

...

"..."

From the abstract:

...

"..."

From the abstract:

...

References

Supplementary references and notes:


Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.




       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0