This is a draft of a potential Signpost article, and should not be interpreted as a finished piece. Its content is subject to review by the editorial team and ultimately by JPxG, the editor in chief. Please do not link to this draft as it is unfinished and the URL will change upon publication. If you would like to contribute and are familiar with the requirements of a Signpost article, feel free to be bold in making improvements!
| |||||
I'm working with a team of researchers to collect a high-quality dataset of fine-grained Wikipedia article assessments. Experienced editors (with at least 1,000 edits) can contribute — and get paid for it — at wikiannotate.org. We'll use this dataset to build better automated article assessment tools.
I've been working at Wiki Education since 2014, building software — like the Wiki Education Dashboard — to support programs that bridge the gap between Wikipedia and academia. Our flagship program — the Wikipedia Student Program — supports hundreds of higher education courses and thousands of students every term, as professors guide their students to improve Wikipedia in their areas of expertise and interest.
The widespread adoption of AI tools has been highly disruptive — as with many online domains — to Wiki Education and our work training student editors how to contribute effectively to the sum of all human knowledge. Teaching students how Wikipedia works — and how to reliably know things and share knowledge in ways that go beyond "just trust the AI" — is more important than ever (both for Wikipedia and for the students who are learning to learn in this AI-centric information environment). You can read a recap of much of our recent work in this area, but I think the impacts AI will have on Wikipedia are just beginning.
We can and will continue adapting to the changing landscape of AI usage, but one of the things holding us back is that we don't have good tools for measuring article quality systematically and automatically. The best software tool we currently have for automatically measuring aspects of article quality — Wikimedia Foundation's ‘articlequality’ model (formerly ORES) — can't differentiate between great content written by an experienced Wikipedian and an AI-slop imitation of what a great Wikipedian would write. It uses some basic metrics, like the amount of text, number of citations, headers, images, and so on, to predict the quality of an article, but can't address anything involving the quality or accuracy of the writing itself.
For Wiki Education's programs, we have one powerful tool for catching slop: the Wiki Education Dashboard integrates with the AI detection service Pangram, automatically scanning larger edits for signs of LLM-generated text. For samples of at least a few hundred words, Pangram is very good at sorting human-written prose from text that came straight out of an LLM. However, real-world AI usage patterns are much more complicated, ranging from minor copyedits to LLM-generated text that gets extensively rewritten by hand (and everything in between). In many cases — like the increasingly AI-centric Grammarly service — it's not even obvious to a student just how much of their text came out of an LLM, because AI tools get integrated into conventional text editors. We can warn a student when we detect a high likelihood of LLM text, but that kind of strategy creates an antagonistic relationship. Students perceive that they've been accused of cheating with AI, and become defensive — and still don't get a clear indication of what the AI did badly or why we have rules against AI-written article content.
Hallucination is fundamental to the way LLMs work, but they can do a pretty good job in some respects: recent models can write understandable prose about encyclopedic topics, and they can generally follow our style guidelines when prompted to do so. Some of the things they do very badly — like accurately representing the content of individual sources — are also harder for a human to notice. (I've come to think of it like this: LLMs think they've read every book, but haven't actually read any. Everything they've trained on is a muddled mix, so they can't accurately represent any single source without accessing it directly.) But it's now possible to do much better.
We can build tools that use LLMs to explicitly evaluate an article against many aspects of our policies, guidelines and quality standards (like the detailed quality rubric of WP:ASSESS), and we can check against some of the ways we know AI usually fails catastrophically (like confabulating citations to sources that the AI didn't actually access).
That's what the research "Wiki Education in the Age of Generative AI" research team is working on with wikiannotate.org. We want to collect a good dataset of fine-grained article quality assessments from experienced Wikipedians — covering general aspects of quality as well as some of the specific things that AI usually does wrong — so that we can build a tool for quantifying the ways that AI usage impacts article quality. We're looking for editors to help build this dataset, with compensation available for each completed batch of evaluations. Currently we’re offering $21 USD for each batch of 5 articles.
With help from the Wikipedia editing community, we can build on the things that LLMs do well to mitigate some of the problems they are causing. Some of the possible applications include:
If you want to help, visit wikiannotate.org to sign up and do some article assessments. Each batch is expected to take 30 to 60 minutes on average, and you can complete multiple batches.
(All these em-dashes are my own. I've been overusing em-dashes my entire adult life, and I'm not about to stop.)
Discuss this story
If this project page does not meet the criteria for speedy deletion, please remove this notice.
Administrators: check links, talk, history (last), and logs before deletion.This page was last edited by Bluerasberry (contribs | logs) at 15:47, 20 May 2026 (UTC) (4 hours ago)[reply]
"Please sign to support..."
[edit]I'd have to suggest that looks a little like canvassing. We shouldn't be telling Signpost readers which proposals to support. AndyTheGrump (talk) 15:32, 16 February 2026 (UTC)[reply]