The Signpost

File:How_wikiannotate.org_works.png
Sage Ross (no AI, believe it or not!)
CC-BY-SA
300
Forum

WikiAnnotate: help us build a dataset of article quality evaluations

Contribute   —  
Share this
By Sage (Wiki Ed)
Editor's note – if you want to know more about how annotated datasets can be used to build tools for the community, as referenced in this article, you can start at Machine learning#Supervised learning.

TL;DR

I'm working with a team of researchers to collect a high-quality dataset of fine-grained Wikipedia article assessments. Experienced editors (with at least 1,000 edits) can contribute — and get paid for it — at wikiannotate.org. We'll use this dataset to build better automated article assessment tools.

Background

I've been working at Wiki Education since 2014, building software — like the Wiki Education Dashboard — to support programs that bridge the gap between Wikipedia and academia. Our flagship program — the Wikipedia Student Program — supports hundreds of higher education courses and thousands of students every term, as professors guide their students to improve Wikipedia in their areas of expertise and interest.

The widespread adoption of AI tools has been highly disruptive — as with many online domains — to Wiki Education and our work training student editors how to contribute effectively to the sum of all human knowledge. Teaching students how Wikipedia works — and how to reliably know things and share knowledge in ways that go beyond "just trust the AI" — is more important than ever (both for Wikipedia and for the students who are learning to learn in this AI-centric information environment). You can read a recap of much of our recent work in this area, but I think the impacts AI will have on Wikipedia are just beginning.

We can and will continue adapting to the changing landscape of AI usage, but one of the things holding us back is that we don't have good tools for measuring article quality systematically and automatically. The best software tool we currently have for automatically measuring aspects of article quality — Wikimedia Foundation's ‘articlequality’ model (formerly ORES) — can't differentiate between great content written by an experienced Wikipedian and an AI-slop imitation of what a great Wikipedian would write. It uses some basic metrics, like the amount of text, number of citations, headers, images, and so on, to predict the quality of an article, but can't address anything involving the quality or accuracy of the writing itself.

For Wiki Education's programs, we have one powerful tool for catching slop: the Wiki Education Dashboard integrates with the AI detection service Pangram, automatically scanning larger edits for signs of LLM-generated text. For samples of at least a few hundred words, Pangram is very good at sorting human-written prose from text that came straight out of an LLM. However, real-world AI usage patterns are much more complicated, ranging from minor copyedits to LLM-generated text that gets extensively rewritten by hand (and everything in between). In many cases — like the increasingly AI-centric Grammarly service — it's not even obvious to a student just how much of their text came out of an LLM, because AI tools get integrated into conventional text editors. We can warn a student when we detect a high likelihood of LLM text, but that kind of strategy creates an antagonistic relationship. Students perceive that they've been accused of cheating with AI, and become defensive — and still don't get a clear indication of what the AI did badly or why we have rules against AI-written article content.

Hallucination is fundamental to the way LLMs work, but they can do a pretty good job in some respects: recent models can write understandable prose about encyclopedic topics, and they can generally follow our style guidelines when prompted to do so. Some of the things they do very badly — like accurately representing the content of individual sources — are also harder for a human to notice. (I've come to think of it like this: LLMs think they've read every book, but haven't actually read any. Everything they've trained on is a muddled mix, so they can't accurately represent any single source without accessing it directly.) But it's now possible to do much better.

wikiannotate.org

We can build tools that use LLMs to explicitly evaluate an article against many aspects of our policies, guidelines and quality standards (like the detailed quality rubric of WP:ASSESS), and we can check against some of the ways we know AI usually fails catastrophically (like confabulating citations to sources that the AI didn't actually access).

That's what the "Wiki Education in the Age of Generative AI" research team is working on with wikiannotate.org. We want to collect a good dataset of fine-grained article quality assessments from experienced Wikipedians — covering general aspects of quality as well as some of the specific things that AI usually does wrong — so that we can build a tool for quantifying the ways that AI usage impacts article quality. We're looking for editors to help build this dataset, with compensation available for each completed batch of evaluations. Currently we’re offering $21 USD for each batch of 5 articles.

With help from the Wikipedia editing community, we can build on the things that LLMs do well to mitigate some of the problems they are causing. Some of the possible applications include:

If you want to help, visit wikiannotate.org to sign up and do some article assessments. Each batch is expected to take 30 to 60 minutes on average, and you can complete multiple batches.

(All these em-dashes are my own. I've been overusing em-dashes my entire adult life, and I'm not about to stop.)


Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

The project seems promising. I've tried to register but it seems that only contributors with more than 1,000 contributions to the English Wikipedia are eligibles. This raises the question of multilinguism. Very often NLP projects are in English and other languages are not even considered. This is known as Bender Rule in the literature. Is your project English only? Is there any plan to include more languages? PAC2 (talk) 18:46, 22 May 2026 (UTC)[reply]


Hi all! I am part of the research team behind this study. Thanks to everyone who engaged in the annotation exercise and for all your feedback on it. I want to try to respond here to the main points that were raised.

---

Is your project English only? Is there any plan to include more languages?

@PAC2 for now, we do not have plans to expand to more languages. You are right in pointing out that the vast majority of NLP projects target only English, and I agree that we need more multilingual studies, but as of now, we do not have enough resources or community interest to support that scale of data collection. Given the scope of our project, in any case, it makes sense to start with English: we are analyzing edits in the context of WikiEdu, which only engages students in the United States and Canada.

---

the most important qualities I am checking in student work and in possibly-AI-generated work are notability, encyclopedic tone, and verifiability. The questions here allowed be to express some concerns about source-text integrity (though I badly wanted to be able to note when a source supported only part of a claim, a particularly common AI problem) but not these other concerns.

and

Does not allow you to flag plagiarism/copyvio of a source—is this intentional?

Thanks for pointing those out, @le 🌸 valyn and @buIdhe (no, it is not intentional). This is all extremely useful. Our rubric was largely inspired by this old community-driven metric used by WP:USPP. Clearly, that assessment is quite old and omits several important dimensions. We tried our best to augment it with things we thought would be relevant, especially considering common pitfalls of AI around fluffy verbosity and hallucinated/unrelated sources. But we would love Wikipedians' input to come up with a better version of this rubric: ideally, this would also be strongly community-driven and the result of many different perspectives. In this sense, I would definitely support an open multi-phase data collection process, taking into consideration yours and others' insights. Realistically, whether we'll be able to do that (and think about other potential expansions, e.g., the multilinguistic element surfaced by @PAC2) largely depends on the community response to this initial collection, and how much interest there is for the kind of tools that we will build.

---

I've run into a bug—while completing a batch, my browser crashed (unrelated), and when I returned to the site, the button "Annotate another batch" would only return "Current batch is not complete". However, there seems to be no way to return to the current batch. Also, your contact information only seems to be available on the consent page, not the main page once preliminary information has been given.

@Crow Basket, I am sorry you encountered a bug: I will look into this and will try to manually fix the database entry for your submission. As for the contact information, I agree that it should be available at all times: I will add it to the rest of the website. In the meantime, you can refer to our contacts on our research page.

---

Once again, thanks to everyone who has donated their time to engage, annotate articles, and provide feedback. If you wish to participate, you can still do so at https://wikiannotate.org/ (including if you have already annotated one or more batches). Feel free to also share this with others who could be interested. --TriggerOne (talk) 00:54, 26 May 2026 (UTC)[reply]




       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0