The Signpost

File:A human writer and a creature with the head and wings of a crow, both sitting and typing on their own laptops, experiencing mild hallucinations (DALL-E illustration).webp
HaeB
CC0
300
Recent research

GPT-4 writes better edit summaries than human Wikipedians

Contribute   —  
Share this
By Tilman Bayer


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


GPT-4 is better at writing edit summaries than human Wikipedia editors

A preprint[1] by researchers from EPFL and the Wikimedia Foundation presents

Edisum, which is, to the best of our knowledge, the first solution to automate the generation of highly-contextual Wikipedia edit summaries [given an edit diff] at large scale, [and] achieves performance similar to the human editors

Average aggregated human evaluation scores for edit summaries generated by language models and by the human editors who originally made the edits

This solution was designed to match the performance and open source requirements for a live service deployed on Wikimedia Foundation servers. It consists of a "very small" language model (ca. 220 million parameters), based on Google's LongT5 (an extension of the company's T5 model from 2019, available under an Apache-2.0 license).

Separately, the authors also tested several contemporaneous large language models (GPT-4, GPT-3.5 and Llama 3 8B). GPT-4's edit summaries in particular were rated as significantly better than those provided by the human Wikipedia editors who originally made the edits in the sample – both using an automated scoring method based on semantic similarity, and in a quality ranking by human raters (where "to ensure high-quality results, instead of relying on the crowdsourcing platforms [like Mechanical Turk, frequently used in similar studies], we recruited 3 MSc students to perform the annotation").

Examples of summaries written by human editors and generated by Edisum and GPT-4
edit human editor Edisum [0%] Edisum [100%] GPT-4
Special:Diff/1172472704 added the range added location Added location information Added information about the location and altitude of Boveys Lookout and the orientation of the Burton Range.
Special:Diff/1169223741 in 1982 Added date and citation. Added the year of the speech Added the year of Fidel Castro's speech.
Special:Diff/1169701032 added artist Added a wikilink and a link Added a new name to the list of artists Added Stefan Brüggemann to the list of artists whose works are included.
Special:Diff/1170976899 lede; infobox infobox (inapplicable); lede; -cats Clarified the nature of Denis- Auguste Affre's cause and added information about his title Updated the description of Denis-Auguste Affre's religious status and clarified the stage of his canonization process.
Special:Diff/1170699096 fixed link to aias awards, it was at 5th, not 7th 5th Corrected the award number from 7th to 5th Corrected the year of the Annual Interactive Achievement Awards nomination.
Special:Diff/1169516533 Amended lists so they are split at 100 m ""at least", not "at least"" Clarified the range of heights in the list Added information about two different height categories for the ranking of buildings and structures in Greater Manchester.
Special:Diff/1173169263 UPD Romelu Lukaku updated Lukaku Removed the end of the season Removed incorrect future date for Romelu Lukaku's tenure at Inter.
Special:Diff/1171641782 "This \"however\" doesn't make sense here" "rmv deprecated, ambiguous "however" per MOS:RELTIME" Removed unnecessary word Removed the word "However," from the beginning of the sentence.
Special:Diff/1168487383 Per feedback given in GA review, elaborated on David King Udall and Ella Stewart Udall in the lead and in Ella Udall's first mention; David Udall was already identified Added link; improved phrasing and punctuation Added additional information about the individuals' professions Added professional details for David King Udall, Ella Stewart Udall, and Mary Ann Linton Morgan Udall.

This outcome joins some other recent research indicating that modern LLMs can match or even surpass the average Wikipedia editor in certain tasks (see e.g. our coverage: "'Wikicrow' AI less 'prone to reasoning errors (or hallucinations)' than human Wikipedia editors when writing gene articles").

A substantial part of the paper is devoted to showing that this particular task (generating good edit summaries) is both important and in need of improvements, motivating the use of AI to "overcome this problem and help editors write useful edit summaries":

"An edit summary is a succinct comment written by a Wikipedia editor explaining the nature of, and reasons for, an edit to a Wikipedia page. Edit summaries are crucial for maintaining the encyclopedia: they are the first thing seen by content moderators and they help them decide whether to accept or reject an edit. [...] Unfortunately, as we show, for many edits, summaries are either missing or incomplete."

In more detail:

"Given the dearth of data on the nature and quality of edit summaries on Wikipedia, we perform qualitative coding to guide our modeling decisions. Specifically, we analyze a sample of 100 random edits made in August 2023 to English Wikipedia [removing bot edits, edits with empty summaries and edits related to reverts] stratified among a diverse set of editor expertise levels. Two of the authors each coded all 100 summaries [...] by following criteria set by the English Wikipedia community (Wikimedia, 2024a) [...]. The vast majority (∼80%) of current edit summaries focus on [the] “what” of the edit, with only 30–40% addressing the “why”. [...] A sizeable minority (∼35%) of edit summaries were labeled as “misleading”, generally due to overly vague summaries or summaries that only mention part of the edit. [...] Almost no edit summaries are inappropriate, likely because highly inappropriate edit summaries would be deleted (Wikipedia, 2024c) by administrators and not appear in our dataset."

Metric Summary (what) Explain (why) Misleading Inappropriate Generate-able (what) Generate-able (why)
Description Attempts to describe what the edit did. For example, "added links" Attempts to describe why the edit was made. For example, "Edited for brevity and easier reading". Overly vague or misleading per English Wikipedia guidance. For example, "updated" without explaining what was updated is too vague. Could be perceived as inappropriate or uncivil per English Wikipedia guidance. Could a language model feasibly describe the "what" of this edit based solely on the edit diff. Could a language model feasibly describe the "why" of this edit based solely on the edit diff.
% Agreement 0.89 0.8 0.77 0.98 0.97 0.8
Cohen's Kappa 0.65 0.57 0.50 -0.01 0.39 0.32
Overall (n=100) 0.75 - 0.86 0.26 - 0.46 0.23 - 0.46 0.00 - 0.02 0.96 - 0.99 0.08 - 0.28
IP editors (n=25) 0.76 - 0.88 0.20 - 0.44 0.40 - 0.64 0.00 - 0.08 0.92 - 0.96 0.04 - 0.16
Newcomers (n=25) 0.76 - 0.84 0.36 - 0.48 0.24 - 0.52 0.00 - 0.00 0.92 - 1.00 0.12 - 0.20
Mid-experienced (n=25) 0.76 - 0.88 0.28 - 0.52 0.16 - 0.36 0.00 - 0.00 1.00 - 1.00 0.08 - 0.28
Experienced (n=25) 0.72 - 0.84 0.20 - 0.40 0.12 - 0.32 0.00 - 0.00 1.00 - 1.00 0.08 - 0.48

"Table 1: Statistics on agreement for qualitative coding for each facet and the proportion of how many edit summaries met each criteria. Ranges are a lower bound (both of the coders marked an edit) and an upper bound (at least one of the coders marked an edit). The majority of summaries are expressing only what was done in the edit, which we also expect a language model to do. A significant portion of edits is of low quality, i.e., misleading."

The paper discusses various other nuances and special cases in interpreting these results and in deriving suitable training data for the "Edisum" model. (For example, "edit summaries should ideally explain why the edit was performed, along with what was changed, which often requires external context" that is not available to the model – or really to any human apart from the editor who made the edit.) The authors' best performing approach relies on fine-tuning the aforementioned LongT5 model on 100% synthetic data generated using a LLM (gpt-3.5-turbo) as an intermediate step.

Overall, they conclude that

while it should be used with caution due to a portion of unrelated summaries, the analysis confirms that Edisum is a useful option that can aid editors in writing edit summaries.

The authors wisely refrain from suggesting the complete replacement of human-generated edit summaries. (It is intriguing, however, to observe that Wikidata, a fairly successful sister project of Wikipedia, has been content with relying almost entirely on auto-generated edit summaries for many years. And the present paper exclusively focuses on English Wikipedia – Wikipedias in other languages might have fairly different guidelines or quality issues regarding edit summaries.)

Still, there might be great value in deploying Edisum as an opt-in tool for editors willing to be mindful of its potential pitfalls. (While the English Wikipedia community has rejected proposals for a policy or guideline about LLMs, a popular essay advises that while their use for generating original content is discouraged, "LLMs can be used for certain tasks (like copyediting, summarization, and paraphrasing) if the editor has substantial prior experience in the intended task and rigorously scrutinizes the results before publishing them.")

On that matter, it is worth noting that the paper was first published (as a preprint) ten months ago already, in April 2024. (It appears to have been submitted for review at an ACL conference, but does not seem to have been published in peer-reviewed form yet.) Given the current extremely fast-paced developments in large language models, this likely means that the paper is already quite outdated concerning several of the constraints that Edisum was developed for. Specifically, the authors write that

commercial LLMs [like GPT-4] are not well suited for [Edisum's] task, as they do not follow the open-source guidelines set by Wikipedia [referring to the Wikmedia Foundation's guiding principles]. [...Furthermore,] the open-source LLM, Llama 3 8B, underperforms even when compared to the finetuned Edisum models.

But the performance of open LLMs (at least those released under the kind of license that is regarded as open-source in the paper) has greatly improved over the past year, while the costs of using LLMs in general have dropped.

Besides the Foundation's licensing requirements, its hardware constraints also played a big role:

We intentionally use a very small model, because of limitations of Wikipedia’s infrastructure. In particular, Wikipedia [i.e. WMF] does not have access to many GPUs on which we could deploy big models (Wikitech, 2024), meaning that we have to focus on the ones that can run effectively on CPUs. Note that this task requires a model running virtually in real-time, as edit summaries should be created when edit is performed, and cannot be precalculated to decrease the latency.

Here too one wonders whether the situation might have improved over the past year since the paper was first published. Unlike much of the rest of the industry, the Wikimedia Foundation avoids NVIDIA GPUs because of their proprietary CUDA software layer and uses AMD GPUs instead, which are known for having some challenges in running standard open LLMs – but conceivably, AMD's software support and performance optimizations for LLMs might have been improving. Also, given the size of WMF's overall budget, it seems interesting that compute budget constraints would apparently prevent the deployment of a better-performing tool for supporting editors in an important task.


Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

"Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs"

From the abstract:[2]

"Several initiatives have been undertaken to conceptually model the domain of scholarly data using ontologies and to create respective Knowledge Graphs. [...] Our main contributions include (a) an analysis of ontologies for representing scholarly data to identify gaps and relevant entities/properties in Wikidata, (b) semi-automated extraction – requiring (minimal) manual validation – of conference metadata (e.g., acceptance rates, organizer roles, programme committee members, best paper awards, keynotes, and sponsors) from websites and proceedings texts using LLMs. Finally, we discuss (c) extensions to visualization tools in the Wikidata context for data exploration of the generated scholarly data. Our study focuses on data from 105 Semantic Web-related conferences and extends/adds more than 6000 entities in Wikidata. It is important to note that the method can be more generally applicable beyond Semantic Web-related conferences for enhancing Wikidata's utility as a comprehensive scholarly resource."

"Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence"

This study uses Wikipedia articles about neighborhoods in Madrid and Barcelona to predict immigrant concentration and segregation. From the abstract:[3]

"The scientific literature on residential segregation in large metropolitan areas highlights various explanatory factors, including economic, social, political, landscape, and cultural elements related to both migrant and local populations. This paper contrasts the impact of these factors individually, such as the immigrant rate and neighborhood segregation. To achieve this, a machine learning analysis was conducted on a sample of neighborhoods in the main Spanish metropolitan areas (Madrid and Barcelona), using a database created from a combination of official statistical sources and textual sources, such as Wikipedia. These texts were transformed into indexes using Natural Language Processing (NLP) and other artificial intelligence algorithms capable of interpreting images and converting them into indexes. [...] The novel application of AI and big data, particularly through ChatGPT and Google Street View, has enhanced model predictability, contributing to the scientific literature on segregated spaces."

"On the effective transfer of knowledge from English to Hindi Wikipedia"

From the abstract:[4]

"[On Wikipedia, t]here is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books) and adapts it to align with Wikipedia's distinctive style, including its neutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations."


References

  1. ^ Šakota, Marija; Johnson, Isaac; Feng, Guosheng; West, Robert (2024-04-04), Edisum: Summarizing and Explaining Wikipedia Edits at Scale, arXiv, doi:10.48550/arXiv.2404.03428 Code models
  2. ^ Mihindukulasooriya, Nandana; Tiwari, Sanju; Dobriy, Daniil; Nielsen, Finn Årup; Chhetri, Tek Raj; Polleres, Axel (2024-11-13), Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs, arXiv, doi:10.48550/arXiv.2411.08696 Code / dataset
  3. ^ López-Otero, Javier; Obregón-Sierra, Ángel; Gavira-Narváez, Antonio (December 2024). "Migration and Segregated Spaces: Analysis of Qualitative Sources Such as Wikipedia Using Artificial Intelligence". Social Sciences. 13 (12): 664. doi:10.3390/socsci13120664. ISSN 2076-0760.
  4. ^ Das, Paramita; Roy, Amartya; Chakraborty, Ritabrata; Mukherjee, Animesh (2024-12-07), On the effective transfer of knowledge from English to Hindi Wikipedia, arXiv, doi:10.48550/arXiv.2412.05708
Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

We should have a gadget using AI to write edit summaries. But of course, some will try to veto it because of anti-AI sentiment. In 20 years, when everyone is using AI for everything and the anti-AI Luddite sentiment dies out, maybe we will do a test run, I guess. Personal context: I am happily using AI to generate DYK hooks and article abstracts - of course, I am proofreading and fact checking them, and often copyediting further. But while I use edit summaries sometimes I am sure I could do it more, but, sorry, I do not consider it an efficient use of my time (also, because nobody ever complains about it), and this looks like a nice tool to have to popularize what is a best practice. --Piotr Konieczny aka Prokonsul Piotrus| reply here 06:32, 7 February 2025 (UTC)[reply]

Piotrus Edit summaries seems like a valid use case, god knows I have written some subpar edit summaries in my day. But using it for DYK hooks surprises me. To me, creating a hook is one of the most fun things an editor can do (aside from maybe writing a well done lead). Why outsource it to a machine? Also, what do you mean by "article abstracts"? CaptainEek Edits Ho Cap'n! 07:13, 7 February 2025 (UTC)[reply]
NPR recently had this same discussion with a professional musician who creates film scores. They found that the AI software created film scores as good or better than the musician did. There were drawbacks; AI just isn't as creative as humans at this point. But if you need to make something that is required to look like/sound like/ read like something else, that it might be a useful tool for the job. The musician in question was very upset and seriously considered that they might not have a job soon. Viriditas (talk) 10:18, 7 February 2025 (UTC)[reply]
To me creating a hook is a pain - and my job is being a writer. If it is fun for you, go nuts :) but for me having some help would remove a hassle from DYK. cyclopiaspeak! 10:44, 7 February 2025 (UTC)[reply]
@CaptainEek Hmmm, pretty much what @Cyclopia said. Maybe it's because I have written 1000+ DYKs - I am a bit burned out coming up with hooks; and I also prefer just writing another DYK then coming up with DYKs. Particularly as I started this (AI hooks) for some DYKs where the reviewer or DYK admins complained that my hooks are not "interesting" and I was stumpted what to do - then I asked AI (after feeding it DYK rules and the article text), and it generated a bunch of hooks, some of which were pretty decent, and did satisfy the "boring" crowd. Frankly, now I just outsource most of my hooks to AI, because I no longer find coming up with my own worth my time (but of course, if you enjoy it, more power to you) :D And by article abstracts, sorry, I am a bit off my game today (fever, cold, etc.). I meant leads. After I finish my recent articles, I often ask AI to write Wiki MOS compliant leads (which then I copyedit and merge with my leads). AI does a pretty good job summarizing stuff. Obviously, it helps I am very familiar with my articles, so I can spot any errors AI makes (which are rare but happens). I would be more cautious using it for articles I haven't read - but people will do it, not much we can do about it (hopefully the issue of AI hallucination will be solved in the near future...). Piotr Konieczny aka Prokonsul Piotrus| reply here 11:18, 7 February 2025 (UTC)[reply]
I find the use of LLM for leads rather disappointing actually :( A lead is one of the only things most people read in an article. I often put as much time into a lead as I do the entire rest of an article. Thinking about what's important and how to best say it is so crucial. For example, along with the other regulars at American Civil War, I've spent years trying to to come up with the perfect lead. We've had more discussions about the lead than anything else, agonizing over single words, and frankly we've come up with something rather amazing. No machine could make a better lead. CaptainEek Edits Ho Cap'n! 17:44, 7 February 2025 (UTC)[reply]
I am not incredibly impressed by the (current) capabilities of LLMs in generating elegant text, but remember that we are machines as well. There is no reason an algorithm cannot or will not ever generate a good lead. That said, apart from the issue of potential copyvio, I see no drawback in using LLMs to generate some initial ideas on which we humans can work on. cyclopiaspeak! 13:00, 10 February 2025 (UTC)[reply]
I'd personally argue the anti-AI sentiment (that I share) is not a result of opposition to the technology itself, but rather opposition to the unethical and wasteful nature of how the technology is being used. In other words, I wouldn't be so quick to dismiss our criticisms as "Luddite sentiment". /home/gracen/ (they/them) 16:18, 7 February 2025 (UTC)[reply]
@Gracen There are blurry boundaries, and all stuff can be misued, but to me it's more like missing the forest for the trees, and ignoring the potential for greater good due to mostly irrelevant concerns; I'd compare it to refusing to use electrical power because some of it comes from non-renewable sources, or criticizing the concept of medical treatment because some drugs come from companies that have behaved unethicality, etc. Plus organizational inertia and fear of change ("we did not need AIs before so we don't need them now or forever, sonny boy cough, cough..."). Piotr Konieczny aka Prokonsul Piotrus| reply here 01:57, 12 February 2025 (UTC)[reply]
I appreciate your perspective, and I agree with you that criticizing AI overall is very similar to criticizing the concept of medical treatment because some drugs come from [unethical companies]. However, I and many others are not criticizing AI overall (although I won't say that nobody's irrationally opposed to AI), but we are in fact criticizing the unethical parts of it. (Skip to the second paragraph if you want to skip my AI rant.) I'm all for computer vision, text to speech (in cases that aren't deepfakes), and AI translation tools. However, I'm very much against LLMs due to the large amounts of energy they consume for what's essentially predictive text that's really good at pretending to think (however, not opposed to LMs in general). I'm also against generative image models due to the incredible levels of artist exploitation and stolen content that they are trained on. I'm also strongly opposed to the marketing of both of these technologies (LLMs and image models) as being things that they are not: i.e. machines capable of creativity and critical thinking.
To be clear, I believe that AI-assisted edit summaries have great potential. Editors should only have to explain the "why" of their edit in a summary, and leaving the "what" to a language model which is trained specifically for this purpose would be excellent. /home/gracen/ (they/them) 16:13, 13 February 2025 (UTC)[reply]
To be fair to the Luddites as well, they were very much left in the lurch by a transition to a system with atrocious working conditions, poor safety, and all-round disregard for ethics. Maybe there are similarities towards opposition to AI, but have we considered that maybe the Luddites had a point, and that, Luddism was, if not in the right, then at least not unambiguously worse than the government that violently suppressed it. Alpha3031 (tc) 05:48, 14 February 2025 (UTC)[reply]

Looking at those edit summary comparisons, I don't necessarily consider them "better". More verbose, certainly, but these are looking at them without the context of the actual edit. When comparing the diffs between two edits, "added artist", for example, is just as much as explanation as "Added Stefan Brüggemann to the list of artists whose works are included", because the diff clearly shows that's what's happening. On a slightly different point, the summary "This \"however\" doesn't make sense here" is actually clearer than "Removed the word "However," from the beginning of the sentence", etc. The bigger problem is that all the LLM summaries (and some of the human ones) fail on one of the key points on what an edit summary is supposed to do, which isn't to explain what the edit was, but to explain why it was done. AI may be able to put in ten words what has been done, but the six words from a human explain why. - SchroCat (talk) 07:36, 7 February 2025 (UTC)[reply]

For human-written edit summaries, we do have a convention of being brevitous and clipped, although I would aver this is because it is unreasonable to have a 100-byte explanation for every 4-byte edit. If it was costless to actually describe th3 changes, I would much rather peruse a history filled with those than the current thing where there's just a solid row of 70 "ce" and "add date" edits and to find where a specific thing was added you have to manually bisect it 😭 jp×g🗯️ 08:25, 7 February 2025 (UTC)[reply]
You can pry ce from my cold dead hands. I do try to make more detailed summaries when it's more than just a ce haha Wilhelm Tell DCCXLVI (talk to me!/my edits) 17:38, 7 February 2025 (UTC)[reply]
Seconding! Help:Edit summary lists three reasons for edit summaries. Yes, as SchroCat says, they offer a rationale for the edit, but they should also describe the edit itself to save us from a binary search through the article history. Sure there's xtools:blame, but that's insufficient for characterizing deletions. ViridianPenguin🐧 (💬) 00:56, 8 February 2025 (UTC)[reply]
They are almost certainly "better" then no edit summaries, and I think no edit summaries is the rule, followed by auto-generated ones... :( Writing edit summaries is very rarely "fun", I think - most of us think it is a waste of our time, and it kind of is (writing a new sentence for an article is more productive then writing an edit summary; of course doing both is best but...). Piotr Konieczny aka Prokonsul Piotrus| reply here 11:19, 7 February 2025 (UTC)[reply]
Piotrus, the real problem with poor or non-existent edit summaries is that it wastes other editors' time having to check if the edit was a reasonable one. And I have no way of knowing whether or not someone with judgement I trust has already looked at the edit. Thus, it can waste the time of several editors. Edwardx (talk) 11:34, 7 February 2025 (UTC)[reply]
True, but since it is not required, most folks ignore it, like many minor best practices. Piotr Konieczny aka Prokonsul Piotrus| reply here 11:41, 7 February 2025 (UTC)[reply]
See, I just don't understand that. I won't claim I'm the most prolific editor, but I will claim that in all my time here, I've made exactly three edits in mainspace without an edit summary — and the last one was in May 2011.
(I definitely understand it not being required, though, because it's one of those things you can't enforce with technology. If edit summaries were required, we'd have an epidemic of edit summaries that read "edit", or ".", or "reghrhtrera". Require a certain number of characters, same thing but longer. There's no way to enforce a requirement for meaningful edit summaries, which would be the only requirement that would matter.) FeRDNYC (talk) 17:55, 7 February 2025 (UTC)[reply]
I do see a surprising number of cases where a reference says exactly the opposite of the claim it's supposed to be supporting. I suspect this is largely due to the human equivalent of "hallucinating." All the best: Rich Farmbrough 11:39, 7 February 2025 (UTC).[reply]
Wikidata, a fairly successful sister project of Wikipedia, has been content with relying almost entirely on auto-generated edit summaries for many years — who asked Wikidata users, though? It is simply impossible to add a summary in most cases on Wikidata, I don’t think anyone is ‘content’ with that, no one was asked whether they want edit summaries or not. As for the article more generally, I also agree with people who said that many of AI-generated summaries are not at all better than human-written ones. stjn 13:35, 7 February 2025 (UTC)[reply]
Well, nobody is protesting either so... Piotr Konieczny aka Prokonsul Piotrus| reply here 14:05, 7 February 2025 (UTC)[reply]
Wikidata has many unsolved problems, like non-existing mobile editing, silent edit wars that happen without a semblance of edit summary in sight (since you can only add an edit summary there if you revert someone’s individual edit directly), and lack of more granular page protection. I don’t think anyone is ‘content’ with what I listed, they are just failures of governance that are put on the back burner by the fact that Wikidata is getting bigger and bigger and all other problems with it get smaller in importance. stjn 14:49, 7 February 2025 (UTC)[reply]
Maybe I should have used a different wording in the review. I actually agree with you that this is a significant shortcoming of Wikidata, it has annoyed me too when making edits on Wikidata. What I meant by "content" is that 1) WMDE (i.e. the people who make the actual decisions about Wikidata's interface design) doesn't seem to have felt a need to address this situation since the project's launch in 2012 (cf. phab:T47224), and 2) as Piotrus pointed out already, there don't seem to be widespread protests about it. Perhaps "complacent" would have been a better term. Still, I regard it as a relevant data point that Wikidata has been fairly successful (at least in attracting sustained participation) relying almost entirely on automated edit summaries.
So yes, I wouldn't disagree with failures of governance that are put on the back burner, although I would note that this expression could also be applied the English Wikipedia's inability to address the longstanding and widespread problem of missing or misleading edit summaries. As I mention on my user page, a lot of my time as editor here has been spent on checking edits on my watchlist and patrolling RC. And the aforementioned problem has a significant negative effect on this kind of work. (I do sometimes raise it with the editors responsible, although I have also received pushback.) Regards, HaeB (talk) 04:43, 8 February 2025 (UTC) (Tilman)[reply]
Like some others, this seems like a really sensible place to use LLMs here, and I'd support a pilot. The question is how to make it workable. Probably the most flexible, energy efficient way for a pilot is to have a button to click at time of publication that says "use AI summary". i.e. human input by default. We've probably all seen claims about how much energy queries take, and it would be a shame for Wikipedia to contribute to that -- parsing two versions of a page and summarizing the difference -- for a minor edit. If it works well, I could see a variety of use cases up to and including e.g. an experiment to turn AI summaries on by default for non autoconfirmed users. But yes, we don't want to completely replace human judgment, especially given edits frequently require context in past edit summaries, on the talk page, on other pages, etc. — Rhododendrites talk \\ 14:53, 7 February 2025 (UTC)[reply]
I would note that some of the problems with ‘ce’-type summaries are solved not by using AI, but by adding buttons to choose common edit summaries from, like Polish, Russian, Ukrainian et al. Wikipedias do by default, see ru:Википедия:Гаджеты/Кнопки описания правок. It is too easy to go to AI to solve interface problems that are solvable in an easier and environmentally friendlier fashion. stjn 15:02, 7 February 2025 (UTC)[reply]
We've probably all seen claims about how much energy queries take - which claims specifically, and how are they relevant for estimating the environmental impact of deploying a tool like Edisum?
There are a lot of wildly inaccurate claims out there about the energy use of current GenAI tools. See e.g. this new estimate, which points out flaws in earlier efforts and finds that typical ChatGPT queries using GPT-4o likely consume roughly 0.3 watt-hours [... which] is less than the amount of electricity that an LED lightbulb or a laptop consumes in a few minutes. And even for a heavy chat user, the energy cost of ChatGPT will be a small fraction of the overall electricity consumption of a developed-country resident. (Also, before anyone applies that estimate to the GPT-4 experiment in the paper: It is based on an output size of 500 tokens (~400 words, or roughly a full page of typed text), many times larger than typical edit summaries.)
What's more, as discussed in the review, 1) the authors of the present paper designed their model to use much fewer resources than GPT-4 and run on CPUs instead of GPUs, 2) WMF already operates a number of GPUs for other AI/ML purposes. And currently, every edit already triggers cascade of computational processes on various servers, some of which incur nontrivial resource usage too, e.g. database operations, edit filter evaluations, and indeed processing in existing AI/ML models (Cluebot, ORES etc).
Overall, I'd encourage folks concerned about how Wikipedia's energy use contributes to climate change to take a more holistic view and pay attention to the Foundation's overall greenhouse gas emissions (m:Sustainability).
Regards, HaeB (talk) 08:56, 8 February 2025 (UTC) (Tilman)[reply]
But of course, some will try to veto it because of anti-AI sentiment. Me. I will try to veto it. Because it's a breathtakinginly unethical technology that is directly and fundamentally opposed to everything that an encyclopedia should stand for. Endorsing it teaches people to accept whatever bullshit the machine outputs instead of thinking and learning. XOR'easter (talk) 01:29, 8 February 2025 (UTC)[reply]
Certainly this, but I think even more than that. The reason people do still trust Wikipedia, to a great degree (and in spite of some people telling them not to) is precisely because it is written by actual people who have put actual thought into what they are doing. Replace that with AI, and we may as well be one more clickbait farm. Seraphimblade Talk to me 01:33, 8 February 2025 (UTC)[reply]
Our coming to rely upon AI would make us fucking hypocrites. XOR'easter (talk) 01:39, 8 February 2025 (UTC)[reply]
And no, the "it's just summaries, not articles" excuse won't fly. We give editors the boot for pumping slop into noticeboards and deletion debates. Bullshit is bullshit, whether in article space or not. XOR'easter (talk) 02:39, 8 February 2025 (UTC)[reply]
As already mentioned in the review, the English Wikipedia has rejected a blanket prohibition against use of AI, and what you claim won't fly is actually specifically highlighted as a possibly appropriate use in the nutshell of the popular WP:LLM essay.
You are evidently extremely emotional about this topic. Personally, I think such decisions should be made in a rational, fact-based manner. For example, while you didn't specify what you meant by AI being a breathtakinginly[sic] unethical technology, it's possible that you were in part worrying about its climate impact. A productive way to discuss such concerns would be to estimate the climate impact of this particular tool (if implemented), and how much it would contribute to the WMF's overall carbon emissions (see m:Sustainability). Given that the researchers already designed it to have low compute usage (running on CPU instead of GPU etc), I would be surprised if it would cause a substantial increase.
Regards, HaeB (talk) 03:58, 8 February 2025 (UTC) (Tilman)[reply]
You understand that snide condescension is a type of "extreme emotion" too, right? Parabolist (talk) 10:26, 8 February 2025 (UTC)[reply]
+1. WP:RGW attitudes have no place in our decision-making process. Thebiguglyalien (talk) 02:48, 9 February 2025 (UTC)[reply]
If nothing else, some years from now, this comment will be useful if I am accused of exaggerating a description of how fervently people would complain about LLMs back in the early 20s. Here I am not really sure what can really be objected to -- in an edit where I change "paralelled" to "paralleled", is it your actual opinion that we need a professional writer/editor (e.g. the level of competence we typically expect from editors) to manually type that out in an edit summary? Assuming that such a person can read the diff and type a simple edit summary like this in how many seconds? At a rate of how many dollars per hour? Is anyone volunteering to be sent an invoice for this? jp×g🗯️ 03:58, 11 February 2025 (UTC)[reply]
You don't need to write a whole paragraph about it, and you shouldn't, in that instance, write a whole paragraph about it; that junks up the history. It takes you what, all of two seconds to put "typo" in the edit summary? That's all that's needed—okay, you fixed a typo. No more than that is necessary. Seraphimblade Talk to me 04:02, 11 February 2025 (UTC)[reply]

I really don't think a lot of these AI generated summaries are needed. I would also note that I would still have to check and review a lot of these edits as the AI shows no signs of thought or credibility. Unless I know an article well, there is a chance that I don't even know what the edits are referring to, human or AI. One beneficial thing about human edits is that I can track patterns across edits. For example, a spam of edits by someone on an article may be a red flag, but if I see that it is going under a GA review and the editor is relatively well known, I don't feel like I am as needed to check it and I can spend my time elsewhere instead of tediously checking the veracity of every edit. ✶Quxyz 15:45, 9 February 2025 (UTC)[reply]




       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0