The Signpost

Essay

Machine-written articles: a new challenge for Wikipedia

Contribute   —  
Share this
By Rory Jaffe


Interview with ChatGPT

Artificial Intelligence.

This interaction with ChatGPT was recorded on January 29, 2023

Author: Are any Wikipedia articles written by ChatGPT?

ChatGPT: No, ChatGPT is not capable of writing articles on Wikipedia. It can only generate text based on the input it receives and its training data.

Author: What should I do with a Wikipedia article someone wrote using ChatGPT?

ChatGPT: Articles generated by language models like ChatGPT are not suitable for use on Wikipedia as they are not written by humans and lack context and accuracy. If you have come across such an article on Wikipedia, you should report it to the Wikipedia community for review and potential removal, as Wikipedia has strict policies regarding the sources and reliability of the information on the platform.

My Experience

Near the end of January, I ran across some odd-sounding articles in draftspace. All the articles had the same structure and layout, even though they were about different things. The articles also had impeccable English, though the author's comments did not. Suspecting something was up, I found an artificial intelligence (AI) article detector that then pronounced all these articles machine generated with greater than 99.9% confidence. Not knowing what to do, I went to the calm environ of the Administrators' noticeboard for incidents. That discussion is worth reading, as it shows a bunch of knowledgable Wikipedians struggling and debating how to deal with these articles. Were they hoaxes? Copyright violations? Good for publication? How does generated text fit within Wikipedia's requirements for articles?

Policies?

As it turns out, an effort was already underway to develop a policy regarding articles written by ChatGPT and its relatives: Wikipedia:Large language models. There, and in its associated talk page, you can see the reasoning related to these articles. In short, AI-generated text is not reliably correct, may not have a neutral point of view, needs verification, can occasionally violate copyright, and can downright lie. This is all in its inherent nature. It is fed information from a large corpus of text, much of which would not meet Wikipedia's sourcing and neutrality criteria, and it synthesizes its output without regard as to whether the text maps to a real source. To quote the ChatGPT general FAQ: "These models were trained on vast amounts of data from the internet written by humans, including conversations, so the responses it provides may sound human-like. It is important to keep in mind that this is a direct result of the system's design (i.e. maximizing the similarity between outputs and the dataset the models were trained on) and that such outputs may be inaccurate, untruthful, and otherwise misleading at times."

Finding More

I then started going through recent drafts and new articles looking for text reminiscent of the text I had seen in the first articles I identified. It didn't take long to find more. The current limiting factor is that I don't have the tools for rapidly reviewing Drafts as I do for new articles as a new page patroller, and I don't have the resources (including time and patience) needed to do this consistently and daily.

I've put the {{AI-generated}} template on those articles and had not one author disagree with the finding. You can search for the template with hastemplate:AI-generated in the Wikipedia search box. Expand the search to Drafts to see the drafts so marked. About sixty articles have been tagged. Several of the previously tagged articles have either been deleted or de-tagged once the generated text was replaced with real text, but many remain. You can then read those remaining examples and get a feel for AI-generated articles.

Editor's note: If you are reading this in the future and none of these are available, some representative drafts have been retained more permanently at User:JPxG/LLM dungeon.

I have been conservative in identifying articles: only testing articles that had a similar appearance and using > 99% assurance that it was machine-generated. I am sure I've missed many more articles. I was looking for typical phrases like "in conclusion" starting the last paragraph, use of the article's title repetitively without abbreviation or variation, and consistent sentence and paragraph length. A more sophisticated AI user would use better prompts to the AI software and produce harder-to-detect output; the ones I've found typically were produced by asking ChatGPT something like "Write a Wikipedia article about XXXX."

I test for articles typically using https://openai-openai-detector.hf.space/, though other sites exist, including https://detector.dng.ai/, https://gptzero.me/, https://platform.openai.com/ai-text-classifier, and https://contentatscale.ai/ai-content-detector/. Before testing, I remove headings, inline references, and other text and markup that appears to have been added after text generation, as those can confuse the analyzer.

Problem?

Is this a real problem? I believe it is. Many of the articles sound reasonable but may have serious errors. The conversation at the administrators' noticeboard includes an analysis of a generated article on geckos. The article contains a lot of specific plausible-sounding information (e.g., size range), much of which is wrong.

ChatGPT will even provide references if requested, but those references are synthesized from its input text and, while sounding correct, usually do not point to real articles. For example, when I asked it for references on an article ChatGPT wrote for me on Sabethes cyaneus (a mosquito), one of the references was "Sabethes cyaneus" (Encyclopedia of Life): https://eol.org/pages/133674. That page does exist, but is for Clavaria flavopurpurea, a fungus. Another reference it provided also had a link, but the link pointed to an article about a mink, and the reference itself was fictional.

There are efforts to improve these programs, and I am sure that eventually they will be successful. Currently, however, those efforts fall well short. One such effort, "Elicit" (elicit.org), only searches research papers and summarizes them. I asked Elicit "What are the characteristics of Sabethes cyaneus?" It summarized one reference as "Sabethes cyaneus is a species of frog."

These false but plausible answers are an inherent property of the current models and is called "hallucination".

Even once these programs improve, there will still be significant concerns limiting the direct use of generated text, such as:

  1. Did the program only use sources acceptable to Wikipedia?
  2. Is the information up to date?
  3. Can the program identify correct references? (One of the biggest hurdles, as the inherent nature of current models is that the output is not linked to a specific source.)
  4. Will the program avoid hallucinating?

However, programs like ChatGPT are great for generating ideas for articles and helping to "mock up" a good article, if the user takes a sophisticated view of the output, using it more for inspiration than for a source of truth. The proposed policy Wikipedia:Large language models has additional information on how these programs can be used to improve Wikipedia.

Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
The links are in the "Finding More" section, the primary one is https://openai-openai-detector.hf.space/Bri (talk) 18:21, 20 February 2023 (UTC)[reply]
Thanks! I didn't see a finding more section but I Could just be dumb. ― Blaze WolfTalkBlaze Wolf#6545 18:23, 20 February 2023 (UTC)[reply]
ChatGPT creates plausible-sounding bullshit. In cases where it has a lot of very similar sources to draw from, such as mostly-empty space-filler articles about an upcoming racing video game (for which it would have about a thousand examples) it can generate something low on nonsense. For something more unique, the bullshit quota is higher. In all cases, though, you can't tell what's bullshit without checking it line by line, because it's all plausible-sounding. Similarly, the sources will always be nonsense, because it isn't generating text based on specific sources, it's generating plausible-sounding reference text bullshit, with no connection to anything. --PresN 19:29, 20 February 2023 (UTC)[reply]
Yes I'm not trying to argue that we should be using ChatGPT (because frankly no one should), simply that it isn't 100% bad all of the time. ― Blaze WolfTalkBlaze Wolf#6545 19:31, 20 February 2023 (UTC)[reply]
IN fact I have encountered situations where it likes to hallucinate (I asked it a few things regarding Splatoon and it kept thinking the special gauge was the amount of ink the weapon had which is not true whatsoever) no matter what I tell it. ― Blaze WolfTalkBlaze Wolf#6545 19:33, 20 February 2023 (UTC)[reply]
One of the data sources for ChatGPT is Wikipedia, so if you ask it to write about something already in Wikipedia, there’s a likelihood that it will select correct information for its output. — rsjaffe 🗣️ 22:24, 20 February 2023 (UTC)[reply]
WP:Randy in Boise can also make good contributions most of the time, but the few times he's wrong still make him a net negative. AI seems to be a long way from getting past this level of ability. Daß Wölf 20:24, 24 February 2023 (UTC)[reply]

I test for articles typically using https://openai-openai-detector.hf.space/ - this and various other currently available "ChatGPT detectors" (including OpenAi's own) are highly unreliable. https://openai-openai-detector.hf.space/ actually already says on the tin that it is a detector for GPT-2 (released in 2019 and very different from ChatGPT). Given the article's focus on the dangers of misinformation, it's a bit sad and ironic that the Signpost is itself providing such dubious recommendations here without any caveats.

Regards, HaeB (talk) 11:12, 21 February 2023 (UTC)[reply]

The article glosses over a lot of the issues regarding detection. It was just a brief intro. I emphasized in the article that I was using a very insensitive method of finding LLM-generated text. There were a couple of reasons I went about things as described there (and to note: I no longer rely solely on GPT-2 detector). 1) at the time I started, other detectors available were very opaque as to how they were constructed; 2) the nature of the output, even though the models are different, has many similar characteristics, so a GPT-2 detector would have some sensitivity and specificity; 3) I intentionally minimized false positives as those irritate article contributors, by doing a vigorous pre-screen of the text. As to point two, note that at least one of the recommended detectors (https://gptzero.me/) is not based on the GPT model, but rather on the text output characteristics. As to point three, I used the authors' feedback as an indicator of the false positive rate: getting no complaints after a lot of tags is a decent indicator that the false positive rate is low. — rsjaffe 🗣️ 18:53, 21 February 2023 (UTC)[reply]
Good to hear that you are proceeding diligently when patrolling new articles (and to be clear, this is very important work and it's good to call attention to this issue). But the part with the tool recommendations was not including any caveats about false positives, and should not have been published in this form.
the nature of the output, even though the models are different, has many similar characteristics, so a GPT-2 detector would have some sensitivity and specificity - what research is this claim based on? (I mean, of course any detection method has "some sensitivity and specificity", the question is whether they are good enough.)
is not based on the GPT model, but rather on the text output characteristics - it seems that there is some fundamental confusion here between the model that is doing the detection and the model whose output is being detected (and/or the features of its output). https://openai-openai-detector.hf.space/ is also not using "the GPT model" (there are many actually) to detect GPT-2 output, but RoBERTa instead.
Regards, HaeB (talk) 06:57, 24 February 2023 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0