"AI" is a silly buzzword that I try to avoid whenever possible. First of all, it is poorly defined, and second of all, the definition is constantly changing for advertising and political reasons. If you want an example of this, look at this image, which illustrates our own article on "AI": it was generated using a single line of code in Mathematica. Simply put, the "AI effect" is that "AI" is always defined as "using computers to do things computers aren't currently good at", and once they're able to do it, people stop calling it "AI". If we just say the actual thing that most "AI" is – currently, neural networks for the most part – we will find the issue easier to approach. In fact, we have already approached it: the Objective Revision Evaluation Service has been running fine for several years.
With that said, here is some silly stuff that happened with a generative NLP model:
Meta, formerly Facebook, released their "Galactica" project this month, a big model accompanied by a long paper. Said paper boasted some impressive accomplishments, with benchmark performance surpassing current SoTA models like GPT-3, PaLM and Chinchilla – Jesus, those links aren't even blue yet, this field moves fast – on a variety of interesting tasks like equation solving, chemical modeling and general scientific knowledge. This is all very good and very cool. Why is there a bunch of drama over it? Probably some explanation of how it works is appropriate.
While we have made ample use of large language models in the Signpost, including two long articles in this August's issue which turned out pretty darn well, there is a certain art to using them to do actual writing: they are not mysterious pixie dust that magically understands your intentions and synthesizes information from nowhere. For the most part, all they do is predict the next token (i.e. a letter or a word) in a sequence – really, that's it – after having been exposed to vast amounts of text to get an idea of which tokens are likely to come after which other tokens. If you want to get an idea of how this works on a more basic level, I wrote a gigantic technical wall of text at GPT-2. Anyway, the fact that it can form coherent sentences, paragraphs, poems, arguments, and treatises is purely a side effect of text completion (which has some rather interesting implications for human brain architecture, but that is beside the point right now). The important thing to know is that they just figure out what the next thing is going to be. If you type in "The reason Richard Nixon decided to invade Canada is because", the LLM will dutifully start explaining the implications of Canada being invaded by the USA in 1971. it's not going to go look up a bunch of sources and see whether that's true or not. It will just do what you're asking it to, which is to say some stuff.
This would have been a great thing to explain on the demo page, but for some reason it was decided that the best way to showcase this prowess would be to throw a text box up on the Internet, encouraging users to type in whatever and generate large amounts of text, including scientific papers, essays... and Wikipedia articles.
So we made a request for an article about The Signpost in the three days the demo was up. The writing was quite impressive, and indeed was indistinguishable from a human's output. You could learn a lot from something like this! The problem is that we were learning a bunch of nonsense: for example, we apparently started out as a print publication. Unfortunately, we didn't save the damn thing, because we didn't think they were going to take everything down three days after putting it up. The outlaws at Wikipediocracy did, so you can see an archived copy of their own attempt at a Galactica self-portrait, which is full of howlers (compare to their article over here).
Ars Technica later wrote a scathing review of the demo. They note several issues, and a little digging into their sources found a Twitter user who managed to get Galactica to write papers on the benefits of eating crushed glass, and got multiple papers that resembled the basic appearance of valid sources, while containing claims like "Crushed glass is a source of dietary silicon, which is important for bone and connective tissue health", and a generated review paper described all the studies that show feeding pigs crushed glass is great for improving weight gain and reducing mortality. Of course, if there were health benefits of eating crushed glass, this is probably what papers about it would look like, but as it stands, the utility of such text is dubious. The same goes for articles on the "benefits of antisemitism", which mrgreene1977 wisely did not quote from, but one can imagine what kind of tokens would come after what kind of other tokens.
Will Douglas Heaven's article for MIT Technology Review "Why Meta's latest large language model survived only three days online" leads with the statement, "Galactica was supposed to help scientists. Instead, it mindlessly spat out biased and incorrect nonsense", and things get worse from there. Apparently, the algorithm was prone to backing up its points (like a wiki article about spacefaring Soviet bears) with fake citations, sometimes from real scientists working in the field in question. Lovely! Well worth reading, with far too many great examples in there to quote, and even more if you follow their suggestion to look at Gary Marcus's blog post on it.
In their defense, the Galacticans did note, at the bottom of a long explanation of how much the website rules:
“ | Language Models can Hallucinate. There are no guarantees for truthful or reliable output from language models, even large ones trained on high-quality data like Galactica. NEVER FOLLOW ADVICE FROM A LANGUAGE MODEL WITHOUT VERIFICATION. [...] Galactica is good for generating content about well-cited concepts, but does less well for less-cited concepts and ideas, where hallucination is more likely. [...] Some of Galactica's generated text may appear very authentic and highly-confident, but might be subtly wrong in important ways. This is particularly the case for highly technical content. | ” |
But then, even when attempting to use it correctly, it had problems. The MIT Technology review report links to an attempt by Michael Black, director at the Max Planck Institute for Intelligent Systems, to get Galactica to write on subjects he knew well, and ended up thinking Galactica was dangerous: "Galactica generates text that's grammatical and feels real. This text will slip into real scientific submissions. It will be realistic but wrong or biased. It will be hard to detect. It will influence how people think." He instead suggests that those who want to do science should "stick with Wikipedia".
Perhaps it would be best to give the last, rather spiteful word to Yann LeCun, Meta's chief AI scientist: "Galactica demo is offline for now. It’s no longer possible to have some fun by casually misusing it. Happy?"
Most of the issues and controversies we run into with ML models follow a familiar pattern: some researcher decides that "Wikipedia" is an interesting application for a new model, and creates some bizarre contraption that serves basically no purpose for editors. Nobody wants more geostubs! But this is not a problem with the underlying technology.
The field of machine learning is growing extremely quickly, both in terms of engineering (the implementation of models) and in terms of science (the development of vastly more powerful models). Anyone who has an opinion about these things is simply going to be wrong about anything a few months from now. They will only grow in importance, and I think that any editor who does not try to read as much about it as possible and keep abreast of developments is doing themselves a disservice. Not wanting to be a man of talk and no action, I wrote GPT-2 (while its successor model is more relevant to current developments, it has identical architecture to the old one, and if you read about GPT-2 you will understand GPT-3).
Moreover, we have already been tackling the issue of neural nets on our own terms: the Objective Revision Evaluation Service has been running fine for several years. It seems to me that, if we were to approach these technologies with open minds, it could be possible to resolve some of our most stubborn problems, and bring ourselves into the future with style and aplomb. I mean, anything is possible. For all we know, the Signpost might start putting out print editions.
Discuss this story
Let's forget about the print editions of The Signpost please! And maybe we should still define AI as artificial ignorance. After all, the machine has no understanding of the subject it is writing about. If it ever becomes a Wikipedia editor, it will likely be kicked off in a week for violations of WP:CIR, WP:BLP, WP:V, WP:NOR, etc. Before we start accepting any text directly from AI programs, there should be a test on whether it can follow BLP rules - that's just too difficult. Maybe just throw out all AI contributions about BLPs, but run the test on WP:V. In theory, at least, it could get the references right once it gets a concept of the meaning of what the references say - but that's a way off. Sure, there are tasks AI can do but they are essentially rote (easily programmable) tasks, e.g. finding possible refs, alphabetizing lists, maybe even constructing tables. Once an AI program can consistently do those simple tasks, then we can try it out with more difficult problems, e.g. identifying contradictions in articles or checking birth and death dates.
ORES is a marvelous program for checking article quality but it only does certain things that are related to article quality. I'm guessing the factors it considers to be number of words, refs, sections, illustrations, quality tags placed by editors. Maybe even incoming links and outgoing links. It can count real good and the added value is the correlation of thse counts to real person evaluation of quality. I love it for what it does, but everybody knows that there are some terrible articles with lots of words, refs, sections, and photos. Smallbones(smalltalk) 16:51, 29 November 2022 (UTC)[reply]