The Signpost

News and notes

More large-scale errors at a "small" wiki

Contribute  —  
Share this
By Bri, Eddie891, and Smallbones

Large-scale errors at Malagasy Wiktionary

Growth of Malagasy Wiktionary, 99.23% due to bot edits

A small wiki audit of the Malagasy Wiktionary found that the wiktionary, which has the second largest number of entries (over 6,103,961), has had a large number of their pages automatically translated. Bot-Jagwar is a bot account run by Jagwar, the sole admin who has made edits. On the project, his bot has made more than 22 million edits (and counting). Jagwar also has a secondary bot account, Bot-Jagwar II which has made a further 6,976 edits. Another major bot contributing to mg.wikt, making the exact same type of edit, is Ikotobaity, with 2,456,748 edits run by Lohataona until 2017; the bot has been inactive since 20 October 2017. These three bots have created 6,076,769 new mainspace pages, which is 99.23% of all mainspace pages on mg.wikt. (Jagwar also ran bot edits on his main account, so the true number of bot-created entries is likely 50,000 higher.)

In this blog post, Jagwar detailed the history of his bot and mg.wikt. The bot began editing in 2010, at a rate of 50,000 edits per day, initially simply importing foreign words from other wiktionaries. After the wiki reached 200,000 pages in 2011, he wrote a script that "upload[ed] the word forms of that language", and propelled Malagasy Wiktionary to be the third largest. In 2012, Jagwar developed a more refined script. He uses NLP and automated translation in order to generate new entries, with no human intervention nor oversight. In the blog post, he wrote that translation errors were estimated at <5%, though he had "no precise idea" of it.

There is no active editing community, and Jagwar is the sole active admin on the site. Jagwar himself has only made 6 edits in the last 90 days, of which only 3 were in mainspace. The audit noted that there are various mistakes in the entries. Of a random survey of 100 non-Malagasy entries, the auditor concluded that 49 were "unusable", 29 "partially usable", and only 22 were "fully correct and usable" (though they may still have minor errors). Of Malagasy entries, the report noted that:

There are 41,902 entries categorised as lacking any definition, most of which seem to be Malagasy entries, and around 30,000 of which are the result of the definitions being removed due to copyright violation many years ago. Although there are 1,150,182 Malagasy entries in total, most of these are inflected forms, which can generally be safely created by bots. These definitionless entries are not strictly speaking incorrect, but a definition is the most central function of a dictionary, so these entries fail to be a useful part of the dictionary as a whole.

The bots also ran 218,156 edits at chr.wikt from 2012 to 2014 and 127,389 edits at ku.wikt from 2012 to 2013. The audit concluded that "Even an editing community of the size of the biggest Wiktionary, en.wikt, would not be able to clean up after these bots by hand". It strongly recommended deleting all non-Malagasy entries, removing translation sections, and telling the bot owners to cease automated creation of entries, and weakly recommended deleting all definition-less entries. – adapted by Eddie891 from Large-scale errors at Malagasy Wiktionary, written by Metaknowledge, with help from Surjection, AryamanA, Erutuon, and Smashhoof, along with input from a fluent speaker of Malagasy who wishes to remain anonymous.

Inline parenthetical citations deprecated

A Request for Comment (RfC) to deprecate the inline parenthetical citation style was closed by Seraphimblade on 5 September as having reached consensus "that inline parenthetical referencing should be deprecated". The RFC, which was begun by CaptainEek on 5 August, drew a large amount of attention and discussion. A watchlist notice for the RFC was placed on 29 August after a discussion determined that it was a sufficiently high-profile RFC.

In closing the discussion, Seraphimblade noted that roughly 71% of the community had supported the proposal and that there was only consensus to deprecate "parenthetical style citations directly inlined into articles", rather than {{harv}} style-references in <ref></ref> tags. The RFC led to the WP:PAREN and WP:CITEVAR guidelines needing an update, though as of The Signpost's publication deadline, what the update would look like was still under discussion. Before the RfC, CITEVAR specifically stated that "editors should not attempt to change an article's established citation style merely on the grounds of personal preference" and cited a 2006 Arbitration Committee decision that "Wikipedia does not mandate styles in many different areas", including citation style. E

More news

Brief notes

S
In this issue
+ Add a comment

Discuss this story

Kiev moves to Kyiv

@Bri and Smallbones: in While most participants in the RfA cited Wikipedia's common name policy is RfA meant to be RfC or RM? --DannyS712 (talk) 22:49, 27 September 2020 (UTC)Reply[reply]

Jagwar

As Jayen466 said: "Whatever Wikipedia as a community is doing, it is more of a vehicle for contributors' self-indulgence than it is a concerted endeavour to bring free knowledge to the world." Our community coddles editors like Jagwar and AmaryllisGardener because that's what we're really about, which is sad. Chris Troutman (talk) 00:54, 28 September 2020 (UTC)Reply[reply]

Total number of pageviews

This is a misleading and not very useful metric, as it includes bot and spider views. Admittedly, the fault largely lies with the Foundation's Wikistats 2.0 tool, which mentions this issue in the small print - "In this data we try to separate bot traffic and focus on human user page views" - but nevertheless presents the total views as the default. The trick for following that "focus on human user page views" recommendation is to click "Split by agent type" in the left sidebar and then use the checkboxes to select the right combination of metric components. (Also, while the linked data documentation fails to mention it, be aware that a substantial amount of views were reclassified as "automated" starting in May 2020.)

Separately, while it seems indeed true that there was a coronavirus-related traffic peak around April and May, pageviews drop considerably from April to August every year due to seasonal changes. So the comparison in the article ("... down from the recent high in April ...") is not very meaningful.

Regard, HaeB (talk) 02:07, 28 September 2020 (UTC)Reply[reply]

The Tropical cyclone WikiProject

That's a rather usual name for a tropical cyclone... --Guy Macon (talk) 06:35, 28 September 2020 (UTC)Reply[reply]

I'm not sure this is an official naming convention yet, maybe just an unofficial plan. I know the agency in charge of this has run out of regular people's names and moved on to Greek letters, e.g. tropical storm Beta. I don't see any reason that they couldn't use the names of various WikiEntities if they run out of Greek letters. So we could have tropical storm "Tropical storm", or Hurricane WMF, tropical storm Women in red, or category 5 hurricane The Signpost. . Let me know what you think. Smallbones(smalltalk) 12:39, 28 September 2020 (UTC)Reply[reply]
"Some people don't understand things as well as I do." --Gracie Allen
--Guy Macon (talk) 14:31, 28 September 2020 (UTC)Reply[reply]





       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0