The Signpost
Single-page Edition
WP:POST/1
30 August 2020

News and notes
The high road and the low road
In the media
Storytelling large and small
Featured content
Going for the goal
Special report
Wikipedia's not so little sister is finding its own way
Op-Ed
The longest-running hoax
Traffic report
Heart, soul, umbrellas, and politics
News from the WMF
Fourteen things we’ve learned by moving Polish Wikimedia conference online
Recent research
Detecting spam, and pages to protect; non-anonymous editors signal their intelligence with high-quality articles
Arbitration report
A slow couple of months
From the archives
Wikipedia for promotional purposes?
Obituaries
Marcus Sherman, Jerome West, and Pauline van Till
 

2020-08-30

The high road and the low road

Contribute   —  
Share this
By SnowFire and Nosebagbear

Scots Wikipedia language quality problems ripple around the Internet, make the news, and trigger Meta-Wiki response

King James I and VI, the actual person to have done the most damage to the Scots language in history (source). James moved his court from Scotland to London in 1603 and later commissioned the King James Version (Authorized Version) of the Bible in English only, not Scots. Both God and the government now spoke English.

The Scots Wikipedia is a quiet, sleepy, low activity edition of Wikipedia written in the Scots language, the Anglic language traditionally spoken in the lowlands of Scotland. Nobody paid it much mind... until August 2020, when a Reddit thread entitled "I've discovered that almost every single article on the Scots version of Wikipedia is written by the same person – an American teenager who can’t speak Scots" spread across the Internet. This young volunteer, who dedicated a large amount of time over seven years to translating segments of the English Wikipedia into Scots, unfortunately seemingly was never told that maintaining English sentence structure and translating words 1:1 from a dictionary is no way to translate at all. Further investigation showed the quality problems ran deep: articles untouched by the prolific user in question also had poor quality and ungrammatical Scots, meaning that many more articles on Scots Wikipedia may be essentially worthless. The author of the Reddit post called the incident "cultural vandalism on an unprecedented scale" and wrote that "This is going to sound incredibly hyperbolic and hysterical but I think this person has possibly done more damage to the Scots language than anyone else in history."

The story hit the news media, for both high and low reasons. For the high road, this was a massive and notable failure of Wikipedia, one that has likely poisoned training data sets for the Scots language used by translation algorithms, and led any curious human readers to think that Scots is simply English in an accent with a few funky words thrown in. For the low road, the hobbies and naivety of the prolific user were mocked. Some of the notable coverage includes:

Several of the tabloid-style sources omitted from this list got the story essentially wrong, confusing Scots with the Scottish Gaelic language, suggesting that the user might have just been writing in silly Groundskeeper Willie-ese, or that the user's admin status was relevant (a status much-misunderstood by the media). The problem was the user's edits: there has been no allegation of misuse of admin tools.

Within the Wikipedia community, several actions were kicked off. User:MJL, the only other active admin on Scots Wikipedia at the time, boldly set up their own "AMA" (short for 'Ask Me Anything') on the Scotland Subreddit to explain the situation as well as solicit interest in potential fixes for Scots Wikipedia. The prolific user apologized for his mistakes after being informed of his lack of proficiency in Scots and has withdrawn from editing for now. Various split discussions eventually coalesced into an RFC on Meta-Wiki: meta:Requests for comment/Large scale language inaccuracies on the Scots Wikipedia. The current short-term course of action with the most support seems to be having a bot perform some sort of mass rollback of affected articles if they meet criteria (which are still being determined), enlisting new admins, and some proposals for other new bots.

The long-term solution requires understanding how this disaster happened in the first place. On Wikipedia user page language templates, the prolific contributor only marked himself a 2/5 and a 3/5 (changing over time) at Scots proficiency in the first place. If he was really that bad at Scots – more like a 1/5 – how did nobody notice? The answer: there simply wasn't anyone to notice. To the extent there ever was an authentic Scots-speaking Scots Wikipedia community, it had departed by 2012. The contributor's contributions were "Scots-y" enough to keep non-native speakers paying mild attention to the wiki from realizing the extent of their problems, and the user himself was a young kid when this started, clearly without the best self-awareness. If even one or two native Scots speakers had been active, they could have sounded the alarm, long before seven years had passed of wasted, counterproductive effort. The fundamental problem at Scots Wikipedia is the lack of a Scots-speaking community of editors. Perhaps not only bad things have emerged from the incident: the burst of attention has drawn the attention of Scots language groups. If the end result is to expand the Scots Wikipedia community, then perhaps something good will have come of this. Sn

Interim Trust & Safety Case Review Committee

In early July, the Wikimedia Foundation announced the creation of the Interim Trust & Safety Case Review Committee (CRC), designed to allow appeal of certain less clear-cut cases decided by the WMF (both on-wiki and event bans), including appealing against a decision by T&S not to act on a complaint. A charter, a public call for applicants, and a Q&A with WMF Vice President of Community Resilience & Sustainability Maggie Dennis were also created. The CRC charter sets out the scope, objectives, and minimum candidate requirements.

The CRC is specifically temporary, designed to terminate with the creation of a permanent process as part of the Universal Code of Conduct. If those discussions have not concluded by July 1, 2021, then a new candidate call can be made for a new term or a single up to six-month extension can be granted if there is a clear indication the process will wrap up by then (such as if an implementation date has been agreed).

Process: Maggie Dennis responded to a question: "Let's say user FooBar is blocked as a T&S office action and requests case review [...] What does the appeal process look like, both from FooBar's perspective and the review committee's perspective?"

Subject to process changing by the CRC, a rough outline was offered as follows:

  1. User emails inbox asking for a review
  2. WMF attorney confirms case is not within remit of "statutory, regulatory, employment, or legal policies", and so is subject to review
  3. User is notified it is under review and given likely timeline
  4. CRC Chair appoints 5 members who review the case for "appropriate handling; appropriate collection of evidence; appropriate outcomes"
  5. Members vote on whether to support, overturn (partially or fully), or return to the WMF for additional investigation
  6. WMF enacts that decision
  7. All involved users will be notified of decision

Overturning could occur on two main grounds: the sanction was inappropriately reached (the evidence didn't warrant the sanction) or the case did not fall within the T&S remit. This would indicate that a complaint could then be resubmitted at local community level (Arbitration Committee, Administrators' Noticeboard/Incidents (ANI) or equivalents). The publicly available documentation doesn't make it clear if a case could be simultaneously overturned on both grounds and whether that would still allow for a "double jeopardy" situation. Individuals may only make a single appeal per prohibition.

Candidates: the WMF imposes a number of eligibility requirements, including holding a current or prior advanced permissions role or an experienced contributor as part of a Wikimedia affiliate. Candidates also need to be members in full good standing with no current sanctions and be fluent in English. Several roles were viewed as exclusive, including current/former WMF staff. The en-wiki Community has decided to disallow currently serving arbitrators from acting as CRC members, which Maggie Dennis said would be accepted. Gender and lingual diversity were also sought, the latter most likely also driving a project diversity.

CRC members are intended to be able to spend up to five hours a week on the role, though there were repeated statements that it was anticipated to be less.

One particular requirement was part of a major theme: anonymity. As well as keeping all case information to themselves under a currently non-published reinforced non-disclosure agreement (NDA) – above and beyond the standard non-public information agreement – candidates made anonymous applications and are to keep both others' and their own membership secret. A number of changes were made after applications closed due to "negotiation between committee finalists and Deputy GC", including further limiting CRC membership knowledge to only three Board members but giving retired CRC members the right to self-disclose after 6 months.

The initial filter of applications was made by non-applying Stewards, with members chosen from that group by the WMF General Counsel Amanda Keton. The WMF is also hiring a contractor to support the committee.

Reporting: the CRC is to provide quarterly generalised reports (number of cases ratified, number of cases overturned). It's not clear whether additional information will also be provided, such as number of cases T&S prohibits from going to appeal. Nbb

Brief notes



Reader comments

2020-08-30

Storytelling large and small

Contribute   —  
Share this
By Smallbones and Jonatan Svensson Glad

Journalists often report on the workings of the large Wikipedia community by focusing on a few individuals. It's an old storytelling technique – older than Homer – that lets the audience identify with the "main actors" in a complex situation and draw general conclusions starting from the specific details embodied by the individuals. But does this technique reflect the true complexity of the Wikipedia community where so many editors interact? And what happens when the editing community is not so large?

Just another article on COVID-19 and Wikipedia?

"Covid-19 is one of Wikipedia’s biggest challenges ever. Here’s how the site is handling it." The Washington Post examines Wikipedia's response to the pandemic focusing on the contributions of individual editors who they identify as Jason Moore, Netha Hussain, and Rosie Stephenson-Goodknight. Moore helped organize WikiProject COVID-19. Hussain, a doctor and researcher, wrote about COVID-19 and pregnancy. Stephenson-Goodknight wrote about fashion and the pandemic. They all contributed to the overall effort.

Our readers have likely seen articles like this before, though the Post does an exceptionally good job. Over a dozen articles in The Signpost have reported how Wikipedians have been affected by and reacted to the pandemic, including in our columns "Project report", "Community view", "Gallery", "Recent research", "Traffic report", "News from the WMF" and "From the editors". This column, "In the media", has reported over 7 months on about twenty stories published off-Wiki about Wikipedia's response, starting with Omer Benjakob's groundbreaking story published in Wired on February 9. Almost all these stories are highly complimentary to several individual editors, who deserve the recognition. Almost all report on the contributions of a broad segment of the community, which perhaps deserves even greater recognition.

A pleasant myth

"Why Wikipedia Decided to Stop Calling Fox a ‘Reliable’ Source" Noam Cohen in Wired traces Fox News's fall from the esteemed heights of being considered a "generally reliable" source on Wikipedia in the areas of science and politics. Starting with a series of challenges to Fox's reliability in the article Karen Bass by editor Muboshgu, Cohen ends with the reasoning of admin Lee Vilenski

We don’t have to assume that Fox is acting in good or bad faith—we simply need to assess if we can trust the information being provided. In this case, a lot of users suggested using our policies that it couldn’t be trusted enough to be 'reliable' for these two topics.

In other words, Wikipedians simply needed to rationally reassess Fox's record in these two areas. It's compelling reading, and he accuses Wikipedians of being "old-school" and even of having "integrity". But many Wikipedians have distrusted Fox's reliability since the beginnings of the project. More likely this distrust simply grew stronger as time passed. Or perhaps the political balance of editors has changed over the years. Thanks for the kind words, Noam.

Kamala Harris and an unpleasant reality

In "The Wikipedia War That Shows How Ugly This Election Will Be" (August 13), The Atlantic examines the reactions to then-presumptive Democratic presidential nominee Joe Biden naming Kamala Harris as his vice-presidential running mate for the 2020 U.S. Presidential election. According to The Atlantic, several news sources, including Fox News, have crossed a line in their reporting on Harris. Perhaps the worst offender was an op-ed, now denounced by its publisher Newsweek, which argues that Harris is not eligible to run for the office which requires being a "natural born citizen". The author of the op-ed, John C. Eastman, doesn't question that Harris was born in Oakland, California, but was expounding on a novel theory of the meaning of "natural born citizen". According to Newsweek, this questioning of her eligibility is now being used by others to support the "racist lie of Birtherism" that was used against Barack Obama.

Wikipedia's reaction was fairly quick in reporting Biden's naming of Harris. Questioning Harris's racial identity and a sexist slur soon followed. One editor was banned. Within 45 minutes of the announcement, the article had been updated, vandalized, corrected, and semi-protected. The questioning of Harris's African American identity then moved to the talk page.

The Scots Wikipedia and smaller language communities

See News and notes for the main story on the Scots Wikipedia incident

"A Teen Threw Scots Wiki Into Chaos and It Highlights a Massive Problem With Wikipedia" is about the language editions of Wikipedia that are supported by smaller editing communities that are vulnerable to problems that can go undetected in these communities. One example cited by Gizmodo is the Croatian Wikipedia, whose admins have come under criticism for wide-ranging instant bans of editors who disagree politically with them. An article in The Signpost alerted the broader Wikipedia community to the problem, but an RFC is still pending a Steward close. Another example from Gizmodo is the Cebuano Wikipedia, the second largest Wikipedia by article count, yet almost entirely written by a non-native speaker from Sweden using a bot. A healthy community is essential to check the sanity of contributions and keep order, yet a look at List of Wikipedias shows that only 28 out of 313 language editions of Wikipedia have had more than 1000 active editors in the past 30 days. Only 80 editions have more than 100 active editors. Considering that many of these "active" accounts are bots, spammers, or passing admins banning the spammer, that's a lot of editions that need some love and care - both from enthusiasts and native speakers.

Whitewashing by cryptocurrency company

FT Alphaville (not paywalled) describes "something like an 'edit war'" on the article about Brad Garlinghouse, the CEO of Ripple Labs. Ripple is in the business of transferring money across borders using its own cryptocurrency. Garlinghouse was caught off-Wiki saying that SWIFT, a leader in the field of cross-border money transfer, had a 6% error rate – a claim which has been convincingly refuted. He has also had some legal difficulties. A controversy section which described these facts was removed several times, first by an anon whose IP address geolocates to a city near a known Ripple business address, then by a logged-in user who FT-A suggests may be a Ripple employee.

David Gerard, a Wikipedia administrator and noted cryptocurrency skeptic, reverted the removal of information about Garlinghouse four times over the course of three weeks, following a similar number of edits by others over two months. He was quoted saying

It’s not clear precisely who did this but, if it looks like corporate whitewashing and quacks like corporate whitewashing, then we’ll treat it as such.

The Signpost completely concurs with Gerard’s judgement on this matter. Cryptocurrency is a type of private token, something like money, issued on the web with a Rube Goldberg mechanism used to verify transactions. These digital wooden nickels have been commonly used in money laundering and other criminal transactions, and extensively advertised on Wikipedia. There are many more articles about cryptocurrency on Wikipedia that have suffered from whitewashing much more than this one.

Fundraising in India

The WMF published Wikimedia Foundation kicks-off fundraising campaign in India on August 5 and many Indian newspapers closely repeated the story, including Inventiva, News 18, The Quint and Live Mint. The Indian Express went well beyond the press release/blog, writing that "Its balancesheet however, tells a different story. According to a Wiki page on its fundraising statistics, the website was able to raise $28,653,256 between 2018-2019, bringing its total assets to $165,641,425. The previous financial year, it garnered $21,619,373 — a marked rise from the $56,666 it earned through donations in 2003."

In brief

Elon Musk
@elonmusk
Twitter logo, a stylized blue bird

Aliens built the pyramids obv

July 31, 2020
Elon Musk
@elonmusk
Twitter logo, a stylized blue bird

Please trash me on Wikipedia, I’m begging you

August 16, 2020



Do you want to contribute to "In the media" by writing a story or even just an "in brief" item? Edit next month's edition in the Newsroom or leave a tip on the suggestions page.



Reader comments

2020-08-30

Going for the goal

Contribute   —  
Share this
By Eddie891 and Gog the Mild
Connor Barth, a placekicker for the Tampa Bay Buccaneers, prepares to kick a field goal during the first quarter of the Bucs v. New York Giants National Football League military appreciation game at Raymond James Stadium in Tampa, Fla., Nov. 8, 2015.

This Signpost "Featured content" report covers material promoted from July 26 through August 22. For nominations and nominators, see the featured contents' talk pages.

An Orangutan
A clay tessera bearing a possible depiction of Odaenathus wearing a diadem
Apollo 15 Command Module Pilot Al Worden.
Lesser horseshoe bat (Rhinolophus hipposideros) with blue metallic identification band on left wing
A football card showing a portrait of Mann in his blue Yanks jersey
Cover of the first issue of Infinity Science Fiction; artwork by Robert Engle
A pilgrim makes a supplication in the direction of the Kaaba, the Muslim qibla, in the Sacred Mosque of Mecca.

19 featured articles were promoted this month.

Sigourney Weaver at the 2017 San Diego Comic-Con
Vilnius Historic Centre, a World Heritage Site in Lithuania.
Ernst van Dyk has won the Boston Marathon ten times, more than any other athlete.
The 2019 Wikimedian of the Year: Emna Mizouni
Clark Gable in a 1938 publicity still
Brad Pitt at the Washington, D.C premiere of Fury in 2014

20 featured lists were promoted this month.

20 featured pictures were promoted this month.

Bernardo Strozzi - Claudio Monteverdi (c.1630)

One featured topic was promoted this month.



Reader comments


2020-08-30

Wikipedia's not so little sister is finding its own way

Contribute   —  
Share this
By Lydia Pintscher
Wikidata is arguably one of Wikipedia's most successful sister projects. It has had a profound impact on Wikipedia in just a few years. Lydia Pintscher is the Product Manager for Wikidata at Wikimedia Germany. This essay was first published at Wikipedia @20 and has been licensed by the author CC-BY SA 3.0

In 2012, Wikipedia had grown and achieved so much in over a decade of creating an encyclopedia. But it was also at a point where fundamental change was needed: The world around Wikipedia was changing and Wikimedia had to find ways to make its content more accessible and support its editors in maintaining an ever increasing body of content in over 250 languages. The vision of a world in which every single human being can freely share in the sum of all knowledge was not achievable in this scattered way.

Ever since 2005 at the very first Wikimania, Wikimedia’s annual conference, one idea kept coming up: to make Wikipedia semantic and thus make its content accessible to machines. Machine-readability would enable intelligent machines to answer questions based on the content and make the content easier to reuse and remix. For example, it was not possible to easily find an answer to the question of what are the biggest cities with a female mayor because the necessary data was distributed over many articles and not machine-readable. Denny Vrandečić and Markus Krötzsch kept working on this idea and created Semantic MediaWiki, learning a lot about how to represent knowledge in a wiki along the way. Others had also started extracting content from Wikipedia, with varying degrees of success, and making the information available in machine-readable form.

So when the first line of code for the software that came to power Wikidata was written in 2012, it was an idea whose time had come. Wikidata was to be a free and open knowledge base for Wikipedia, its sister projects and the world that helps give more people more access to more knowledge. Today, it provides the underlying data for a lot of technology you use and the Wikipedia articles you read every day.

Being able to influence the world around you is such an important and empowering thing and yet we are losing this ability a bit more everywhere every day. More and more in our daily lives depends on data so lets make sure it stays open, free and editable for everyone in a world where we put people before data. Wikipedia showed how it can be done and now its sister Wikidata joins to contribute a new set of strengths.

Growing up

Wikidata always had bigger ambitions, but it started out by focusing on supporting Wikipedia. There were nearly 300 different language versions of Wikipedia, all covering overlapping (but not identical) topics without being able to share even basic data about these topics. Considering that most of these language versions had only a handful of editors, this was a problem. Small language versions were not able to keep up with the ever changing world and, depending on which language you could read, a vast amount of Wikipedia content was inaccessible to you. Perhaps someone famous had died? That information was usually available quickly on the largest Wikipedias but took a long time to be added to the smaller ones — if they even had an article about the person. Wikidata helps fix this problem by offering a central place to store general purpose data (like those found in the infoboxes on Wikipedia, such as the number of inhabitants of a city or the names of the actors in a movie) related to the millions of concepts covered in Wikipedia articles.

To start this knowledge base, Wikidata began by solving a simple but long-standing problem for Wikipedians, the headache of links between different language versions of an article. Each article contained links to all other language versions covering the same topic but this was highly redundant and caused synchronisation issues. Wikidata’s first contribution was to store these links centrally and thereby eliminate needless duplication. With this first simple step, Wikidata helped eliminate over 240 million lines of unnecessary wikitext from Wikipedia and at the same time created pages for millions of concepts on Wikidata, providing the basis for the next stage. Once the initial set of concepts were created and connected to Wikipedia articles, it was time for the actual data to be added, introducing the ability to make statements about the concepts (e.g. Berlin is the capital of Germany). After that, last but not least, came the capability to use this data in Wikipedia articles. Now Wikipedia editors could enrich their infoboxes automatically with data coming from Wikidata.

Along the way, a fantastic community maintaining that data developed, much faster than the development team could have dreamed. This new community included new people who had never contributed to a Wikimedia project before and were now becoming interested because Wikidata was a good fit for them. It also included contributors from adjacent Wikimedia projects who were more interested in structuring information than writing encyclopedic articles and found their calling in Wikidata.

The number of concepts represented in Wikidata items
The number of editors on Wikidata since its start (the circles indicate the beginning and end of the mass-import of interwiki links)

Later, Wikidata's scope expanded to support other Wikimedia projects, such as Wikivoyage, Wikisource, and Wikimedia Commons, allowing them to benefit from a centralized knowledge base as Wikipedia did.

As it evolved, Wikidata became an attractive source for Wikimedia projects and those who used to data-scrape Wikipedia infoboxes. External websites, apps, and visualisations used this information as a basic ingredient: from a website for browsing artwork, to book inventory managers, to history teaching tools, to digital personal assistants. Now, Wikidata is used in countless places without most users even being aware of it.

Most recently, it became clear that we need to think beyond Wikidata to a large network of knowledge bases running the same software (Wikibase) to publish data in an open and collaborative way, called the Wikibase ecosystem. In this ecosystem, many different institutions, activists and companies are opening up their data and making it accessible to the world by connecting it with Wikidata and among each other. Wikidata doesn't need to be and shouldn't be the only place where people collaborate to produce open data.

At the time of writing of this chapter, Wikidata provides data about more than 55 million concepts. It includes data about such things as movies, people, scientific papers and genes. Additionally, it provides links to over 4,000 external databases, projects and catalogs, making even more data accessible. This data is added and maintained by more than 20,000 people every month and used in over half of all articles in Wikimedia projects.

Helping people (and machines) come together

Just like Wikipedia is not like any other encyclopedia, Wikidata is not like any other knowledge base. There are a number of things that set Wikidata apart. They are a result of striving to be a global knowledge base and covering a multitude of topics in a machine-readable way.

The most important differentiator is probably the acknowledgement that the world is complex and can’t easily be pressed into simple data. Did you know that there is a woman who married the Eiffel Tower? That the Earth is not a perfect sphere? A lot of technology today is trying to simplify the world by hiding necessary complexity and nuance. Conflicting worldviews need to be surfaced. Otherwise we take away people’s ability to talk about, understand, and ultimately resolve their differences. Wikidata is striving to change that by not trying to force one truth but by collecting different points of view with their sources and context intact. This additional context can, for example, include which official body disputes or supports which view on a territorial dispute. Without this focus on verifiability instead of truth and not trying to force agreement it would be impossible to bring together a community from different languages and cultures. For the same reason, Wikidata doesn’t have an enforced schema that restricts the data, but, rather, has a system of editor-defined constraints that highlight potential problems.

Being able to cover different points of view and nuance is not enough however for a truly global project. The data also needs to be accessible to everyone in their language without privileging any particular language by design. Because of this, every concept in Wikidata is identified by a unique ID instead of an English name. Q5, for instance, is the identifier for the concept of a human. It is then given labels in the different languages: “human” in English, “người” in Vietnamese and “ihminen” in Finnish. This way the underlying data is language-independent and everyone can see the data in their language when viewing or editing it. This of course does not eliminate the language issue but it goes a long way towards more equity in contributing to Wikimedia’s content.

Besides fabulous people, Wikidata’s ultimate secret sauce are its connections. All concepts in Wikidata are connected to each other through statements. The statement “Iron Man -> member of -> Avengers” for example tells us that Iron Man is a member of the Avengers. That one connection alone does not tell us much yet. But if you take a number of other similar connections you can easily get a list of all Avengers. And then make a list of the movies they first appeared in and the actors they were portrayed by. A lot of simple individual connections taken together are powerful. If you add on top of that the wide range of topics Wikidata covers it becomes even more powerful because you can make connections that have not been made before. How about a list of species named after politicians? Now possible, thanks to these simple connections! And those are just the connections inside Wikidata itself; Wikidata also connects to a large amount of external databases, catalogs and projects that make even more data available. Since Wikidata has such a large number of links to external resources it can act as a hub so that way you, and even more importantly any machine, can find a vast amount of additional information based on a single piece of data. If the ISBN of a book is known, then knowing its entry in the relevant national library is just a hop away. There might not be a direct link from an artist’s entry in the Louvre’s catalog to their entry in the Rijksmuseum’s catalog but with Wikidata this connection is easily made, opening up yet more options for discovering knowledge.

Wikidata links to more than 4,000 external databases, projects and catalogs, creating a vast network.

Impacting Wikipedia

Its close connection to Wikipedia made all the difference for Wikidata, especially at the start. Without the community, experience, mindshare and tools that Wikipedia provided, Wikidata would not be where it is today. Wikidata gained a lot from its close association with Wikipedia. It is also giving back of course, not just by significantly lowering maintenance burdens through centralisation of data but also in a number of more subtle and indirect ways.

Before Wikidata the different Wikimedia projects and language versions of each project worked in silos to a large degree. There was little collaboration on content across project and language boundaries. Wikimedia Commons had been around for a while as a central repository for media files that are shared between all Wikimedia projects, but by its nature it did not force a lot of collaboration. Because of this a large part of the editors associated first and foremost with their language version of Wikipedia and only a distant second, if at all, with the Wikimedia Movement as a whole. Statements like “The Wikipedia in this and that language is terrible” were not uncommon when Wikidata started. The thought of using content that is shared with these other Wikipedias that were perceived as inferior was deemed frightening. Equally, the thought that the large Wikipedias could gain anything from contributions by smaller projects was unthinkable. By helping people connect across language and project boundaries, Wikidata has helped to steer Wikipedia away from a silo mentality towards a truly global movement where every project is recognized and valued for their contribution to the sum of all knowledge.

Improved search box using structured data from Wikidata

Wikidata also helps Wikipedia by being a fundamental building block for technical innovation - big and small. Simple changes like the improved search box when linking to another article in VisualEditor become possible thanks to structured data in Wikidata. Now the selector shows you the short description from Wikidata and you can select the right article to link to without having to look it up. Wikidata also makes possible more fundamental changes like overhauling Wikimedia Commons in order to make images more discoverable for Wikipedia editors and others. Wikidata provides the data necessary to build better experiences for Wikipedia’s editors and readers.

Through the data in Wikidata we can also understand Wikipedia better. We can analyse much more easily what content is covered and what is missing. Take the gender gap. It was known for a long time that Wikipedia’s content is skewed towards covering men. The simple fact that there are more Wikipedia articles about men than women is not very helpful for a big community though as it is too broad a problem to be motivated by and meaningfully make progress on. Wikidata allows us to see a more detailed picture and analyse the content by time period, country, profession of the person and other relevant characteristics. We can also see if there is a difference between the language versions of Wikipedia to see if any of them has a particularly narrow gender gap so we can learn from them. We can also see the geographic distribution of Wikipedia’s content and find blind spots on Wikipedia’s map of the world. The same can be done for any other content bias or gap that needs to be understood better. This way, Wikidata helps Wikipedia learn more about itself.

The gender gap on Wikipedia visualized per country of citizenship of the article’s subject. (tool by Envel Le Hir at denelezh.org)

Better understanding the knowledge that Wikipedia covers is a necessary first step towards countering biases and filling gaps. Wikidata can also help there by making it possible to generate automated worklists for a topic you care about. Interested in video games? You can make a list of all video games released in the last 10 years which are missing a publisher and start adding that data. How about party affiliations of politicians in your recent local election? Monuments in the city you last visited that are missing street addresses? All that is just a few clicks away, making it easier to contribute to collecting the sum of all human knowledge and making Wikipedia more complete.

And last but not least, Wikidata helps bring new contributors to Wikipedia. It opens up Wikimedia to new types of people, ones more interested in structuring information and connecting data points than writing long prose. And the small contributions that can be made on Wikidata lend themselves well to beginners who are initially overwhelmed by writing full articles. It also is a gateway for institutional contributors like galleries, libraries, archives and museums who want to make their content accessible.

Wikidata’s influence on Wikipedia far exceeds simply providing a few data points for infoboxes. It is a driver and supporter of change. Growing up with a big sister is not always easy. There’s the occasional disagreement and even fight but in the end you make up and stick together because you are the best team there could be. It is amazing to have someone to look up to. Wikidata is a project in its own right now, with its own reason for existence… but it will always be there to support Wikipedia.

Thank you, big sister! Wikidata owes you.




Reader comments

2020-08-30

The longest-running hoax

Contribute   —  
Share this
By Enwebb
Enwebb is the organizer of WikiProject Bats and founder of the Tree of Life Newsletter.

On August 7, WikiProject Palaeontology member Rextron discovered a suspicious taxon article, Mustelodon, which was created in November 2005. The article lacked references and the subsequent discussion on WikiProject Palaeontology found that the alleged type locality (where the fossil was first discovered) of Lago Nandarajo "near the northern border of Panama" was nonexistent. In fact, Panama does not even really have a northern border, as it is bounded along the north by the Caribbean Sea. No other publications or databases mentioned Mustelodon, save a fleeting mention in a 2019 book that presumably followed Wikipedia, Felines of the World.

The article also appeared in four other languages, Catalan, Spanish, Dutch, and Serbian. In Serbian Wikipedia, a note at the bottom of the page warned: "It is important to note here that there is no data on this genus in the official scientific literature, and all attached data on the genus Mustelodon on this page are taken from the English Wikipedia and are the only known data on this genus of mammals, so the validity of this genus is questionable."

Placeholder alt text
This is not a Mustelodon.

Editors took action to alert our counterparts on other projects, and these versions were removed also. As the editor who reached out to Spanish and Catalan Wikipedia, it was somewhat challenging to navigate these mostly foreign languages (I have a limited grasp of Spanish). I doubted that the article had very many watchers, so I knew I had to find some WikiProjects where I could post a machine translation advising of the hoax, and asking that users follow local protocols to remove the article. I was surprised to find, however, that Catalan Wikipedia does not tag articles for WikiProjects on talk pages, meaning I had to fumble around to find what I needed (turns out that WikiProjects are Viquiprojectes in Catalan!) Mustelodon remains on Wikidata, where its "instance of" property was swapped from "taxon" to "fictional taxon".

How did this article have such a long lifespan? Early intervention is critical for removing hoaxes. A 2016 report found that a hoax article that survives its first day has an 18% chance of lasting a year.[1] Additionally, hoax articles tend to have longer lifespans if they are in inconspicuous parts of Wikipedia, where they do not receive many views. Mustelodon was only viewed a couple times a day, on average.

Mustelodon survived a brush with death three years into its lifespan. The article was proposed for deletion in September 2008, with a deletion rationale of "No references given; cannot find any evidence in peer-reviewed journals that this alleged genus actually exists". Unfortunately, the proposed deletion was contested and the template removed, though the declining editor did not give a rationale. Upon its rediscovery in August 2020, Mustelodon was tagged for speedy deletion under CSD G3 as a "blatant hoax". This was challenged, and an Articles for Deletion discussion followed. On 12 August, the AfD was closed as a SNOW delete. WikiProject Palaeontology members ensured that any trace of it was scrubbed from legitimate articles. The fictional mammal was finally, truly extinct.

At the ripe old age of 14 years, 9 months, this is the longest-lived documented hoax on Wikipedia, topping the previous documented record of 14 years, 5 months, set by The Gates of Saturn, a fictitious television show, which was incidentally also discovered in August 2020. Based on the edit history of List of hoaxes on Wikipedia, new hoaxes are identified regularly at English Wikipedia. Dealing with this hoax and its fallout left me ruminating over some questions: How can we better identify hoaxes to keep them from reaching their tenth (or even fifteenth) birthdays? How can Wikipedia co-ordinate more readily across its different language versions once a hoax is discovered in one language? Does English Wikipedia harbor hoaxes that have been deleted elsewhere? Happy to hear your ideas.

References

  1. ^ Kumar, Srijan; West, Robert; Leskovec, Jure (April 2016). "Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes" (PDF). Proceedings of the 25th International World Wide Web Conference: 591–602. doi:10.1145/2872427.2883085.




Reader comments

2020-08-30

Heart, soul, umbrellas, and politics

Contribute   —  
Share this
By Igordebraga, Kingsif, Mcrsftdog, Rebestalic
This traffic report is adapted from the Top 25 Report, prepared with commentary by Igordebraga (July 26 to August 22), Kingsif (July 26 to August 16), Mcrsftdog (July 6 to August 1, August 9 to August 22) and Rebestalic (August 16 to 22)

Give me time and give me space. Give me real don't give me fake. Give me a cure for the COVID-19 pandemic that can't leave soon enough (to the point the view counts for that article are dropping...). And for those who prefer in those troubled quarantined times to move onto another "diseased" subject, tell me your own politik.

(data provided by the provisional Top 1000 report)

Give me heart and give me soul (July 26 to August 1)

Most Popular Wikipedia Articles of the Week (July 26 to August 1, 2020)
Rank Article Class Views Image Notes/about
1 John Lewis (civil rights leader) 1,507,358 The funerary services befitting such a figure as Congressman Lewis took place this week. After his funeral he lay in state at first in the Alabama State Capitol, and then the United States Capitol rotunda on Monday and Tuesday, the first African-American lawmaker to receive the honor. A second funeral ceremony was held in Atlanta on Thursday, where he was eulogized by former Presidents Clinton, W., and Obama, and he rests in Atlanta's South-View Cemetery. Lewis died on July 17, and now doubles the views his article had last week during a strangely slow period for Wikipedia, appearing on here for three consecutive weeks, unusual for a recent death: more unusual is only hitting #1 in the third week, which he does now thanks to many redirects for his common name.
2 Regis Philbin 1,505,819 American television has lost enough stars old and young this year to fill out several montages at the upcoming Emmys, but the most prominent is probably Regis, who died last week and now overtakes all the Sushant Singh Rajput-related entries. Whether it be every game show you can think of or the morning talk show named after him for over 20 years, just about every American (and a sizable number of people from around the world) has seen him host despite pulling back due to poor health in the 2010s. This poor health led to his fatal heart attack on July 24.
3 Olivia de Havilland 1,448,864 After Kirk Douglas in February, another centenarian from Hollywood's Golden Age leaves us with the passing of Dame Olivia Mary de Havilland, winner of an Academy Award for To Each His Own (only Luise Rainer, who almost got to her 105 birthday, lived longer among Oscar winners). De Havilland was also involved in classics such as The Adventures of Robin Hood and Gone with the Wind.
4 Herman Cain 1,331,901 Cain, a businessman who was once considered a front runner for the 2012 Republican nomination, died of COVID-19 complications on Thursday. He was hospitalized on June 1, only 9 days after attending a Trump rally maskless. Cain's death should be seen as a cautionary tale for the anti-mask movements. It won't, but it should.
5 Shakuntala Devi 1,097,470 The first Indian figure on the list this week is Devi, author of The World of Homosexuals which, fascinating as it sounds and groundbreaking as it was, is unrelated. Devi was best known as a human calculator (or the human calculator, so was her fame) and her amazing mind earned her an official Guinness World Record... in 1980. She died in 2013, and was only presented with the record this week, despite appearing in the GWR book. She's also the subject of a recent biopic, released Friday on Prime Video.
6 Rhea Chakraborty 1,095,924 Chakraborty was first reported as Sushant Singh Rajput's girlfriend after the latter committed suicide. On the 25th, the deceased's father filed a First Information Report, accusing her (and many others) of theft and abetting suicide for allegedly threatening Singh Rajput by saying he should be declared mentally unwell. She was arrested this past Tuesday.
7 Deaths in 2020 921,476 No I don't want to battle from beginning to end
I don't want a cycle of recycled revenge
I don't want to follow Death and All His Friends!
8 The Umbrella Academy (TV series) 686,289 Netflix released the much-anticipated second season adapting the comics written by musician Gerard Way and drawn by Gabriel Bá (pictured), where the remaining kids of a superpowered "family" time travel to prevent an apocalypse. "Family" being in inverted commas thanks to adoption that allowed for diverse casting: among its popular main cast are a British actor, an Irish actor, a Canadian, a teenager, and one of the original Broadway cast of #14's musical.
9 Dil Bechara 664,134 Director Mukesh Chhabra's (pictured) take on the teenage cancer of teenage cancer books, The Fault in Our Stars, was released for free streaming on Disney+ Hotstar on July 24, and was reportedly viewed 85 million times in its first 24 hours. It's either still getting hype or has been dragged into the new scandal (#6) about main actor Sushant Singh Rajput's suicide.
10 Jacob Elordi 632,000 This young Australian actor has seen a sudden rise to prominence thanks to his leading roles in two major franchises: TV's Euphoria and the Netflix movies about a kissing booth co-starring Joey King that are getting a lot of coverage at the moment. The second of the films was released this week.

Wounds that heal and cracks that fix (August 2 to 8)

Most Popular Wikipedia Articles of the Week (August 2 to 8, 2020)
Rank Article Class Views Image Notes/about
1 Lebanon 1,588,673 A small country beset by war and tragedy this week saw its capital city (#6) destroyed (#3) in a big explosion caused by incompetence (#5). Though not nuclear, the size and appearance of the mushroom cloud that resulted in earthquakes in mainland Europe has been likened to some notable bombings.
2 The Umbrella Academy (TV series) 1,538,754 Season 2 of the mystery superhero drama arrived on Netflix. Ellen Page (pictured) stars in it as Vanya, who is doing a hell of a lot better than in season 1. Page is also from Canada, where the show is filmed, and according to co-star Emmy Raver-Lampman she would take other castmembers out to local places while filming.
3 2020 Beirut explosions 1,207,762 In the port of Beirut (#6), capital of Lebanon (#1), there was a warehouse that since 2014 housed dangerous chemicals (#5) taken from an abandoned ship. On August 4, a fire broke in said warehouse, leading to a blast that wrecked buildings in a 10 kilometer (6 miles) radius.
4 Shakuntala Devi 1,178,421 The subject of a new film from Amazon Prime, where she's portrayed by Vidya Balan (pictured). While Netflix is going action, Amazon has decided to go... math.
5 Ammonium nitrate 1,089,158 Ammonium nitrate is a highly unstable substance that has caused some big explosions, like #14 and #3, the latter of which turned Beirut, capital of #1, into rubble this week.
6 Beirut 961,178
7 Deaths in 2020 858,347 Will you defeat them
Your demons and all the non-believers?
The plans that they have made?
Because one day, I'll leave you
A phantom to lead you in the summer
To join
The Black Parade
8 Rhea Chakraborty 691,270 How's this for Bollywood drama: Chakraborty, the girlfriend of the late Sushant Singh Rajput, was originally arrested last week for something related to his suicide, but is now being investigated for money laundering. In a shocking turn of events in this whole suicide scandal, Singh Rajput's best friend and fellow Bollywood star, Sharma, killed himself this week.
9 Samir Sharma 665,074
10 Wilford Brimley 618,624 A moderately famous actor and sometime singer, Brimley is also the person who caused half of North America to pronounce diabetes as "diabeetus" – he was diagnosed with the condition in the 1970s and became a prominent campaigner, but one with a mountain accent. He died on August 1 from what appears to be a diabetes-related kidney problem.

Tell me all your politik (August 9 to 15)

Most Popular Wikipedia Articles of the Week (August 9 to 15, 2020)
Rank Article Class Views Image Notes/about
1 Kamala Harris 11,843,595 California lawyer and senator who was announced this week as the Democrat VP pick with running-mate #9. She was a popular choice, had a brief presidential campaign last year, and brings the rest of her family to the list. In the days after her selection, birtherism was reborn: though she was definitely born in California, with an American father, she is not white, which is enough to send certain people into discredit mode.
2 Shyamala Gopalan 1,851,954
As a result of #1 being chosen as a VP candidate, attention was brought in for the whole family – in order, her mother, her sister (above), her father, and her husband (below).
3 Maya Harris 1,644,390
4 Donald J. Harris 1,640,562
5 Douglas Emhoff 1,427,685
6 QAnon 1,370,205 Marjorie Taylor Greene, a vocal supporter of Q, won a primary to a safe seat in the United States House of Representatives on Tuesday. Trump twote in support the next morning, leading to a question in a briefing. Trump sidestepped it, without mentioning Q.
7 The Umbrella Academy (TV series) 986,180 Netflix released the second season of this a little while ago, setting the apocalypse in Dallas. The moral of the story seems to be that even when you try really hard, you can still get everything wrong? That, or join a cult.
8 Joe Biden 836,439 While current president Trump has spent a lot of time on this list, the Democrats are presently occupying a lot of the top 10. Biden is Trump's competition as the countdown to November's election continues. He picked a running mate, #1, this week.
9 Gunjan Saxena 814,894 An Indian female air force pilot, a movie about her life (where she's played by actress Janhvi Kapoor, pictured) was released August 12 on Netflix.
10 Deaths in 2020 813,025 They call me The Seeker
I've been searching low and high
I won't get to get what I'm after
Till the day I die

And open up your eyes (August 16 to 22)

Most Popular Wikipedia Articles of the Week (August 16 to 22, 2020)
Rank Article Class Views Image Notes/about
1 Kamala Harris 2,523,180 The 2020 Democratic National Convention was a four day television event taking place from Monday to Thursday, with an average audience of 21.6 million viewers. While the real stars of the show were Biden and Harris, viewers got to see appearances from all of their favorite characters from the Democratic primaries, and even a few teasers for the 2024 arc.
2 Joe Biden 1,852,528
3 QAnon 1,379,518 QAnon stands alone as the only major conspiracy theory that's supportive of the government. Imagine if David Icke thought there were lizards controlling everything and he openly campaigned to become one of them. Imagine somone thinking that the CIA killed Kennedy but also thanking them for it. Bizarre.
4 Jill Biden 1,252,629 #2's wife (and potential First Lady) appeared in a pre-taped video at the DNC on Tuesday night, talking about how capable of a president her husband would be.
5 Elon Musk 980,785 In my skim of the news, Musk is doing something in Texas and has a new brain chip?
6 Donald Trump 847,079 Is seeking re-election.
7 Deaths in 2020 800,303 And when you're gone, who remembers your name?
Who keeps your flame?
Who tells your story?
8 Beau Biden 775,600 The last night of the DNC featured a tribute to the late son of #2 and Neilia Hunter, who died of brain cancer in 2015.
9 Ronald Koeman 767,192 FC Barcelona isn't what it used to be: when faced with Bayern Munchen in the shortened\empty 2019–20 UEFA Champions League knockout phase, the usually victorious Spanish squad received an 8-2 thumping! Such a humiliation led to the dismissal of their coach, and in comes a Dutchman who was an old idol of the team, Ronald Koeman, most recently manager of his country's national team.
10 Betty Broderick 749,527 Netflix released season 2 of Dirty John, which tells some of Broderick's story – she, played there by Amanda Peet (pictured), killed her ex-husband and his new wife in 1989, and is still in jail for it.

Exclusions



Reader comments

2020-08-30

Fourteen things we’ve learned by moving Polish Wikimedia conference online

Contribute   —  
Share this
By Natalia Szafran-Kozakowska
Natalia Szafran-Kozakowska is the community support officer for Wikimedia Polska She originally posted this essay on Diff, (part 1) (part 2), a new project hosted by the Wikimedia Foundation for the Wikipedia community. You can join Diff here.

Every year Polish Wikimedians convene to feel the human touch of the movement, and meet at conferences to learn, discuss and work together. This annual meeting, which gathers about 100 Wikimedians every year, is a great celebration of our community, movement and mission. When the COVID pandemic made it impossible for us to meet in person we decided that we would move the event online. And with that decision we started quite an adventure! Since online meetings are here to stay for a bit we would like to share some of the lessons we have learned.

Conference package – slippers, chocolates and a door hanger
  1. Do not replicate offline routines. You may be experienced in organizing live events, but the digital environment, the amount of things that you can control, and the needs of your participants are different. Make a list of things that need to happen for your event to be successful and then ask yourself how can you make sure they do happen in the new environment. Think not only about big things (“people need to learn something useful”) but also about tiny ones (“people need to be in the right place at the right time”). Be creative! For example we stated that wellbeing of the attendees is a factor. This is why we had a lot of breaks so that everyone could step away from their devices, and a yoga session to bring some care to our tired spines.
  2. But in some aspects – do. Especially if you replace a regular event which had its place in people’s calendars with a digital one. Bring a bit of a feeling of an in-person conference to give a sense of continuity. We knew that our attendees were excited about the fact that the conference was supposed to take place in Cracow. This is why we organized a remote guided live city tour. We were able to enjoy the views and ask questions. We also had a group photo (instead of a typical group screen capture we went for a collection of selfies which made the photo more vibrant). As a replacement of coffee breaks, we sent chocolates to the participants. Also, in the registration process, attendees could choose whether they want a physical surprise package sent to their home or a digital one to download.
  3. Make it simple, and avoid adding confusion. Virtual events are still new for a lot of people. Participants need to know where and when to click, where to seek information and whom to ask for help. Keep as much information as you can on one page and, if possible, hold all (or most) of the sessions on just one or two links so that whenever the participants click, they will get to the conference room. Have a person and a separate communication channel (in our case, it was a Telegram group) assigned to give technical information and support.
  4. The time can get tricky. While facilitating a conference and making sure that everything is on time is a challenge, it is much more difficult at a digital event. The speakers can go over their assigned time and can easily miss (or even ignore on purpose) cues from the moderator. Muting a person while they speak is neither elegant nor kind. So instead, plan breaks a bit (5 minutes) longer than you actually want them to be – it will give you a time buffer and will let participants have time to re-energize even if the session gets a bit too long. You will also have flexibility to allow an interesting conversation to continue. Keep the buffer secret from the panelists or speakers, though, so that they won’t treat it as an actual session time.
  5. Think of all the things in which online conferences are better than live ones. And then make the most of it! Are there any people whom you’ve always wanted to invite but never could because of geographical distance or language barriers? Now it is possible! We took advantage and invited guest speakers from across the ocean and broadened our pool of participants by offering simultaneous translation. This way we could have attendees from all over the globe: from Russia to Sweden and from Ukraine to the U.S.! Online events give you a unique chance to broaden your audience and invite people outside of the Wikimedia Movement. We promoted our speakers using social media to boost interest from non-Wikimedians and invite them to our event.
  6. Why so serious? To the participants, we sent conference packages including a pair of comfortable home slippers and a door hanger saying “Do not disturb, I’m attending a conference” so that we could add some humour to the fact that the conference has unexpectedly moved to participants’ homes.
  7. Conference platforms – remember your priorities. Choosing a platform is not easy. Make a list of functionalities you need and put them in hierarchical order so that you will know: how important it is to you that the tool is open source? What feature is only nice to have? For example, Wikimedians use a very diverse set of browsers, so for us, having a tool that works on many different ones was a criterion.
  8. Test your conference platform, learn its constraints, and let the speakers test it again. Test it in different groups and in different technical conditions (browsers, devices, and so forth). Shortly before the event we decided to shift the conference to a different platform because the one we had planned had shortcomings that were a no-go for us. You may schedule a get-together for the speakers the day before – it will help everyone get acquainted with the tool before the serious work begins.
  9. From the attendees’ perspective, remote participation is less of a logistical effort. This extends to the period way before the event. In our case, participants (speakers, too) were often way less strict in honoring their commitments than they are at live events. They kept us waiting longer for their decisions about participating. They submitted the details of their talks later than they usually do.
  10. Plan a lot and prepare your speakers. If you are having a scenario for a live session panel, discuss “theme entries” (and the amount of them) with your guests earlier. It keeps you within the schedule, and makes everything less stressful!
  11. People need to move. And to take breaks. Sitting in front of the computer is much more tiring than being in a conference room. Which means: less session time, more breaks. We went for a 1-hour session/30-minute break schedule with one long (2 hours) lunch break and it was a perfect amount of time to keep everyone focused and well.
  12. Diversify your program. Don’t make it a series of webinars. Shift between discussions and lectures, workshops and panel discussions. Changing format will help your attendees keep their focus. We made a mistake of scheduling social activities in the late afternoon when people were tired. In retrospect it would be better to plan them during the day.
  13. Be flexible. Not all our ideas went as planned. And it was OK. Rather than pushing them we followed our participants’ needs. We wanted to provide a place for conversations so we opened a participants Telegram group (a solution which worked perfectly during our live events) but people preferred to use the Zoom chat and Telegram became more of a place for announcements. We planned a Wikipedia scavenger hunt for the evening but people preferred to socialize by chatting. If your goals are met in a different way than the one you have planned, who cares! As long as they are met, right?
  14. Embrace the fact that things will go wrong. Because some will. The internet can go down, cats may jump on keyboards, the mics and the cameras may not cooperate, the speaker’s neighbours can decide to drill in their walls. There is a lot that can go wrong and not a lot of things you can control. Accept that the event doesn’t need to be perfect to be awesome. It’s not about perfection, it’s about connecting with each other. If obstacles come up, communicate it clearly to your participants and stay kind to yourself even if things go wrong. As long as you have that last one going – everything will be fine. Because kindness is the most important force in the Wikiverse!

And because of that I would like to thank my teammates Wojciech, Klara and Szymon with helping me with their insight in bringing all those learnings together!




Reader comments

2020-08-30

Detecting spam, and pages to protect; non-anonymous editors signal their intelligence with high-quality articles

Contribute   —  
Share this
By Matthew Sumpter and Tilman Bayer


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


"Protecting the Web from Misinformation" by detecting Wikipedia spammers and identifying pages to protect

Reviewed by Matthew Sumpter

This book chapter [1] discusses general trends in misinformation on the web. Misinformation can take many forms including vandalism, spam, rumors, hoaxes, counterfeit websites, fake product reviews, clickbait, and fake news. The chapter briefly describes each subtopic and presents examples of them in practice. The following section details a comprehensive set of NLP and network analysis studies that have been conducted both gain further insight into each subtopic, as well as combat them.

The chapter concludes with a case study based on the authors' research to protect Wikipedia content quality. The open editing mechanism of Wikipedia is ripe for exploitation by bad actors. This occurs mainly by vandalism, but also through page spamming and the dissemination of false information. To combat vandalism, the authors developed the "DePP" system, which is a tool for detecting which Wikipedia article pages to protect. DePP achieves 92.1% accuracy across multiple languages in this task. This system is based on the following base features: 1) Total average time between revisions, 2) Total number of users making five or more revisions, 3) Total average number of revisions per user, 4) Total number of revisions by non-registered users, 5) Total number of revisions made from mobile devices, and 6) Total average size of revisions. Through careful statistical analysis to determine the standard behavior of these metrics, malicious revisions can be identified by a deviation from these standards.

To combat spam, the authors developed the "Wikipedia Spammer Detector" (WiSDe). WiSDe uses a framework built upon features that research has revealed to be typical of spammers. These features most notably include the size of the edits, the time required to make edits, and the ratio of links to text within the edits. WiSDe achieved an 80.8% accuracy on a dataset of 4.2K users and 75.6K edits - an improvement of 11.1% over ORES. The case study concludes by providing some findings regarding the retention of new contributors to Wikipedia. They proposed a predictive model that achieved a high precision (0.99) in predicting users that would become inactive. This model relies on the observation that active users are more involved in edit wars, edit a wider variety of categories, and positively accept critiques.

See also our earlier coverage of related papers involving the first author: "Detecting Pages to Protect", "Spam Users Identification in Wikipedia Via Editing Behavior"


Editors successfully signal their intelligence by writing high-quality articles - but only when contributing non-anonymously

Reviewed by Tilman Bayer
Peacocks are a well-known example of signalling

An article[2] in the psychology journal Personality and Individual Differences reports on an experiment in a Wikipedia-like wiki, where editors with higher general intelligence scores write higher quality articles (as rated by readers) - but only when contributing non-anonymously. This is interpreted as evidence that contributors successfully "signal" their intelligence to readers (in the sense of signalling theory, which seeks to explain various behaviours in humans and animals that appear to have no direct benefit to the actor by positing that they serve to communicate certain traits or states to observers in an "honest", i.e. difficult to fake fashion).

The authors start out by wondering (like many have before) why "some people share knowledge online, often without tangible compensation", on sites such as Wikipedia, Reddit or YouTube. "Many contributions appear to be unconditionally altruistic and the system vulnerable to free riding. If the selfish gene hypothesis is correct, however, altruism must be apparent and compensated with fitness benefits. As such, our findings add to previous work that tests the costly signaling theory explanations for altruism." (Notably, not all researchers share this assumption about altruistic motivations, see e.g. the preprint by Pinto et al. listed below.)

An IQ test item in the style of a Raven's Progressive Matrices test. Given eight patterns, the subject must identify the missing ninth pattern

For the experiment, 98 undergraduate students, who had previously completed the Raven's Advanced Progressive Matrices (RPM) intelligence test, were asked to spend 30 minutes "to contribute to an ostensibly real wiki-style encyclopedia being created by the Department of Communication. Participants were told that the wiki would serve as a repository of information for incoming first-year students and that it would contain entries related to campus life, culture, and academics [...] The wiki resembled Wikipedia and contained a collection of preliminary articles." 38 of the participants were told their contributions would remain anonymous, whereas another 40 "were photographed and told that their photo would be placed next to their contribution", and their names were included with their contribution. (Curiously, the paper doesn't specify the treatment of the remaining 20 participants.) "The quality of all participants' contributions was rated by four undergraduate research assistants who were blind to hypotheses and experimental conditions. [...] The research assistants also judged the contributors' intelligence relative to other participants using a 7-point Likert-type scale (1 Much dumber than average, 7 Much smarter than average)".

The researchers "found that as individuals' scores on Ravens Progressive Matrices (RPM) increased, participants were judged to have written better quality articles, but only when identifiable and not when anonymous. Further, the effect of RPM scores on inferred intelligence was mediated by article quality, but only when signalers were identifiable." They note that their results leave several "important questions" still open, e.g. that "it remains unclear what benefits are gained by signalers who contribute to information pools." Citing previous research, they "doubt a direct relationship to reproductive success for altruism in signaling g in information pools. Technical abilities are not particularly sexually attractive (Kaufman et al., 2014), so it is likely that g mediates indirect fitness benefits in such contexts." It might be worth noting that the study's convenience sample likely differs in its demographics from those of Wikipedia editors, e.g. only 28 of the 98 participating students were male, whereas males are well known to form the vast majority of Wikipedia contributors.

The article is an important contribution to the existing body of literature on Wikipedia editors' motivations to contribute, even if it appears to be curiously unaware of it (none of the cited references contain "Wikipedia" or "wiki" in their title).


Briefly


Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

6.7% of Wikipedia articles cite at least one academic journal article with DOI

From the abstract:[3]

"we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further labeled an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI. Scientific articles cited from Wikipedia correspond to 3.5% of all articles with a DOI currently indexed in the Web of Science."

"Science through Wikipedia: A novel representation of open knowledge through co-citation networks"

From the abstract:[4]

"... the sample was reduced to 847 512 references made by 193 802 Wikipedia articles to 598 746 scientific articles belonging to 14 149 journals indexed in Scopus. As highlighted results we found a significative presence of 'Medicine' and 'Biochemistry, Genetics and Molecular Biology' papers and that the most important journals are multidisciplinary in nature, suggesting also that high-impact factor journals were more likely to be cited. Furthermore, only 13.44% of Wikipedia citations are to Open Access journals."

See also earlier by some of the same authors: "Mapping the backbone of the Humanities through the eyes of Wikipedia"


"Quantifying Engagement with Citations on Wikipedia"

From the abstract:[5]

"... we built client-side instrumentation for logging all interactions with links leading from English Wikipedia articles to cited references during one month, and conducted the first analysis of readers’ interactions with citations. We find that overall engagement with citations is low: about one in 300 page views results in a reference click (0.29% overall; 0.56% on desktop; 0.13% on mobile). [...] clicks occur more frequently on shorter pages and on pages of lower quality, suggesting that references are consulted more commonly when Wikipedia itself does not contain the information sought by the user. Moreover, we observe that recent content, open access sources, and references about life events (births, deaths, marriages, etc.) are particularly popular."

See also the research project page on Meta-wiki, and a video recording and slides of a presentation in the June 2020 Wikimedia Research Showcase

Presentation slide illustrating the instrumentation of reader interactions with citations


"Individual Factors that Influence Effort and Contributions on Wikipedia"

From the abstract and paper:[6]

"... [We] surveyed [Portuguese Wikipedia] community members and collected secondary data. After excluding outliers, we obtained a final sample with 212 participants. We applied exploratory factor analysis and structural equation modeling, which resulted in a model with satisfactory fit indices. The results indicate that effort influences active contributions, and attitude, altruism by reputation, and altruism by identification influence effort. None of the proposed factors are directly related to active contributions. Experience directly influences self-efficacy while it positively moderates the relation between effort and active contributions. [...] To reach [editors registered on Portuguese Wikipedia], we sent questionnaires to Wikimedia Brasil’s e-mail lists, made an announcement in Wikipedia’s notice section, and sent private messages to members through the platform itself."


"Approaches to Understanding Indigenous Content Production on Wikipedia"

From the abstract:[7]

"We examine pages with geotagged content in English Wikipedia in four categories, places with Indigenous majorities (of any size), Rural places, Urban Clusters, and Urban areas. We find significant differences in quality and editor attention for articles about places with Native American majorities, as compared to other places."


"Tabouid: a Wikipedia-based word guessing game"

This article describes the automatic generation of a Taboo-like game (where players have to describe a word while avoiding a given set of other words), also released as a free mobile app for Android and iOS. From the abstract:[8]

"We present Tabouid, a word-guessing game automatically generated from Wikipedia. Tabouid contains 10,000 (virtual) cards in English, and as many in French, covering not only words and linguistic expressions but also a variety of topics including artists, historical events or scientific concepts. Each card corresponds to a Wikipedia article, and conversely, any article could be turned into a card. A range of relatively simple NLP and machine-learning techniques are effectively integrated into a two-stage process. "


"Vandalism Detection in Crowdsourced Knowledge Bases"

From the abstract:[9]

"In this thesis, we [...] develop novel machine learning-based vandalism detectors to reduce the manual reviewing effort [on Wikidata]. To this end, we carefully develop large-scale vandalism corpora, vandalism detectors with high predictive performance, and vandalism detectors with low bias against certain groups of editors. We extensively evaluate our vandalism detectors in a number of settings, and we compare them to the state of the art represented by the Wikidata Abuse Filter and the Objective Revision Evaluation Service by the Wikimedia Foundation. Our best vandalism detector achieves an area under the curve of the receiver operating characteristics of 0.991, significantly outperforming the state of the art; our fairest vandalism detector achieves a bias ratio of only 5.6 compared to values of up to 310.7 of previous vandalism detectors. Overall, our vandalism detectors enable a conscious trade-off between predictive performance and bias and they might play an important role towards a more accurate and welcoming web in times of fake news and biased AI systems."


"SchemaTree: Maximum-Likelihood Property Recommendation for Wikidata"

From the abstract:[10]

"We introduce a trie-based method that can efficiently learn and represent property set probabilities in RDF graphs. [...] We investigate how the captured structure can be employed for property recommendation, analogously to the Wikidata PropertySuggester. We evaluate our approach on the full Wikidata dataset and compare its performance to the state-of-the-art Wikidata PropertySuggester, outperforming it in all evaluated metrics. Notably we could reduce the average rank of the first relevant recommendation by 71%."


NPOV prevails in Hindi, Urdu, and English Wikipedia articles about the Jammu and Kashmir conflict

From the abstract:[11]

"This article asks to what degree Wikipedia articles in three languages --- Hindi, Urdu, and English --- achieve Wikipedia's mission of making neutrally-presented, reliable information on a polarizing, controversial topic available to people around the globe. We chose the topic of the recent revocation of Article 370 of the Constitution of India, which, along with other recent events in and concerning the region of Jammu and Kashmir, has drawn attention to related articles on Wikipedia. This work focuses on the English Wikipedia, being the preeminent language edition of the project, as well as the Hindi and Urdu editions. [...] We analyzed page view and revision data for three Wikipedia articles [on the English Wikipedia, these were Kashmir conflict, Article 370 of the Constitution of India, and Insurgency in Jammu and Kashmir ]. Additionally, we interviewed editors from all three Wikipedias to learn differences in editing processes and motivations. [...] In Hindi and Urdu, as well as English, editors predominantly adhere to the principle of neutral point of view (NPOV), and these editors quash attempts by other editors to push political agendas."

See also the authors' conference poster


References

  1. ^ Spezzano, Francesca; Gurunathan, Indhumathi (2020). "Protecting the Web from Misinformation". In Mohammad A. Tayebi; Uwe Glässer; David B. Skillicorn (eds.). Open Source Intelligence and Cyber Crime: Social Media Analytics. Lecture Notes in Social Networks. Cham: Springer International Publishing. pp. 1–27. ISBN 9783030412517. Closed access icon
  2. ^ Yoder, Christian N.; Reid, Scott A. (2019-10-01). "The quality of online knowledge sharing signals general intelligence". Personality and Individual Differences. 148: 90–94. doi:10.1016/j.paid.2019.05.013. ISSN 0191-8869. Closed access icon
  3. ^ Singh, Harshdeep; West, Robert; Colavizza, Giovanni (2020-07-14). "Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia". arXiv:2007.07022 [cs]. Dataset
  4. ^ Arroyo-Machado, Wenceslao; Torres-Salinas, Daniel; Herrera-Viedma, Enrique; Romero-Frías, Esteban (2020-02-10). "Science through Wikipedia: A novel representation of open knowledge through co-citation networks". PLOS ONE. 15 (2): –0228713. doi:10.1371/journal.pone.0228713. ISSN 1932-6203.
  5. ^ Piccardi, Tiziano; Redi, Miriam; Colavizza, Giovanni; West, Robert (2020-04-20). "Quantifying Engagement with Citations on Wikipedia". Proceedings of The Web Conference 2020. WWW '20. New York, NY, USA: Association for Computing Machinery. pp. 2365–2376. doi:10.1145/3366423.3380300. ISBN 9781450370233. Closed access icon Author's copy
  6. ^ Pinto, Luiz F.; Santos, Carlos Denner dos; Onoyama, Silvia (2020-07-14). "Individual Factors that Influence Effort and Contributions on Wikipedia". arXiv:2007.07333 [cs].
  7. ^ Sethuraman, Manasvini; Grinter, Rebecca E.; Zegura, Ellen (2020-06-15). "Approaches to Understanding Indigenous Content Production on Wikipedia". Proceedings of the 3rd ACM SIGCAS Conference on Computing and Sustainable Societies. COMPASS '20. Ecuador: Association for Computing Machinery. pp. 327–328. doi:10.1145/3378393.3402249. ISBN 9781450371292. Closed access icon
  8. ^ Bernard, Timothée (July 2020). "Tabouid: a Wikipedia-based word guessing game". Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics. pp. 24–29. doi:10.18653/v1/2020.acl-demos.4.
  9. ^ Heindorf, Stefan (2019). Vandalism Detection in Crowdsourced Knowledge Bases (Thesis). Paderborn, Germany: Paderborn University. S2CID 209517598. (dissertation)
  10. ^ Gleim, Lars C.; Schimassek, Rafael; Hüser, Dominik; Peters, Maximilian; Krämer, Christoph; Cochez, Michael; Decker, Stefan (2020). "SchemaTree: Maximum-Likelihood Property Recommendation for Wikidata". In Andreas Harth; Sabrina Kirrane; Axel-Cyrille Ngonga Ngomo; Heiko Paulheim; Anisa Rula; Anna Lisa Gentile; Peter Haase; Michael Cochez (eds.). The Semantic Web. Lecture Notes in Computer Science. Cham: Springer International Publishing. pp. 179–195. doi:10.1007/978-3-030-49461-2_11. ISBN 9783030494612.
  11. ^ Hickman, Molly G.; Pasad, Viral; Sanghavi, Harsh; Thebault-Spieker, Jacob; Lee, Sang Won (2020-06-17). "Wiki HUEs: Understanding Wikipedia practices through Hindi, Urdu, and English takes on evolving regional conflict". Proceedings of the 2020 International Conference on Information and Communication Technologies and Development. ICTD2020. Guayaquil, Ecuador: Association for Computing Machinery. pp. 1–5. doi:10.1145/3392561.3397586. ISBN 9781450387620. Closed access icon




Reader comments

2020-08-30

A slow couple of months

Contribute   —  
Share this
By Bri

Arbitration requests

Amendment requests

Amendment requests adjusting one editor's editing restrictions are not discussed here.

Arbcom member DGG included this statement in his decision: [This case has] helped me settle my position on the more general question of DS (discretionary sanctions): I would abolish them, and then there would be no more such questions. among other merits of terminating the procedure, is that it leads to inappropriate requests for us to involve ourself in deciding content. What is within the scope of arb com is to end the concept of DS, and the only reason I do not now propose it by motion is that I do not think it would have a majority yet.
2) Editors are prohibited from making more than one revert per page per day on any page relating to genetically modified organisms, agricultural biotechnology, and agricultural chemicals, commercially produced agricultural chemicals and the companies that produce them, broadly construed and subject to the usual exemptions.

Declined/withdrawn

Unban

As part of Wikipedia:Arbitration/Requests/Case/Lightbreather, Lightbreather (talk · contribs) was site banned and subject to several restrictions. Following an appeal to ArbCom by email, a motion to unblock Lightbreather and lift the restrictions was posted for discussion on-wiki. The request was closed 18 July after Arbcom decided not to reverse the ban.

Other matters




Reader comments

2020-08-30

Wikipedia for promotional purposes?

Contribute   —  
Share this
By Ral315
This article was first published 15 years ago on August 22, 2005, eight months after The Signpost was founded. It may be the first Signpost article about paid editing, but certainly hasn't been the last. An earlier article, Outside groups targeting Wikipedia spur fears about bias, published February 7, 2005, a month after The Signpost's first issue raised similar questions about conflict-of-interest editing and canvassing.S

Twice recently, television organizations have been accused of attempting to use Wikipedia for promotional purposes. The BBC recently added articles on Jamie Kane and Boy*d Upp, a fictional character and band existing in a BBC alternate-reality game. In another incident, G4's Attack of the Show program, to commemorate an appearance by Jimbo Wales, created User:Attackoftheshow, a user page which was used primarily as a sandbox for interested viewers to edit, raising questions over whether the usage was permissable or not.

Jamie Kane

On August 12, a new user created an article about Jamie Kane, asserting that the fictional star of a boy band was real. The article was quickly tagged for speedy deletion, then taken to VfD. Uncle G and other editors changed the article, expanding it and making note that the band was fictional. The VfD subsequently failed, though a series of unsigned and unregistered users attempted to vote.

Later, an article on the fictional band, Boy*d Upp, was created by an IP address inside the BBC, assumed to be a BBC employee. This article was also tagged for VfD, and was deleted, then redirected to Jamie Kane. BBC confirmed that an employee had written the article, but denied that it was meant to promote the game:

"The first posting was simply a case of a fan of the game getting into the spirit of alternative reality a little too much. The follow up posting was made by a fan of the game who happens to work in the BBC (where we've been beta-testing for the last month). This was unauthorized and made without the knowledge of anyone in the Jamie Kane Team or BBC Marketing. To confirm: the BBC would never use Wikipedia as a marketing tool."

Attack of the Show

On August 16, G4 aired an interview with Wikipedia founder Jimbo Wales. They created a user page for the show, where viewers could edit as they pleased. Vandalism ensued, and just a day after the episode aired, and over 1200 edits after the page was created, the page was protected. As of press time, the page is still protected to deal with vandalism.

Tony Sidaway protected the page immediately after it was created, but Jimbo unprotected it and instructed administrators to leave it open, because he had already talked with G4, and authorized the move.

Issues with using Wikipedia for marketing

From Wikipedia's point of view:

From the marketers point of view the Wikipedia is a difficult choice:

Possibility of marketing spam in the future?

This raises the legitimate question of whether marketing spam may be a problem in the future. While this is a common occurrence on Special:Newpages patrol, a more confusing type of spamming such as the Jamie Kane articles may occur, where many users may be confused over whether the article's content is real, fake, or even vanity. Perhaps what is most reassuring is that all three pages were quickly found and taken care of. Nevertheless, this is a problem that may occur again in the near future.




Reader comments

2020-08-30

Marcus Sherman, Jerome West, and Pauline van Till

Contribute   —  
Share this
By Wikipedia editors

Marcus Sherman (Marcus334)

Marcus Sherman

Marcus Sherman (August 5, 1947 – April 25, 2020) from Cape Cod joined Wikipedia on 14 January 2007 and was keenly interested in improving content related to the protected areas in southern India on the English Wikipedia.[1][2][3]

Jerome West (Jcw69)

Jerome died on 19 July 2020. He was a South African contributor and administrator on the English Wikipedia. He made 9,265 edits. Jerome's death, from consequences of COVID-19, was announced by his widow on Facebook.

Pauline van Till (Pvt pauline)

Pauline Louise van Till
(The Hague, 25 July 1944 –
The Hague, 27 July 2020)

Last month, the Dutch Wikipedia lost a long-time member of the editing community, Pauline van Till. She volunteered at the Museum Sophiahof which used her title "Barones" in its obituary.[1] The Dutch wiki Wikisage reports that she was the first female caddy on the PGA European Tour, where she got the nickname "Dutchess".

Van Till wrote articles in the areas of golf and the international world of golfers in the Dutch and English Wikipedias, and also contributed images from all over the world to Commons. One of her best-known pictures, widely used through the projects, is of Johan Cruijff as a golfer in 2009: File:Johan Cruijff golfer cropped.jpg. She was also known under her other accounts, Pvt pauline~commonswiki and Pvt pauline~enwiki.

References

  1. ^ "In memoriam Pauline van Till". museumsophiahof.nl. Retrieved 30 August 2020.




Reader comments

If articles have been updated, you may need to refresh the single-page edition.



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0