The Signpost

Recent research

Automatic detection of covert paid editing; Wiki Workshop 2020

Contribute   —  
Share this
By Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Automatic detection of undisclosed paid editing

Figure from the paper: "Article network: two articles are connected by an edge if they have been edited by a common user. Colors indicate articles create by the same sockpuppet group of undisclosed paid editors (UPEs). Negative articles (in gray) are articles never edited by an UPE."

In a paper[1] published in the proceedings of last month's (virtual) The Web Conference, four researchers from Boise State University (collaborating with an English Wikipedia administrator) present a machine learning framework for "automatically detecting Wikipedia undisclosed paid contributions, so that they can be quickly identified and flagged for removal."

Their approach is based on constructing two datasets, of articles and editors, each consisting of undisclosed paid editing (UPE; as previously confirmed by Wikipedia administrators) and a control group of articles/users assumed to be "benign" (i.e., not the result of, or engaged in, UPE). In more detail, the authors started from a previously published dataset that had collected the results of 23 past sockpuppet investigations,[2] yielding 1,006 known UPE accounts, and added 98 manually determined UPE accounts. A sample of articles newly created in March 2019 (limited to those created by users with less than 200 edits who were manually assessed to not being engaged in paid editing) was used to come with the benign parts of the two datasets.

For both articles and editors, the authors tested three different classification algorithms (logistic regression, support vector machine, and random forest) on a relatively simple set of features (e.g., for articles, the number of categories, or for editors, the average time between two consecutive edits made by the user). Still, the resulting method appears quite effective for detecting undisclosed paid articles:

"when we combine both article and user-based features, we improve our classification results upon each group of features individually: AUROC of 0.983 and average precision of 0.913. This means that both article content and information about the account that created the article are important for detecting undisclosed paid articles."

Among the most effective features was "the percentage of edits made by a user that are less than 10 bytes. Undisclosed paid editors try to become autoconfirmed users; thus they typically make around 10 minor edits before creating a promotional article."

Overall, the results appear to hold high promise for a practical application that could be of significant assistance to the editing community in combating the abuse of Wikipedia for promotional purposes, which is an ongoing and pervasive problem (compare e.g. this month's Signpost coverage of a recent investigation on the French Wikipedia). Obviously, any output of such an algorithm would be needed to be vetted manually, considering the relatively small but (in absolute terms) still considerable number of false positives. The paper contains little discussion of possible limitations of the sockpuppet investigations dataset used (e.g., how representative it might be of UPE efforts overall, as opposed to focused on the activities of some specific PR agencies), leaving open the possibility of overfitting.

The paper also includes an analysis of the network of the articles in the dataset, with two articles connected by an edge if the same user had edited both (see figure). But its results do not appear to have been used in the detection method. Among the findings: "there is less user collaboration among positive articles [as measured by local clustering coefficient and PageRank]. UPEs only work on a limited number of Wikipedia titles that they are interested in promoting, whereas genuine users edit more pages related to their field of expertise."

The authors highlight the importance of sockpuppets, observing that "undisclosed paid editors typically act as a group of sockpuppet accounts" and basing most of their ground truth dataset on sockpuppet cases. A brief literature review covers previous research on the automatic detection of sockpuppets on Wikipedia, including a paper from the 2016 Web Conference[3] presenting a method able "to detect 99% of fake accounts," and an earlier stylometric method (cf. our 2013 coverage: " Sockpuppet evidence from automated writing style analysis" / "New sockpuppet corpus"). An ongoing research project by the Wikimedia Foundation (presented at last year's Wikimania) concerns the practical implementation of such a tool.


Wikiworkshop 2020

As part of The Web Conference, the annual Wiki Workshop "[brought] together researchers exploring all aspects of Wikipedia, Wikidata, and other Wikimedia projects", this year held as an one-day Zoom meeting with over 100 participants. Among the papers (see also proceedings):

From the abstract:[4]

"We find that Wikipedia links are extremely common in important search contexts, appearing in 67–84% of all SERPs [search engine results pages] for common and trending queries, but less often for medical queries. Furthermore, we observe that Wikipedia links often appear in 'Knowledge Panel' SERP elements and are in positions visible to users without scrolling, although Wikipedia appears less in prominent positions on mobile devices. Our findings reinforce the complementary notions that (1) Wikipedia content and research has major impact outside of the Wikipedia domain and (2) powerful technologies like search engines are highly reliant on free content created by volunteers."

See also slides

"Layered Graph Embedding for Entity Recommendation using Wikipedia in the Yahoo! Knowledge Graph"

From the abstract:[5]

"we describe an embedding-based entity recommendation framework for Wikipedia that organizes Wikipedia into a collection of graphs layered on top of each others, learns complementary entity representations from their topology and content, and combines them with a lightweight learning-to-rank approach to recommend related entities on Wikipedia. [...]. Balancing simplicity and quality, this framework provides default entity recommendations for English and other languages in the Yahoo! Knowledge Graph, which Wikipedia is a core subset of."

See also slides


"WikiHist.html: English Wikipedia's Full Revision History in HTML Format"

From the abstract:[6]

"researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is publicly available exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions. We solve these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and release WikiHist.html, English Wikipedia’s full revision history in HTML format. We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia’s hyperlinks, showing that over half of the wiki links present in HTML are missing from raw wikitext, and that the missing links are important for user navigation."

See also slides and the underlying 7 terabyte dataset with code


"Collaboration of Open Content News in Wikipedia: The Role and Impact of Gatekeepers"

From the abstract:[7]

"In the current proposed study, I aim to understand this new model of content generation process through the lens of gatekeepers in social media platforms such as Wikipedia. Specifically, I aim to discover ways to identify gatekeepers and assess their impact on information quality and content polarization."

See also slides


"Domain-Specific Automatic Scholar Profiling Based on Wikipedia"

From the abstract:[8]

"to extract some properties of a given scholar, structured data, like infobox in Wikipedia, are often used as training datasets. But it may lead to serious mis-labeling problems, such as institutions and alma maters, and a Fine-Grained Entity Typing method is expected. Thus, a novel Relation Embedding method based on local context is proposed to enhance the typing performance. Also, to highlight critical concepts in selective bibliographies of scholars, a novel Keyword Extraction method based on Learning to Rank is proposed to bridge the gap that conventional supervised methods fail to provide junior scholars with relative importance of keywords."

See also slides


From the abstract:[9]

"we propose a way to match red links in one Wikipedia edition to existent pages in another edition. We define the task as a Named Entity Linking problem because red link titles are mostly named entities. We solve it in a context of Ukrainian red links and English existing pages. We created a dataset of 3171 most frequent Ukrainian red links and a dataset of almost 3 million pairs of red links and the most probable candidates for the correspondent pages in English Wikipedia."

See also slides


"Beyond Performing Arts: Network Composition and Collaboration Patterns"

From the abstract:[10]

"we propose the reconstruction and analysis of the collaboration networks of performing artists registered in Wikidata. Our results suggest that different performing arts share similar collaboration patterns, as well as a mechanism of community formation that is consistent with observed social behaviors."

See also slides


"Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia"

From the abstract:[11]

"We leverage a large scale natural experiment to study how exogenous content contributions to Wikipedia articles affect the attention they attract and how that attention spills over to other articles in the network. Results reveal that exogenously added content leads to significant, substantial and long-term increases in both content consumption and subsequent contributions. Furthermore, we find significant attention spillover to downstream hyperlinked articles."

See also slides


"The Positioning Matters: Estimating Geographical Bias in the Multilingual Record of Biographies on Wikipedia"

From the abstract:[12]

"This article proposes that an appropriate assessment of the geographical bias in multilingual Wikipedia's content should consider not only the number of articles linked to places, but also their internal positioning –i.e. their location in different languages and their centrality in the network of references between articles. This idea is studied empirically, systematically evaluating the geographic concentration in the biographical coverage of globally recognized individuals (those whose biographies are found in more than 25 language versions of Wikipedia). Considering the internal positioning levels of these biographies, only 5 countries account for more than 62% of Wikipedia's biographical coverage. In turn, the inequality in coverage between countries reaches very high levels, estimated with a Gini coefficient of .84 and a Palma ratio of 207."

"Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale"

From the abstract:[13]

"We present Citation Detective, a system designed to periodically run Citation Need models on a large number of articles in English Wikipedia, and release public, usable, monthly data dumps exposing sentences classified as missing citations. [...] We provide an example of a research direction enabled by Citation Detective, by conducting a large-scale analysis of citation quality in Wikipedia, showing that article citation quality is positively correlated with article quality, and that articles in Medicine and Biology are the most well sourced in English Wikipedia."

See also code and blog post.


For coverage of some other papers from Wiki Workshop 2020, see last month's issue ("What is trending on (which) Wikipedia?"), and upcoming issues. This blog post about the event covers several non-paper aspects of the schedule, including the keynote by Jess Wade.


Briefly

References

  1. ^ Joshi, Nikesh; Spezzano, Francesca; Green, Mayson; Hill, Elijah (2020-04-20). "Detecting Undisclosed Paid Editing in Wikipedia". Proceedings of The Web Conference 2020. WWW '20. Taipei, Taiwan: Association for Computing Machinery. pp. 2899–2905. doi:10.1145/3366423.3380055. ISBN 9781450370233.
  2. ^ TonyBallioni; Heilman, James; Henry, Brian; Halfaker, Aaron (2018-04-24). "Known Undisclosed Paid Editors (English Wikipedia)". Figshare.
  3. ^ Yamak, Zaher; Saunier, Julien; Vercouter, Laurent (2016-04-11). "Detection of Multiple Identity Manipulation in Collaborative Projects". Proceedings of the 25th International Conference Companion on World Wide Web. WWW '16 Companion. Montréal, Québec, Canada: International World Wide Web Conferences Steering Committee. pp. 955–960. doi:10.1145/2872518.2890586. ISBN 9781450341448.
  4. ^ Vincent, Nicholas; Hecht, Brent (2020-04-21). "A Deeper Investigation of the Importance of Wikipedia Links to the Success of Search Engines". arXiv:2004.10265.
  5. ^ Ni, Chien-Chun; Sum Liu, Kin; Torzec, Nicolas (2020-04-20). "Layered Graph Embedding for Entity Recommendation using Wikipedia in the Yahoo! Knowledge Graph". Companion Proceedings of the Web Conference 2020. WWW '20: The Web Conference 2020. Taipei Taiwan: ACM. pp. 811–818. doi:10.1145/3366424.3383570. ISBN 9781450370240.
  6. ^ Mitrevski, Blagoj; Piccardi, Tiziano; West, Robert (2020-04-21). "WikiHist.html: English Wikipedia's Full Revision History in HTML Format". arXiv:2001.10256.
  7. ^ Li, Ang; Farzan, Rosta (2020-04-20). "Collaboration of Open Content News in Wikipedia: The Role and Impact of Gatekeepers". Companion Proceedings of the Web Conference 2020. WWW '20. Taipei, Taiwan: Association for Computing Machinery. pp. 802–805. doi:10.1145/3366424.3383568. ISBN 9781450370240.
  8. ^ Chuai, Ziang; Geng, Qian; Jin, Jian (2020-04-20). "Domain-Specific Automatic Scholar Profiling Based on Wikipedia". Companion Proceedings of the Web Conference 2020. WWW '20. Taipei, Taiwan: Association for Computing Machinery. pp. 786–793. doi:10.1145/3366424.3383565. ISBN 9781450370240.
  9. ^ Liubonko, Kateryna; Sáez-Trumper, Diego (2020-04-20). "Matching Ukrainian Wikipedia Red Links with English Wikipedia's Articles". Companion Proceedings of the Web Conference 2020. WWW '20. Taipei, Taiwan: Association for Computing Machinery. pp. 819–826. doi:10.1145/3366424.3383571. ISBN 9781450370240.
  10. ^ Yessica Herrera-Guzman, Eduardo Graells-Garrido and Diego Caro: Beyond Performing Arts: Network Composition and Collaboration Patterns. https://wikiworkshop.org/2020/papers/Wiki_Workshop_2020_paper_5.pdf
  11. ^ Kai Zhu, Dylan Walker and Lev Muchnik: Content Growth and Attention Contagion in Information Networks: Addressing Information Poverty on Wikipedia. https://wikiworkshop.org/2020/papers/Wiki_Workshop_2020_paper_7.pdf . Preprint: Zhu, Kai; Walker, Dylan; Muchnik, Lev (2018-06-05). Content Growth and Attention Contagion in Information Networks: A Natural Experiment on Wikipedia. Rochester, NY: Social Science Research Network. doi:10.2139/ssrn.3191128.
  12. ^ Beytía, Pablo (2020-04-20). "The Positioning Matters: Estimating Geographical Bias in the Multilingual Record of Biographies on Wikipedia". Companion Proceedings of the Web Conference 2020. WWW '20. Taipei, Taiwan: Association for Computing Machinery. pp. 806–810. doi:10.1145/3366424.3383569. ISBN 9781450370240.
  13. ^ Ai-Jou Chou, Guilherme Gonçalves, Sam Walton and Miriam Redi: Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale. Wiki Workshop 2020. PDF
S
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

AUROC of 0.983 and average precision of 0.913 that sounds pretty good to me! Looking forward to seeing the results of this once it's applied in a more practical setting. {{u|Sdkb}}talk 11:57, 1 June 2020 (UTC)[reply]

Looking at it, I suspect it might help to filter out company Paidcoi SPAs, which is certainly worthwhile, but the "professionals" can tweak their behaviour fairly easy to heavily minimise their appearance without that much more work (e.g. in edits used to gain AC). Nosebagbear (talk) 12:45, 1 June 2020 (UTC)[reply]
Goodhart's law applies. Nemo 14:16, 1 June 2020 (UTC)[reply]
I've been working in this arena for a while, and in fact have a credit in the paper for contributing labeled data that was used to train the model. We aren't sure how sophisticated some of these operations are but my feeling is there's a distinct break between the activities of the outfits catering to well-funded Global North entities (in particular corporations and their executives, entertainers/entertainment companies, and politicians and political groups) – probably what you mean by the "professionals" – and the rest. I wouldn't be surprised if the former are highly aware of the investigative techniques used on-Wiki, and adapt to whatever metrics and techniques we apply, but the latter are unable to, at least quickly. But the greatest volume of stuff that has to be dealt with is due to the less sophisticated group, and it would still be useful to have tools that willow that away so human effort can be focused on the remainder. ☆ Bri (talk) 16:30, 1 June 2020 (UTC)[reply]
Yes, or in other words it's easy to focus on the least consequential cases, while large-scale manipulations by well-funded enemies of the neutral point of view will be left untouched. Nemo 18:00, 1 June 2020 (UTC)[reply]
That's kind of the opposite of what I said. Enhanced tools can help identify the least consequential cases; and dogged and talented experts can detect large-scale manipulations by well-funded enemies of the neutral point of view. You should drop in at WP:COIN and see how it works. ☆ Bri (talk) 02:47, 2 June 2020 (UTC)[reply]
@Sdkb: I put together the dataset that the authors used and generated a lot of the features. When I last checked on unseen articles, it was classifying 50% as UPE... so clearly not much help in practice. Admittedly I wasn't aware they'd published this and they might have improved, but the metrics they were getting back then were pretty similar. I still think this is possible, but it requires a lot more work to generate the training data. SmartSE (talk) 17:16, 3 June 2020 (UTC)[reply]
  • Guy, some of your edits might score high on that axis, but I don't think you would be selected by the algorithm due to the additional features outlined in 4.2 User-based features. Both of these would score low: average time between two consecutive edits made by the same user in the same article and the percentage of edits made by a user that are less than 10 bytes. Unclear if the last thing is scored across the suspicious article or across the account's lifetime, but either way. ☆ Bri (talk) 19:03, 1 June 2020 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0