The Signpost

Recent research

Sentiment monitoring; Wikipedians and academics favor the same papers; UNESCO and systemic bias; How ideas flow on Wikiversity

Contribute  —  
Share this
By Piotr Konieczny, Oren Bochman, Taha Yasseri, Jonathan T. Morgan and Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

Too good to be true? Detecting COI, Attacks and Neutrality using Sentiment Analysis

Finn Årup Nielsen, Michael Etter and Lars Kai Hansen presented a technical report[1] on an online service which they created to conduct real-time monitoring of Wikipedia articles of companies. It performs sentiment analysis of edits, filtered by companies and editors. Sentiment analysis is a new applied linguistics technology which is being used in a number of tasks ranging from author profiling to detecting fake reviews on online retailers. The form of visualization provided by this tool can easily detect deviation from linguistic neutrality. However, as the authors point out, this analysis only gives a robust picture when used statistically and is more prone to mistakes when operating within a limited scope.

The service monitors recent changes using an IRC stream and detects company-related articles from a small hand-built list. It then retrieves the current version using the MediaWiki API and performs sentiment analysis using the AFINN sentiment-annotated word list. The project was developed by integrating a number of open source components such as NLTK and CouchDB. Unfortunately, the source code has not been made available and the service can only run queries on the shortlisted companies which will limit the impact of this report on future Wikipedia research. However, it seems to have potential as a tool for detecting COI edits that tend to tip neutrality by adding excess praise or attacks which tip the content in the other direction. We hope the researchers will open-source this tool like their prior work on the AFINN data-set, or at least provide some UI to query articles not included in the original research.

"A Comparative Study of Academic impact and Wikipedia Ranking"

A paper[2] with this title investigates the relation between the scientific reputation of scientific items (authors, papers, and keywords) and the impact of the same items on Wikipedia articles. The sample of scientific items is made of the entries in the ACM digital library including more than 100 k papers, 150 k authors and 35 k keywords. However, only a tiny subset of these could be found in English Wikipedia pages (the authors considered all Wikipedia pages in the English edition which contain at least two mentions of any of the scientific items in the sample). The academic reputation is calculated based on three criteria: frequency of appearance, number of citations each item receives from the others, and PageRank calculated on the citation network. The Wikipedia ranking is based on three popularity measures of all the pages that have mentioned the item: number of mentions, sum over PageRank of all the mentioning pages, and sum over in-degrees of all the mentioning pages in Wikipedia's hyperlink network.

These 3 times 3 choices give 9 combinations of academic ranking and Wikipedia ranking for 3 types of scientific entities (authors, papers, keywords). All these 27 pairs are shown to be correlated according to Spearman's Rank Correlation, indicating that in general Wikipedia mentions are non-randomly driven by scientific reputation. However, most of the combinations are less significant. Surprisingly, the most relevant Wikipedia ranking criterion turns out to be the pure total number of mentions, compared to the more sophisticated ones, i.e., PageRank and in-degree measures.

In a separate part, authors define two sets of scientific items, those which are mentioned in Wikipedia, and those which are not mentioned at all (the latter is larger in size by a factor of 2 for keywords, 100 for authors, and 300 for papers). They show that for all 3 types, the set of items which are mentioned in Wikipedia have a better academic rank on average.

1970s UNESCO debate applied to Wikipedia's systemic bias in the case of Cambodia

According to the author, the Angkor period dominates Cambodian historiography as well as tourist attention in the country, corresponding to an unevenness in the quality of Wikipedia articles
An article[3] in the Journal of the American Society for Information Science and Technology rated the quality of Wikipedia articles on the history of Cambodia (defined as those linked in the corresponding navbox, using four measures: 1) the article's ratio of the number of citations per the number of words, 2) the number of editors who have commented on its talk page, 3) the quality of the cited sources, rated in five categories ("traditional reference" like print encyclopedias, "news reports" including both newspapers and news websites such as CNN, "academic periodicals", "books", and "miscellany" like reports by governments or NGOs, or personal websites) and 4) "the number of unique authors cited", assuming that articles which are based on a larger variety of perspectives are of higher quality. The findings are summarized as follows:

The early history of Cambodia is represented by an extremely weak article, but there is an improvement in the articles dealing with the early kingdoms of Cambodia. The improvement ends abruptly with articles on the 'dark age' of Cambodia, the French Protectorate, the Japanese occupation, and early postindependence periods being of a much lower quality. Afterward, the quality picks up again with especially good articles on the American intervention in Cambodia, the Cambodian-Vietnamese War, and the People's Republic of Kampuchea. However, the quality does not last; as we near contemporary times, the articles take another turn for the worse.

From this, the author concludes that "the Wikipedia community is unconsciously mimicking the general historiography of the country", in particular a glorification of Angkor and other early kingdoms at the cost of later periods, and observes a "continuing dominance of the traditional historiographical narrative of Cambodian history in Wikipedia." The subsequent section of the paper tries to put these results into the context of the historical debates in the late 1970s and early 1980s about the New World Information and Communication Order (NWICO), a suggested remedy for problems with the under-representation of the developing world in the media, put forth by a UNESCO commission in the MacBride report (1980):

Wikipedia provides access—it is free to use by anyone with an Internet connection, and print versions can also be distributed. But the whole thrust of the NWICO argument is that content matters and those who create content matter perhaps even more, with the commission stressing that countries needed to 'achieve self-reliance in communication capacities and policies' ... Contrary to popular belief, in the new 'information age' content is, once again, the preserve of the few, not the many, and a geographically concentrated few at that.

The author's argument is somewhat weakened by asserting erroneously that "there exists no Cambodian-language Wikipedia", but generally aligns with other quantitative research that has found a geographic unevenness of coverage in Wikipedia. The author is an information studies professor at Singapore's Nanyang Technological University and previously published a related paper in the same journal examining the Wikipedia article History of the Philippines, reviewed in the August issue: "The limits of amateur NPOV history".

Julia Preusse, Jerome Kunegis, Matthias Thimm, Thomas Gottron and Steffen Staab investigate[4] mechanisms of changes in a wiki that are of structural nature, i.e., which are a direct result of the wiki's linking structure. They consider if the addition and removal of internal links between pages can be predicted using just information about the network connecting these articles. The study's innovation lies in considering the removal of links, which account for a high proportion of removals and reverts. The authors performed an empirical study on Wikipedia, stating that traditional indicators of structural change used in the link analysis literature can be classified into four classes, which indicate growth, decay, stability and instability of links. These methods were then employed to identify the underlying reasons for individual additions and removals of knowledge links.

The network created by links between articles in Wikipedia is characterized by preferential attachment. Prior work on social networks has identified a phenomenon called "liability of newness", in which new connections are more likely to be broken than older ones. To provide a better predictive model of link evolution the team considered five hypotheses:

  1. Preferential attachment: The number of adjacent nodes is a good indicator for link addition.
  2. Embedding : The embeddedness of a link is suitable to predict the appearance of links and the non-disappearance of existing links.
  3. Reciprocity: The presence of a link makes the addition of a link in the opposite direction more likely and the removal of a reciprocal link less likely.
  4. Liability of Newness: Old age of an edge or a node is a good indicator for link persistence.
  5. Instability The less stable two nodes are, the less stable the link connecting them is, or would be if it does not exist.

To test these hypotheses, they created networks based on the history of the mainspace articles till 2011 of the top five Wikipedias after the English one. For example, in the French Wikipedia, 41.7 million links were added and 17.3 million removed during that time. The data was used to create a link creation predictor and a link removal predictor. These were then evaluated using the area under the receiver operating characteristic curve.

The results were that Preferential attachment and Embedding are good indicators of growth. Liability of Newness did not turn out to be a good indicator of link removal, but more of article instability. Reciprocity is also an indicator of growth, but is not as significant since most links in a wiki are not reciprocated.

Generation Z judges [[Generation Z]], questioning role of amphetamines

An article[5] in the Journal of Information Science, titled "Understanding trust formation in digital information sources: The case of Wikipedia", explores the criteria used by students to evaluate the credibility of Wikipedia articles. It contains an overview of various earlier studies about credibility judgments of Wikipedia articles (some of them reviewed previously in this space, example: "Quality of featured articles doesn't always impress readers").

The authors asked "20 second-year undergraduate students and 30 Master’s students" in information studies to first spend 20 minutes reading "a copy of a two-page Wikipedia article on Generation Z, a topic with which students were expected to have some familiarity", and answer an open-ended question explaining how they would judge its trustworthiness. In a subsequent part, the respondents were asked to rank a list of factors for trustworthiness in case of "either (a) the topic of an assignment, or (b) a minor medical condition from which they were suffering". One of the first findings was a "low pre-disposition to use [Wikipedia], possibly suggesting a propensity to distrust, grounded on debates and comments on the trustworthiness of Wikipedia" – possibly to the fact that the example article contained an example of vandalism, a fact highlighted by several respondents (e.g. "started off as a valid entry ... due to citations strengthening this ... however came to the last paragraph and the whole document was marred by the insert of 'writing articles on Wikipedia while on amphetamines' [as purported hobby of Generation Z members]... just feels that you can't trust anything now").

Among the given trustworthiness factors, the following were ranked most highly:

authorship, currency, references, expert recommendation and triangulation/verification, with usefulness just below this threshold. In other words, participants valued having articles that were written by experts on the subject, that were up to date, and that they perceived to be useful (content factors). ... Interestingly these factors all seemed more or less equally important for both contexts, with the exception of references, which for predictable reasons were seen as having greater importance in the context of assignments.

Visualizing the "flow of ideas" on Wikiversity

In a conference paper titled "Analyzing the flow of ideas and profiles of contributors in an open learning community"[6] (see also audience notes from the presentation), the authors construct a graph from the set of revisions of a set of Wikiversity pages, with two kind of edges: 1) "Update edges", linking a page's revision to the directly subsequent revision. These are understood as representing "knowledge flow over the course of the collaborative process on a single wiki page". 2) "Hyperlink edges" between two revisions of different pages with a wikilink between them - but pointing in the opposite direction, because the idea is that they indicate knowledge flowing from the linked page to the linking page. By requiring the source node of a hyperlink edge "as the latest revision of the hyperlinked page at the moment of creation of the target revision", both kinds of links point forward in time, resulting in a two-relational directed acyclic graph (DAG), which is "depicting the knowledge flow over time." After filtering out "redundant" hyperlink edges and attaching authorship information to each node (page revision).

The authors apply this procedure to a set of Wikiversity articles in the area of medicine, starting with v:Gynecological History Taking. The results are interpreted as follows:

the beginning, short after the category medicine was founded, the authors in this category built up the basic structure of the knowledge domain. The main relations and idea flows between the learning materials were established early in the development of the domain. After that the authors have been focusing on elaborating the articles without introducing new important hyperlinks. The overall picture of the learning process in this domain suggests a divergent evolution of ideas after an initial period of mutual fertilization between different topics. This conforms to the idea of groups of learners that followed different interests in the medicine domain with little inter-group collaboration on the creation of new shared learning resource.

The method is subsequently applied to profile the activities of various users.

The authors have integrated these algorithms, including visualization tools, into a "network analytics workbench ... used in the ongoing EU project SISOB which aims to measure the influence of science on society based on the analysis of (social) networks of researchers and created artifacts."

In brief

References

  1. ^ Finn Årup Nielsen, Michael Etter, Lars Kai Hansen: PDF Real-time monitoring of sentiment in business related Wikipedia articles (conference paper, submitted). Informatics and Mathematical Modelling, Technical University of Denmark,
  2. ^ Xin Shuai, Zhuoren Jiang, Xiaozhong Liu, Johan Bollen: A Comparative Study of Academic impact and Wikipedia Ranking
  3. ^ Brendan Luyt: History on Wikipedia: In need of a NWICO (New World Information and Communication Order)? The case of Cambodia. Journal of the American Society for Information Science and Technology Closed access icon
  4. ^ Julia Preusse, Jerome Kunegis, Matthias Thimm, Thomas Gottron and Steffen Staab: Structural Dynamics of Knowledge Networks
  5. ^ Jennifer Rowley, Frances Johnson: Understanding trust formation in digital information sources: The case of Wikipedia. Journal of Information Science, first published on March 6, 2013 doi:10.1177/0165551513477820 Closed access icon
  6. ^ Iassen Halatchliyski, Tobias Hecking, Tilman Göhnert, H. Ulrich Hoppe: Analyzing the flow of ideas and profiles of contributors in an open learning community. LAK '13 Proceedings of the Third International Conference on Learning Analytics and Knowledge (April 08 - 12 2013, Leuven, Belgium), Pages 66-74. ACM New York, NY, USA. http://dx.doi.org/10.1145/2460296.2460311 Closed access icon
  7. ^ Marcus Messner & Marcia DiStaso: Wikipedia Vs. Encyclopedia Britannica: A Longitudinal Analysis to Identify the Impact of Social Media on the Standards of Knowledge DOI:10.1080/15205436.2012.732649 Closed access icon
  8. ^ Georgios Fessakis, Maria Zoumpatianou: Wikipedia uses in learning design: A literature review [1]
  9. ^ Michele Van Hoeck and Debra Hoffmann: From Audience to Authorship to Authority: Using Wikipedia to Strengthen Research and Critical Thinking Skills. PDF
  10. ^ Co-authorship patterns around Pope Francis | Brian Keegan
  11. ^ Boston Marathon bombing | Brian Keegan
  12. ^ Mihai Georgescu, Dang Duc Pham, Nattiya Kanhabua, Sergej Zerr, Stefan Siersdorfer, Wolfgang Nejdl: Temporal Summarization of Event-Related Updates in Wikipedia. WWW 2013, May 13–17, 2013, Rio de Janeiro, Brazil. PDF
  13. ^ Oliver Keyes: Why are users blocked on Wikipedia?.
  14. ^ 50,000 Lessons on How to Read: a Relation Extraction Corpus
  15. ^ Young-Ho Eom, Klaus M. Frahm, András Benczúr, Dima L. Shepelyansky: "Time evolution of Wikipedia network ranking"
  16. ^ Yelena Mejova, Ilaria Bordino, Mounia Lalmas, Aristides Gionis: Searching for Interestingness in Wikipedia and Yahoo! Answers. WWW 2013 Companion, May 13–17, 2013, Rio de Janeiro, Brazil.
  17. ^ Luz Rello Martin Pielot Mari-Carmen Marcos, Roberto Carlini: Size Matters (Spacing not): 18 Points for a Dyslexic-friendly Wikipedia. W4A2013 - Technical May 13-15, 2013, Rio de Janeiro, Brazil. Co-Located with the 22nd International World Wide Web Conference.
  18. ^ Why We Publish Through the ACM Digital Library in 2013 | The Joint International Symposium on Open Collaboration
  19. ^ Requirements for a Suitable Publisher in 2014 | The Joint International Symposium on Open Collaboration
  20. ^ Roy Rosenzweig: Can History be Open Source? Wikipedia and the Future of the Past. The Journal of American History Volume 93, Number 1 (June, 2006): 117-46. Open access icon
  21. ^ Paolo Missier, Ziyu Chen: Extracting PROV provenance traces from Wikipedia history pages. EDBT/ICDT ’13 March 18 - 22 2013, Genoa, Italy. PDF
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Font size

According to the above, four researchers from Barcelona "recommend using 18-point font size when designing web text for readers with dyslexia". Can't dyslexics "zoom" their displays, like the rest of us do? --Orlady (talk) 15:08, 2 May 2013 (UTC)[reply]

Better yet, they should be able to use a font that is specifically designed for those with dyslexia. -- John Broughton (♫♫) 19:38, 2 May 2013 (UTC)[reply]

Opinionated

As usual, good work summarizing briefly a number of interesting activities. However, a minor grammar error brought me to a hightened state of alert, which made me notice the poorer quality of the next "Mining content removed" item. It's too long for an "in brief" bullet point, because the reviewer spends too many words pointing out what's wrong with the reviewed work. Other than that, it's a well written page, rewarding the usual wait for our overdue Signpost. Jim.henderson (talk) 12:25, 3 May 2013 (UTC)[reply]

I was the reviewer. Thanks for the feedback; this is my first research review for Signpost and I'm still getting the hang of the genre conventions. - J-Mo Talk to Me Email Me 19:07, 3 May 2013 (UTC)[reply]

Usability study

To increase redability, I recommend a line width of 120 characters. You can do that with this code, copy it on your userspace. Have fun! --NaBUru38 (talk) 17:32, 3 May 2013 (UTC)[reply]

Provenance

The provenance of Wikipedia articles is per-character, but the W3C PROV descriptor is per-document, isn't it? 116.233.70.143 (talk) 01:00, 5 May 2013 (UTC)[reply]

As said in the brief summary, this tools appears to be based on "metadata from Wikipedia revision history and user contribution pages (e.g. the author of a particular revision, or articles edited by an editor)" - this metadata provided by MediaWiki is per-revision, it does not include information tracking the authorship of particular parts of text. Some external tools like WikiBlame or WikiTrust do this, and there is currently a proposed Google Summer of Code project to integrate an optimized and streamlined version of the WikiTrust algorithm (only the authorship tracking part) into Wikipedia - we'll probably cover the accompanying WWW'2013 conference paper in the next issue. Regards, Tbayer (WMF) (talk) 18:38, 5 May 2013 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0