The Signpost

Recent research

Wikipedia articles vs. concepts; Wikipedia usage in Europe

Contribute   —  
Share this
By Thomas Niebler and Tilman Bayer

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

"Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia"

Reviewed by Thomas Niebler
This paper was also presented in the June 2017 WikiResearch showcase.

In several Wikipedia-based systems and scientific analyses, researchers have assumed that no two articles in Wikipedia represent the same concept, i.e. a semantically closed description of a specific item, for example "New York City". Lin et al. however published a paper at CSCW'17[1] where they showed that this “article-as-concept” assumption does in fact not hold: The abovementioned article about "New York City" has a separate sub-article about the "History of New York City", which describes a topic very closely related to “New York City” and could at the same time easily be merged into the original article. This way of splitting up lengthy articles into several smaller ones ("summary style", more specifically "article size") may improve readability for human users, but seriously impairs many studies based on the “article-as-concept” assumption. Using a simple classification approach on features based on both the link structure as well as semantic aspects of the title and the context, the authors identified 70.8% of the top 1000 visited pages which have been split up into articles and sub-articles, with an average of 7.5 sub-articles per article, thus stating that the existence of sub-articles is not the exception, but the rule.

A drawback with the proposed sub-article relationship detection method, as stated in the paper, is that it is trained only on explicitly encoded sub-article relationships; it is yet unsure how to detect implicit relationships, i.e. where no editor has linked the sub-article with the main article. Still, this presents the first step into a deeper analysis of the Wikipedia page network to make it at the same time better readable for humans, but also easily exploitable for many algorithms.

Briefly

85% of German scientists use Wikipedia, and other European media survey results

Summary by Tilman Bayer

A survey among 1,354 German academic researchers about their professional use of social media found Wikipedia to be the most widely used site as of 2015, with 84.7%.[2] Among German internet users in general, 79% use Wikipedia. Only 2% of these Wikipedia readers think it's "never reliable" and 80% hold it is "mostly" ("größtenteils") reliable.[3] A report by the German Monopolkommission (which advises the government on antitrust matters) on potential monopoly problems in the Internet search engine market highlighted Wikipedia as the top 10 website in Germany that is by far the most dependent on Google, with around 80% of its traffic (according to third-party data from SimilarWeb that is not quite consistent with the Wikimedia Foundation's own data).[4]

In France, surveys by the Institut national de la statistique et des études économiques (INSEE) found that from 2011 to 2013, the ratio of people who use the internet to consult Wikipedia ("or any other collaborative online encylopedia") rose from 39% to 51%. Wikipedia usage was higher among younger internet users and among those with degrees - 82% among 16-24 year olds, 54% among 25-54 year olds, and only 31% among 55-74 year olds.[5] The corresponding Eurostat data gave 45% for the entire European Union as of 2015.[6]

In contrast, Ofcom found that only 2-4% of UK 12-15 year olds use Wikipedia as first stop for information as of 2015.[7]

In the meantime, a 2016 Knight Foundation report, based on a study by Nielsen, found that "Among mobile sites [in the US], Wikipedia reigns in terms of popularity (the app does well too) and amount of time users spend on the entity. Wikipedia’s site reaches almost one-third of the total mobile population each month".[8]

Conferences and events

See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. contributions are always welcome for reviewing or summarizing newly published research.

Compiled by Tilman Bayer
This paper was also presented in the February 2017 Wikimedia Research showcase

References

  1. ^ Lin, Yilun; Yu, Bowen; Hall, Andrew; Hecht, Brent (2017). Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia. CSCW '17. New York, NY, USA: ACM. pp. 2052–2067. doi:10.1145/2998181.2998274. ISBN 9781450343350.
  2. ^ Siegfried, Doreen (2015-11-06). "Social Media: Forschende nutzen am häufigsten Wikipedia". ZBW Website. (in German)
  3. ^ "Vier von fünf Internetnutzern recherchieren bei Wikipedia". Bitkom. 2016-01-11.
  4. ^ http://www.monopolkommission.de/images/PDF/SG/SG68/S68_volltext.pdf [bare URL PDF]
  5. ^ "Ce que l'on sait sur les usages de Wikipedia en France". 2017-07-10.
  6. ^ "Individuals using the internet for consulting wiki". Eurostat - Tables, Graphs and Maps Interface (TGM).
  7. ^ Children's Media Use and Attitudes Report 2015 Section 6 - Knowledge and understanding of media among 8-15s (PDF). United Kingdom: Ofcom. 2015. p. 16.
  8. ^ Foundation, Knight (2016-05-11). "Mobile America: How Different Audiences Tap Mobile News".
  9. ^ Yun, Jinhyuk; Lee, Sang Hoon; Jeong, Hawoong. "Intellectual interchanges in the history of the massive online open-editing encyclopedia, Wikipedia". Physical Review E. 93 (1): 012307. doi:10.1103/PhysRevE.93.012307. Closed access icon, preprint: Yun, Jinhyuk; Lee, Sang Hoon; Jeong, Hawoong (2016-01-22). "Intellectual interchanges in the history of the massive online open-editing encyclopedia, Wikipedia". Physical Review E. 93 (1). doi:10.1103/PhysRevE.93.012307. ISSN 2470-0053.
  10. ^ Johnson, Isaac L.; Yilun, Lin; Li, Toby Jia-Jun; Hall, Andrew; Halfaker, Aaron; Schöning, Johannes; Brent, Hecht (2016-05-07). Not at Home on the Range: Peer Production and the Urban/Rural Divide (PDF). SIGCHI. San Jose, USA: SIGCHI. p. 13. doi:10.1145/2858036.2858123. ISBN 978-1-4503-3362-7.
  11. ^ He, Yang (2015-12-09). "Understanding the Role of Participative Web within Collaborative Culture: The Case of Wikipedia". Current Trends in Publishing (Tendances de l'édition): student compilation étudiante. 1 (2).
  12. ^ Tanon, Thomas Pellissier; Vrandecic, Denny; Schaffert, Sebastian; Steiner, Thomas; Pintscher, Lydia (2016-04-11). From Freebase to Wikidata: The Great Migration (PDF). 25TH INTERNATIONAL WORLD WIDE WEB CONFERENCE. Montreal, Quebec, Canada. p. 10. doi:10.1145/2872427.2874809.
Signpost
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

... "History of New York City" ... describes a topic very closely related to “New York City” and could at the same time easily be merged into the original article. This way of splitting up lengthy articles into several smaller ones ("summary style", more specifically "article size") may improve readability for human users, but seriously impairs many studies based on the “article-as-concept” assumption.

Does it? The split isn't solely based off of prose size but is proportional to coverage in secondary sources. I could write a history of a small town too, but that wouldn't warrant a split if the sourcing is all primary (e.g., if no secondary sources have conceptually addressed the history of the town), so we'd pare the history section of the town's article down to due weight. The history of NYC, though, has many reams of books written on it. (I'm actively parsing several books on the history of NYC's schools specifically in the 1960s...) Perhaps this is better explained in the talk itself, but as for the summary, splits such as the "history of NYC" should be seen as separate concepts from NYC itself, not only content forks. And besides, embedded in the idea of a split is the practical concern that the amount that can be reliably written on the topic extends past what a general audience would want to read in the context of the given article. czar 02:29, 25 September 2017 (UTC)[reply]
Absolutely. Blithely assuming that a "History of X" article is the same as "X" is entirely unsafe. A topic of sufficient size can often have subsidiary articles on topics such as history, methods, cultural connections, and so on depending on its type. In the case of a major city with a large history, the primary article may contain a summarized history, with an article giving much more detail. Lists of books or films featuring the city would also rightly be subsidiary articles, not at all desirable in the primary article, even though they would unquestionably be "about" the city.
On a different point, if there is an unlinked article with "New York City" in its title, it cannot be difficult for a script to detect and propose a likely connection as a subsidiary article. "Vampires of New York City" (if it existed) for instance would presumably feature that city as an involved participant. Chiswick Chap (talk) 08:07, 27 September 2017 (UTC)[reply]

Blithely assuming that a "History of X" article is the same as "X" is entirely unsafe.

Where did I blithely assume this? My point was that "History of NYC" should be judged by sourcing specific to the topic intersection (on its own merits) and based on the reams of books specific to NYC's history, the split becomes appropriate. It's having the ability to determine when a subtopic is itself the subject of significant coverage, and not simply the result of a size split. czar 13:52, 27 September 2017 (UTC)[reply]
Erm, I wasn't addressing you, and I agree with your comments both above and below mine. Chiswick Chap (talk) 14:01, 27 September 2017 (UTC)[reply]
They aren't criticizing our decision to split. The issue is that some science/research projects use Wikipedia as a massive useful database of information. The feed it into software for automatic analysis. Their simplistic initial assumption was basically "every city has an article, and all information about the city is in that article". That works perfectly for the article "Tinytown, Ohio". The history of Tinytown is in that article, and they want it included. However they are now noticing that they haven't been pulling together all information about New York City. They are surprised and disappointed that their software is failing to include "History of New York City" in with the other New York City information. From their point of view, their New York City results are incomplete or biased. They understand and accept that any machine analysis is going to have flaws. From their point of view gathering subarticles in with the parent articles generally gives better results, even if the software occasionally screws up and incorporates an incorrect article. Alsee (talk) 10:51, 4 October 2017 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0