The Signpost

Op-ed

Wikipedia's lead sentence problem

Contribute   —  
Share this
By Kaldari
Thomas Spencer Baynes, genius or pedant?

In the 9th edition of the Encyclopædia Britannica, editor Thomas Spencer Baynes introduced the convention of including a person's birth and death year after their name in all biographical articles:

CAMPBELL, John, LL.D. (1708–1775), a miscellaneous author, was born at Edinburgh, March 8, 1708.

This allowed a reader to more easily distinguish between the 100+ notable people named John Campbell (only one of whom was actually lucky enough to get an article in the 9th edition). Although this convention was a bit awkward and redundant, it served a useful purpose (in the absence of disambiguation pages), and was kept in all subsequent editions.

When Wikipedia was created in 2001, it sought to emulate the successful model of the Encyclopædia Britannica and many editors adopted the convention of including birth and death years in the lead sentence.[1] Here is the lead sentence for Christopher Columbus as it appeared on June 13, 2001:

Christopher Columbus (1451?–1506) was a probably Genovian sailor who crossed the Atlantic in service of Spain.

Little did Thomas Spencer Baynes realize, Wikipedia editors would eventually expand on his convention, including not only birth and death years, but entire birth and death dates, birth and death dates in alternate calendars, birth and death locations, alternate names, maiden names, foreign names, pronunciations, foreign pronunciations, and transliterations. Fifteen years later, here's what Christoper Columbus's lead sentence had become:

Christopher Columbus (/kəˈlʌmbəs/; Ligurian: Cristoffa Combo; Italian: Cristoforo Colombo; Spanish: Cristóbal Colón; Portuguese: Cristóvão Colombo; Latin: Christophorus Columbus; born between 31 October 1450 and 30 October 1451 in Genoa – died on 20 May 1506 in Valladolid) was an Italian explorer, navigator, colonizer, and citizen of the Republic of Genoa.

Flesch Reading Ease scores for the lead sentence of Christopher Columbus from 2002 to 2016

What began as a concise, encyclopedic sentence had slowly grown into a sprawling mess of multiplying metadata—a sentence so complicatingly packed as to render it unreadable.[2] This isn't just a subjective opinion, either. If you chart the Flesch Reading Ease score of the sentence over the years, you'll see an almost continuous decline since 2002. This is by no means an isolated example, either. The metadata virus has spread from biographical articles to other subjects as well, like geography:

Israel (/ˈɪzrəl/; Hebrew: יִשְׂרָאֵל Yisrā'el; Arabic: إِسْرَائِيل Isrāʼīl), officially the State of Israel (Hebrew: מְדִינַת יִשְׂרָאֵל Medīnat Yisrā'el [mediˈnat jisʁaˈʔel]; Arabic: دَوْلَة إِسْرَائِيل Dawlat Isrāʼīl [dawlat ʔisraːˈʔiːl]), is a country in the Middle East, on the southeastern shore of the Mediterranean Sea and the northern shore of the Red Sea.

The problem has become so noticeable that many reusers of Wikipedia content (including the WMF itself) have started stripping out parenthetical phrases from the lead sentence in certain contexts. If you search for "Christopher Columbus" on Google, you'll see a much more digestible description, both in the Knowledge Graph and under the Wikipedia search result. If you turn on the Page Previews beta feature in your preferences and hover over Christopher Columbus, you'll also see a much shorter version. The Wikipedia apps even experimented with removing parenthetical phrases from the lead sentences in the articles themselves. This has led to heated debates about whether or not we are potentially removing important information (as some parenthetical phrases consist of content other than metadata). Without a clear way to identify which parenthetical phrases are useful and which are detrimental, I'm sure these issues will remain unresolved. What's really needed is a vigorous debate by the Wikipedia community about how to bring this problem under control and make our articles readable again.

If we don't take significant steps to address this problem, the metadata disease is only going to keep multiplying and spreading. If left unchecked, I fear this is what our future will look like:

[Excerpt from the Americapedia article about Wikipedia, copyright 2034, used with permission.]

...Like frogs in a pot of boiling water, the proliferation of lead sentence metadata happened so slowly that no one noticed until 2021 when John Seigenthaler's son published a devastating video on ClickNews in which he read aloud the lead sentence of his Wikipedia article, and then wept for 3 minutes.

John Michael SeigenthalerQ1701714 on Wikidata (English pronunciation: /ˈdʒɑn ˈmaɪkəl ˈsiːɡənθɔːlər/ ; German pronunciation: [ˈjuːˈan ˈmaɪkəl ˈziːkənθɔːlər] ; born December 21, 1955 in Nashville, TennesseeQ23197 on Wikidata, current resident of Weston, ConnecticutQ662537 on Wikidata (as of 2008), not yet deceased), also known as John Seigenthaler Jr. (English pronunciation: /ˈdʒɑn ˈsiːɡənθɔːlər ˈdʒunjəɹ/ ; German: John Seigenthaler jünger, pronounced [ˈjuːˈan ˈziːkənθɔːlər ˈdʒunjəɹ] ), is an American news anchor, most recently working for ClickNews.

Seigenthaler's video caught the attention of the recently re-elected Donald Trump, who only weeks before had dissolved The New York Times and Washington Post by executive order. Trump immediately posted a flurry of tweets eviscerating the venerable online encyclopedia. By the next day, Wikipedia was no more.

Let's avoid this sorry fate and make Wikipedia great again!

  1. ^ German Wikipedia also adopted the convention of preceding all death dates with a dagger (called a "Kreuz" in German), which has led to endless debates about whether or not the symbol is Christian and thus inappropriate to use for non-Christian biographies. Luckily, such a convention doesn't seem to exist in English encyclopedias!
  2. ^ Another famous example:
    Genghis Khan (English pronunciation:/ˈɡɛŋɡɪs ˈkɑːn/ or /ˈɛŋɡɪs ˈkɑːn/;[1][2]; Cyrillic: Чингис Хаан, Chingis Khaan, IPA: [tʃiŋɡɪs xaːŋ] ; Mongol script: , Činggis Qaɣan; Chinese: 成吉思汗; pinyin: Chéng Jí Sī Hán; probably May 31, 1162[3] – August 25, 1227), born Temujin (English pronunciation: /təˈmɪn/; Mongolian: Тэмүжин, Temüjin IPA: [tʰemutʃiŋ] ; Middle Mongolian: Temüjin;[4] traditional Chinese: 鐵木真; simplified Chinese: 铁木真; pinyin: Tiě mù zhēn) and also known by the temple name Taizu (Chinese: 元太祖; pinyin: Yuán Tàizǔ; Wade–Giles: T'ai-Tsu), was the founder and Great Khan (emperor) of the Mongol Empire, which became the largest contiguous empire in history after his death.
S
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Join the RfC in response to this article.A L T E R C A R I   06:59, 19 June 2017 (UTC) [reply]

Brilliant insight

Sometimes a problem is right in front of you, but you don't notice it until someone else points it out, at which point you see it everywhere. This essay is that sort of eye-opener. --Guy Macon (talk) 22:26, 8 May 2017 (UTC)[reply]

Exactly. thank you Kaldari for bringing this up. --Saqib (talk) 06:24, 12 May 2017 (UTC)[reply]

Solutions

Some {{infobox medical condition}} introduced a |pronounce= a while ago, which I think is a good solution. Alternate names/languages could be handled the same way in articles with infoboxes, e.g., as documented at Template:Infobox settlement#Name and transliteration.

Etymology is an endless problem (e.g., in anatomy articles), with some editors wanting it to be the first thing that you read, others wanting it last, and others not wanting it included at all. WhatamIdoing (talk) 18:53, 12 May 2017 (UTC)[reply]

Moving information to infoboxes seems like the right way to go. Inline parentheticals should be limited to what helps disambiguate the subject from plausible alternatives. – SJ + 22:16, 7 June 2017 (UTC)[reply]
It is key that the first sentence of English Wikipedia be in English as much as possible. Not sure which language pronunciations are written in, but it is not one I can read. Doc James (talk · contribs · email) 19:50, 8 June 2017 (UTC)[reply]
Yeah it's unfortunate that we waste the most valuable real-estate in the article for information that only 0.01% of readers are both interested in and can understand (don't quote me on that statistic). Kaldari (talk) 20:14, 8 June 2017 (UTC)[reply]
This is described as a metadata explosion issue. Wasn't Wikidata created as a solution to metadata surfacing issues? Maybe original language pronunciations etc. be toggled by the user, to go fetch them from Wikidata? By the way, I agree that this is a problem for many articles. - Bri (talk) 21:32, 8 June 2017 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0