The Signpost

Technology report

What does it take to upload a file?

Contribute   —  
Share this
By Legoktm
The 2020 Picture of the Year was uploaded using the Commons UploadWizard.
Legoktm is a site reliability engineer for the Wikimedia Foundation. He wrote this in his volunteer capacity.

There is some irony in a piece of software being named MediaWiki while it struggles with media files, but it's not that surprising given that much of Wikipedia's focus and efforts go towards developing text. On the Main Page, you'll most likely have to scroll past multiple sections celebrating written text until you hit the day's featured photo.

Given the recent issues with uploading files, let's take a look into what it actually takes to upload a file to Wikimedia servers.

A brief history

In the very beginning, you needed to email a Bomis employee to place your photo on the server. The initial version of Magnus Manske's PHP-based wiki would accept any file from editors and administrators. Users had to select a checkbox which said, "I hereby affirm that this file is not copyrighted, or that I own the copyright for this file and donate it to Wikipedia." On the server, the only thing it checked was that the hard drive was not more than 96% full, and if so, it would disable all uploads. And it had a polite request for users, "You can upload as many files you like. Please don't try to crash our server, ha ha."

It was not until 2004 that tagging images with copyright statements became a convention. The Creative Commons licenses and templates were introduced, and Wikimedia Commons was first proposed by Eloquence in March of that same year.

In 2009, the Usability Initiative (see past Signpost coverage) brought grant funding for improving the multimedia experience, leading to UploadWizard on Commons and better metadata extraction, among other things. The English Wikipedia's own File Upload Wizard was developed in 2012, offering users a guided method to upload non-free files.

More recently there has been an increased focus on tools to facilitate mass GLAM contributions, such as bulk uploaders like GWToolset and Pattypan.

How it works today

A rough overview of how media storage is organized (from 2014).

Today all media files are stored in an OpenStack Swift cluster, a cloud storage system similar to Amazon S3, so it's unlikely the disks will actually fill up. These files are made available in both of Wikimedia's two primary data centers in Virginia and Texas for redundancy. Users will end up downloading these files from either the data centers, or one of Wikimedia's CDN servers in Amsterdam, San Francisco, and Singapore that is geographically closer to them.

Most users will never see the original file that was uploaded. Instead, a piece of software named Thumbor generates smaller versions of each image, so users viewing an article would only download the size of images they see (also stored in Swift). Videos are scaled in the same way, generating lower-quality versions of high quality uploads (just like on YouTube and other video-sharing sites).

There are three main ways MediaWiki accepts uploads in the backend, each with its pros and cons. First is a direct upload through Special:Upload, which is the original upload interface. This is the simplest form; the entire file is transferred in one go, and available for processing on the server immediately. However, because it is so direct, there's no opportunity for any nice user-facing progress bars, and any failure means the entire upload must be retried. It also only accepts files up to 100MB.

bigChunkedUpload.js is a gadget that lets users see the individual chunks being uploaded.

Then there's chunked uploading, in which a file is split into much smaller pieces, uploaded chunk-by-chunk, and then finally reassembled into one file (via the job queue) and processed. Tools like UploadWizard and bigChunkedUpload are able to provide progress bars for users, and individual chunks can be retried if there's a brief network interruption. It is more complex to implement, but more reliable and flexible, so most upload tools and bots use it. In theory, users can use chunked uploads to upload files up to the maximum file size of 4GB, but there may be some practical issues like server-side timeouts.

Finally, some editors and administrators can upload files by specifying its URL. MediaWiki will download the file from the remote website, and then process it as if the user uploaded it. This is convenient for users, as they don't need to download the file individually before re-uploading it. However, this comes with limitations, as the server needs to be able to download and finish the entire upload in 180 seconds, and downloading from some sources (especially the Internet Archive) might be too slow for that.

Once the file is actually on the server, MediaWiki does some security checks on each file. It's easy to hide arbitrary files (think malware or copyrighted stuff) inside JPEG images, which we want to reject. This was exploited by some users on Wikipedia Zero networks to share pirated films without having it count against their data plans. SVG files can be written in a way that allows triggering cross-site scripting (XSS) attacks—MediaWiki rejects those too. There are also some validity checks, like making sure a file named "Foo.png" is actually a PNG file.

MediaWiki displays the metadata it knows about at the bottom of each file description page.

At this point, MediaWiki will extract some metadata from the file, like its size, geolocation, and other exif fields. This data is stored separately from the file itself for quicker retrieval. PDF and DjVu files that contain text will have that extracted and indexed for search.

Finally, the original file is uploaded to Swift in both of the primary data centers (Virginia and Texas). Even though this step takes place between Wikimedia servers, it is encrypted using HTTPS in case an attacker is able to tap into cross-data center communications. MediaWiki will also instruct Thumbor to pre-generate thumbnails for common sizes, so users see no delay when trying to use them in an article. If a new version of an existing file was uploaded, MediaWiki would also rename the previous version in Swift, and delete all the old thumbnails.

Once a file has been uploaded, there's no way to modify it in MediaWiki itself. Performing basic functions like rotating or cropping needs to be done by external tools like CropTool.

A complex process

The process for uploading files has grown more and more complex as requirements and scale have increased. The current system is rather optimized for delivering users small images quickly, and less so for handling the upload and processing of very large files. For an encyclopedia that is still mostly focused on text and images, that may be fine. But as people ask for and expect more interactive and engaging elements, that may need to change.

There has been no dedicated Wikimedia Foundation development team focusing on backend media development in the past few years (it's debatable whether it ever had one), with critical components like Thumbor being entirely unmaintained at Wikimedia and outdated. For now, many of the gaps are being filled with various tools and gadgets by those who are interested.

And if you desire some nostalgia, you can still send an email file a Phabricator task and sysadmins will upload the files for you.


S
In this issue
+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

We really need to offer more powerful (and well-maintained) upload tools, for both indivudal and institutional users. The fact that nobody owns this issue at the moment is simply crazy, also considering the Strategic Direction. --Gnom (talk) 16:24, 29 November 2021 (UTC)[reply]

Agreed, except that I'm not sure how the Strategic Direction is helpful in that regard - are you referring to this very brief mention there: "... there are many external factors that we must account for to plan for the future. Many readers now expect multimedia formats beyond text and images"? Regards, HaeB (talk) 16:38, 29 November 2021 (UTC)[reply]
@HaeB: Well, the first sentence of the Strategic Direction says that we want to become the essential infrastructure of the ecosystem of free knowledge. Allowing people to easily and professionally upload files on Commons is a pretty basic building block for an "essential infrastructure", I'd say. --Gnom (talk) 23:30, 29 November 2021 (UTC)[reply]
@Gnom: If this kind of handwavy interpretation of generic language from the strategy documents is the best we can do, then they are even less useful than I thought for the purpose of actually prioritizing work of strategic importance. I mean, I fully agree with you in this case, but it's easy to imagine the same nine words from the Strategic Direction (which are even less concrete than the WMF's mission statement) being similarly cited in support of all kinds of other less impactful efforts. A strategy that doesn't facilitate meaningful prioritization is not worth much.
Regards, HaeB (talk) 12:19, 1 December 2021 (UTC)[reply]
As Amir said in his excellent recent rant: "And it all boils down to not having a dedicated team on multimedia but in all fairness, it's not something you can fix overnight. You need to grow, hire, plan, etc. etc."
On that note, this Signpost article's otherwise great historical context section should have mentioned that starting in the mid 2010s, the WMF already had a dedicated multimedia team for a while (which the article alludes to in veiled form further down, but doesn't go into). Its formation was motivated by many of the same issues that persist today, see this 2013 announcement: "Breaking through walls of text: How we will create a richer Wikimedia experience [...] There has never been a well-resourced team fully dedicated to multimedia engineering work at the Wikimedia Foundation. This is about to change. ..." Of the challenges listed there eight years ago, the team successfully addressed some (in particular the lack of "a standard lightbox viewer for media embedded in an article"), but then was pulled into other directions and later dissolved before making any serious impact on the video UX or, apparently, the technical issues discussed in this Signpost article.
Regards, HaeB (talk) 01:21, 2 December 2021 (UTC)[reply]
@HaeB: I remember back when MediaViewer was going to be a "quick win" before working on more complicated projects. ROFL. Bawolff (talk) 07:35, 11 January 2022 (UTC)[reply]



       

The Signpost · written by many · served by Sinepost V0.9 · 🄯 CC-BY-SA 4.0