There are numerous industry conferences dedicated to web performance. We have attended and spoken at several of them, and noticed important topics remain underrepresented. While the logistics of organizing a conference is too daunting for our small team, FOSDEM presents an appealing compromise.
The Wikimedia Performance Team organized the inaugural Web Performance devroom at FOSDEM 2020.
FOSDEM is the biggest Free and Open Source software conference in the world. It takes place in Brussels every year, is free to attend, and attracts over 8000 attendees. FOSDEM is known for its many self-organized conference tracks, known as “devrooms”. The logistics are taken care of by FOSDEM, while we focus on programming the content. We ran our own CfP, curate and invite speakers, and emcee the event.
This year saw the completion of two milestones on the MediaWiki Multi-DC roadmap. Multi-DC is a cross-team initiative driven by the Performance Team, to evolve MediaWiki for operation from multiple datacenters. This is motivated by higher resilience, and eliminating steps from switchover procedures. This eases or enables routine maintenance by allowing clusters to be turned off — without a major switchover event.
The Multi-DC initiative has brought about performance and resiliency improvements across the MediaWiki codebases, and at every level of our infrastructure. These gains are effective even in today’s single-DC operation. We resolved long-standing tech debt and improved extension interfaces, which increased developer productivity. We also reduced dependencies, coupling, restructured business logic, and implemented async eventual-consistency solutions.
This year we applied the Multi-DC strategy to MediaWiki’s ChronologyProtector (T254634), and started work on the MainStash DB (T212129).
Today we collect real-user data from pageviews, which alerts when a regression happens, but doesn’t help investigate and fix why. Synthetic testing complements this for desktop browsers, but we have no equivalent for mobile devices. Desktop browsers have an “emulate mobile” option, but DevTools emulation is nothing like real mobile devices.
The goal of the mobile device lab is to find performance regressions on Wikipedia, that are relevant to the experience of our mobile users. Alerts include detailed profiles for investigation, like we do for desktop browsers today.
Starting in 2020, we give out a Web Perf Hero award to individuals who have gone above and beyond to improve site performance. It’s awarded (up to) once a quarter to individuals who demonstrate repeated care and discipline around performance.
Since 2018, we have an on-going survey measuring performance perception on several Wikipedias. You can find the main findings in last year’s blog post. An important take-away was that none of the standard and new metrics we tried, correlate well to real user experience. The “best” metric (page load time) scored a mere 0.14 on the Pearson coefficient scale (from 0 to 1). As such, it remains valuable to survey the real perceived performance, as empirical barometer to validate other performance monitoring.
Data from three cohorts, seen in Grafana. You can see that there’s loose correlation with page load time (“loadEventEnd”). When site performance degrades (time goes up), satisfaction gets worse too (positive percentage goes down). Likewise, when load time improves (yellow goes down), satisfaction improves (green goes up).
“How to Logstash at Wikimedia” (🎥 watch, 📙 slides), explains how we monitor production errors with Logstash dashboards, and demonstrates setting up a triage workflow.
Existing frontend metrics correlated poorly with user-perceived performance. It became clear that the best way to understand perceived performance is still to ask people directly about their experience. We set out to run our own survey to do exactly that, and look for correlations from a range of well-known and novel performance metrics to the lived experience. We partnered with Dario Rossi, Telecom ParisTech, and Wikimedia Research to carry out the study (T187299).
While machine learning failed to explain everything, the survey unearthed many key findings. It gave us newfound appreciation for the old school Page Load Time metric, as the metric that best (or least-terribly) correlated to the real human experience.
The Performance Team has been participating in web standards as individually “invited experts” for a while. We initiated the work for Wikimedia Foundation to become an W3C member organization, and by March 2019 it was official.
As a represented membership organization, we are now collaborating in W3C working groups alongside other major stakeholders to the Web!
In the search for a better user experience metric, we tried out the upcoming Element Timing API for images. This is meant to measure when a given image is displayed on-screen. We enrolled wikipedia.org in the ongoing Google Chrome origin trial for the Element Timing API.
The upcoming Event Timing API is meant to help developers identify slow event handlers on web pages. This is an area of web performance that hasn’t gotten a lot of attention, but its effects can be very frustrating for users.
Via another Chrome origin trial, this experiment gave us an opportunity to gather data, discover bugs in several MediaWiki extensions, and provide early feedback on the W3C Editor’s Draft to the browser vendors designing this API.
We decided to commission the implementation of a browser feature that measures performance from an end-user perspective. The Paint Timing API measures when content appears on-screen for a visitor’s device. This was, until now, a largely Chrome-only feature. Being unable to measure such a basic user experience metric for Safari visitors risks long-term bias, negatively affecting over 20% of our audience. It’s essential that we maintain equitable access and keep Wikimedia sites fast for everyone.
We funded and oversaw implementation of the Paint Timing API in WebKit. We contracted Noam Rosenthal who brings experience in both web standards and upstream WebKit development.
ResourceLoader is Wikipedia’s delivery system for styles, scripts, and localization. It delivers JavaScript code on web pages in two stages. This design prioritizes the user experience through optimal cache performance of HTML and individual modules, and through a consistent experience between page views (i.e. no flip-flopping between pages based on when they were cache). It also achieves a great developer experience by ensuring we don’t mix incompatible versions of modules on the same page, and by ensuring rollout (and rollback) of deployments and complete worldwide in under 10 minutes.
This design rests on the first stage (startup manifest) staying small. We carried out a large-scale audit that shrunk the manifest size back down, and put monitoring and guidelines in place. This work was tracked under T202154.
Identify modules that are unused in practice. This included picking up unfinished or forgotten software deprecations, and removing code for obsolete browser compatibility.
Consolidate modules that did not represent an application entrypoint or logical bundle. Extensions are encouraged to use directories and file splitting for internal organization. Some extensions were registering internal files and directories as public module bundles (like a linker or autoloader), thus growing the startup manifest for all page views.
Shrink the registry holistically through clever math and improved compression.
We wrote new frontend development guides as reference material, enabling developers to understand how each stage of the page load process is impacted by different types of changes. We merged and redirected various older guides in favor of this one.
We published our first AS report, which explores the experience of Wikimedia visitors by their IP network (such as mobile carriers and Internet service providers, also known as Autonomous Systems).
This new monthly report is notable for how it accounts for differences in device type and device performance, because device ownership and content choice is not equally distributed among people and regions. We believe our method creates a fair assessment that focuses specifically on the connectivity of mobile carriers and internet services providers, to Wikimedia datacenters.
The goal is to watch the evolution of these metrics over time, allowing us to identify improvements and potential pain points.
Introduce automatic creation of performance metrics that measure specific chunks of MediaWiki code in core and extensions. Powered by WANObjectCache, via the new WANObjectCache keygroup dashboard in Grafana (T197849).
Develop and launch WikimediaDebug v2 featuring inline performance profiling, dark mode, and Beta Cluster support.
TL;DR:On-wiki search “supports” a lot of “languages”. “Search supports more than 50 language varieties” is a defensible position to take. “Search supports more than 40 languages” is 100% guaranteed! Precise numbers present a philosophical conundrum.
Recently, someone asked the Wikimedia Search Platform Team how many languages we support.
This is a squishy question!
The definition of what qualifies as a language is very squishy. We can try to avoid some of the debate by outsourcing the decision to the language codes we use—different codes equal different languages—though it won’t save us.
Another squishy concept is what we mean by “support”, since the level of language-specific processing provided for each language varies wildly, and even what it means to be “language-specific” is open to interpretation. But before we unrecoverably careen off into the land of philosophy of language, let’s tackle the easier parts of the question.
Full Support
“Full” support for many languages means that we have a stemmer or tokenizer, a stop word list, and we do any necessary language-specific normalization. (See the Anatomy of Search series of blog posts, or the Bare-Bones Basics of Full-Text Search video for technical details on stemmers, tokenizers, stop words, normalization, and more.)
CirrusSearch/Elasticsearch/Lucene
The wiki-specific custom component of on-wiki search is called CirrusSearch, which is built on the Elasticsearch search engine, which in turn is built on the Apache Lucene search library.
Sorani has language code ckb, and it is often called Central Kurdish in English.
Persian and Thai do not have stemmers, but that seems to be because they don’t need them.
Running Count: 33
Elasticsearch 7.10 also has two other language analyzers:
The “Brazilian” analyzer is for Brazilian Portuguese, which is represented by a sub-language code (pt-br). However, the Brazilian analyzer has all separate components, and we do use it for the brwikimedia wiki (“Wiki Movimento Brasil”).
The “CJK” (which stands for “Chinese, Japanese, and Korean”) analyzer only normalizes non-standard half-width and fixed-width characters (ア→ア and A→A), breaks up CJK characters into overlapping bigrams (e.g., ウィキペディア is indexed as ウィ, ィキ, キペ, ペデ, ディ, and ィア), and applies some English stop words. That’s not really “full” support, so we won’t count it here. (We also don’t use it for Chinese or Korean.)
We will count Brazilian Portuguese as a language that we support, but also keep a running sub-tab of “maybe only sort of distinct” language varieties.
We’ll come back to Chinese, Japanese, Korean, and the CJK analyzer a bit later.
Running Count: 33–34 (33 languages + 1 major language variety)
We have found some open source software that does stemming or other processing for particular languages. Some as Elasticsearch plugins, some as stand-alone Java code, and some in other programming languages. We have used, wrapped, or ported as needed to make the algorithms available for our wikis.
We have open-source Serbian, Esperanto, and Slovak stemmers that we ported to Elasticsearch plugins.
There are currently no stop word lists for these languages. However, for a typical significantly inflected alphabetic Indo-European language,† a decent stemmer is the biggest single improvement that can be added to an analysis chain for that language. Stop words are very useful, but general word statistics will discount them even without an explicit stop word list.
Having a stemmer (for a language that needs one) can count as the bare minimum for “full” support.
[†] English is weird in that it is not significantly inflected. Non-Indo-European languages can have very different inflection patterns (like Inuit—so much!—or Chinese—so little!), and non-alphabetic writing systems (like Arabic or Chinese) can have significantly different needs beyond stemming to count as “fully” supported.
For Chinese (Mandarin) we have something beyond the not-so-smart (but much better than nothing!) CJK analyzer provided by Elasticsearch/Lucene. Chinese doesn’t really need a stemmer, but it does need a good tokenizer to break up strings of text without spaces into words. That’s the most important component for Chinese, and we found an open-source plugin to do that. Our particular instantiation of Chinese comes with additional complexity because we allow both Traditional and Simplified characters, often in the same sentence. We have an additional open-source plugin to convert everything to Simplified characters internally.
For Hebrew we found an open-source Elasticsearch plugin that does stemming. It also handles the ambiguity caused by the lack of vowels in Hebrew (by sometimes generating more than one stem).
For Korean, we have another open-source plugin that is much better than the very basic processing provided by the CJK analyzer. It does tokenizing and part-of-speech tagging and filtering.
For Polish and Ukrainian, we found an open-source plugin for each that provides a stemmer and stop word list. They both needed some tweaking to handle odd cases, but overall both were successes.
Running Count: 41–42 (41 languages + 1 major language variety)
Shared Configs
Some languages come in different varieties. As noted before, the distinction between “closely related languages” and “dialects” is partly historical, political, and cultural. Below are some named language varieties with distinct language codes that share language analysis configuration with another language. How you count these is a philosophical question, so we’ll incorporate them into our numerical range.
Egyptian Arabic and Moroccan Arabic use the same configuration as Standard Arabic. Originally they had some extra stop words, but it turned out to be better to use those stop words in Standard Arabic, too. Add two languages/language varieties.
Serbo-Croatian—also called Serbo-Croat, Serbo-Croat-Bosnian (SCB), Bosnian-Croatian-Serbian (BCS), and Bosnian-Croatian-Montenegrin-Serbian (BCMS)—is a pluricentric language with four mutually intelligible standard varieties, namely Serbian, Croatian, Bosnian, and Montenegrin. For various historical and cultural reasons, we have Serbian, Croatian, and Bosnian (but no Montenegrin) wikis, as well as Serbo-Croatian wikis. The Serbian and Serbo-Croatian Wikipedias support Latin and Cyrillic, while the Croatian and Bosnian Wikipedias are generally in Latin script. The Bosnian, Croatian, and Serbo-Croatian wikis use the same language analyzer as the Serbian wikis. Add three languages/language varieties.
Malay is very closely related to Indonesian—close enough that we can use the Elasticsearch Indonesian analyzer for Malay. (Indonesian is a standardized variety of Malay.) Add another language/language variety.
Running Count: 41–48 (41 languages + 7 major language varieties)
Moderate Language-Specific Processing
These languages have some significant language-specific(ish) processing that improves search, while still lacking some obvious component (like a stemmer or tokenizer).
For Japanese, we currently use the CJK analyzer (described above). This is the bare minimum of custom configuration that might be considered “moderate” support. It also stretches the definition of “language-specific”, since bigram tokenizing—which would be useful for many languages without spaces—isn’t really specific to any language, though the decision to apply it is language-specific.
There is a “full” support–level Japanese plugin (Kuromoji) that we tested years ago (and have configured in our code, even), but we decided not to use it because of some problems. We have a long-term plan to re-evaluate Kuromoji (and our ability to customize it for our use cases) and see if we could productively enable it for Japanese.
The Khmer writing system is very complex and—for Historical Technological Reasons™—there are lots of ways to write the same word that all look the same, but are underlyingly distinct sequences of characters. We developed a very complex system that normalizes most sequences to a canonical order. The ICU Tokenizer breaks up Khmer text (which doesn’t use spaces between words) into orthographic syllables, which are very often smaller than words. It’s somewhat similar to breaking up Chinese into individual characters—many larger “natural” units are lost, but all of their more easily detected sub-units are indexed for searching.
This is probably the maximum level of support that counts as “moderate”. It’s tempting to move it to “full” support, but true full support would require tokenizing the Khmer syllables into Khmer words, which requires a dictionary and more complex processing. On the other hand, our support for the wild variety of ways people can (and do!) write Khmer is one place where we currently outshine the big internet search engines.
For Mirandese, we were able to work with a community member to set up elision rules (for word-initial l’, d’, etc., as in some other Romance languages) and translate a Portuguese stop word list.
Running Count: —Full: 41–48 (41 languages + 7 major language varieties) —Moderate: 3
Azerbaijani, Crimean Tatar, Gagauz, Kazakh, and Tatar have the smallest possible amount of language-specific processing. Like Turkish, they use the uppercase/lowercase pairs İ/i and I/ı, so they have the Turkish version of lowercasing configured.
However, Tatar is generally written in Cyrillic (at least on-wiki). Kazakh is also generally in Cyrillic on-wiki, and the switch to using İ/i and I/ı in the Kazakh Latin script was only made in 2021, so maybe we should count that as half?
Running Count: —Full: 41–48 (41 languages + 7 major language varieties) —Moderate: 3 —Minimal: 4½–5
(Un)Intentional Specific Generic Support
Well there’s a noun phrase you don’t see every day—what does it even mean?
Sometimes a language-specific (or wiki community–specific) issue gets generalized to the point where there’s no trace of the motivating source. Conversely, a generic improvement can have an outsized impact on a specific language, wiki, or community.
For example, the Nias language uses lots of apostrophes, and some of the people in its Wikipedia community are apparently more comfortable composing articles in word processors, with the text then being copied to the Nias Wikipedia. Some word processors like to “smarten” quotes and apostrophes, automatically replacing them with the curly variants. This kind of variation makes searching hard. When I last looked (some time ago) it also resulted in Nias Wikipedia having article titles that only differ by apostrophe curliness—I assume people couldn’t find the one so they created the other. Once we got the Phab ticket, we added some Nias-specific apostrophe normalization that fixed a lot of their problems.
Does Nias-specific apostrophe normalization count as supporting Nias? It might arguably fall into the “minimal” category.
About a year later, we cautiously and deliberately tested similar apostrophe normalization for all wikis, and eventually added it as a default, which removed all Nias-specific config in our code.
Does general normalization inspired by a strong need from the Nias Wiki community (but not really inherent to the Nias language) count as supporting Nias? I don’t even know.
At another time, I extended some general normalization upgrades that remove “non-native” diacritics to a bunch of languages, and an unexpectedly large benefit was that it was super helpful in Basque, because Basque searchers often ignore Spanish diacritics on Spanish words, while editors use the correct diacritics in articles, creating a mismatch.
If I hadn’t bothered to do some analysis after going live, I wouldn’t have known about this specific noticeable improvement. On the other hand, if I’d known about the specific problem and there wasn’t a semi-generic solution, I would’ve wanted to implement something Basque-specific to solve it.
Does a general improvement that turns out to strongly benefit Basque count as supporting Basque? I don’t even know! (In practice, this is a slightly philosophical question, since Basque has a stemmer and stopword list, too, so it’s already otherwise on the “full support” list.)
I can’t think of any other language-specific cases that generalized so well—though Nias wasn’t the first or only case of apostrophe-like characters needing to be normalized.
Of course, general changes that were especially helpful to a particular language are easy to miss, if you don’t go looking for them. Even if you do, they can be subtle. The Basque case was much easier for me, personally, to notice, because I don’t speak Basque, but I know a little Spanish, so the Spanish words really stood out as such when looking at the data.
Running Count: —Full: 41–48 (41 languages + 7 major language varieties) —Moderate: 3 —Minimal: 4½–5 —I Don’t Even Know: 2+
Vague Categorical Support
It’s easy enough to say that the CJK analyzer supports Japanese (where we are currently using it) and that it would be supporting Chinese and Korean if we were using it for those languages—in small part because it has limited scope, and in large part because it seems specific to Chinese, Japanese, and Korean because of the meaning of “CJK”.
But what about a configuration that is not super specific, but still applied to a subset of languages?
Back in the day, we identified that “spaceless languages” (those whose writing system doesn’t put spaces between words) could benefit from (or be harmed by) specific configurations.
We identified the following languages as “spaceless”. We initially passed on enabling an alternate ranking algorithm (BM25) for them (Phab T152092), but we also deployed the ICU tokenizer for them by default.
Tibetan, Dzongkha, Gan, Japanese, Khmer, Lao, Burmese, Thai, Wu, Chinese, Classical Chinese, Cantonese, Buginese, Min Dong, Cree, Hakka, Javanese, and Min Nan.
14 of those are new.
We eventually did enable BM25 for them, but this list has often gotten special consideration and testing to make sure we don’t unexpectedly do bad things to them when we make changes that seem fine for languages with clearer word boundaries (like Phab T266027).
And what about the case where the “category” we are trying to support is “more or less all of them”? Our recent efforts at cross-wiki “harmonization”—making all language processing that is not language-specific as close to the same as possible on all wikis (see Phab T219550)—was a rising language tide that lifted all/most/many language boats. (An easy to understand example is acronym processing, so that NASA and N.A.S.A. can match more easily. However, some languages—because of their writing systems—have few if any native acronyms. Foreign acronyms (like N.A.S.A.) still show up, though.)
Running Count: —Full: 41–48 (41 languages + 7 major language varieties) —Moderate: 3 —Minimal: 4½–5 —I Don’t Even Know: 0–∞
So far we’ve focussed on the most obviously languagey of the language support in Search, which is language analysis. However, there are other parts of our system that support particular wikis in a language-specific way.
Learning to Rank
Learning to Rank (LTR) is a plugin that uses machine learning—based on textual properties and user behavior data—to re-rank search results to move better results higher in the result list.
It makes use of many ranking signals, including making wiki-specific interpretations of textual properties—like word frequency stats, the number of words in a query or document, the distribution of matching terms, etc.
Arguably some of what the model learns is language-specific. Some is probably wiki-specific (say, because Wikipedia titles are organized differently than Wikisource titles), and some may be community-specific (say, searchers search differently on Wikipedia than they do on Wiktionary).
The results are the same or better than our previously hand-tuned ranking, and the models are regularly retrained, allowing them to keep up with changes to the way searchers behave in those languages on those wikis.
Does that count as minimal language-specific support? Maybe? Probably?
Running Count: —Full: 41–48 (41 languages + 7 major language varieties) —Moderate: 3 —Minimal: 4½–6 —I Don’t Even Know: 0–∞
Cross-Language Searching
Years ago we worked on a project on some wikis to do language detection on queries that got very few or no results, to see if we could provide results from another wiki. The process was complicated, so we only deployed it to nine of the largest (by search volume) Wikipedias:
Dutch, English, French, German, Italian, Japanese, Portuguese, Spanish, and Russian.
Those are all covered by language analyzers above. However, for each of those wikis, we limited the specific languages that could be identified by the language-ID tool (called TextCat), to maximize accuracy and relevance.
Nine of those are not covered by the language analyzers, and eight are not covered by the LTR plugin: Afrikaans, Breton, Burmese, Georgian, Icelandic, Latin, Tagalog, Telugu, and Urdu. (Vietnamese is covered both by Learning to Rank and TextCat.)
Does sending queries from the largest wikis to other wikis count as some sort of minimal support? Maybe. Arguably. Perhaps.
Running Count: —Full: 41–48 (41 languages + 7 major language varieties) —Moderate: 3 —Minimal: 4½–14 —I Don’t Even Know: 0–∞
Conclusions?
What, if any, specific conclusions can we draw? Let’s look again at the list we have so far (even though it is also right above.)
“Final” Count: —Full: 41–48 (41 languages + 7 major language varieties) —Moderate: 3 —Minimal: 4½–14 —I Don’t Even Know: 0–∞
We have good to great support (“moderate” or “full”) for 44 inarguably distinct languages, though it’s very reasonable to claim 51 named language varieties.
The Search Platform team loves to make improvements to on-wiki search that are relevant to all or almost all languages (like acronym handling) or that help all wikis (like very basic parsing for East Asian languages on any wiki). So, how many on-wiki communities does the Search team support? All of them, of course!
Exactly how many languages is that? I don’t even know.
(This blog post is a snapshot from July 2024. If you are from the future, there may be updated details on mediawiki.org.)
Summary: this article shares the experience and learnings of migrating away from Kubernetes PodSecurityPolicy into Kyverno in the Wikimedia Toolforge platform.
Wikimedia Toolforge is a Platform-as-a-Service, built with Kubernetes, and maintained by the Wikimedia Cloud Services team (WMCS). It is completely free and open, and we welcome anyone to use it to build and host tools (bots, webservices, scheduled jobs, etc) in support of Wikimedia projects.
We provide a set of platform-specific services, command line interfaces, and shortcuts to help in the task of setting up webservices, jobs, and stuff like building container images, or using databases. Using these interfaces makes the underlying Kubernetes system pretty much invisible to users. We also allow direct access to the Kubernetes API, and some advanced users do directly interact with it.
Each account has a Kubernetes namespace where they can freely deploy their workloads. We have a number of controls in place to ensure performance, stability, and fairness of the system, including quotas, RBAC permissions, and up until recently PodSecurityPolicies (PSP). At the time of this writing, we had around 3.500 Toolforge tool accounts in the system. We early adopted PSP in 2019 as a way to make sure Pods had the correct runtime configuration. We needed Pods to stay within the safe boundaries of a set of pre-defined parameters. Back when we adopted PSP there was already the option to use 3rd party agents, like OpenPolicyAgent Gatekeeper, but we decided not to invest in them, and went with a native, built-in mechanism instead.
The WMCS team explored different alternatives for this migration, but eventually we decided to go with Kyverno as a replacement for PSP. And so with that decision it began the journey described in this blog post.
First, we needed a source code refactor for one of the key components of our Toolforge Kubernetes: maintain-kubeusers. This custom piece of software that we built in-house, contains the logic to fetch accounts from LDAP and do the necessary instrumentation on Kubernetes to accommodate each one: create namespace, RBAC, quota, a kubeconfig file, etc. With the refactor, we introduced a proper reconciliation loop, in a way that the software would have a notion of what needs to be done for each account, what would be missing, what to delete, upgrade, and so on. This would allow us to easily deploy new resources for each account, or iterate on their definitions.
The initial version of the refactor had a number of problems, though. For one, the new version of maintain-kubeusers was doing more filesystem interaction than the previous version, resulting in a slow reconciliation loop over all the accounts. We used NFS as the underlying storage system for Toolforge, and it could be very slow because of reasons beyond this blog post. This was corrected in the next few days after the initial refactor rollout. A side note with an implementation detail: we stored a configmap on each account namespace with the state of each resource. Storing more state on this configmap was our solution to avoid additional NFS latency.
I initially estimated this refactor would take me a week to complete, but unfortunately it took me around three weeks instead. Previous to the refactor, there were several manual steps and cleanups required to be done when updating the definition of a resource. The process is now automated, more robust, performant, efficient and clean. So in my opinion it was worth it, even if it took more time than expected.
Then, we worked on the Kyverno policies themselves. Because we had a very particular PSP setting, in order to ease the transition, we tried to replicate their semantics on a 1:1 basis as much as possible. This involved things like transparent mutation of Pod resources, then validation. Additionally, we had one different PSP definition for each account, so we decided to create one different Kyverno namespaced policy resource for each account namespace — remember, we had 3.5k accounts.
For developing and testing all this, maintain-kubeusers and the Kyverno bits, we had a project called lima-kilo, which was a local Kubernetes setup replicating production Toolforge. This was used by each engineer in their laptop as a common development environment.
We had planned the migration from PSP to Kyverno policies in stages, like this:
update our internal template generators to make Pod security settings explicit
introduce Kyverno policies in Audit mode
see how the cluster would behave with them, and if we had any offending resources reported by the new policies, and correct them
modify Kyverno policies and set them in Enforce mode
drop PSP
In stage 1, we updated things like the toolforge-jobs-framework and tools-webservice.
In stage 2, when we deployed the 3.5k Kyverno policy resources, our production cluster died almost immediately. Surprise. All the monitoring went red, the Kubernetes apiserver became irresponsibe, and we were unable to perform any administrative actions in the Kubernetes control plane, or even the underlying virtual machines. All Toolforge users were impacted. This was a full scale outage that required the energy of the whole WMCS team to recover from. We temporarily disabled Kyverno until we could learn what had occurred.
This incident happened despite having tested before in lima-kilo and in another pre-production cluster we had, called Toolsbeta. But we had not tested that many policy resources. Clearly, this was something scale-related. After the incident, I went on and created 3.5k Kyverno policy resources on lima-kilo, and indeed I was able to reproduce the outage. We took a number of measures, corrected a few errors in our infrastructure, reached out to the Kyverno upstream developers, asking for advice, and at the end we did the following to accommodate the setup to our needs.:
corrected the external HAproxy kubernetes apiserver health checks, from checking just for open TCP ports, to actually checking the /healthz HTTP endpoint, which more accurately reflected the health of each k8s apiserver.
having a more realistic development environment. In lima-kilo, we created a couple of helper scripts to create/delete 4000 policy resources, each on a different namespace.
greatly over-provisioned memory in the Kubernetes control plane servers. This is, bigger memory in the base virtual machine hosting the control plane. Scaling the memory headroom of the apiserver would prevent it from running out of memory, and therefore crashing the whole system. We went from 8GB RAM per virtual machine to 32GB. In our cluster, a single apiserver pod could eat 7GB of memory on a normal day, so having 8GB on the base virtual machine was clearly not enough. I also sent a patch proposal to Kyverno upstream documentation suggesting they clarify the additional memory pressure on the apiserver.
increased the number of replicas of the Kyverno admission controller to 7, so admission requests could be handled more timely by Kyverno.
I have to admit, I was briefly tempted to drop Kyverno, and even stop pursuing using an external policy agent entirely, and write our own custom admission controller out of concerns over performance of this architecture. However, after applying all the measures listed above, the system became very stable, so we decided to move forward. The second attempt at deploying it all went through just fine. No outage this time 🙂
When we were in stage 4 we detected another bug. We had been following the Kubernetes upstream documentation for setting securityContext to the right values. In particular, we were enforcing the procMount to be set to the default value, which per the docs it was ‘DefaultProcMount’. However, that string is the name of the internal variable in the source code, whereas the actual default value is the string ‘Default’. This caused pods to be rightfully rejected by Kyverno while we figured the problem. We sent a patch upstream to fix this problem.
We finally had everything in place, reached stage 5, and we were able to disable PSP. We unloaded the PSP controller from the kubernetes apiserver, and deleted every individual PSP definition. Everything was very smooth in this last step of the migration.
This whole PSP project, including the maintain-kubeusers refactor, the outage, and all the different migration stages took roughly three months to complete.
For me there are a number of valuable reasons to learn from this project. For one, the scale is something to consider, and test, when evaluating a new architecture or software component. Not doing so can lead to service outages, or unexpectedly poor performances. This is in the first chapter of the SRE handbook, but we got a reminder the hard way 🙂
MediaWiki is the platform that powers Wikipedia and other Wikimedia projects. There is a lot of traffic to these sites. We want to serve our audience in a way that they get the best experience and performance possible. So efficiency of the MediaWiki platform is of great importance to us and our readers.
MediaWiki is a relatively large application with 645,000 lines of PHP code in 4,600 PHP files, and growing! (Reported by cloc.) When you have as much traffic as Wikipedia, working on such a project can create interesting problems.
MediaWiki uses an “autoloader” to find and import classes from PHP files into memory. In PHP, this happens on every single request, as each request gets its own process. In 2017, we introduced support for loading classes from PSR-4 namespace directories (in MediaWiki 1.31). This mechanism involves checking which directory contains a given class definition.
Problem statement
Kunal (@Legoktm) noticed after MediaWiki 1.35, wikis became slower due to spending more time in fstat system calls. Syscalls make a program switch to kernel mode, which is expensive.
We learned that our Autoloader was the one doing the fstat calls, to check file existence. The logic powers the PSR-4 namespace feature, and actually existed before MediaWiki 1.35. But, it only became noticeable after we introduced the HookRunner system, which loaded over 500 new PHP interfaces via the PSR-4 mechanism.
MediaWiki’s Autoloader has a class map array that maps class names to their file paths on disk. PSR-4 classes do not need to be present in this map. Before introducing HookRunner, very few classes in MediaWiki were loaded by PSR-4. The new hook files leveraged PSR-4, exposing many calls to file_exists() for PSR-4 directory searching, in every request. This adds up pretty quickly thereby degrading MediaWiki performance.
See task T274041 on Phabricator for the collaborative investigation between volunteers and staff.
Solution: Optimized class map
Máté Szabó (@TK-999) took a deep dive and profiled a local MediaWiki install with php-excimer and generated a flame graph. He found that about 16.6% of request time was spent in the Autoloader::find() method, which is responsible for finding which file contains a given class.
Figure 1: Flame graph by Máté Szabó.
Checking for file existence during PSR-4 autoloading seems necessary because one namespace can correspond to multiple directories that promise to define some of its classes. The search logic has to check each directory until it finds a class file. Only when the class is not not found anywhere may the program crash with a fatal error.
Máté avoided the directory searching cost by expanding MediaWiki’s Autoloader class map to include all classes, including those registered via PSR-4 namespaces. This solution makes use of a hash-map, where each class maps to one and only one file path on disk, making it a 1-to-1 mapping.
This means, the Autoloader::find() method no longer has to search through the PSR-4 directories. It now knows upfront where each class is, by merely accessing the array from memory. This removes the need for file existence checks. This approach is similar to the autoloader optimization flag in Composer.
Impact
Máté’s optimization significantly reduced response time by optimizing the Autoloader::find() method. This is largely due to the elimination of file system calls.
After deploying the change to MediaWiki appservers in production, we saw a major shift in response times toward faster buckets: a ~20% increase in requests completed within 50ms, and a ~10% increase in requests served under 100ms (T274041#8379204).
Máté analyzed the baseline and classmap cases locally, benchmarking 4800 requests, controlled at exactly 40 requests per second. He found latencies reduced on average by ~12%:
Table 1: Difference in latencies between baseline and classmap autoloader.
Latencies
Baseline
Full classmap
p50 (mean average)
26.2ms
22.7ms (~13.3% faster)
p90
29.2ms
25.7ms (~11.8% faster)
p95
31.1ms
27.3ms (~12.3% faster)
We reproduced Máté’s findings locally as well. On the Git commit right before his patch, Autoloader::find() really stands out.
Figure 2: Profile before optimization.Figure 3: Profile after optimization.
NOTE: We used ApacheBench to load the /wiki/Main_Page URL from a local MediaWiki installation with PHP 8.1 on on Apple M1. We ran it both in a bare metal environment (PHP built-in webserver, 8 workers, no APCU), and in MediaWiki-Docker. We configured our benchmark to run 1000 requests with 7 concurrent requests. The profiles were captured using Excimer with a 1ms interval. The flame graphs were generated with Speedscope, and the box plots were created with Gnuplot.
In Figure 4 and 5, the “After” box plot has a lower median than the “Before” box plot. This means there is a reduction in latency. Also, the standard deviation in the “After” scenario shrunk, which indicates that responses were more consistently fast (not only on average). This increases the percentage of our users that have an experience very close to the average response time of web requests. Fewer users now experience an extreme case of web response slowness.
Figure 4: Boxplot for requests on bare metal.Figure 5: Boxplot for requests on Docker.
Web Perf Hero award
The Web Perf Hero award is given to individuals who have gone above and beyond to improve the web performance of Wikimedia projects. The initiative is led by the Performance Team and started mid-2020. It is awarded quarterly and takes the form of a Phabricator badge.
You might have already heard the buzz: the Wikimedia Hackathon is gearing up for an incredible event in Tallinn, Estonia, from May 3rd to 5th, 2024. Now, we’re thrilled to announce that the Registration form, which also includes an optional Scholarship application, is officially open until FridayJanuary 5th 2024.
Participation in the in-person Wikimedia Hackathon in Tallinn is contingent upon registration. The registration portal will remain accessible until we hit our venue’s capacity, which is approximately set at 220 participants. Here’s the exciting part: the event itself is entirely free of charge ensuring that everyone has an opportunity to join us for this fantastic experience. Please note that participants are required to make individual travel arrangements unless they have been awarded a scholarship. For comprehensive details about the scholarship process, committee, and eligibility criteria, please visit the dedicated page.
The registration and scholarship application form is powered by Pretix, an open-source third-party service, which may introduce additional terms. If you have inquiries regarding privacy and data handling, consult the privacy statement for more information.
Seize the Opportunity: Apply Now!
Register, apply for a scholarship, and join the vibrant technical community dedicated to making a difference and shaping the future of Wikimedia’s Technical Ecosystem.
Stay Connected: Join the Conversation
As the excitement builds, stay connected with the Wikimedia community. Engage in discussions, share your ideas, and connect with fellow participants on the talk page and explore the various channels mentioned here. Follow the event updates, announcements, and get ready for an enriching experience that goes beyond coding — it’s about building connections and leaving a lasting impact on the Wikimedia Technical projects.
Should you have any questions or encounter issues related to the registration form or scholarship application, don’t hesitate to reach out to the organizing team. You can connect with us via the talk page or through email at hackathon@wikimedia.org. We’re here to support you every step of the way!
We are thrilled to share the exciting news that the 2024 Wikimedia Hackathon is scheduled to unfold in the captivating city of Tallinn, Estonia, from May 3rd – 5th 2024!
A Celebration of Innovation and Collaboration
The Wikimedia Hackathon is not just an annual hacking event, it’s a celebration of innovation and collaboration, uniting the global Wikimedia technical community in a dynamic gathering focused on connection, innovation, and exploration. At this event, technical contributors hailing from all corners of the globe converge with a shared mission: to enhance the technological infrastructure and software that underpins and empowers Wikimedia projects.
The theme for this edition aligns with last year’s, emphasizing the gathering of individuals who have a track record of contributing to the technical aspects of Wikimedia projects. We’re looking for those who are well-versed in navigating the technical ecosystem and are adept at working autonomously or collaborating effectively on projects.
How to Get Involved
Participating in the Hackathon is easy! Simply mark your calendar for May 3rd – 5th, 2024, and register to attend. Stay tuned for registration and scholarship details, which will be announced on Monday November 27th 2023 on our MediaWiki page and social media channels.
Spread the Word!
Help us make the Hackathon a massive success by spreading the word. Share this announcement with community members,, and anyone who shares your passion for Wikimedia Technical Projects. Let’s make this gathering of brilliant minds an unforgettable experience!
The new “Excimer UI” option in WikimediaDebug generates flame graphs. What are flame graphs, and when do you need this?
A flame graph visualizes a tree of function calls across the codebase, and emphasizes the time each function spends. In 2014, we introduced Arc Lamp to help detect and diagnose performance issues in production. Arc Lamp samples live traffic and publishes daily flame graphs. This same diagnostic power is now available on-demand to debug sessions!
Debugging until now
WikimediaDebug is a browser extension for Firefox and Chromium-based browsers. It helps stage deployments and diagnose problems in backend requests. It can pin your browser to a given data center and server, send verbose messages to Logstash, and… capture performance profiles!
Our main debug profiler has been XHGui. XHGui is an upstream project that we first deployed in 2016. It’s powered by php-tideways under the hood, which favors accuracy in memory and call counts. This comes at the high cost of producing wildly inaccurate time measurements. The Tideways data model also can’t represent a call tree, needed to visualize a timeline (learn more, upstream change). These limitations have led to misinterpretations and inconclusive investigations. Some developers work around this manually with time-consuming instrumentation from a production shell. Others might repeatedly try fixing a problem until a difference is noticeable.
Screenshot of XHGui.
Accessible performance profiling
Our goal is to lower the barrier to performance profiling, such that it is accessible to any interested party, and quick enough to do often. This includes reducing knowledge barriers (internals of something besides your code), and mental barriers (context switch).
You might wonder (in code review, in chat, or reading a mailing list) why one thing is slower than another, what the bottlenecks are in an operation, or whether some complexity is “worth” it?
With WikimediaDebug, you flip a switch, find out, and continue your thought! It is part of a culture in which we can make things faster by default, and allows for a long tail of small improvements that add up.
Example: In reviewing a change, which proposes adding caching somewhere, I was curious. Why is that function slow? I opened the feature and enabled WikimediaDebug. That brought me to an Excimer profile where you can search (ctrl-F) for the changed function (“doDomain”). We find exactly how much time is spent in that particular function. You can verify our results, or capture your own!
Flame graph in Excimer UI via Speedscope (by Jamie Wong, MIT License).
What: Production vs Debugging
We measure backend performance in two categories: production and debugging.
“Production” refers to live traffic from the world at large. We collect statistics from MediaWiki servers, like latency, CPU/memory, and errors. These stats are part of the observability strategy and measure service availability (“SLO”). To understand the relationship between availability and performance, let’s look at an example. Given a browser that timed out after 30 seconds, can you tell the difference between a response that will never arrive (it’s lost), and a response that could arrive if you keep waiting? From the outside, you can’t!
When setting expectations, you thus actually define both “what” and “when”. This makes performance and availability closely intertwined concepts. When a response is slower than expected, it counts toward the SLO error budget. We do deliver most “too slow” responses to their respective browser (better than a hard error!). But above a threshold, a safeguard stops the request mid-way, and responds with a timeout error instead. This protects us against misuse that would drain web server and database capacity for other clients.
These high-level service metrics can detect regressions after software deployments. To diagnose a server overload or other regression, developers analyze backend traffic to identify the affected route (pageview, editing, login, etc.). Then, developers can dig one level deeper to function-level profiling, to find which component is at fault. On popular routes (like pageviews), Arc Lamp can find the culprit. Arc Lamp publishes daily flame graphs with samples from MediaWiki production servers.
Production profiling is passive. It happens continuously in the background and represents the shared experience of the public. It answers: What routes are most popular? Where is server time generally spent, across all routes?
“Debug” profiling is active. It happens on-demand and focuses on an individual request—usually your own. You can analyze any route, even less popular ones, by reproducing the slow request. Or, after drafting a potential fix, you can use debugging tools to stage and verify your change before deploying it worldwide.
These “unpopular” routes are more common than you might think. Wikipedia is among the largest sites with ~8 million requests per minute. About half a million are pageviews. Yet, looking at our essential workflows, anything that isn’t a pageview has too few samples for real-time monitoring. Each minute we receive a few hundred edits. Other workflows are another order of magnitude below that. We can take all edits, reviews of edits (“patrolling”), discussion replies, account blocks, page protections, etc; and their combined rate would be within the error budget of one high-traffic service.
Excimer to the rescue
Tim Starling on our team realized that we could leverage Excimer as the engine for a debug profiler. Excimer is the production-grade PHP sampling profiler used by Arc Lamp today, and was specifically designed for flame graphs and timelines. Its data model represents the full callstack.
Remember that we use XHGui with Tideways, which favors accurate call counts by intercepting every function call in the PHP engine. That costly choice skews time. Excimer instead favors low-overhead, through a sampling interval on a separate thread. This creates more representative time measures. Re-using Excimer felt obvious in retrospect, but when we first deployed the debug services in 2016, Excimer did not yet exist. As a proof of concept, we first created an Excimer recipe for local development.
How it works
After completing the proof of concept, we identified four requirements to make Excimer accessible on-demand:
Capture the profiling information,
Store the information,
Visualize the profile in a way you can easily share or link to,
Discover and control it from an interface.
We took the capturing logic as-is from the proof of concept, and bundled it in mediawiki-config. This builds on the WikimediaDebug component, with an added conditional for the “excimer” option.
To visualize the data we selected Speedscope, an interactive profile data visualization tool that creates flame graphs. We did consider Brendan Gregg’s original flamegraph.pl script, which we use in Arc Lamp. flamegraph.pl specializes in aggregate data, using percentages and sample counts. This is great for Arc Lamp’s daily summaries, but when debugging a single request we actually know how much time has passed. It would be more intuitive to developers if we presented the time measurements, instead of losing that information. Speedscope can display time.
We store each captured profile in a MySQL key-value table, hosted in the Foundation’s misc database cluster. The cluster is maintained by SRE Data Persistence, and also hosts the databases of Gerrit, Phabricator, Etherpad, and XHGui.
Freely licensed software
We use Speedscope as the flame graph visualizer. Speedscope is an open source project by Jamie Wong. As part of this project we upstreamed two improvements, including a change to bundle a font rather than calling on a third-party CDN. This aligns with our commitment to privacy and independence.
The underlying profile data is captured by Excimer, a low-overhead sampling profiler for PHP. We developed Excimer in 2018 for Arc Lamp. To make the most of Speedscope’s feature set, we added support for time units and added the Speedscope JSON format as a built-in output type for Excimer.
We added Excimer to the php.net registry and submitted it to major Linux package managers (Debian, Ubuntu, Sury, and Remi’s RPM). Special thanks to Kunal Mehta as Debian Developer and fellow Wikimedian who packaged Excimer for Debian Linux. These packages make Excimer accessible to MediaWiki contributors and their local development environment (e.g. MediaWiki-Docker).
Our presence in the Debian repository carries special meaning. Presence in the Debian repository signals trust, stability, and confidence in our software to the free software ecosystem. For example, we were pleased to learn that Sentry adopted Excimer to power their Sentry Profiling for PHP service!
Try it!
If you haven’t already, install WikimediaDebug in your Firefox or Chrome browser.
Navigate to any article on Wikipedia.
Set the widget to On, with the “Excimer UI” checked.
Reload the page.
Click the “Open profile” link in the WikimediaDebug popup.
Accessible debugging tools empower you to act on your intuitions and curiosities, as part of a culture where you feel encouraged to do so. What we want to avoid is filtering these intuitions down to big incidents only, where you can justify hours of work, or depend on specialists.
Learn why we transitioned the MediaWiki platform to serve traffic from multiple data centers, and the challenges we faced along the way.
Wikimedia Foundation provides access to information for people around the globe. When you visit Wikipedia, your browser sends a web request to our servers and receives a response. Our servers are located in multiple geographically separate datacenters. This gives us the ability to quickly respond to you from the closest possible location.
You can find out which data center is handling your requests by using the Network tab in your browser’s developer tools (e.g. right-click -> Inspect element -> Network). Refresh the page and click the top row in the table. In the “x-cache” response header, the first digit corresponds to a data center in the above map.
In the example above, we can tell from the 4 in “cp4043”, that San Francisco was chosen as my nearest caching data center. The cache did not contain a suitable response, so the 2 in “mw2393” indicates that Dallas was chosen as the application data center. These are the ones where we run the MediaWiki platform on hundreds of bare metal Apache servers. The backend response from there is then proxied via San Francisco back to me.
Why multiple data centers?
Our in-house Content Delivery Network (CDN) is deployed in multiple geographic locations. This lowers response time by reducing the distance that data must travel, through (inter)national cables and other networking infrastructure from your ISP and Internet backbones. Each caching data center that makes up our CDN, contains cache servers that remember previous responses to speed up delivery. Requests that have no matching cache entry yet, must be forwarded to a backend server in the application data center.
If these backend servers are also deployed in multiple geographies, we lower the latency for requests that are missing from the cache, or that are uncachable. Operating multiple application data centers also reduces organizational risk from catastrophic damage or connectivity loss to a single data center. To achieve this redundancy, each application data center must contain all hardware, databases, and services required to handle the full worldwide volume of our backend traffic.
Multi-region evolution of our CDN
Wikimedia started running its first datacenter in 2004, in St Petersburg, Florida. This contained all our web servers, databases, and cache servers. We designed MediaWiki, the web application that powers Wikipedia, to support cache proxies that can handle our scale of Internet traffic. This involves including Cache-Control headers, sending HTTP PURGE requests when pages are edited, and intentional limitations to ensure content renders the same for different people. We originally deployed Squid as the cache proxy software, and later replaced it with Varnish and Apache Traffic Server.
In 2005, with only minimal code changes, we deployed cache proxies in Amsterdam, Seoul, and Paris. More recently, we’ve added caching clusters in San Francisco, Singapore, and Marseille. Each significantly reduces latency from Europe and Asia.
Adding cache servers increased the overhead of cache invalidation, as the backend would send an explicit PURGE request to each cache server. After ten years of growth both in Wikipedia’s edit rate and the number of servers, we adopted a more scalable solution in 2013 in the form of a one-to-many broadcast. This eventually reaches all caching servers, through a single asynchronous message (based on UDP multicast). This was later replaced with a Kafka-based system in 2020.
When articles are temporarily restricted, “View source” replaces the familiar “Edit” link for most readers.
The traffic we receive from logged-in users is only a fraction of that of logged-out users, while also being difficult to cache. We forward such requests uncached to the backend application servers. When you browse Wikipedia on your device, the page can vary based on your name, interface preferences, and account permissions. Notice the elements highlighted in the example above. This kind of variation gets in the way of whole-page HTTP caching by URL.
Our highest-traffic endpoints are designed to be cacheable even for logged-in users. This includes our CSS/JavaScript delivery system (ResourceLoader), and our image thumbnails. The performance of these endpoints is essential to the critical path of page views.
Multi-region for application servers
Wikimedia Foundation began operating a secondary data center in 2014, as contingency to facilitate a quick and full recovery within minutes in the event of a disaster. We excercise full switchovers annually, and we use it throughout the year to ease maintenance through partial switchover of individual backend services.
Actively serving traffic from both data centers would add advantages over a cold-standby system:
Requests are forwarded to closer servers, which reduces latency.
Traffic load is spread across more hardware, instead of half sitting idle.
No need to “warm up” caches in a standby data center prior to switching traffic from one data center to another.
With multiple data centers in active use, there is institutional incentive to make sure each one can correctly serve live traffic. This avoids creation of services that are configured once, but not reproducible elsewhere.
We drafted several ideas into a proposal in 2015, to support multiple application data centers. Many components of the MediaWiki platform assumed operating from one backend data center. Such as assuming that a primary database is always reachable for querying, or that deleting a key from “the” Memcache cluster suffices to invalidate a cache. We needed to adopt new paradigms and patterns, deploy new infrastructure, and update existing components to accommodate these. Our seven-year journey ended in 2022, when we finally enabled concurrent use of multiple data centers!
The biggest changes that made this transition possible are outlined below.
HTTP verb traffic routing
MediaWiki was designed from the ground up to make liberal use of relational databases (e.g. MySQL). During most HTTP requests, the backend application makes several dozen round trips to its databases. This is acceptable when those databases are physically close to the web servers (<0.2ms ping time). But, this would accumulate significant delays if they are in different regions (e.g. 35ms ping time).
MediaWiki is also designed to strictly separate primary (writable) from replica (read-only) databases. This is essential at our scale. We have a CDN and hundreds of web servers behind it. As traffic grows, we can add more web servers and replica database servers as-needed. But, this requires that page views don’t put load on the primary database server — of which there can be only one! Therefore we optimize page views to rely only on queries to replica databases. This generally respects the “method” section of RFC 9110, which states that requests that modify information (such as edits) use HTTP POST requests, whereas read actions (like page views) only involve HTTP GET (or HTTP HEAD) requests.
The above pattern gave rise to the key idea that there could be a “primary” application datacenter for “write” requests, and “secondary” data centers for “read” requests. The primary databases reside in the primary datacenter, while we have MySQL replicas in both data centers. When the CDN has to forward a request to an application server, it chooses the primary datacenter for “write” requests (HTTP POST) and the closest datacenter for “read” requests (e.g. HTTP GET).
We cleaned up and migrated components of MediaWiki to fit this pattern. For pragmatic reasons, we did make a short list of exceptions. We allow certain GET requests to always route to the primary data center. The exceptions require HTTP GET for technical reasons, and change data at the same low frequency as POST requests. The final routing logic is implemented in Lua on our Apache Traffic Server proxies.
Media storage
Our first file storage and thumbnailing infrastructure relied on NFS. NetApp hardware provided mirroring to standby data centers.
By 2012, this required increasingly expensive hardware and proved difficult to maintain. We migrated media storage to Swift, a distributed file store.
As MediaWiki assumed direct file access, Aaron Schulz and Tim Starling introduced the FileBackend interface to abstract this. Each application data center has its own Swift cluster. MediaWiki tries writes to both clusters, and the “swiftrepl” background service manages consistency. When our CDN finds thumbnails absent from its cache, it forwards requests to the nearest Swift cluster.
Job queue
MediaWiki features a job queue system since 2009, for performing background tasks. We took our Redis-based job queue service, and migrated to Kafka in 2017. With Kafka, we support bidirectional and asynchronous replication. This allows MediaWiki to quickly and safely queue jobs locally within the secondary data center. Jobs are then relayed to and executed in the primary data center, near the primary databases.
The bidirectional queue helps support legacy features that discover data updates during a pageview or other HTTP GET request. Changing each of these features was not feasible in a reasonable time span. Instead, we designed the system to ensure queueing operations are equally fast and local to each data center.
In-memory object cache
MediaWiki uses Memcached as an LRU key-value store to cache frequently accessed objects. Though not as efficient as whole-page HTTP caching, this very granular cache is suitable for dynamic content.
Some MediaWiki extensions assumed that Memcached had strong consistency guarantees, or that a cache could be invalidated by setting new values at relevant keys when the underlying data changes. Although these assumptions were never valid, they worked well enough in a single data center.
We introduced WANObjectCache as a simple yet robust interface in MediaWiki. It takes care of dealing with multiple independent data centers. The system is backed by mcrouter, a Memcached proxy written by Facebook. WANObjectCache provides two basic functions: getWithSet and delete. It uses cache-aside in the local data center, and broadcasts invalidation to all data centers. We’ve migrated virtually all Memcached interactions in MediaWiki to WANObjectCache.
Parser cache
Most of a Wikipedia page is the HTML rendering of the editable content. This HTML is the result of parsing wikitext markup and expanding template macros. MediaWiki stores this in the ParserCache to improve scalability and performance. Originally, Wikipedia used its main Memcached cluster for this. In 2011, we added MySQL as the lower tier key-value store. This improved resiliency from power outages and simplified Memcached maintenance. ParserCache databases use circular replication between data centers.
Ephemeral object stash
The MainStash interface provides MediaWiki extensions on the platform with a general key-value store. Unlike Memcached, this is is a persistent store (disk-backed, to survive restarts) and replicates its values between data centers. Until now, in our single data center setup, we used Redis as our MainStash backend.
In 2022 we moved this data to MySQL, and replicate it between data centers using circular replication. Our access layer (SqlBagOStuff) adheres to a Last-Write-Wins consistency model.
Login sessions were similarly migrated away from Redis, to a new session store based on Cassandra. It has native support for multi-region clustering and tunable consistency models.
Reaping the rewards
Most multi-DC work took the form of incremental improvements and infrastructure cleanup, spread over several years. While we did find latency redunction on some of the individual changes, we mainly looked out for improvements in availability and reliability.
The final switch to “turn on” concurrent traffic to both application data centers was the HTTP verb routing. We deployed it in two stages. The first stage applied the routing logic to 2% of web traffic, to reduce risk. After monitoring and functional testing, we moved to the second stage: route 100% of traffic.
We reduced latency of “read” requests by ~15ms for users west of our data center in Carrollton (Texas, USA). For example, logged-in users within East Asia. Previously, we forwarded their CDN cache-misses to our primary data center in Ashburn (Virginia, USA). Now, we could respond from our closer, secondary, datacenter in Carrollton. This improvement is visible in the 75th percentile TTFB (Time to First Byte) graph below. The time is in seconds. Note the dip after 03:36 UTC, when we deployed the HTTP verb routing logic.
For over 15 years, the Wikimedia Foundation has provided public dumps of the content of all wikis. They are not only useful for archiving or offline reader projects, but can also power tools for semi-automated (or bot) editing such as AutoWikiBrowser. For example, these tools comb through the dumps to generate lists of potential spelling mistakes in articles for editors to fix. For researchers, the dumps have become an indispensable data resource (footnote: Google Scholar lists more than 16,000 papers mentioning the word “Wikipedia dumps”). Especially in the area of natural language processing, the use of Wikipedia dumps has become almost ubiquitous with the advancement of large language models such as GPT-3 (and thus by extension also the recently published ChatGPT) or BERT. Virtually all language models are trained on Wikipedia content, especially multilingual models which rely heavily on Wikipedia for many lower-resourced languages.
Over time, the research community has developed many tools to help folks who want to use the dumps. For instance, the mwxml Python library helps researchers work with the large XML files and iterate through the articles within them. Before analyzing the content of the individual articles, researchers must usually further preprocess them, since they come in wikitext format. Wikitext is the markup language used to format the content of a Wikipedia article in order to, for example, highlight text in bold or add links. In order to parse wikitext, the community has built libraries such as mwparserfromhell, developed over 10 years and comprising almost 10,000 lines of code. This library provides an easy interface to identify different elements of an article, such as links, templates, or just the plain text. This ecosystem of tooling lowers the technical barriers to working with the dumps because users do not need to know the details of XML or wikitext.
While convenient, there are severe drawbacks to working with the XML dumps containing articles in wikitext. In fact, MediaWiki translates wikitext into HTML which is then displayed to the readers. Thus, some elements contained in the HTML version of the article are not readily available in the wikitext version; for example, due to the use of templates. This means that parsing only wikitext means that researchers might ignore important content which is displayed to readers. For example, a study by Mitrevski et al. found for English Wikipedia that from the 475M internal links in the HTML versions of the articles, only 171M (36%) were present in the wikitext version.
Therefore, it is often desirable to work with HTML versions of the articles instead of using the wikitext versions. Though, in practice this has remained largely impossible for researchers. Using the MediaWiki APIs or scraping Wikipedia directly for the HTML is computationally expensive at scale and discouraged for large projects. Only recently, the Wikimedia Enterprise HTML dumps have been introduced and made publicly available with regular monthly updates so that researchers or anyone else may use them in their work.
However, while the data is available, it still requires lots of technical expertise by researchers, such as how different elements from wikitext get parsed into HTML elements. In order to lower the technical barriers and improve the accessibility of this incredible resource, we released the first version of mwparserfromhtml, a library that makes it easy to parse the HTML content of Wikipedia articles – inspired by the wikitext-oriented mwparserfromhell.
Figure 1. Examples of different types of elements that mwparserfromhtml can extract from an article
The tool is written in Python and available as a pip-installable package. It provides two main functionalities. First, it allows the user to access all articles in the dump files one by one in an iterative fashion. Second, it contains a parser for the individual HTML of the article. Using the Python library beautifulsoup, we can parse the content of the HTML and extract individual elements (see Figure 1 for examples):
Wikilinks (or internal links). These are annotated with additional information about the namespace of the target link or whether it is disambiguation page, redirect, red link, or interwiki link.
External links. We distinguish whether it is named, numbered, or autolinked.
Categories
Templates
References
Media. We capture the type of media (image, audio, or video) as well as the caption and alt text (if applicable).
Plain text of the articles
We also extract some properties of the elements that end users might care about, such as whether each element was originally included in the wikitext version or was transcluded from another page.
Building the tool posed several challenges. First, it remains difficult to systematically test the output of the tool. While we can verify that we are correctly extracting the total number of links in an article, there is no “right” answer for what the plain text of an article should include. For example, should image captions or lists be included? We manually annotated a handful of example articles in English to evaluate the tool’s output, but it is almost certain that we have not captured all possible edge cases. In addition, other language versions of Wikipedia might provide other elements or patterns in the HTML than the tool currently expects. Second, while much of how an article is parsed is handled by the core of MediaWiki and well documented by the Wikimedia Foundation Content Transform Team and the editor community on English Wikipedia, article content can also be altered by wiki-specific Extensions. This includes important features such as citations, and documentation about some of these aspects can be scarce or difficult to track down.
The current version of mwparserfromhtml constitutes a first starting point. There are still many functionalities that we would like to add in the future, such as extracting tables, splitting the plain text into sections and paragraphs, or handing in-line templates used for unit conversion (for example displaying lbs and kg). If you have suggestions for improvements or would like to contribute, please reach out to us on the repository, and file an issue or submit a merge request.
Finally, we want to acknowledge that the project was started as part of an Outreachy internship with the Wikimedia Foundation. We encourage folks to consider mentoring or applying to the Outreachy program as appropriate.
Wikimedia Commons is our open media repository. Like Wikipedia and its other sister projects, Commons runs on the MediaWiki platform. Commons is home to millions of photos, documents, videos, and other multimedia files.
MediaWiki has a built-in imagescaler that, until now, we used in production as well. To improve security isolation, we started an effort in 2015 to develop support in MediaWiki for external media handling services. We choose Thumbor, an open-source thumbnail generation service, for Wikimedia’s thumbnailing needs.
During routine post-deployment checks we found the p99 First Paint metric regressed from 4s to 20s. That’s quite a jump. The median and p75 during the same time period remained constant at their sub-second values.
Distribution of First Paint, which prompted our investigation.
After an investigation we learned that page load time and visual rendering metrics are often skewed in visually hidden browser tabs (such as tabs that are open in the background). The deployment had refactored code such that background tabs could deprioritize more of the rendering work. Rather than revert this, we decided to change how MediaWiki’s Navigation Timing client collects these metrics. We now only sample pageviews in browser tabs that are “visible” from their birth until the page finishes loading.
To understand why background tabs had such an impact on our global metrics, we also ran a simple JS counter for a few days. We found that over a three-day period, 8.4% of page views in capable browsers were visually hidden for at least part of their load time. (Measured using the Page Visibility API, which itself was available on 98% of the sampled pageviews.)
Browser support and distribution of page visibility on Wikipedia.
– Peter Hedenskog and Timo Tijhof.
Performance Inspector goes Beta
We had an idea to improve page load time performance on Wikipedia by providing performance metrics to editors through an in-article modal link (T117411). By using the Performance Inspector, tech-savvy Wikipedians could use this extra data to inform edits that make the article load faster. At least, that was the idea.
It turns out that in reality it’s hard for users to distinguish between costs due to the article content and costs of our own software features. It was hard for editors to actually do something that made a noticeable difference in page load time. We discontinued the Performance Inspector in favor of providing more developer-oriented tools.
— Peter Hedenskog.
The discontinued Perf Inspector offered a modal interface to list each bundle with its size in kilobytes.
The “mw.inspect” console utility for calculating bundle sizes.
Hello, HTTP/2!
Deploying HTTP/2 support to the Wikimedia CDN significantly changed how browsers negotiate and transfer data during the page load process. We anticipated a speed-up as part of the transition, and also identified specific opportunities to leverage HTTP/2 in our architecture for even faster page loads.
We also found unexpected regressions in page load performance during the HTTP/2 transition. In Chrome, pageviews using HTTP/2 initially had a slower Time to First Paint experience when compared to the previous HTTP/1 stack. We wrote about this in HTTP/2 performance revisited.
– Timo Tijhof and Peter Hedenskog.
Stylesheet-aware dependency tracking
2016 saw a new state-tracking mechanism for stylesheets in ResourceLoader (Wikipedia’s JS/CSS delivery system). The HTML we send from MediaWiki to the browser, references a bundle of stylesheets. The server now also transmits a small metadata blob alongside that HTML, which provides the JS client with information about those stylesheets. On the client side, we utilize this new metadata to act as if those stylesheets were already imported by the client.
Why now
MediaWiki is built with semantic HTML and standardized CSS classes in both PHP-rendered and client-rendered elements alike. The server is responsible for loading the current skin stylesheets. We generally do not declare an explicit dependency from a JS feature to a specific skin stylesheet. This is by design, and allows us to separate concerns and give each skin control over how to style these elements.
The adoption of OOUI (our in-house UI framework that renders natively in both PHP and JavaScript), got to a point where an increasing number of features needed to load OOUI both as stylesheet for server-rendered elements, but also potentially load OOUI for (unrelated) JS functionality such as modal interactions elsewhere on the page. These JS-based interactions can happen on any page, including on pages that don’t embed OOUI elements server-side. Thus the OOUI module must include stylesheets in this bundle. This would have caused the stylesheet to sometimes download twice. We worked around this issue for OOUI, through a boolean signal from the server to the JS client (in the HTML head). The signal indicates whether OOUI styles were already referenced (change 267794).
Outcome
We turned our workaround into a small general-purpose mechanism built-in to ResourceLoader. It works transparently to developers, and is automatically applied to all stylesheets.
This enabled wider adoption of OOUI, and also applied the optimization to other reusable stylesheets in the wider MediaWiki ecosystem (such as for Gadgets). It also facilitates easy creation of multiple distinct OOUI bundles without developers having to manually track each with a boolean signal.
This tiny capability took only a few lines of code to implement, but brought huge bandwidth savings; both through relative improvements as well as through what we prevented from being incurred in the future.
Despite being small in code, we did plan for a multi-month migration (T92459). Over the years, some teams had begun to rely on a subtle bug in the old behavior. It was previously permitted to load a JavaScript bundle through a static stylesheet link. This wasn’t an intended feature of ResourceLoader, and would load only the stylesheet portion of the bundle. Their components would then load the same JS bundle a second time from the client-side, disregarding the fact that it downloaded CSS twice. We found that the reason some teams did this was to avoid a FOUC (first load the CSS for the server-rendered elements, then load the module in its entirety for client-side enhancements). In most cases, we mitigated this by splitting the module in question in two: a reusable stylesheet and a pure JS payload.
– Timo Tijhof.
One step closer to Multi-DC
Prior to 2015, numerous MediaWiki extensions treated Memcache (erroneously) as a linearizable “black box”. A box that could be written to in a naive way. This approach, while somewhat intuitive, was based on dated and unrealistic assumptions:
That cache servers are always reachable for updates.
That transactions for database writes never fail, time out, or get rolled back later in the same request.
That database servers do not experience replication lag.
That there are no concurrent web requests also writing to the same database or cache in between our database reads.
That application and cache servers reside in a single data center region, with cache reads always reflecting prior writes.
The Flow extension, for example, made these assumptions and experienced anomalies even within our primary data center. The addition of multiple data centers would amplify these anomalies, reminding us to face the reality that these assumptions were not true.
Flow became among the first to adopt WANCache, a new developer-friendly interface we built for Memcached, specifically to offer high resiliency when operating at Wikipedia scale.
Replication lag was especially important. In MySQL/MariaDB, database reads can enjoy an “isolation level” that offers session consistency with repeatable reads. MediaWiki implements this by wrapping queries from a web request in one transaction. This means web requests will interact with one consistent and internally stable point-in-time state of the database. For example, this ensures foreign keys reliably resolve to related rows, even when queried later in the same request. However, it also means these queries perceive more replication lag.
WANCache is built using the “cache aside” and “purge” strategies. This means callers let go of the fine-grained control of (problematically) directly writing cache values. In exchange, they enjoy the simplicity of only declaring a cache key and a closure that computes the value. Optionally, they can send a “purge” notification to invalidate a cache key during a (soon-to-be-committed) database write.
Instead of proactively writing new values to both the database and the cache, WANCache lets subsequent HTTP requests fill the cache on-demand from a local DB replica. During the database write, we merely purge relevant cache keys. This avoids having to wait for, and incur load on, the primary DB during the critical path of wiki edits and other user actions. WANCache’s tombstone system prevents lagged data from getting (back) into a long-lived cache.
We made numerous improvements to database performance across the platform. This is often in collaboration with SRE and/or with the engineering teams that build atop our platform. We regularly review incident reports, flame graphs, and other metrics; and look for ways to address infra problems at the source, in higher-level components and MediaWiki service classes.
For example, the incident where a partial outage due to database unavailability, was caused by significant network saturation on the Wikimedia Commons database replicas. The saturation occurred due to the PdfHandler service fetching metadata from the database during every thumbnail transformation and every access to the PDF page count. This was mitigated by removing the need for metadata loads from the thumbnail handler, and refactoring the page count to utilize WANCache.
Another time we used our flame graphs to learn one of the top three queries came from WikiModule::preloadTitleInfo. This DB query uses batching to improve latency, and would traditionally be difficult to cache due to variable keys that each relate to part of a large dataset. We applied WANCache to WikiModule and used the “checkKeys” feature to facilitate easy cache invalidation of a large category of cache keys, through a single operation; without need for any propagation or tracking.
Creating a Docker image for your service should be easy—cram your code and its dependencies into a container: boom. done.
But that’s never the whole story.
You have to build new images for each release, monitor them for vulnerabilities, and find a way to safely ship them to production.
You need a reliable process to create, test, and deploy images to Kubernetes. In short: you need a release pipeline.
Wikimedia’s service release pipeline 🚢
A “build and deployment expert” is an antipattern.
Jez Humble & David Farley, Continuous Delivery
Wikimedia has a little more than thirty microservices running atop our in-house Kubernetes infrastructure.
Back when we started moving to Kubernetes in 2017, we had a few aims:
Build trust – After you generate an image, build confidence through incremental testing and validation.
Streamlined image builds – Developer teams shouldn’t need to be experts to build an excellent image for their service.
Security – Build on known-good images, run as a non-root user, and monitor for common vulnerabilities and exposures (CVEs).
And we created two tools to help us achieve these goals:
Blubber – This tool ensures our Docker images are lean, safe, and built from our blessed subset of known-good base images.
PipelineLib – A Jenkins library that uses Blubber to produce, test, and promote images to our Docker registry after establishing trust.
But our migration from Jenkins to GitLab has required some changes to these tools.
Kokkuri: the pipeline from GitLab 🦊
Now we’re migrating to GitLab, we’re replacing Jenkins and PipelineLib with a shared GitLab repository called Kokkuri.
What PipelineLib was for Jenkins, Kokkuri is for GitLab. You can extend Kokkuri jobs in your GitLab project’s `.gitlab-ci.yml` to build streamlined and secure docker images for Wikimedia production, test them, and push them to our production registry.
We’re using this tooling today for two of our internal projects: Scap (our deployment tool for MediaWiki) and Blubber itself.
For now, Kokkuri is an internal tool for Wikimedia’s GitLab. Using it outside of our unique production environment wouldn’t make sense.
Blubber as a BuildKit Frontend 🐳
All of our Wikimedia production services use Blubber to build their Docker images. Blubber is an active, open project—for use both inside and outside Wikimedia 🎉 And as part of the migration to GitLab, we’ve made improvements.
Blubber used to generate opinionated Dockerfiles—now it’s a full-fledged BuildKit front-end. BuildKit is a project from Moby, the people who make Docker, and it’s now used by Docker itself to create images.
As with all in-progress migrations: we’re still missing some things.
Here’s what we’re working on next for our GitLab move:
Dependency caching – tests will be slow if they need to fetch a lot of dependencies for every run, we’re working on a few solutions and you can follow along on Phabricator.
Visibility – we’re still missing all the nice integrations we have in our old systems
Links between our bug tracker (Phabricator) and GitLab
IRC and Slack notifications—yes, we use both 😅
But why “Kokkuri”? 🦝
Tanukis: a crucial part of our pipeline.
Alright. Let’s unpack the name “kokkuri.”
Fun fact: the GitLab logo may look like a fox, but it’s a tanuki—a totally real racoon/dog/fox-type thing `{{citation-needed}}`.
Tanukis are the real-life inspiration for a mythical trickster known as a “kokkuri-san”—an animal spirit bringing mischief, magic, and luck.
And to summon a kokkuri-san: you’d use a kokkuri—which is kinda like a Japanese Ouija board.
So.
To summon a mischievous and magical tanuki you use a kokkuri. And now you can summon our tricksy GitLab magic in the exact. same. way.
I’m happy to share that the second Web Perf Hero award of 2022 goes to Valentín Gutierrez!
This award is in recognition of Valentín’s work on the Wikimedia CDN over the past three months. In particular, Valentín dove deep into Apache Traffic Server. We use ATS as the second layer in our HTTP Caching strategy for MediaWiki. (The first layer is powered by Varnish.)
Cache miss
Valentín (@Vgutierrez) observed that ATS was treating many web requests as cache misses, despite holding a seemingly matching entry in the cache. To understand why, we have to talk about the Vary header.
If a page is served the same way to everyone, it can be cached under its URL and served as such to anyone navigating to that same URL. This is nearly true for us from a statistical viewpoint, except that we have editors with logged-in sessions, whose pageviews must bypass the CDN and its static HTML caches. In HTTP terminology, we say that MediaWiki server responses “vary” by cookies. Two clients with different cookies may get a different response. Two clients with the same cookies, or with no cookies, can enjoy the same cached response. But, log-in sessions aren’t the only cookies in town! For example, our privacy-conscious device counting metric also utilizes a cookie (“WMF-Last-Access”). It is a very low entropy cookie, but a cookie nonetheless. We also optionally use cookies for fundraising localisation, and various other JavaScript features. As such, a majority of connecting browsers will have at least one cookie.
The HTTP specification says that when a response for a URL varies by the value of a header (in our case, the Cookies header controls whether you’re logged-in), then cache proxies like ATS and Varnish must not re-use a cache entry, unless the original and current browser have the exact same cookies. For the cache to be effective, though, we must pay attention to the session cookie only, and ignore cookies related to metrics and JavaScript. For our Varnish cache, we do exactly that (through custom VCL code), but we never did this for ATS.
And so work began to implement Lua code for ATS to identify session cookies, and treat all other cookies as if they don’t exist — but only within the context of finding a match in the cache, restoring them right after.
In our Singapore data center, our ATS latency improved by 25% at the p75, e.g. from 475ms down to 350ms compared to the same time and day a week earlier. That’s a 125ms drop, which is one of the biggest reductions we’ve ever documented!
The reduction is due to more requests being served directly from the cache, instead of generating new pageviews for each combination of unrelated cookies. We can also measure this as a ratio between cache hits and cache misses — the cache hit ratio. For the Amsterdam data center, ATS cache hits went from ~600/s to 1200/s. As a percentage of all backend traffic, that’s from 2% to 4%. (The CDN frontend enjoys a cache hit ratio of 90-99% depending on entrypoint.)
Disk reads
In September, Valentín created a Grafana dashboard to explore metrics from internal operations within ATS. This is part of on-going work to establish a high-level SLO for ATS. ATS reads from disk as part of serving a cache hit. Valentín discovered that disk reads were regularly taking up to a whole second.
Most traffic passing through ATS is a cache miss, where we respond within 300ms at the p75 (latency shown earlier). For the subset where we serve a cache hit at the ATS layer, we generally respond within ~5ms, magnitudes faster. When we observed a cache hit taking 1000ms to respond, that is not only very slow, it is also notably slower than generating a fresh page from a MediaWiki server.
After ruling out timeout-related causes, Valentín traced the issue to the ATS cache_dir_sync operation. This operation synchronizes metadata about cache entries to disk, and runs once every few minutes. It takes about one minute, during which we consistently saw 0.1% of requests experience the delay. Cache reads had to wait for a safety lock held by a single sync for the entire server. Valentín worked around the issue by partitioning the cache into multiple volumes, with the sync (and its lock) applying only to a portion of the data. These are held for a shorter period of time, and less likely to overlap with a cache read in the first place. (our investigation, upstream issue)
On most ATS servers, the cache read p999 dropped from spiking at 1000ms down to a steady 1ms. That’s a 1000X reduction!
Note that this issue was not observable through the 75th percentile measure, because each minute affected a different 0.1% of requests, despite happening consistently throughout the day. This is why we don’t recommend p75 for backend objectives. Left continuously, much more than 0.1% of clients would experience the issue. Resolving this avoids a constant spending of the error budget SLO, preserving our budget for more unusual and unforeseen issues down the line.
Web Perf Hero award
The Web Perf Hero award is given to individuals who have gone above and beyond to improve the web performance of Wikimedia projects. The initiative is led by the Performance Team and started mid-2020. It is awarded quarterly and takes the form of a Phabricator badge.
In 2016, the Wikimedia Foundation deployed HTTP/2 (or “H2”) support to our CDN. At the time, we used Nginx- for TLS termination and two layers of Varnish for caching. We anticipated a possible speed-up as part of the transition, and also identified opportunities to leverage H2 in our architecture.
The HTTP/2 protocol was standardized through the IETF, with Google Chrome shipping support for the experimental SPDY protocol ahead of the standard. Brandon Black (SRE Traffic) led the deployment and had to make a choice between SPDY and H2. We launched with SPDY in 2015, as H2 support was still lacking in many browsers, and Nginx did not support having both. By May 2016, browser support had picked up and we switched to H2.
Goodbye domain sharding?
You can benefit more from HTTP/2 through domain consolidation. The following improvements were achieved by effectively undoing domain sharding:
Faster delivery of static CSS/JS assets. We changed ResourceLoader to no longer use a dedicated cookieless domain (“bits.wikimedia.org”), and folded our asset entrypoint back into the MediaWiki platform for faster requests local to a given wiki domain name (T107430).
Speed up mobile page loads, specifically mobile-device “m-dot” redirects. We consolidated the canonical and mobile domains behind the scenes, through DNS. This allows the browser to reuse and carry the same HTTP/2 connection over a cross-domain redirect (T124482).
Faster Geo service and faster localized fundraising banner rendering. The Geo service was moved from geiplookup.wikimedia.org to /geoiplookup on each wiki. The service was later removed entirely, in favor of an even faster zero-roundtrip solution (0-RTT): An edge-injected cookie within the Wikimedia CDN (T100902, patch). This transfers the information directly alongside the pageview without the delay of a JavaScript payload requesting it after the fact.
Could HTTP/2 be slower than HTTP/1?
During the SPDY experiment, Peter Hedenskog noticed early on that SPDY and HTTP/2 have a very real risk of being slower than HTTP/1. We observed this through our synthetic testing infrastructure.
In HTTP/1, all resources are considered equal. When your browser navigates to an article, it creates a dedicated connection and starts downloading HTML from the server. The browser streams, parses, and renders in real-time as each chunk arrives. The browser creates additional connections to fetch stylesheets and images when it encounters references to them. For a typical article, MediaWiki’s stylesheets are notably smaller than the body content. This means, despite naturally being discovered from within (and thus after the start of) the HTML download, the CSS download generally finishes first, while chunks from the HTML continue to trickle in. This is good, because it means we can achieve the First Paint and Visually Complete milestones (above-the-fold) on page views before the HTML has fully downloaded in the background.
Page load over HTTP/1.
In HTTP/2, the browser assigns a bandwidth priority to each resource, and resources share a single connection. This is different from HTTP/1, where each resource has its own connection, with lower-level networks and routers dividing their bandwidth equally as two seemingly unrelated connections. During the time where HTML and CSS downloads overlap, HTTP/1 connections each enjoyed about half the available bandwidth. This was enough for the CSS to slip through without any apparent delay. With HTTP/2, we observed that Chrome was not getting any CSS response until after the HTML was mostly done.
Page load over SPDY.
This HTTP/2 feature can solve a similar issue in reverse. If a webpage suffers from large amounts of JavaScript code and below-the-fold images being downloaded during the page load, under HTTP1 those low-priority resources would compete for bandwidth and starve the critical HTML and CSS downloads. The HTTP/2 priority system allows the browser and server to agree, and give more bandwidth to the important resources first. A bug in Chrome caused CSS to effectively have a lower priority relative to HTML (chromium #586938).
First paint regression correlated with SPDY rollout. (Ori Livneh, T96848#2199791)
We confirmed the hypothesis by disabling SPDY support on the Wikimedia CDN for a week (T125979). After Chrome resolved the bug, we transitioned from SPDY to HTTP/2 (T166129, T193221). This transition saw improvements both to how web browsers give signals to the server, and the way Nginx handled those signals.
As it stands today, page load time is overall faster on HTTP/2, and the CSS once again often finishes before the HTML. Thus, we achieve the same great early First Paint and Visually Complete milestones that we were used to from HTTP/1. But, we do still see edge cases where HTTP/2 is sometimes not able to re-negotiate priorities quick enough, causing CSS to needlessly be held back by HTML chunks that have already filled up the network pipes for that connection (chromium #849106, still unresolved as of this writing).
Lessons learned
These difficulties in controlling bandwidth prioritization taught us that domain consolidation isn’t a cure-all. We decided to keep operating our thumbnail service at upload.wikimedia.org through a dedicated IP and thus a dedicated connection, for now (T116132).
Browsers may reuse connections for multiple domains if an existing HTTPS connection carries a TLS certificate that includes the other domain in its SNI information, even when this connection is for a domain that corresponds to a different IP address in DNS. Under certain conditions, this can lead to a surprising HTTP 404 error (T207340, mozilla #1363451, mozilla #1222136). Emanuele Rocca from SRE Traffic Team mitigated this by implementing HTTP 421 response codes in compliance with the spec. This way, visitors affected by non-compliant browsers and middleware will automatically recover and reconnect accordingly.