Reading view

There are new articles available, click to refresh the page.

Unifying our mobile and desktop domains

How we achieved 20% faster mobile response times, improved SEO, and reduced infrastructure load.

Until now, when you visited a wiki (like en.wikipedia.org), the server responded in one of two ways: a desktop page, or a redirect to the equivalent mobile URL (like en.m.wikipedia.org). This mobile URL in turn served the mobile version of the page from MediaWiki. Our servers have operated this way since 2011, when we deployed MobileFrontend.

Before: Wikimedia CDN responds with a redirect from en.wikipedia.org to en.m.wikipedia.org for requests from mobile clients, and en.m.wikipedia.org then responds with the mobile HTML. After: Wikimedia CDN responds directly with the mobile HTML.
Diagram of technical change.

Over the past two months we unified the mobile and desktop domain for all wikis (timeline). This means we no longer redirect mobile users to a separate domain while the page is loading.

We completed the change on Wednesday 8 October after deploying to English Wikipedia. The mobile domains became dormant within 24 hours, which confirms that most mobile traffic arrived on Wikipedia via the standard domains and thus experienced a redirect until now.[1][2]

Why?

Why did we have a separate mobile domain? And, why did we believe that changing this might benefit us?

The year is 2008 and all sorts of websites large and small have a mobile subdomain. The BBC, IMDb, Facebook, and newspapers around the world featured the iconic m-dot domain. For Wikipedia, a separate mobile domain made the mobile experiment low-risk to launch and avoided technical limitations. It became the default in 2011 by way of a redirect.

Fast-forward seventeen years, and much has changed. It is no longer common for websites to have m-dot domains. Wikipedia’s use of it is surprising to our present day audience, and it may decrease the perceived strength of domain branding. The technical limitations we had in 2008 have long been solved, with the Wikimedia CDN having efficient and well-tested support for variable responses under a single URL. And above all, we had reason to believe Google stopped supporting separate mobile domains, which motivated the project to start when it did.

You can find a detailed history and engineering analysis in the Mobile domain sunsetting RFC along with weekly updates on mediawiki.org.

Site speed

Google used to link from mobile search results directly to our mobile domain, but last year this stopped. This exposed a huge part of our audience to the mobile redirect and regressed mobile response times by 10-20%.[2]

Google supported mobile domains in 2008 by letting you advertise a separate mobile URL. While Google only indexed the desktop site for content, they stored this mobile URL and linked to it when searching from a mobile device.[3] This allowed Google referrals to skip over the redirect.

Google introduced a new crawler in 2016, and gradually re-indexed the Internet with it.[4-7] This new “mobile-first” crawler acts like a mobile device rather than a desktop device, and removes the ability to advertise a separate mobile or desktop link. It’s now one link for everyone! Wikipedia.org was among the last sites Google switched, with May 2024 as the apparent change window.[2] This meant the 60% of incoming pageviews referred by Google, now had to wait for the same redirect that the other 40% of referrals have experienced since 2011.[8]

Persian Wikipedia saw a quarter second cut in the “responseStart” metric from 1.0s to 0.75s.

Unifying our domains eliminated the redirect and led to a 20% improvement in mobile response times.[2] This improvement is both a recovery and a net-improvement because it applies to everyone! It recovers the regression that Google-referred traffic started to experience last year, but also improves response times for all other traffic by the same amount.

The graphs below show how the change was felt worldwide. The “Worldwide p50” corresponds to what you might experience in Germany or Italy, with fast connectivity close to our data centers. The “Worldwide p80” resembles what you might experience in Iran browsing the Persian Wikipedia.

Wordwide p80 regressed 11% from 0.63s to 0.70s, then reduced 18% from 0.73s to 0.60s. Wordwide p75 regressed 13% to 0.61s, then reduced 19% to 0.52s. Wordwide p50 regressed 22% to 0.33s, then reduced 21% to 0.27s. Full table in the linked comment on Phabricator.
Check Perf report to explore the underlying data and for other regions.

SEO

The first site affected was not Wikipedia but Commons. Wikimedia Commons is the free media repository used by Wikipedia and its sister projects. Tim Starling found in June that only half of the 140 million pages on Commons were known to Google.[9] And of these known pages, 20 million were also delisted due to the mobile redirect. This had been growing by one million delisted pages every month.[10] The cause for delisting turned out to be the mobile redirect. You see, the new Google crawler, just like your browser, also has to follow the mobile redirect.

After following the redirect, the crawler reads our page metadata which points back to the standard domain as the preferred one. This creates a loop that can prevent a page from being updated or listed in Google Search. Delisting is not a matter of ranking, but about whether a page is even in the search index.

Tim and myself disabled the mobile redirect for “Googlebot on Commons” through an emergency intervention on June 23rd. Referrals then began to come back, and kept rising for eleven weeks in a row, until reaching a 100% increase in Google-referrals. From a baseline of 3 million weekly pageviews up to 6 million. Google’s data on clickthroughs shows a similar increase from 1M to 1.8M “clicks”.[9]

Pageviews to Wikimedia Commons having type equal to user (meaning not a known bot or spider), and referrer equal to Google. After July 2025, it increases from 3 million to 6 million per week.
Google-referred pageviews in 2025.
Stable 1.0 million clicks per week in June and early July, then increase to 1.8 million clicks per week in mid-July and stayed there.
Weekly clicks (according to Google Search Console).

We reversed last year’s regression and set a new all-time high. We think there’s three reasons Commons reached new highs:

  1. The redirect consumed half of the crawl budget, thus limiting how many pages could be crawled.[10][11]
  2. Google switched Commons to its new crawler some years before Wikipedia.[12] The index had likely been shrinking for two years already.
  3. Pages on Commons have a sparse link graph. Wikipedia has a rich network of links between articles, whereas pages on Commons represent a photo with an image description that rarely links to other files. This unique page structure makes it hard to discover Commons pages through recursive crawling without a sitemap.

Unifying our domains lifted a ceiling we didn’t know was there!

The MediaWiki software has a built-in sitemap generator, but we disabled this on Wikimedia sites over a decade ago.[13] We decided to enable it for Commons and submitted it to Google on August 6th.[14][15] Google has since indexed 70 million new pages for Commons, up 140% since June.[9]

We also found that less than 0.1% of videos on Commons were recognised by Google as video watch pages (for the Google Search “Videos” tab). I raised this in a partnership meeting with Google Search, and it may’ve been a bug on their end. Commons started showing up in Google Videos a week later.[16][17]

Link sharing UX

When sharing links from a mobile device, such link previously hardcoded the mobile domain. Links shared from a mobile device gave you the mobile site, even when received on desktop. The “Desktop” link in the footer of the mobile site pointed to the standard domain and disabled the standard-to-mobile redirect for you, on the assumption you arrived on the mobile site via the redirect. The “Desktop” link did not remember your choice on the mobile domain itself, and there existed no equivalent mobile-to-standard redirect for when you arrive there. This meant a shared mobile link always presented the mobile site, even after opting-out on desktop.

Everyone now shares the same domain which naturally shows the appropiate version.

There is a long tail of stable referrals from news articles, research papers, blogs, talk pages, and mailing lists that refer to the mobile domain. We plan to support this indefinitely. To limit operational complexity, we now serve these through a simple whole-domain redirect. This has the benefit of retroactively fixing the UX issue because old mobile links now redirect to the standard domain.[18]

This resolves a long-standing bug with workarounds in the form of shared user scripts,[19] browser extensions,[20] and personal scripts.[24]

Infrastructure load

After publishing an edit, MediaWiki instructs the Wikimedia CDN to clear the cache of affected articles (“purge”). It has been a perennial concern from SRE teams at WMF that our CDN purge rates are unsustainable. For every purge from MediaWiki core, the MobileFrontend extension would add a copy for the mobile domain.

Daily purge workload.

After unifying our domains we turned off these duplicate purges, and cut the MediaWiki purge rate by 50%. Over the past weeks the Wikimedia CDN processed approximately 4 billion fewer purges a day. MediaWiki used to send purges at a baseline rate of 40K/second with spikes up to 300K/second, and both have been halved. Factoring in other services, the Wikimedia CDN now receives 20% to 40% fewer purges per second overall, depending on the edit activity.[18]

Footnotes

  1. T403510: Main rollout, Wikimedia Phabricator.
  2. T405429: Detailed traffic stats and performance reports, Wikimedia Phabricator.
  3. Running desktop and mobile versions of your site (2009), developers.google.com.
  4. Mobile-first indexing (2016), developers.google.com.
  5. Google makes mobile-first indexing default for new domains (2019), TechCrunch.
  6. Mobile-first indexing has landed (2023), developers.google.com.
  7. Mobile indexing vLast final final (Jun 2024), developers.google.com.
  8. Mobile domain sunsetting RFC § Footnote: Wikimedia pageviews (Feb 2025), mediawiki.org.
  9. T400022: Commons SEO review, Wikimedia Phabricator.
  10. T54647: Image pages not indexed by Google, Wikimedia Phabricator.
  11. Crawl Budget Management For Large Sites, developers.google.com.
  12. I don’t have a guestimate for when Google switched Commons to its new crawler. I pinpointed May 2024 as the switch date for Wikipedia based on the new redirect impacting page load times (i.e. a non-zero fetch delay). For Commons, this fetch delay was already non-zero since at least 2018. This suggests Google’s old crawler linked mobile users to Commons canonical domain, unlike Wikipedia which it linked to the mobile domain until last year. Raw perf data: P73601.
  13. History of sitemaps at Wikimedia by Tim Starling, wikitech.wikimedia.org.
  14. T396684: Develop Sitemap API for MediaWiki
  15. T400023: Deploy Sitemap API for Commons
  16. T396168: Video pages not indexed by Google, Wikimedia Phabricator.
  17. Google Videos Search results for commons.wikimedia.org.
  18. T405931: Clean up and redirect, Wikimedia Phabricator.
  19. Wikipedia:User scripts/List on en.wikipedia.org. Featuring NeverUseMobileVersion, AutoMobileRedirect, and unmobilePlus.
  20. Redirector (10,000 users), Chrome Web Store.
  21. How can I force my desktop browser to never use mobile Wikipedia (2018), StackOverflow.
  22. Skip Mobile Wikipedia (726 users), Firefox Add-ons.
  23. Search for “mobile wikipedia”, Firefox Add-ons.
  24. Mobile domain sunsetting 2025 Announcement § Personal script workarounds (Sep 2025), mediawiki.org.

About this post

Featured image by PierreSelim, CC BY 3.0, via Wikimedia Commons.

Wikimedia Toolforge: migrating Kubernetes from PodSecurityPolicy to Kyverno

Summary: this article shares the experience and learnings of migrating away from Kubernetes PodSecurityPolicy into Kyverno in the Wikimedia Toolforge platform.

Christian David, CC BY-SA 4.0, via Wikimedia Commons

Wikimedia Toolforge is a Platform-as-a-Service, built with Kubernetes, and maintained by the Wikimedia Cloud Services team (WMCS). It is completely free and open, and we welcome anyone to use it to build and host tools (bots, webservices, scheduled jobs, etc) in support of Wikimedia projects. 

We provide a set of platform-specific services, command line interfaces, and shortcuts to help in the task of setting up webservices, jobs, and stuff like building container images, or using databases. Using these interfaces makes the underlying Kubernetes system pretty much invisible to users. We also allow direct access to the Kubernetes API, and some advanced users do directly interact with it.

Each account has a Kubernetes namespace where they can freely deploy their workloads. We have a number of controls in place to ensure performance, stability, and fairness of the system, including quotas, RBAC permissions, and up until recently PodSecurityPolicies (PSP). At the time of this writing, we had around 3.500 Toolforge tool accounts in the system.
We early adopted PSP in 2019 as a way to make sure Pods had the correct runtime configuration. We needed Pods to stay within the safe boundaries of a set of pre-defined parameters. Back when we adopted PSP there was already the option to use 3rd party agents, like  OpenPolicyAgent Gatekeeper, but we decided not to invest in them, and went with a native, built-in mechanism instead.

In 2021 it was announced that the PSP mechanism would be deprecated, and removed in Kubernetes 1.25. Even though we had been warned years in advance, we did not prioritize the migration of PSP until we were in Kubernetes 1.24, and blocked, unable to upgrade forward without taking actions.

The WMCS team explored different alternatives for this migration, but eventually we decided to go with Kyverno as a replacement for PSP. And so with that decision it began the journey described in this blog post.

First, we needed a source code refactor for one of the key components of our Toolforge Kubernetes: maintain-kubeusers. This custom piece of software that we built in-house, contains the logic to fetch accounts from LDAP and do the necessary instrumentation on Kubernetes to accommodate each one: create namespace, RBAC, quota, a kubeconfig file, etc. With the refactor, we introduced a proper reconciliation loop, in a way that the software would have a notion of what needs to be done for each account, what would be missing, what to delete, upgrade, and so on. This would allow us to easily deploy new resources for each account, or iterate on their definitions. 

The initial version of the refactor had a number of problems, though. For one, the new version of maintain-kubeusers was doing more filesystem interaction than the previous version, resulting in a slow reconciliation loop over all the accounts. We used NFS as the underlying storage system for Toolforge, and it could be very slow because of reasons beyond this blog post. This was corrected in the next few days after the initial refactor rollout. A side note with an implementation detail: we stored a configmap on each account namespace with the state of each resource. Storing more state on this configmap was our solution to avoid additional NFS latency.

I initially estimated this refactor would take me a week to complete, but unfortunately it took me around three weeks instead. Previous to the refactor, there were several manual steps and cleanups required to be done when updating the definition of a resource. The process is now automated, more robust, performant, efficient and clean. So in my opinion it was worth it, even if it took more time than expected.

Then, we worked on the Kyverno policies themselves. Because we had a very particular PSP setting, in order to ease the transition, we tried to replicate their semantics on a 1:1 basis as much as possible. This involved things like transparent mutation of Pod resources, then validation. Additionally, we had one different PSP definition for each account, so we decided to create one different Kyverno namespaced policy resource for each account namespace — remember, we had 3.5k accounts.

We created a Kyverno policy template that we would then render and inject for each account.

For developing and testing all this, maintain-kubeusers and the Kyverno bits, we had a project called lima-kilo, which was a local Kubernetes setup replicating production Toolforge. This was used by each engineer in their laptop as a common development environment.

We had planned the migration from PSP to Kyverno policies in stages, like this:

  1. update our internal template generators to make Pod security settings explicit
  2. introduce Kyverno policies in Audit mode
  3. see how the cluster would behave with them, and if we had any offending resources reported by the new policies, and correct them
  4. modify Kyverno policies and set them in Enforce mode
  5. drop PSP

In stage 1, we updated things like the toolforge-jobs-framework and tools-webservice.

In stage 2, when we deployed the 3.5k Kyverno policy resources, our production cluster died almost immediately. Surprise. All the monitoring went red, the Kubernetes apiserver became irresponsibe, and we were unable to perform any administrative actions in the Kubernetes control plane, or even the underlying virtual machines. All Toolforge users were impacted. This was a full scale outage that required the energy of the whole WMCS team to recover from. We temporarily disabled Kyverno until we could learn what had occurred.

This incident happened despite having tested before in lima-kilo and in another pre-production cluster we had, called Toolsbeta. But we had not tested that many policy resources. Clearly, this was something scale-related. After the incident, I went on and created 3.5k Kyverno policy resources on lima-kilo, and indeed I was able to reproduce the outage. We took a number of measures, corrected a few errors in our infrastructure,  reached out to the Kyverno upstream developers, asking for advice, and at the end we did the following to accommodate the setup to our needs.:

  • corrected the external HAproxy kubernetes apiserver health checks, from checking just for open TCP ports, to actually checking the /healthz HTTP endpoint, which more accurately reflected the health of each k8s apiserver.
  • having a more realistic development environment. In lima-kilo, we created a couple of helper scripts to create/delete 4000 policy resources, each on a different namespace.
  • greatly over-provisioned memory in the Kubernetes control plane servers. This is, bigger memory in the base virtual machine hosting the control plane. Scaling the memory headroom of the apiserver would prevent it from running out of memory, and therefore crashing the whole system. We went from 8GB RAM per virtual machine to 32GB.  In our cluster, a single apiserver pod could eat 7GB of memory on a normal day, so having 8GB on the base virtual machine was clearly not enough. I also sent a patch proposal to Kyverno upstream documentation suggesting they clarify the additional memory pressure on the apiserver.
  • corrected resource requests and limits of Kyverno, to more accurately describe our actual usage.
  • increased the number of replicas of the Kyverno admission controller to 7, so admission requests could be handled more timely by Kyverno.

I have to admit, I was briefly tempted to drop Kyverno, and even stop pursuing using an external policy agent entirely, and write our own custom admission controller out of concerns over performance of this architecture. However, after applying all the measures listed above, the system became very stable, so we decided to move forward. The second attempt at deploying it all went through just fine. No outage this time 🙂

When we were in stage 4 we detected another bug. We had been following the Kubernetes upstream documentation for setting securityContext to the right values. In particular, we were enforcing the procMount to be set to the default value, which per the docs it was ‘DefaultProcMount’. However, that string is the name of the internal variable in the source code, whereas the actual default value is the string ‘Default’. This caused pods to be rightfully rejected by Kyverno while we figured the problem. We sent a patch upstream to fix this problem.

We finally had everything in place, reached stage 5, and we were able to disable PSP. We unloaded the PSP controller from the kubernetes apiserver, and deleted every individual PSP definition. Everything was very smooth in this last step of the migration.

This whole PSP project, including the maintain-kubeusers refactor, the outage, and all the different migration stages took roughly three months to complete.

For me there are a number of valuable reasons to learn from this project. For one, the scale is something to consider, and test, when evaluating a new architecture or software component. Not doing so can lead to service outages, or unexpectedly poor performances. This is in the first chapter of the SRE handbook, but we got a reminder the hard way 🙂

❌