The rise of Mastodon has made me so much more aware of government services requiring us to use private companies’ systems to communicate with them and access services.
Sitting on a Dutch train just now I was shown on a screen “feeling unsafe in the train? Contact us via WhatsApp”.
Jones says the railway operator’s website also contains SMS reporting instructions, but that was not shown on the train itself.
One of the side effects of the decline of née Twitter is in the splintering of its de facto customer support and alert capabilities. Plenty of organizations still use it that way. But it should only be one option. Apps like WhatsApp should not be the preferred contact method, either. Private companies’ contact methods should be available, sure — meet people where they are — but a standard method should always be as easily available.
Summary: this article shares the experience and learnings of migrating away from Kubernetes PodSecurityPolicy into Kyverno in the Wikimedia Toolforge platform.
Wikimedia Toolforge is a Platform-as-a-Service, built with Kubernetes, and maintained by the Wikimedia Cloud Services team (WMCS). It is completely free and open, and we welcome anyone to use it to build and host tools (bots, webservices, scheduled jobs, etc) in support of Wikimedia projects.
We provide a set of platform-specific services, command line interfaces, and shortcuts to help in the task of setting up webservices, jobs, and stuff like building container images, or using databases. Using these interfaces makes the underlying Kubernetes system pretty much invisible to users. We also allow direct access to the Kubernetes API, and some advanced users do directly interact with it.
Each account has a Kubernetes namespace where they can freely deploy their workloads. We have a number of controls in place to ensure performance, stability, and fairness of the system, including quotas, RBAC permissions, and up until recently PodSecurityPolicies (PSP). At the time of this writing, we had around 3.500 Toolforge tool accounts in the system. We early adopted PSP in 2019 as a way to make sure Pods had the correct runtime configuration. We needed Pods to stay within the safe boundaries of a set of pre-defined parameters. Back when we adopted PSP there was already the option to use 3rd party agents, like OpenPolicyAgent Gatekeeper, but we decided not to invest in them, and went with a native, built-in mechanism instead.
The WMCS team explored different alternatives for this migration, but eventually we decided to go with Kyverno as a replacement for PSP. And so with that decision it began the journey described in this blog post.
First, we needed a source code refactor for one of the key components of our Toolforge Kubernetes: maintain-kubeusers. This custom piece of software that we built in-house, contains the logic to fetch accounts from LDAP and do the necessary instrumentation on Kubernetes to accommodate each one: create namespace, RBAC, quota, a kubeconfig file, etc. With the refactor, we introduced a proper reconciliation loop, in a way that the software would have a notion of what needs to be done for each account, what would be missing, what to delete, upgrade, and so on. This would allow us to easily deploy new resources for each account, or iterate on their definitions.
The initial version of the refactor had a number of problems, though. For one, the new version of maintain-kubeusers was doing more filesystem interaction than the previous version, resulting in a slow reconciliation loop over all the accounts. We used NFS as the underlying storage system for Toolforge, and it could be very slow because of reasons beyond this blog post. This was corrected in the next few days after the initial refactor rollout. A side note with an implementation detail: we stored a configmap on each account namespace with the state of each resource. Storing more state on this configmap was our solution to avoid additional NFS latency.
I initially estimated this refactor would take me a week to complete, but unfortunately it took me around three weeks instead. Previous to the refactor, there were several manual steps and cleanups required to be done when updating the definition of a resource. The process is now automated, more robust, performant, efficient and clean. So in my opinion it was worth it, even if it took more time than expected.
Then, we worked on the Kyverno policies themselves. Because we had a very particular PSP setting, in order to ease the transition, we tried to replicate their semantics on a 1:1 basis as much as possible. This involved things like transparent mutation of Pod resources, then validation. Additionally, we had one different PSP definition for each account, so we decided to create one different Kyverno namespaced policy resource for each account namespace — remember, we had 3.5k accounts.
For developing and testing all this, maintain-kubeusers and the Kyverno bits, we had a project called lima-kilo, which was a local Kubernetes setup replicating production Toolforge. This was used by each engineer in their laptop as a common development environment.
We had planned the migration from PSP to Kyverno policies in stages, like this:
update our internal template generators to make Pod security settings explicit
introduce Kyverno policies in Audit mode
see how the cluster would behave with them, and if we had any offending resources reported by the new policies, and correct them
modify Kyverno policies and set them in Enforce mode
drop PSP
In stage 1, we updated things like the toolforge-jobs-framework and tools-webservice.
In stage 2, when we deployed the 3.5k Kyverno policy resources, our production cluster died almost immediately. Surprise. All the monitoring went red, the Kubernetes apiserver became irresponsibe, and we were unable to perform any administrative actions in the Kubernetes control plane, or even the underlying virtual machines. All Toolforge users were impacted. This was a full scale outage that required the energy of the whole WMCS team to recover from. We temporarily disabled Kyverno until we could learn what had occurred.
This incident happened despite having tested before in lima-kilo and in another pre-production cluster we had, called Toolsbeta. But we had not tested that many policy resources. Clearly, this was something scale-related. After the incident, I went on and created 3.5k Kyverno policy resources on lima-kilo, and indeed I was able to reproduce the outage. We took a number of measures, corrected a few errors in our infrastructure, reached out to the Kyverno upstream developers, asking for advice, and at the end we did the following to accommodate the setup to our needs.:
corrected the external HAproxy kubernetes apiserver health checks, from checking just for open TCP ports, to actually checking the /healthz HTTP endpoint, which more accurately reflected the health of each k8s apiserver.
having a more realistic development environment. In lima-kilo, we created a couple of helper scripts to create/delete 4000 policy resources, each on a different namespace.
greatly over-provisioned memory in the Kubernetes control plane servers. This is, bigger memory in the base virtual machine hosting the control plane. Scaling the memory headroom of the apiserver would prevent it from running out of memory, and therefore crashing the whole system. We went from 8GB RAM per virtual machine to 32GB. In our cluster, a single apiserver pod could eat 7GB of memory on a normal day, so having 8GB on the base virtual machine was clearly not enough. I also sent a patch proposal to Kyverno upstream documentation suggesting they clarify the additional memory pressure on the apiserver.
increased the number of replicas of the Kyverno admission controller to 7, so admission requests could be handled more timely by Kyverno.
I have to admit, I was briefly tempted to drop Kyverno, and even stop pursuing using an external policy agent entirely, and write our own custom admission controller out of concerns over performance of this architecture. However, after applying all the measures listed above, the system became very stable, so we decided to move forward. The second attempt at deploying it all went through just fine. No outage this time 🙂
When we were in stage 4 we detected another bug. We had been following the Kubernetes upstream documentation for setting securityContext to the right values. In particular, we were enforcing the procMount to be set to the default value, which per the docs it was ‘DefaultProcMount’. However, that string is the name of the internal variable in the source code, whereas the actual default value is the string ‘Default’. This caused pods to be rightfully rejected by Kyverno while we figured the problem. We sent a patch upstream to fix this problem.
We finally had everything in place, reached stage 5, and we were able to disable PSP. We unloaded the PSP controller from the kubernetes apiserver, and deleted every individual PSP definition. Everything was very smooth in this last step of the migration.
This whole PSP project, including the maintain-kubeusers refactor, the outage, and all the different migration stages took roughly three months to complete.
For me there are a number of valuable reasons to learn from this project. For one, the scale is something to consider, and test, when evaluating a new architecture or software component. Not doing so can lead to service outages, or unexpectedly poor performances. This is in the first chapter of the SRE handbook, but we got a reminder the hard way 🙂
One big change Apple announced Thursday is that developers who include link-outs in their apps will no longer need to accept the newer version of its business terms — which requires they commit to paying the Core Technology Fee (CTF) the EU is investigating.
In another notable revision of approach, Apple is giving developers more flexibility around how they can communicate external offers and the types of offers they can promote through their iOS apps. Apple said developers will be able to inform users about offers available anywhere, not only on their own websites — such as through other apps and app marketplaces.
These are good changes. Users will also be able to turn off the scary alerts when using external purchasing mechanisms. But there is a catch.
There are two fees that are associated with directing customers to purchase options outside of the App Store. A 5 percent initial acquisition fee is paid for all sales of digital goods and services that the customer makes on any platform that occur within a 12-month period after an initial install. The fee does not apply to transactions made by customers that had an initial install before the new link changes, but is applicable for new downloads.
Apple says that the initial acquisition fee reflects the value that the App Store provides when connecting developers with customers in the European Union.
The other new fee is a Store Services Fee of 7% or 20% assessed annually. Apple says it “reflects the ongoing services and capabilities that Apple provides developers”:
[…] including app distribution and management; App Review; App Store trust and safety; re-discovery, re-engagement and promotional tools and services; anti-fraud checks; recommendations; ratings and reviews; customer support; and more.
Contrary to its name, this fee does not apply solely to apps acquired through the App Store; rather, it is assessed against any digital purchase made on any platform. If an app is first downloaded on an iPhone and then, within a year, the user ultimately purchases a subscription in the Windows version of the same app, Apple believes it deserves 7–20% of the cost of that subscription in perpetuity, plus 5% for the first year’s instance. This seems to be the case no matter whether the iPhone version of that app is ever touched again.
I am not sure what business standards apply here and whether it is completely outlandish, but it sure feels that way. The App Store certainly helps with app discovery to some degree, and Apple does provide a lot of services whether developers want them or not. Yet this basically ties part of a developer’s entire revenue stream to Apple; the part is unknown but will be determined based on whichever customers used the iPhone version of an app first.
I think I have all this right based on news reports from those briefed by Apple and the new contract (PDF), but I might have messed something up. Please let me know if I got some detail wrong. This is all very confusing and, though I do not think that is deliberate, I think it struggles to translate its priorities into straightforward policy. None of these changes applies to external purchases in the U.S., for example. But what I wrote at the time applies here just the same: it is championing this bureaucracy because it believes it is entitled to a significant finder’s fee, regardless of its actual contribution to a customer’s purchase.
Last quarter, Apple made about $22 billion in profit from products and $18 billion from Services. It’s the closest those two lines have ever come to each other.
This is what was buzzing in the back of my head as I was going over all the numbers on Thursday. We’re not quite there yet, but it’s hard to imagine that there won’t be a quarter in the next year or so in which Apple reports more total profit on Services than on products.
When that happens, is Apple still a products company? Or has it crossed some invisible line?
The most important thing Snell gets at in this article, I think, is that the “services” which likely generate the most revenue for Apple — the App Store, Apple Pay transactions, AppleCare, and the Google search deal — are all things which are tied specifically to its hardware. It sells subscriptions to its entertainment services elsewhere, for example, but they are probably not as valuable to the company as these four categories. It would be disappointing if Apple sees its hardware products increasingly as vehicles for recurring revenue.