Reading view

There are new articles available, click to refresh the page.

The evasive evitability of enshittification

Our company recently announced a fundraise. We were grateful for all the community support, but the Internet also raised a few of its collective eyebrows, wondering whether this meant the dreaded “enshittification” was coming next.

That word describes a very real pattern we’ve all seen before: products start great, grow fast, and then slowly become worse as the people running them trade user love for short-term revenue.

It’s a topic I find genuinely fascinating, and I've seen the downward spiral firsthand at companies I once admired. So I want to talk about why this happens, and more importantly, why it won't happen to us. That's big talk, I know. But it's a promise I'm happy for people to hold us to.

What is enshittification?

The term "enshittification" was first popularized in a blog post by Corey Doctorow, who put a catchy name to an effect we've all experienced. Software starts off good, then goes bad. How? Why?

Enshittification proposes not just a name, but a mechanism. First, a product is well loved and gains in popularity, market share, and revenue. In fact, it gets so popular that it starts to defeat competitors. Eventually, it's the primary product in the space: a monopoly, or as close as you can get. And then, suddenly, the owners, who are Capitalists, have their evil nature finally revealed and they exploit that monopoly to raise prices and make the product worse, so the captive customers all have to pay more. Quality doesn't matter anymore, only exploitation.

I agree with most of that thesis. I think Doctorow has that mechanism mostly right. But, there's one thing that doesn't add up for me:

Enshittification is not a success mechanism.

I can't think of any examples of companies that, in real life, enshittified because they were successful. What I've seen is companies that made their product worse because they were... scared.

A company that's growing fast can afford to be optimistic. They create a positive feedback loop: more user love, more word of mouth, more users, more money, more product improvements, more user love, and so on. Everyone in the company can align around that positive feedback loop. It's a beautiful thing. It's also fragile: miss a beat and it flattens out, and soon it's a downward spiral instead of an upward one.

So, if I were, hypothetically, running a company, I think I would be pretty hesitant to deliberately sacrifice any part of that positive feedback loop, the loop I and the whole company spent so much time and energy building, to see if I can grow faster. User love? Nah, I'm sure we'll be fine, look how much money and how many users we have! Time to switch strategies!

Why would I do that? Switching strategies is always a tremendous risk. When you switch strategies, it's triggered by passing a threshold, where something fundamental changes, and your old strategy becomes wrong.

Threshold moments and control

In Saint John, New Brunswick, there's a river that flows one direction at high tide, and the other way at low tide. Four times a day, gravity equalizes, then crosses a threshold to gently start pulling the other way, then accelerates. What doesn't happen is a rapidly flowing river in one direction "suddenly" shifts to rapidly flowing the other way. Yes, there's an instant where the limit from the left is positive and the limit from the right is negative. But you can see that threshold coming. It's predictable.

In my experience, for a company or a product, there are two kinds of thresholds like this, that build up slowly and then when crossed, create a sudden flow change.

The first one is control: if the visionaries in charge lose control, chances are high that their replacements won't "get it."

The new people didn't build the underlying feedback loop, and so they don't realize how fragile it is. There are lots of reasons for a change in control: financial mismanagement, boards of directors, hostile takeovers.

The worst one is temptation. Being a founder is, well, it actually sucks. It's oddly like being repeatedly punched in the face. When I look back at my career, I guess I'm surprised by how few times per day it feels like I was punched in the face. But, the constant face punching gets to you after a while. Once you've established a great product, and amazing customer love, and lots of money, and an upward spiral, isn't your creation strong enough yet? Can't you step back and let the professionals just run it, confident that they won't kill the golden goose?

Empirically, mostly no, you can't. Actually the success rate of control changes, for well loved products, is abysmal.

The saturation trap

The second trigger of a flow change is comes from outside: saturation. Every successful product, at some point, reaches approximately all the users it's ever going to reach. Before that, you can watch its exponential growth rate slow down: the infamous S-curve of product adoption.

Saturation can lead us back to control change: the founders get frustrated and back out, or the board ousts them and puts in "real business people" who know how to get growth going again. Generally that doesn't work. Modern VCs consider founder replacement a truly desperate move. Maybe a last-ditch effort to boost short term numbers in preparation for an acquisition, if you're lucky.

But sometimes the leaders stay on despite saturation, and they try on their own to make things better. Sometimes that does work. Actually, it's kind of amazing how often it seems to work. Among successful companies, it's rare to find one that sustained hypergrowth, nonstop, without suffering through one of these dangerous periods.

(That's called survivorship bias. All companies have dangerous periods. The successful ones surivived them. But of those survivors, suspiciously few are ones that replaced their founders.)

If you saturate and can't recover - either by growing more in a big-enough current market, or by finding new markets to expand into - then the best you can hope for is for your upward spiral to mature gently into decelerating growth. If so, and you're a buddhist, then you hire less, you optimize margins a bit, you resign yourself to being About This Rich And I Guess That's All But It's Not So Bad.

The devil's bargain

Alas, very few people reach that state of zen. Especially the kind of ambitious people who were able to get that far in the first place. If you can't accept saturation and you can't beat saturation, then you're down to two choices: step away and let the new owners enshittify it, hopefully slowly. Or take the devil's bargain: enshittify it yourself.

I would not recommend the latter. If you're a founder and you find yourself in that position, honestly, you won't enjoy doing it and you probably aren't even good at it and it's getting enshittified either way. Let someone else do the job.

Defenses against enshittification

Okay, maybe that section was not as uplifting as we might have hoped. I've gotta be honest with you here. Doctorow is, after all, mostly right. This does happen all the time.

Most founders aren't perfect for every stage of growth. Most product owners stumble. Most markets saturate. Most VCs get board control pretty early on and want hypergrowth or bust. In tech, a lot of the time, if you're choosing a product or company to join, that kind of company is all you can get.

As a founder, maybe you're okay with growing slowly. Then some copycat shows up, steals your idea, grows super fast, squeezes you out along with your moral high ground, and then runs headlong into all the same saturation problems as everyone else. Tech incentives are awful.

But, it's not a lost cause. There are companies (and open source projects) that keep a good thing going, for decades or more. What do they have in common?

  • An expansive vision that's not about money, and which opens you up to lots of users. A big addressable market means you don't have to worry about saturation for a long time, even at hypergrowth speeds. Google certainly never had an incentive to make Google Search worse.

    (Update 2025-06-14: A few people disputed that last bit. Okay. Perhaps Google has ccasionally responded to what they thought were incentives to make search worse -- I wasn't there, I don't know -- but it seems clear in retrospect that when search gets worse, Google does worse. So I'll stick to my claim that their true incentives are to keep improving.)

  • Keep control. It's easy to lose control of a project or company at any point. If you stumble, and you don't have a backup plan, and there's someone waiting to jump on your mistake, then it's over. Too many companies "bet it all" on nonstop hypergrowth and don't have any way back have no room in the budget, if results slow down even temporarily.

    Stories abound of companies that scraped close to bankruptcy before finally pulling through. But far more companies scraped close to bankruptcy and then went bankrupt. Those companies are forgotten. Avoid it.

  • Track your data. Part of control is predictability. If you know how big your market is, and you monitor your growth carefully, you can detect incoming saturation years before it happens. Knowing the telltale shape of each part of that S-curve is a superpower. If you can see the future, you can prevent your own future mistakes.

  • Believe in competition. Google used to have this saying they lived by: "the competition is only a click away." That was excellent framing, because it was true, and it will remain true even if Google captures 99% of the search market. The key is to cultivate a healthy fear of competing products, not of your investors or the end of hypergrowth. Enshittification helps your competitors. That would be dumb.

    (And don't cheat by using lock-in to make competitors not, anymore, "only a click away." That's missing the whole point!)

  • Inoculate yourself. If you have to, create your own competition. Linus Torvalds, the creator of the Linux kernel, famously also created Git, the greatest tool for forking (and maybe merging) open source projects that has ever existed. And then he said, this is my fork, the Linus fork; use it if you want; use someone else's if you want; and now if I want to win, I have to make mine the best. Git was created back in 2005, twenty years ago. To this day, Linus's fork is still the central one.

If you combine these defenses, you can be safe from the decline that others tell you is inevitable. If you look around for examples, you'll find that this does actually work. You won't be the first. You'll just be rare.

Side note: Things that aren't enshittification

I often see people worry about enshittification that isn't. They might be good or bad, wise or unwise, but that's a different topic. Tools aren't inherently good or evil. They're just tools.

  1. "Helpfulness." There's a fine line between "telling users about this cool new feature we built" in the spirit of helping them, and "pestering users about this cool new feature we built" (typically a misguided AI implementation) to improve some quarterly KPI. Sometimes it's hard to see where that line is. But when you've crossed it, you know.

    Are you trying to help a user do what they want to do, or are you trying to get them to do what you want them to do?

    Look into your heart. Avoid the second one. I know you know how. Or you knew how, once. Remember what that feels like.

  2. Charging money for your product. Charging money is okay. Get serious. Companies have to stay in business.

    That said, I personally really revile the "we'll make it free for now and we'll start charging for the exact same thing later" strategy. Keep your promises.

    I'm pretty sure nobody but drug dealers breaks those promises on purpose. But, again, desperation is a powerful motivator. Growth slowing down? Costs way higher than expected? Time to capture some of that value we were giving away for free!

    In retrospect, that's a bait-and-switch, but most founders never planned it that way. They just didn't do the math up front, or they were too naive to know they would have to. And then they had to.

    Famously, Dropbox had a "free forever" plan that provided a certain amount of free storage. What they didn't count on was abandoned accounts, accumulating every year, with stored stuff they could never delete. Even if a very good fixed fraction of users each year upgraded to a paid plan, all the ones that didn't, kept piling up... year after year... after year... until they had to start deleting old free accounts and the data in them. A similar story happened with Docker, which used to host unlimited container downloads for free. In hindsight that was mathematically unsustainable. Success guaranteed failure.

    Do the math up front. If you're not sure, find someone who can.

  3. Value pricing. (ie. charging different prices to different people.) It's okay to charge money. It's even okay to charge money to some kinds of people (say, corporate users) and not others. It's also okay to charge money for an almost-the-same-but-slightly-better product. It's okay to charge money for support for your open source tool (though I stay away from that; it incentivizes you to make the product worse).

    It's even okay to charge immense amounts of money for a commercial product that's barely better than your open source one! Or for a part of your product that costs you almost nothing.

    But, you have to do the rest of the work. Make sure the reason your users don't switch away is that you're the best, not that you have the best lock-in. Yeah, I'm talking to you, cloud egress fees.

  4. Copying competitors. It's okay to copy features from competitors. It's okay to position yourself against competitors. It's okay to win customers away from competitors. But it's not okay to lie.

  5. Bugs. It's okay to fix bugs. It's okay to decide not to fix bugs; you'll have to sometimes, anyway. It's okay to take out technical debt. It's okay to pay off technical debt. It's okay to let technical debt languish forever.

  6. Backward incompatible changes. It's dumb to release a new version that breaks backward compatibility with your old version. It's tempting. It annoys your users. But it's not enshittification for the simple reason that it's phenomenally ineffective at maintaining or exploiting a monopoly, which is what enshittification is supposed to be about. You know who's good at monopolies? Intel and Microsoft. They don't break old versions.

Enshittification is real, and tragic. But let's protect a useful term and its definition! Those things aren't it.

Epilogue: a special note to founders

If you're a founder or a product owner, I hope all this helps. I'm sad to say, you have a lot of potential pitfalls in your future. But, remember that they're only potential pitfalls. Not everyone falls into them.

Plan ahead. Remember where you came from. Keep your integrity. Do your best.

I will too.

APIs as a product: Investing in the current and next generation of technical contributors

Wikipedia is coming up on its 25th birthday, and that would not have been possible without the Wikimedia technical volunteer community. Supporting technical volunteers is crucial to carrying forward Wikimedia’s free knowledge mission for generations to come. In line with this commitment, the Foundation is turning its attention to an important area of developer support—the Wikimedia web (HTTP) APIs. 

Both Wikimedia and the Internet have changed a lot over the last 25 years. Patterns that are now ubiquitous standards either didn’t exist or were still in their infancy as the first APIs allowing developers to extend features and automate tasks on Wikimedia projects emerged. In fact, the term representational state transfer”, better known today as the REST framework, was first coined in 2000, just months before the very first Wikipedia post was published, and only 6 years before the Action API was introduced. Because we preceded what have since become industry standards, our most powerful and comprehensive API solution, the Action API, sticks out as being unlike other APIs – but for good reason, if you understand the history.

Wikimedia APIs are used within Foundation-authored features and by volunteer developers. A common sentiment surfaced through the recent API Listening Tour conducted with a mix of volunteers and Foundation staff is “Wikimedia APIs are great, once you know what you’re doing.” New developers first entering the Wikimedia community face a steep learning curve when trying to onboard due to unfamiliar technologies and complex APIs that may require a deep understanding of the underlying Wikimedia systems and processes. While recognizing the power, flexibility, and mission-critical value that developers created using the existing API solutions, we want to make it easier for developers to make more meaningful contributions faster. We have no plans to deprecate the Action API nor treat it as ‘legacy’. Instead, we hope to make it easier and more approachable for both new and experienced developers to use. We also aim to expand REST coverage to better serve developers who are more comfortable working in those structures.

We are focused on simplifying, modernizing, and standardizing Wikimedia API offerings as part of the Responsible Use of Infrastructure objective in the FY25-26 Annual Plan (see: the WE5.2 key result). Focusing on common infrastructure that encourages responsible use allows us to continue to prioritize reliable, free access to knowledge for the technical volunteer community, as well as the readers and contributors they support. Investing in our APIs and the developer experiences surrounding them will ensure a healthy technical community for years to come. To achieve these objectives, we see three main areas for improving the sustainability of our API offering: simplification, documentation, and communication.

Simplification

To reduce maintenance costs and ensure a seamless developer experience, we are simplifying our API infrastructure and bringing greater consistency across all APIs. Decades of organic growth without centralized API governance led to fragmented, bespoke implementations that now hinder technical agility and standardization. Beyond that, maintaining services is not free; we are paying for duplicative infrastructure costs, some of which are scaling directly with the amount of scraper traffic hitting our services.

In light of the above, we will focus on transitioning at least 70% of our public endpoints to common API infrastructure (see the WE 5.2 key result). Common infrastructure makes it easier to maintain and roll out changes across our APIs, in addition to empowering API authors to move faster. Instead of expecting API authors to build and manage their own solutions for things like routing and rate limiting, we will create centralized tools and processes that make it easier to follow the “golden path” of recommended standards. That will allow centralized governance mechanisms to drive more consistent and sustainable end-user experiences, while enabling flexible, federated API ownership. 

An example of simplified internal infrastructure will be introducing a common API Gateway for handling and routing all Wikimedia API requests. Our approach will start as an “invisible gateway” or proxy, with no changes to URL structure or functional behavior for any existing APIs. Centralizing API traffic will make observability across APIs easier, allowing us to make better data-driven decisions. We will use this data to inform endpoint deprecation and versioning, prioritize human and mission-oriented access first, and ultimately provide better support to our developer community.  

Centralized management and traffic identification will also allow us to have more consistent and transparent enforcement of our API policies. API policy enforcement enables us to protect our infrastructure and ensure continued access for all. Once API traffic is rerouted through a centralized gateway, we will explore simplifying options for developer identification mechanisms and standardizing how rate limits and other API access controls are applied. The goal is to make it easier for all developers to know exactly what is expected and what limitations apply.

As we update our API usage policies and developer requirements, we will avoid breaking existing community tools as much as possible. We will continue offering low-friction entry points for volunteer developers experimenting with new ideas, lightly exploring data, or learning to build in the Wikimedia ecosystem. But we must balance support for community creativity and innovation with the need to reduce abuse, such as scraping, Denial of Service (DoS) attacks, and other harmful activities. While open, unauthenticated API access for everyone will continue, we will need to make adjustments. To reduce the likelihood and impact of abuse, we may apply stricter rate limits to unauthenticated traffic and more consistent authentication requirements to better match our documented API policy, Robot policy, and API etiquette guidelines, as well as consolidate per-API access guidelines to reduce the likelihood and impact of abuse.

To continue supporting Wikimedia’s technical volunteer community and minimize disruption to existing tools, community developers will have simple ways to identify themselves and receive higher limits or other access privileges. In many cases, this won’t require additional steps. For example, instead of universally requiring new access tokens or authentication methods, we plan to use IP ranges from Wikimedia Cloud Services (WMCS) and User-Agent headers to grant elevated privileges to trusted community tools, approved bots, and research projects. 

Documentation

It is essential for any API to enable developers to self-serve their use cases through clear, consistent, and modern documentation experiences. However, Wikimedia API documentation is frequently spread across multiple wiki projects, generated sites, and communication channels, which can make it difficult for developers to find the information they need, when they need it. 

To address this, we are working towards a top-requested item coming out of the 2024 developer satisfaction survey: OpenAPI specs and interactive sandboxes for all of our APIs (including conducting experiments to see if we can use OpenAPI to describe the Action API). The MediaWiki Interfaces team began addressing this request through the REST Sandbox, which we released to a limited number of small Wikipedia projects on March 31, 2025. Our implementation approach allows us to generate an OpenAPI specification, which we then use to power a SwaggerUI sandbox. We are also using the OpenAPI specs to automatically validate our endpoints as part of our automated deployment testing, which helps ensure that the generated documentation always matches the actual endpoint behavior. 

In addition, the generated OpenAPI spec offers translation support (powered by Translatewiki) for critical and contextual information like endpoint and parameter descriptions. We believe this is a more equitable approach to API documentation for developers who don’t have English as their preferred language. In the coming year, we plan to transition from Swagger UI to a custom Codex implementation for our sandbox experiences, which will enable full translation support for sandbox UI labels and navigation, as well as a more consistent look and feel for Wikimedia developers. We will also expand coverage for OpenAPI specs and sandbox experiences by introducing repeatable patterns for API authors to publish their specs to a single location where developers can easily browse, learn, and make test calls across all Wikimedia API offerings. 

Communication

When new endpoints are released or breaking changes are required, we need a better way to keep developers informed. As information is shared through different channels, it can become challenging to keep track of the full picture. Over the next year, we will address this on a few fronts. 

First, from a technical change management perspective, we will introduce a centralized API changelog. The changelog will summarize new endpoints, as well as new versions, planned deprecations, and minor changes such as new optional parameters. This will help developers with troubleshooting, as well as help them to more easily understand and monitor the changes happening across the Wikimedia APIs.

In addition to the changelog, we remain committed to consistently communicating changes early and often. As another step towards this commitment, we will provide migration guides and, where needed, provide direct communication channels for developers impacted by the changes to help guarantee a smooth transition. Recognizing that the Wikimedia technical community is split across many smaller communities both on and off-wiki, we will share updates in the largest off-wiki communities, but we will need volunteer support in directing questions and feedback to the right on-wiki pages in various languages. We will also work with communities to make their purpose and audience clearer for new developers so they can more easily get support when they need it and join the discussion with fellow technical contributors. 

Over the next few months, we will also launch a new API beta program, where developers are invited to interact with new endpoints and provide feedback before the capabilities are locked into a long-term stable version. Introducing new patterns through a beta program will allow developers to directly shape the future of the Wikimedia APIs to better suit their needs. To demonstrate this pattern, we will start with changes to MediaWiki REST APIs, including introducing API modularization and consistent structures. 

What’s Next

We are still in the early stages – we are just making the first steps on the journey to a unified API product offering. But we hope that by this time next year, we will be running towards it together. Your involvement and insights can help us shape a future that better serves the technical volunteers behind our knowledge mission. To keep you informed, we will continue to post updates on mailing lists, Diff, TechBlog, and other technical volunteer communication channels. We also invite you to stay actively engaged: share your thoughts on the WE5 objective in the annual plan, ask questions on the related discussion pages, review slides from the Future of Wikimedia APIs session we conducted at the Wikimedia Hackathon, volunteer for upcoming Listening Tour topics, or come talk to us at upcoming events such as Wikimania Nairobi

Technical volunteers play an essential role in the growth and evolution of Wikipedia, as well as all other Wikimedia projects. Together, we can make a better experience for developers who can’t remember life before Wikipedia, and make sure that the next generation doesn’t have to live without it. Here’s to another 25 years! 

Promoting events and WikiProjects

Editatona Mujeres Artistas Mexicanas 2024, Museo Universitario de Arte Contemporáneo, Mexico City, Mexico
Editatona_Mujeres_Artistas_Mexicanas_2024_10, ProtoplasmaKid

The Campaigns team at WMF has released two features that allow organizers to promote events and WikiProjects on the wikis: Invitation Lists and Collaboration List. These two tools are a part of the CampaignEvents extension, which is available on many wikis.

Invitation Lists

Product overview

Invitation Lists allows organizers to generate a list of people to invite to their WikiProjects, events, or other collaborative activities. It can be accessed by going to Special:GenerateInvitationList, if a wiki has the CampaignEvents extension enabled. You can watch this video demo to see how it works.

It works by looking at a list of articles that an organizer plans to focus on during an activity and then finding users to invite based on the following criteria: the bytes they contributed to the articles, the number of edits they made to the articles, their overall edit count on the wikis, and how recently they have edited the wikis. This makes it easier for organizers to invite people who are already interested in the activity’s topics, hence increasing the likelihood of participation.

With this work, we hope to empower organizers to seek out new audiences. We also hope to highlight the important work done by editors, who may be inspired or touched to receive an invitation to an activity based on their work. However, if someone does not want to receive invitations, they can opt out of being included in Invitation Lists via Preferences.

Technical overview

The “Invitation Lists” feature is part of the CampaignEvents extension for MediaWiki, designed to assist event organizers in identifying and reaching out to potential participants based on their editing activity.

Access and Permissions

  • Special Pages: The feature introduces two special pages:
    • Special:GenerateInvitationList: Allows organizers to create new invitation lists.
    • Special:InvitationList: Displays the generated list of recommended invitees.
  • User Rights: Access to these pages is restricted to users with the event-organizer right, ensuring that only authorized individuals can generate and view invitation lists.

Invitation List Generation Process

  1. Input Parameters:
    • List Name: Organizers provide a name for the invitation list.
    • Target Articles: A list of up to 300 articles relevant to the event’s theme.
      •  The articles will need to be on the wiki of the Invitation List.
    • Event Page Link: Optionally, a link to the event’s registration page can be included.
  2. Data Collection:
    • The system analyzes the specified articles to identify contributors.
    • For each contributor, it gathers metrics such as:
      • Bytes Added: The total number of bytes the user has added to the articles.
      • Edit Count: The number of edits made by the user on the specified articles.
      • Overall Edit Count: The user’s total edit count across the wiki.
      • Recent Activity: The recency of the user’s edits on the wiki.
  3. Scoring and Ranking:
    • Contributors are scored based on the collected metrics.
    • The scoring algorithm assigns weights to each metric to calculate a composite score for each user.
    • Users are then ranked and categorized into:
      • Highly Recommended to Invite: Top contributors with high relevance and recent activity.
      • Recommended to Invite: Contributors with moderate relevance and activity.
  4. Output:
    • The generated invitation list is displayed on the Special:InvitationList page.
    • Each listed user includes a link to their contributions page, facilitating further review by the organizer.

Technical Implementation Details

  • Backend Processing:
    • The extension utilizes MediaWiki’s job queue system to handle the processing of invitation lists asynchronously, ensuring that the generation process does not impact the performance of the wiki.
    • Jobs are queued upon submission of the article list and processed in the background.
    • The articles will need to be on the wiki of the Invitation List, and they can add a maximum of 300 articles.
  • Data Retrieval:
    • The extension interfaces with MediaWiki’s revision and user tables to extract the necessary contribution data.
    • Efficient querying and indexing strategies are employed to handle large datasets and ensure timely processing.
  • User Preferences and Privacy:
    • Users have the option to opt out of being included in invitation lists via their preferences.
    • The extension respects these preferences by excluding opted-out users from the generated lists.
  • Integration with Event Registration:
    • If an event page link is provided, the invitation list can be associated with the event’s registration data. This way, we can link their invitation data to their event registration data.

Collaboration List

Product overview

The Collaboration List is a list of events and WikiProjects. It can be accessed by going to SpecialːAllEvents,  if a wiki has the CampaignEvents extension enabled.

The Collaboration List has two tabs: “Events” and “Communities.” The Events tab is a global, automated list of all events that use Event Registration. It also has search filters, so you can find events by start and end dates, meeting type (i.e., online, in person, or hybrid), event topic, event wikis, and by keyword searches. You can also find events that are both ongoing (i.e., started before but continue within the selected date range) and upcoming (i.e., events that start within the selected date range).

The Communities tab provides a list of WikiProjects on the local wiki. The WikiProject list is generated by using Wikidata, and it includes: WikiProject name, description, a link to the WikiProject page, and a link to the Wikidata item for the WikiProject. We aim to produce a symbiotic relationship with WikiProjects, in which people can find WikiProjects that interest them, and they can also enhance the Wikidata items for those projects, which in turn improves our project.

Additionally, you can embed the Collaboration List on any wiki page, if the CampaignEvents extension is enabled on that wiki. To do this, you transclude the Collaboration List on a wiki page. You can also choose to customize the Collaboration List through URL parameters, if you want. For example, you can choose to only display a certain number of events or to add formatting. You can read more about this on Help:Extension:CampaignEvents/Collaboration list/Transclusion.

With the Collaboration List, we hope to make it easier for people to find events and WikiProjects that interest them, so more people can find community and make impactful contributions on the wikis together.

Screenshot of the Collaboration List
Screenshot of the Collaboration List

Technical Overview: Events Tab of Collaboration List

  • Purpose: Displays a global list of events across all participating wikis.
  • Data Source: Event data stored centrally in Wikimedia’s X1 database cluster.
  • Displayed Information:
    • Event name and description
    • Event dates (start and end)
    • Event type (online, in-person, hybrid)
    • Associated wikis and event topics
  • Search and Filters:
    • Date range (start/end)
    • Meeting type (online, in-person, hybrid)
    • Event topics and wikis
    • Keyword search
    • Ongoing and upcoming event filtering
  • Technical Implementation:
    • The CampaignEvents extension retrieves event data directly from centralized tables within the X1 cluster.
    • Efficient SQL queries and indexing optimize performance for cross-wiki data retrieval.

This implementation ensures quick access and easy discoverability of events from across Wikimedia projects.

Technical Overview: Communities Tab of Collaboration List

  • Purpose: Displays a list of local WikiProjects on the wiki.
  • Data Source: Dynamically retrieved from Wikidata via the Wikidata Query Service (WDQS).
  • Displayed Information:
    • WikiProject name
    • Description from Wikidata
    • Link to the local WikiProject page
    • Link to the Wikidata item
  • Performance Optimization:
    • Query results from WDQS are cached locally using MediaWiki’s caching mechanisms (WANObjectCache).
    • Cache reduces repeated queries and ensures quick loading times.
  • Technical Implementation:
    • The WikimediaCampaignEvents extension retrieves data via SPARQL from WDQS.
    • The CampaignEvents extension renders the data on Special:AllEvents under the Communities tab.
  • Extension Communication:
    • The extensions communicate using MediaWiki’s hook system. The WikimediaCampaignEvents extension provides WikiProject data to the CampaignEvents extension through hook implementations.

This structure enables efficient collaboration between extensions, ensuring clear responsibilities, optimized performance, and simplified discoverability of WikiProjects.

Wikimedia Cloud VPS: IPv6 support

Dietmar Rabich, Cape Town (ZA), Sea Point, Nachtansicht — 2024 — 1867-70 – 2, CC BY-SA 4.0

Wikimedia Cloud VPS is a service offered by the Wikimedia Foundation, built using OpenStack and managed by the Wikimedia Cloud Services team. It provides cloud computing resources for projects related to the Wikimedia movement, including virtual machines, databases, storage, Kubernetes, and DNS.

A few weeks ago, in April 2025, we were finally able to introduce IPv6 to the cloud virtual network, enhancing the platform’s scalability, security, and future-readiness. This is a major milestone, many years in the making, and serves as an excellent point to take a moment to reflect on the road that got us here. There were definitely a number of challenges that needed to be addressed before we could get into IPv6. This post covers the journey to this implementation.

The Wikimedia Foundation was an early adopter of the OpenStack technology, and the original OpenStack deployment in the organization dates back to 2011. At that time, IPv6 support was still nascent and had limited implementation across various OpenStack components. In 2012, the Wikimedia cloud users formally requested IPv6 support.

When Cloud VPS was originally deployed, we had set up the network following some of the upstream-recommended patterns:

  • nova-networks as the engine in charge of the software-defined virtual network
  • using a flat network topology – all virtual machines would share the same network
  • using a physical VLAN in the datacenter
  • using Linux bridges to make this physical datacenter VLAN available to virtual machines
  • using a single virtual router as the edge network gateway, also executing a global egress NAT – barring some exceptions, using what was called “dmz_cidr” mechanism

In order for us to be able to implement IPv6 in a way that aligned with our architectural goals and operational requirements, pretty much all the elements in this list would need to change. First of all, we needed to migrate from nova-networks into Neutron, a migration effort that started in 2017. Neutron was the more modern component to implement software-defined networks in OpenStack. To facilitate this transition, we made the strategic decision to backport certain functionalities from nova-networks into Neutron, specifically the “dmz_cidr” mechanism and some egress NAT capabilities.

Once in Neutron, we started to think about IPv6. In 2018 there was an initial attempt to decide on the network CIDR allocations that Wikimedia Cloud Services would have. This initiative encountered unforeseen challenges and was subsequently put on hold. We focused on removing the previously backported nova-networks patches from Neutron.

Between 2020 and 2021, we initiated another significant network refresh. We were able to introduce the cloudgw project, as part of a larger effort to rework the Cloud VPS edge network. The new edge routers allowed us to drop all the custom backported patches we had in Neutron from the nova-networks era, unblocking further progress. Worth mentioning that the cloudgw router would use nftables as firewalling and NAT engine.

A pivotal decision in 2022 was to expose the OpenStack APIs to the internet, which crucially enabled infrastructure management via OpenTofu. This was key in the IPv6 rollout as will be explained later. Before this, management was limited to Horizon – the OpenStack graphical interface – or the command-line interface accessible only from internal control servers.

Later, in 2023, following the OpenStack project’s announcement of the deprecation of the neutron-linuxbridge-agent, we began to seriously consider migrating to the neutron-openvswitch-agent. This transition would, in turn, simplify the enablement of “tenant networks” – a feature allowing each OpenStack project to define its own isolated network, rather than all virtual machines sharing a single flat network.

Once we replaced neutron-linuxbridge-agent with neutron-openvswitch-agent, we were ready to migrate virtual machines to VXLAN. Demonstrating perseverance, we decided to execute the VXLAN migration in conjunction with the IPv6 rollout.

We prepared and tested several things, including the rework of the edge routing to be based on BGP/OSPF instead of static routing. In 2024 we were ready for the initial attempt to deploy IPv6, which failed for unknown reasons. There was a full network outage and we immediately reverted the changes. This quick rollback was feasible due to our adoption of OpenTofu: deploying IPv6 had been reduced to a single code change within our repository.

We started an investigation, corrected a few issues, and increased our network functional testing coverage before trying again. One of the problems we discovered was that Neutron would enable the “enable_snat” configuration flag for our main router when adding the new external IPv6 address.

Finally, in April 2025, after many years in the making, IPv6 was successfully deployed.

Compared to the network from 2011, we would have:

  • Neutron as the engine in charge of the software-defined virtual network
  • Ready to use tenant-networks
  • Using a VXLAN-based overlay network
  • Using neutron-openvswitch-agent to provide networking to virtual machines
  • A modern and robust edge network setup

Over time, the WMCS team has skillfully navigated numerous challenges to ensure our service offerings consistently meet high standards of quality and operational efficiency. Often engaging in multi-year planning strategies, we have enabled ourselves to set and achieve significant milestones.

The successful IPv6 deployment stands as further testament to the team’s dedication and hard work over the years. I believe we can confidently say that the 2025 Cloud VPS represents its most advanced and capable iteration to date.

2025-06-08 Omnimax

In a previous life, I worked for a location-based entertainment company, part of a huge team of people developing a location for Las Vegas, Nevada. It was COVID, a rough time for location-based anything, and things were delayed more than usual. Coworkers paid a lot of attention to another upcoming Las Vegas attraction, one with a vastly larger budget but still struggling to make schedule: the MSG (Madison Square Garden) Sphere.

I will set aside jokes about it being a square sphere, but they were perhaps one of the reasons that it underwent a pre-launch rebranding to merely the Sphere. If you are not familiar, the Sphere is a theater and venue in Las Vegas. While it's know mostly for the video display on the outside, that's just marketing for the inside: a digital dome theater, with seating at a roughly 45 degree stadium layout facing a near hemisphere of video displays.

It is a "near" hemisphere because the lower section is truncated to allow a flat floor, which serves as a stage for events but is also a practical architectural decision to avoid completely unsalable front rows. It might seem a little bit deceptive that an attraction called the Sphere does not quite pull off even a hemisphere of "payload," but the same compromise has been reached by most dome theaters. While the use of digital display technology is flashy, especially on the exterior, the Sphere is not quite the innovation that it presents itself as. It is just a continuation of a long tradition of dome theaters. Only time will tell, but the financial difficulties of the Sphere suggest that it follows the tradition faithfully: towards commercial failure.

You could make an argument that the dome theater is hundreds of years old, but I will omit it. Things really started developing, at least in our modern tradition of domes, with the 1923 introduction of the Zeiss planetarium projector. Zeiss projectors and their siblings used a complex optical and mechanical design to project accurate representations of the night sky. Many auxiliary projectors, incorporated into the chassis and giving these projectors famously eccentric shapes, rendered planets and other celestial bodies. Rather than digital light modulators, the images from these projectors were formed by purely optical means: perforated metal plates, glass plates with etched metalized layers, and fiber optics. The large, precisely manufactured image elements and specialized optics created breathtaking images.

While these projectors had considerable entertainment value, especially in the mid-century when they represented some of the most sophisticated projection technology yet developed, their greatest potential was obviously in education. Planetarium projectors were fantastically expensive (being hand-built in Germany with incredible component counts) [1], they were widely installed in science museums around the world. Most of us probably remember a dogbone-shaped Zeiss, or one of their later competitors like Spitz or Minolta, from our youths. Unfortunately, these marvels of artistic engineering were mostly retired as digital projection of near comparable quality became similarly priced in the 2000s.

But we aren't talking about projectors, we're talking about theaters. Planetarium projectors were highly specialized to rendering the night sky, and everything about them was intrinsically spherical. For both a reasonable viewing experience, and for the projector to produce a geometrically correct image, the screen had to be a spherical section. Thus the planetarium itself: in its most traditional form, rings of heavily reclined seats below a hemispherical dome. The dome was rarely a full hemisphere, but was usually truncated at the horizon. This was mostly a practical decision but integrated well into the planetarium experience, given that sky viewing is usually poor near the horizon anyway. Many planetaria painted a city skyline or forest silhouette around the lower edge to make the transition from screen to wall more natural. Later, theatrical lighting often replaced the silhouette, reproducing twilight or the haze of city lights.

Unsurprisingly, the application-specific design of these theaters also limits their potential. Despite many attempts, the collective science museum industry has struggled to find entertainment programming for planetaria much beyond Pink Floyd laser shows [2]. There just aren't that many things that you look up at. Over time, planetarium shows moved in more narrative directions. Film projection promised new flexibility---many planetaria with optical star projectors were also equipped with film projectors, which gave show producers exciting new options. Documentary video of space launches and animations of physical principles became natural parts of most science museum programs, but were a bit awkward on the traditional dome. You might project four copies of the image just above the horizon in the four cardinal directions, for example. It was very much a compromise.

With time, the theater adapted to the projection once again: the domes began to tilt. By shifting the dome in one direction, and orienting the seating towards that direction, you could create a sort of compromise point between the traditional dome and traditional movie theater. The lower central area of the screen was a reasonable place to show conventional film, while the full size of the dome allowed the starfield to almost fill the audience's vision. The experience of the tilted dome is compared to "floating in space," as opposed to looking up at the sky.

In true Cold War fashion, it was a pair of weapons engineers (one nuclear weapons, the other missiles) who designed the first tilted planetarium. In 1973, the planetarium of what is now called the Fleet Science Center in San Diego, California opened to the public. Its dome was tilted 25 degrees to the horizon, with the seating installed on a similar plane and facing in one direction. It featured a novel type of planetarium projector developed by Spitz and called the Space Transit Simulator. The STS was not the first, but still an early mechanical projector to be controlled by a computer---a computer that also had simultaneous control of other projectors and lighting in the theater, what we now call a show control system.

Even better, the STS's innovative optical design allowed it to warp or bend the starfield to simulate its appearance from locations other than earth. This was the "transit" feature: with a joystick connected to the control computer, the planetarium presenter could "fly" the theater through space in real time. The STS was installed in a well in the center of the seating area, and its compact chassis kept it low in the seating area, preserving the spherical geometry (with the projector at the center of the sphere) without blocking the view of audience members sitting behind it and facing forward.

And yet my main reason for discussing the Fleet planetarium is not the the planetarium projector at all. It is a second projector, an "auxiliary" one, installed in a second well behind the STS. The designers of the planetarium intended to show film as part of their presentations, but they were not content with a small image at the center viewpoint. The planetarium commissioned a few of the industry's leading film projection experts to design a film projection system that could fill the entire dome, just as the planetarium projector did.

They knew that such a large dome would require an exceptionally sharp image. Planetarium projectors, with their large lithographed slides, offered excellent spatial resolution. They made stars appear as point sources, the same as in the night sky. 35mm film, spread across such a large screen, would be obviously blurred in comparison. They would need a very large film format.

Omnimax dome with work lights on at Chicago Museum of Science and Industry

Fortuitously, almost simultaneously the Multiscreen Corporation was developing a "sideways" 70mm format. This 15-perf format used 70mm film but fed it through the projector sideways, making each frame much larger than typical 70mm film. In its debut, at a temporary installation in the 1970 Expo Osaka, it was dubbed IMAX. IMAX made an obvious basis for a high-resolution projection system, and so the then-named IMAX Corporation was added to the planetarium project. The Fleet's film projector ultimately consisted of an IMAX film transport with a custom-built compact, liquid-cooled lamphouse and spherical fisheye lens system.

The large size of the projector, the complex IMAX framing system and cooling equipment, made it difficult to conceal in the theater's projector well. Threading film into IMAX projectors is quite complex, with several checks the projectionist must make during a pre-show inspection. The projectionist needed room to handle the large film, and to route it to and from the enormous reels. The projector's position in the middle of the seating area left no room for any of this. We can speculate that it was, perhaps, one of the designer's missile experience that lead to the solution: the projector was serviced in a large projection room beneath the theater's seating. Once it was prepared for each show, it rose on near-vertical rails until just the top emerged in the theater. Rollers guided the film as it ran from a platter, up the shaft to the projector, and back down to another platter. Cables and hoses hung below the projector, following it up and down like the traveling cable of an elevator.

To advertise this system, probably the greatest advance in film projection since the IMAX format itself, the planetarium coined the term Omnimax.

Omnimax was not an easy or economical format. Ideally, footage had to be taken in the same format, using a 70mm camera with a spherical lens system. These cameras were exceptionally large and heavy, and the huge film format limited cinematographers to short takes. The practical problems with Omnimax filming were big enough that the first Omnimax films faked it, projecting to the larger spherical format from much smaller conventional negatives. This was the case for "Voyage to the Outer Planets" and "Garden Isle," the premier films at the Fleet planetarium. The history of both is somewhat obscure, the latter especially.

"Voyage to the Outer Planets" was executive-produced by Preston Fleet, a founder of the Fleet center (which was ultimately named for his father, a WWII aviator). We have Fleet's sense of showmanship to thank for the invention of Omnimax: He was an accomplished business executive, particularly in the photography industry, and an aviation enthusiast who had his hands in more than one museum. Most tellingly, though, he had an eccentric hobby. He was a theater organist. I can't help but think that his passion for the theater organ, an instrument almost defined by the combination of many gizmos under electromechanical control, inspired "Voyage." The film, often called a "multimedia experience," used multiple projectors throughout the planetarium to depict a far-future journey of exploration. The Omnimax film depicted travel through space, with slide projectors filling in artist's renderings of the many wonders of space.

The ten-minute Omnimax film was produced by Graphic Films Corporation, a brand that would become closely associated with Omnimax in the following decades. Graphic was founded in the midst of the Second World War by Lester Novros, a former Disney animator who found a niche creating training films for the military. Novros's fascination with motion and expertise in presenting complicated 3D scenes drew him to aerospace, and after the war he found much of his business in the newly formed Air Force and NASA. He was also an enthusiast of niche film formats, and Omnimax was not his first dome.

For the 1964 New York World's Fair, Novros and Graphic Films had produced "To the Moon and Beyond," a speculative science film with thematic similarities to "Voyage" and more than just a little mechanical similarity. It was presented in Cinerama 360, a semi-spherical, dome-theater 70mm format presented in a special theater called the Moon Dome. "To the Moon and Beyond" was influential in many ways, leading to Graphic Films' involvement in "2001: A Space Odyssey" and its enduring expertise in domes.

The Fleet planetarium would not remain the only Omnimax for long. In 1975, the city of Spokane, Washington struggled to find a new application for the pavilion built for Expo '74 [3]. A top contender: an Omnimax theater, in some ways a replacement for the temporary IMAX theater that had been constructed for the actual Expo. Alas, this project was not to be, but others came along: in 1978, the Detroit Science Center opened the second Omnimax theater ("the machine itself looks like and is the size of a front loader," the Detroit Free Press wrote). The Science Museum of Minnesota, in St. Paul, followed shortly after.

Omnimax hit prime time the next year, with the 1979 announcement of an Omnimax theater at Caesars Palace in Las Vegas, Nevada. Unlike the previous installations, this 380-seat theater was purely commercial. It opened with the 1976 IMAX film "To Fly!," which had been optically modified to fit the Omnimax format. This choice of first film is illuminating. "To Fly!" is a 27 minute documentary on the history of aviation in the United States, originally produced for the IMAX theater at the National Air and Space Museum [4]. It doesn't exactly seem like casino fare.

The IMAX format, the flat-screen one, was born of world's fairs. It premiered at an Expo, reappeared a couple of years later at another one, and for the first years of the format most of the IMAX theaters built were associated with either a major festival or an educational institution. This noncommercial history is a bit hard to square with the modern IMAX brand, closely associated with major theater chains and the Marvel Cinematic Universe.

Well, IMAX took off, and in many ways it sold out. Over the decades since the 1970 Expo, IMAX has met widespread success with commercial films and theater owners. Simultaneously, the definition or criteria for IMAX theaters have relaxed, with smaller screens made permissible until, ultimately, the transition to digital projection eliminated the 70mm film and more or less reduce IMAX to just another ticket surcharge brand. It competes directly with Cinemark xD, for example. To the theater enthusiast, this is a pretty sad turn of events, a Westinghouse-esque zombification of a brand that once heralded the field's most impressive technical achievements.

The same never happened to Omnimax. The Caesar's Omnimax theater was an odd exception; the vast majority of Omnimax theaters were built by science museums and the vast majority of Omnimax films were science documentaries. Quite a few of those films had been specifically commissioned by science museums, often on the occasion of their Omnimax theater opening. The Omnimax community was fairly tight, and so the same names recur.

The Graphic Films Corporation, which had been around since the beginning, remained so closely tied to the IMAX brand that they practically shared identities. Most Omnimax theaters, and some IMAX theaters, used to open with a vanity card often known as "the wormhole." It might be hard to describe beyond "if you know you know," it certainly made an impression on everyone I know that grew up near a theater that used it. There are some videos, although unfortunately none of them are very good.

I have spent more hours of my life than I am proud to admit trying to untangle the history of this clip. Over time, it has appeared in many theaters with many different logos at the end, and several variations of the audio track. This is in part informed speculation, but here is what I believe to be true: the "wormhole" was originally created by Graphic Films for the Fleet planetarium specifically, and ran before "Voyage to the Outer Planets" and its double-feature companion "Garden Isle," both of which Graphic Films had worked on. This original version ended with the name Graphic Films, accompanied by an odd sketchy drawing that was also used as an early logo of the IMAX Corporation. Later, the same animation was re-edited to end with an IMAX logo.

This version ran in both Omnimax and conventional IMAX theaters, probably as a result of the extensive "cross-pollination" of films between the two formats. Many Omnimax films through the life of the format had actually been filmed for IMAX, with conventional lenses, and then optically modified to fit the Omnimax dome after the fact. You could usually tell: the reprojection process created an unusual warp in the image, and more tellingly, these pseudo-Omnimax films almost always centered the action at the middle of the IMAX frame, which was too high to be quite comfortable in an Omnimax theater (where the "frame center" was well above the "front center" point of the theater). Graphic Films had been involved in a lot of these as well, perhaps explaining the animation reuse, but it's just as likely that they had sold it outright to the IMAX corporation which used it as they pleased.

For some reason, this version also received new audio that is mostly the same but slightly different. I don't have a definitive explanation, but I think there may have been an audio format change between the very early Omnimax theaters and later IMAX/Omnimax systems, which might have required remastering.

Later, as Omnimax domes proliferated at science museums, the IMAX Corporation (which very actively promoted Omnimax to education) gave many of these theaters custom versions of the vanity card that ended with the science museum's own logo. I have personally seen two of these, so I feel pretty confident that they exist and weren't all that rare (basically 2 out of 2 Omnimax theaters I've visited used one), but I cannot find any preserved copies.

Another recurring name in the world of IMAX and Omnimax is MacGillivray Freeman Films. MacGillivray and Freeman were a pair of teenage friends from Laguna Beach who dropped out of school in the '60s to make skateboard and surf films. This is, of course, a rather cliché start for documentary filmmakers but we must allow that it was the '60s and they were pretty much the ones creating the cliché. Their early films are hard to find in anything better than VHS rip quality, but worth watching: Wikipedia notes their significance in pioneering "action cameras," mounting 16mm cinema cameras to skateboards and surfboards, but I would say that their cinematography was innovative in more ways than just one. The 1970 "Catch the Joy," about sandrails, has some incredible shots that I struggle to explain. There's at least one where they definitely cut the shot just a couple of frames before a drifting sandrail flung their camera all the way down the dune.

For some reason, I would speculate due to their reputation for exciting cinematography, the National Air and Space Museum chose MacGillivray and Freeman for "To Fly!". While not the first science museum IMAX documentary by any means (that was, presumably, "Voyage to the Outer Planets" given the different subject matter of the various Expo films), "To Fly!" might be called the first modern one. It set the pattern that decades of science museum films followed: a film initially written by science educators, punched up by producers, and filmed with the very best technology of the time. Fearing that the film's history content would be dry, they pivoted more towards entertainment, adding jokes and action sequences. "To Fly!" was a hit, running in just about every science museum with an IMAX theater, including Omnimax.

Sadly, Jim Freeman died in a helicopter crash shortly after production. Nonetheless, MacGillivray Freeman Films went on. Over the following decades, few IMAX science documentaries were made that didn't involve them somehow. Besides the films they produced, the company consulted on action sequences in most of the format's popular features.

Omnimax projection room at OMSI

I had hoped to present here a thorough history of the films that were actually produced in the Omnimax format. Unfortunately, this has proven very difficult: the fact that most of them were distributed only to science museums means that they are very spottily remembered, and besides, so many of the films that ran in Omnimax theaters were converted from IMAX presentations that it's hard to tell the two apart. I'm disappointed that this part of cinema history isn't better recorded, and I'll continue to put time into the effort. Science museum documentaries don't get a lot of attention, but many of the have involved formidable technical efforts.

Consider, for example, the cameras: befitting the large film, IMAX cameras themselves are very large. When filming "To Fly!", MacGillivray and Freeman complained that the technically very basic 80 pound cameras required a lot of maintenance, were complex to operate, and wouldn't fit into the "action cam" mounting positions they were used to. The cameras were so expensive, and so rare, that they had to be far more conservative than their usual approach out of fear of damaging a camera they would not be able to replace. It turns out that they had it easy. Later IMAX science documentaries would be filmed in space ("The Dream is Alive" among others) and deep underwater ("Deep Sea 3D" among others). These IMAX cameras, modified for simpler operation and housed for such difficult environments, weighed over 1,000 pounds. Astronauts had to be trained to operate the cameras; mission specialists on Hubble service missions had wrangling a 70-pound handheld IMAX camera around the cabin and developing its film in a darkroom bag among their duties. There was a lot of film to handle: as a rule of thumb, one mile of IMAX film is good for eight and a half minutes.

I grew up in Portland, Oregon, and so we will make things a bit more approachable by focusing on one example: The Omnimax theater of the Oregon Museum of Science and Industry, which opened as part of the museum's new waterfront location in 1992. This 330-seat boasted a 10,000 sq ft dome and 15 kW of sound. The premier feature was "Ring of Fire," a volcano documentary originally commissioned by the Fleet, the Fort Worth Museum of Science and Industry, and the Science Museum of Minnesota. By the 1990s, the later era of Omnimax, the dome format was all but abandoned as a commercial concept. There were, an announcement article notes, around 90 total IMAX theaters (including Omnimax) and 80 Omnimax films (including those converted from IMAX) in '92. Considering the heavy bias towards science museums among these theaters, it was very common for the films to be funded by consortia of those museums.

Considering the high cost of filming in IMAX, a lot of the documentaries had a sort of "mashup" feel. They would combine footage taken in different times and places, often originally for other projects, into a new narrative. "Ring of Fire" was no exception, consisting of a series of sections that were sometimes more loosely connected to the theme. The 1982 Loma Prieta earthquake was a focus, and the eruption of Mt. St. Helens, and lava flows in Hawaii. Perhaps one of the reasons it's hard to catalog IMAX films is this mashup quality, many of the titles carried at science museums were something along the lines of "another ocean one." I don't mean this as a criticism, many of the IMAX documentaries were excellent, but they were necessarily composed from painstakingly gathered fragments and had to cover wide topics.

Given that I have an announcement feature piece in front of me, let's also use the example of OMSI to discuss the technical aspects. OMSI's projector cost about $2 million and weighted about two tons. To avoid dust damaging the expensive prints, the "projection room" under the seating was a positive-pressure cleanroom. This was especially important since the paucity of Omnimax content meant that many films ran regularly for years. The 15 kW water-cooled lamp required replacement at 800 to 1,000 hours, but unfortunately, the price is not noted.

By the 1990s, Omnimax had become a rare enough system that the projection technology was a major part of the appeal. OMSI's installation, like most later Omnimax theaters, had the audience queue below the seating, separated from the projection room by a glass wall. The high cost of these theaters meant that they operated on high turnovers, so patrons would wait in line to enter immediately after the previous showing had exited. While they waited, they could watch the projectionist prepare the next show while a museum docent explained the equipment.

I have written before about multi-channel audio formats, and Omnimax gives us some more to consider. The conventional audio format for much of Omnimax's life was six-channel: left rear, left screen, center screen, right screen, right rear, and top. Each channel had an independent bass cabinet (in one theater, a "caravan-sized" enclosure with eight JBL 2245H 46cm woofers), and a crossover network fed the lowest end of all six channels to a "sub-bass" array at screen bottom. The original Fleet installation also had sub-bass speakers located beneath the audience seating, although that doesn't seem to have become common.

IMAX titles of the '70s and '80s delivered audio on eight-track magnetic tape, with the additional tracks used for synchronization to the film. By the '90s, IMAX had switched to distributing digital audio on three CDs (one for each two channels). OMSI's theater was equipped for both, and the announcement amusingly notes the availability of cassette decks. A semi-custom audio processor made for IMAX, the Sonics TAC-86, managed synchronization with film playback and applied equalization curves individually calibrated to the theater.

IMAX domes used perforated aluminum screens (also the norm in later planetaria), so the speakers were placed behind the screen in the scaffold-like superstructure that supported it. When I was young, OMSI used to start presentations with a demo program that explained the large size of IMAX film before illuminating work lights behind the screen to make the speakers visible. Much of this was the work of the surprisingly sophisticated show control system employed by Omnimax theaters, a descendent of the PDP-15 originally installed in the Fleet.

Despite Omnimax's almost complete consignment to science museums, there were some efforts at bringing commercial films. Titles like Disney's "Fantasia" and "Star Wars: Episode III" were distributed to Omnimax theaters via optical reprojection, sometimes even from 35mm originals. Unfortunately, the quality of these adaptations was rarely satisfactory, and the short runtimes (and marketing and exclusivity deals) typical of major commercial releases did not always work well with science museum schedules. Still, the cost of converting an existing film to dome format is pretty low, so the practice continues today. "Star Wars: The Force Awakens," for example, ran on at least one science museum dome. This trickle of blockbusters was not enough to make commercial Omnimax theaters viable.

Caesars Palace closed, and then demolished, their Omnimax theater in 2000. The turn of the 21st century was very much the beginning of the end for the dome theater. IMAX was moving away from their film system and towards digital projection, but digital projection systems suitable for large domes were still a nascent technology and extremely expensive. The end of aggressive support from IMAX meant that filming costs became impractical for documentaries, so while some significant IMAX science museum films were made in the 2000s, the volume definitely began to lull and the overall industry moved away from IMAX in general and Omnimax especially.

It's surprising how unforeseen this was, at least to some. A ten-screen commercial theater in Duluth opened an Omnimax theater in 1996! Perhaps due to the sunk cost, it ran until 2010, not a bad closing date for an Omnimax theater. Science museums, with their relatively tight budgets and less competitive nature, did tend to hold over existing Omnimax installations well past their prime. Unfortunately, many didn't: OMSI, for example, closed its Omnimax theater in 2013 for replacement with a conventional digital theater that has a large screen but is not IMAX branded.

Fortunately, some operators hung onto their increasingly costly Omnimax domes long enough for modernization to become practical. The IMAX Corporation abandoned the Omnimax name as more of the theaters closed, but continued to support "IMAX Dome" with the introduction of a digital laser projector with spherical optics. There are only ten examples of this system. Others, including Omnimax's flagship at the Fleet Science Center, have been replaced by custom dome projection systems built by competitors like Sony.

Few Omnimax projectors remain. The Fleet, to their credit, installed the modern laser projectors in front of the projector well so that the original film projector could remain in place. It's still functional and used for reprisals of Omnimax-era documentaries. IMAX projectors in general are a dying breed, a number of them have been preserved but their complex, specialized design and the end of vendor support means that it may become infeasible to keep them operating.

We are, of course, well into the digital era. While far from inexpensive, digital projection systems are now able to match the quality of Omnimax projection. The newest dome theaters, like the Sphere, dispense with projection entirely. Instead, they use LED display panels capable of far brighter and more vivid images than projection, and with none of the complexity of water-cooled arc lamps.

Still, something has been lost. There was once a parallel theater industry, a world with none of the glamor of Hollywood but for whom James Cameron hauled a camera to the depths of the ocean and Leonardo DiCaprio narrated repairs to the Hubble. In a good few dozen science museums, two-ton behemoths rose from beneath the seats, the zenith of film projection technology. After decades of documentaries, I think people forgot how remarkable these theaters were.

Science museums stopped promoting them as aggressively, and much of the showmanship faded away. Sometime in the 2000s, OMSI stopped running the pre-show demonstration, instead starting the film directly. They stopped explaining the projectionist's work in preparing the show, and as they shifted their schedule towards direct repetition of one feature, there was less for the projectionist to do anyway. It became just another museum theater, so it's no wonder that they replaced it with just another museum theater: a generic big-screen setup with the exceptionally dull name of "Empirical Theater."

From time to time, there have been whispers of a resurgence of 70mm film. Oppenheimer, for example, was distributed to a small number of theaters in this giant of film formats: 53 reels, 11 miles, 600 pounds of film. Even conventional IMAX is too costly for the modern theater industry, though. Omnimax has fallen completely by the wayside, with the few remaining dome operators doomed to recycling the same films with a sprinkling of newer reformatted features. It is hard to imagine a collective of science museums sending another film camera to space.

Omnimax poses a preservation challenge in more ways than one. Besides the lack of documentation on Omnimax theaters and films, there are precious few photographs of Omnimax theaters and even fewer videos of their presentations. Of course, the historian suffers where Madison Square Garden hopes to succeed: the dome theater is perhaps the ultimate in location-based entertainment. Photos and videos, represented on a flat screen, cannot reproduce the experience of the Omnimax theater. The 180 horizontal degrees of screen, the sound that was always a little too loud, in no small part to mask the sound of the projector that made its own racket in the middle of the seating. You had to be there.

Omnimax projector at St. Louis Science Center

IMAGES: Omnimax projection room at OMSI, Flickr user truk. Omnimax dome with work lights on at MSI Chicago, Wikimedia Commons user GualdimG. Omnimax projector at St. Louis Science Center, Flickr user pasa47.

[1] I don't have extensive information on pricing, but I know that in the 1960s an "economy" Spitz came in over $30,000 (~10x that much today).

[2] Pink Floyd's landmark album Dark Side of The Moon debuted in a release event held at the London Planetarium. This connection between Pink Floyd and planetaria, apparently much disliked by the band itself, has persisted to the present day. Several generations of Pink Floyd laser shows have been licensed by science museums around the world, and must represent by far the largest success of fixed-installation laser projection.

[3] Are you starting to detect a theme with these Expos? the World's Fairs, including in their various forms as Expos, were long one of the main markets for niche film formats. Any given weird projection format you run into, there's a decent chance that it was originally developed for some short film for an Expo. Keep in mind that it's the nature of niche projection formats that they cannot easily be shown in conventional theaters, so they end up coupled to these crowd events where a custom venue can be built.

[4] The Smithsonian Institution started looking for an exciting new theater in 1970. As an example of the various niche film formats at the time, the Smithsonian considered a dome (presumably Omnimax), Cinerama (a three-projector ultrawide system), and Circle-Vision 360 (known mostly for the few surviving Expo films at Disney World's EPCOT) before settling on IMAX. The Smithsonian theater, first planned for the Smithsonian Museum of Natural History before being integrated into the new National Air and Space Museum, was tremendously influential on the broader world of science museum films. That is perhaps an understatement, it is sometimes credited with popularizing IMAX in general, and the newspaper coverage the new theater received throughout North America lends credence to the idea. It is interesting, then, to imagine how different our world would be if they had chosen Circle-Vision. "Captain America: Brave New World" in Cinemark 360.

2025-05-27 the first smart homes

Sometimes I think I should pivot my career to home automation critic, because I have many opinions on the state of the home automation industry---and they're pretty much all critical. Virtually every time I bring up home automation, someone says something about the superiority of the light switch. Controlling lights is one of the most obvious applications of home automation, and there is a roughly century long history of developments in light control---yet, paradoxically, it is an area where consumer home automation continues to struggle.

An analysis of how and why billion-dollar tech companies fail to master the simple toggling of lights in response to human input will have to wait for a future article, because I will have a hard time writing one without descending into incoherent sobbing about the principles of scene control and the interests of capital. Instead, I want to just dip a toe into the troubled waters of "smart lighting" by looking at one of its earliest precedents: low-voltage lighting control.

A source I generally trust, the venerable "old internet" website Inspectapedia, says that low-voltage lighting control systems date back to about 1946. The earliest conclusive evidence I can find of these systems is a newspaper ad from 1948, but let's be honest, it's a holiday and I'm only making a half effort on the research. In any case, the post-war timing is not a coincidence. The late 1940s were a period of both rapid (sub)urban expansion and high copper prices, and the original impetus for relay systems seems to have been the confluence of these two.

But let's step back and explain what a relay or low-voltage lighting control system is. First, I am not referring to "low voltage lighting" meaning lights that run on 12 or 24 volts DC or AC, as was common in landscape lighting and is increasingly common today for integrated LED lighting. Low-voltage lighting control systems are used for conventional 120VAC lights. In the most traditional construction, e.g. in the 1940s, lights would be served by a "hot" wire that passed through a wall box containing a switch. In many cases the neutral (likely shared with other fixtures) went directly from the light back to the panel, bypassing the switch... running both the hot and neutral through the switch box did not become conventional until fairly recently, to the chagrin of anyone installing switches that require a neutral for their own power, like timers or "smart" switches.

The problem with this is that it lengthens the wiring runs. If you have a ceiling fixture with two different switches in a three-way arrangement, say in a hallway in a larger house, you could be adding nearly 100' in additional wire to get the hot to the switches and the runner between them. The cost of that wiring, in the mid-century, was quite substantial. Considering how difficult it is to find an employee to unlock the Romex cage at Lowes these days, I'm not sure that's changed that much.

There are different ways of dealing with this. In the UK, the "ring main" served in part to reduce the gauge (and thus cost) of outlet wiring, but we never picked up that particular eccentricity in the US (for good reason). In commercial buildings, it's not unusual for lighting to run on 240v for similar reasons, but 240v is discouraged in US residential wiring. Besides, the mid-century was an age of optimism and ambition in electrical technology, the days of Total Electric Living. Perhaps the technology of the relay, refined by so many innovations of WWII, could offer a solution.

Switch wiring also had to run through wall cavities, an irritating requirement in single-floor houses where much of the lighting wiring could be contained to the attic. The wiring of four-way and other multi-switch arrangements could become complex and require a lot more wall runs, discouraging builders providing switches in the most convenient places. What if relays also made multiple switches significantly easier to install and relocate?

You probably get the idea. In a typical low-voltage lighting control system, a transformer provides a low voltage like 24VAC, much the same as used by doorbells. The light switches simply toggle the 24VAC control power to the coils of relays. Some (generally older) systems powered the relay continuously, but most used latching relays. In this case, all light switches are momentary, with an "on" side and an "off" side. This could be a paddle that you push up or down (much like a conventional light switch), a bar that you push the left or right sides of, or a pair of two push buttons.

In most installations, all of the relays were installed together in a single enclosure, usually in the attic where the high-voltage wiring to the actual lights would be fairly short. The 24VAC cabling to the switches was much smaller gauge, and depending on the jurisdiction might not require any sort of license to install.

Many systems had enclosures with separate high voltage and low voltage components, or mounted the relays on the outside of an enclosure such that the high voltage wiring was inside and low voltage outside. Both arrangements helped to meet code requirements for isolating high and low voltage systems and provided a margin of safety in the low voltage wiring. That provided additional cost savings as well; low voltage wiring was usually installed without any kind of conduit or sheathed cable.

By 1950, relay lighting controls were making common appearances in real estate listings. A feature piece on the "Melody House," a builder's model home, in the Tacoma News Tribune reads thus:

Newest features in the house are the low voltage touch plate and relay system lighting controls, with wide plates instead of snap buttons---operated like the stops of a pipe organ, with the merest flick of a finger.

The comparison to a pipe organ is interesting, first in its assumption that many readers were familiar with typical organ stops. Pipe organs were, increasingly, one of the technological marvels of the era: while the concept of the pipe organ is very old, this same era saw electrical control systems (replete with relays!) significantly reduce the cost and complexity of organ consoles. What's more, the tonewheel electric organ had become well-developed and started to find its way into homes.

The comparison is also interesting because of its deficiencies. The Touch-Plate system described used wide bars, which you pressed the left or right side of---you could call them momentary SPDT rocker switches if you wanted. There were organs with similar rocker stops but I do not think they were common in 1950. My experience is that such rocker switch stops usually indicate a fully digital control system, where they make momentary action unobtrusive and avoid state synchronization problems. I am far from an expert on organs, though, which is why I haven't yet written about them. If you have a guess at which type of pipe organ console our journalist was familiar with, do let me know.

Touch-Plate seems to have been one of the first manufacturers of these systems, although I can't say for sure that they invented them. Interestingly, Touch-Plate is still around today, but their badly broken WordPress site ("Welcome to the new touch-plate.com" despite it actually being touchplate.com) suggests they may not do much business. After a few pageloads their WordPress plugin WAF blocked me for "exceed[ing] the maximum number of page not found errors per minute for humans." This might be related to my frustration that none of the product images load. It seems that the Touch-Plate company has mostly pivoted to reselling imported LED lighting (touchplateled.com), so I suppose the controls business is withering on the vine.

The 1950s saw a proliferation of relay lighting control brands, with GE introducing a particularly popular system with several generations of fixtures. Kyle Switch Plates, who sell replacement switch plates (what else?), list options for Remcon, Sierra, Bryant, Pyramid, Douglas, and Enercon systems in addition to the two brands we have met so far. As someone who pays a little too much attention to light switches, I have personally seen four of these brands, three of them still in use and one apparently abandoned in place.

Now, you might be thinking that simply economizing wiring by relocating the switches does not constitute "home automation," but there are other features to consider. For one, low-voltage light control systems made it feasible to install a lot more switches. Houses originally built with them often go a little wild with the n-way switching, every room providing lightswitches at every door. But there is also the possibility of relay logic. From the same article:

The necessary switches are found in every room, but in the master bedroom there is a master control panel above the bed, from where the house and yard may be flooded with instant light in case of night emergency.

Such "master control panels" were a big attraction for relay lighting, and the finest homes of the 1950s and 1960s often displayed either a grid of buttons near the head of the master bed, or even better, a GE "Master Selector" with a curious system of rotary switches. On later systems, timers often served as auxiliary switches, so you could schedule exterior lights. With a creative installer, "scenes" were even possible by wiring switches to arbitrary sets of relays (this required DC or half-wave rectified control power and diodes to isolate the switches from each other).

Many of these relay control systems are still in use today. While they are quite outdated in a certain sense, the design is robust and the simple components mean that it's usually not difficult to find replacement parts when something does fail. The most popular system is the one offered by GE, using their RR series relays (RR3, RR4, etc., to the modern RR9). That said, GE suggests a modernization path to their LightSweep system, which is really a 0-10v analog dimming controller that has the add-on ability to operate relays.

The failure modes are mostly what you would expect: low voltage wiring can chafe and short, or the switches can become stuck. This tends to cause the lights to stick on or off, and the continuous current through the relay coil often burns it out. The fix requires finding the stuck switch or short and correcting it, and then replacing the relay.

One upside of these systems that persists today is density: the low voltage switches are small, so with most systems you can fit 3 per gang. Another is that they still make N-way switching easier. There is arguably a safety benefit, considering the reduction in mains-voltage wire runs.

Yet we rarely see such a thing installed in homes newer than around the '80s. I don't know that I can give a definitive explanation of the decline of relay lighting control, but reduced prices for copper wiring were probably a main factor. The relays added a failure point, which might lead to a perception of unreliability, and the declining familiarity of electricians means that installing a relay system could be expensive and frustrating today.

What really interests me about relay systems is that they weren't really replaced... the idea just went away. It's not like modern homes are providing a master control panel in the bedroom using some alternative technology. I mean, some do, those with prices in the eight digits, but you'll hardly ever see it.

That gets us to the tension between residential lighting and architectural lighting control systems. In higher-end commercial buildings, and in environments like conference rooms and lecture halls, there's a well established industry building digital lighting control systems. Today, DALI is a common standard for the actual lighting control, but if you look at a range of existing buildings you will find everything from completely proprietary digital distributed dimming to 0-10v analog dimming to central dimmer racks (similar to traditional theatrical lighting).

Relay lighting systems were, in a way, a nascent version of residential architectural lighting control. And the architectural lighting control industry continues to evolve. If there is a modern equivalent to relay lighting, it's something like Lutron QSX. That's a proprietary digital lighting (and shade) control system, marketed for both residential and commercial use. QSX offers a wide range of attractive wall controls, tight integration to Lutron's HomeSense home automation platform, and a price tag that'll make your eyes water. Lutron has produced many generations of these systems, and you could make an argument that they trace their heritage back to the relay systems of the 1940s. But they're just priced way beyond the middle-class home.

And, well, I suppose that requires an argument based on economics. Prices have gone up. Despite tract construction being a much older idea than people often realize, it seems clear that today's new construction homes have been "value engineered" to significantly lower feature and quality levels than those of the mid-century---but they're a lot bigger. There is a sort of maxim that today's home buyers don't care about anything but square footage, and if you've seen what Pulte or D. R. Horton are putting up... well, I never knew that 3,000 sq ft could come so cheap, and look it too.

Modern new-construction homes just don't come with the gizmos that older ones did, especially in the '60s and '70s. Looking at the sales brochure for a new development in my own Albuquerque ("Estates at La Cuentista"), besides 21st century suburbanization (Gated Community! "East Access to Paseo del Norte" as if that's a good thing!) most of the advertised features are "big." I'm serious! If you look at the "More Innovation Built In" section, the "innovations" are a home office (more square footage), storage (more square footage), indoor and outdoor gathering spaces (to be fair, only the indoor ones are square footage), "dedicated learning areas" for kids (more square footage), and a "basement or bigger garage" for a home gym (more square footage). The only thing in the entire innovation section that I would call a "technical" feature is water filtration. You can scroll down for more details, and you get to things like "space for a movie room" and a finished basement described eight different ways.

Things were different during the peak of relay lighting in the '60s. A house might only be 1,600 sq ft, but the builder would deck it out with an intercom (including multi-room audio of a primitive sort), burglar alarm, and yes, relay lighting. All of these technologies were a lot newer and people were more excited about them; I bring up Total Electric Living a lot because of an aesthetic obsession but it was a large-scale advertising and partnership campaign by the electrical industry (particularly Westinghouse) that gave builders additional cross-promotion if they included all of these bells and whistles.

Remember, that was when people were watching those old videos about the "kitchen of the future." What would a 2025 "Kitchen of the Future" promotional film emphasize? An island bigger than my living room and a nook for every meal, I assume. Features like intercoms and even burglar alarms have become far less common in new construction, and even if they were present I don't think most buyers would use them.

But that might seem a little odd, right, given the push towards home automation? Well, built-in home automation options have existed for longer than any of today's consumer solutions, but "built in" is a liability for a technology product. There are practical reasons, in that built-in equipment is harder to replace, but there's also a lamer commercial reason. Consumer technology companies want to sell their products like consumer technology, so they've recontextualized lighting control as "IoT" and "smart" and "AI" rather than something an electrician would hook up.

While I was looking into relay lighting control systems, I ran into an interesting example. The Lutron Lu Master Lumi 5. What a name! Lutron loves naming things like this. The Lumi 5 is a 1980s era product with essentially the same features as a relay system, but architected in a much stranger way. It is, essentially, five three way switches in a box with remote controls. That means that each of the actual light switches in the house (which could also be dimmers) need mains-voltage wiring, including runner, back to the Lumi 5 "interface."

Pressing a button on one of the Lutron wall panels toggles the state of the relay in the "interface" cabinet, toggling the light. But, since it's all wired as a three-way switch, toggling the physical switch at the light does the same thing. As is typical when combining n-way switches and dimming, the Lumi 5 has no control over dimmers. You can only dim a light up or down at the actual local control, the Lumi 5 can just toggle the dimmer on and off using the 3-way runner. The architecture also means that you have two fundamentally different types of wall panels in your house: local switches or dimmers wired to each light, and the Lu Master panels with their five buttons for the five circuits, along with "all on" and "all off."

The Lumi 5 "interface" uses simple relay logic to implement a few more features. Five mains-voltage-level inputs can be wired to time clocks, so that you can schedule any combination(s) of the circuits to turn on and off. The manual recommends models including one with an astronomical clock for sunrise/sunset. An additional input causes all five circuits to turn on; it's suggested for connection to an auxiliary relay on a burglar alarm to turn all of the lights on should the alarm be triggered.

The whole thing is strange and fascinating. It is basically a relay lighting control system, like so many before it, but using a distinctly different wiring convention. I think the main reason for the odd wiring was to accommodate dimmers, an increasingly popular option in the 1980s that relay systems could never really contend with. It doesn't have the cost advantages of relay systems at all, it will definitely be more expensive! But it adds some features over the fancy Lutron switches and dimmers you were going to install anyway.

The Lu Master is the transitional stage between relay lighting systems and later architectural lighting controls, and it straddled too the end of relay light control in homes. It gives an idea of where relay light control in homes would have evolved, had the whole technology not been doomed to the niche zone of conference centers and universities.

If you think about it, the Lu Master fills the most fundamental roles of home automation in lighting: control over multiple lights in a convenient place, scheduling and triggers, and an emergency function. It only lacks scenes, which I think we can excuse considering that the simple technology it uses does not allow it to adjust dimmers. And all of that with no Node-RED in sight!

Maybe that conveys what most frustrates me about the "home automation" industry: it is constantly reinventing the wheel, an oligopoly of tech companies trying to drag people's homes into their "ecosystem." They do so by leveraging the buzzword of the moment, IoT to voice assistants to, I guess now AI?, to solve a basic set of problems that were pretty well solved at least as early as 1948.

That's not to deny that modern home automation platforms have features that old ones don't. They are capable of incredibly sophisticated things! But realistically, most of their users want only very basic functionality: control in convenient places, basic automation, scenes. It wouldn't sting so much if all these whiz-bang general purpose computers were good at those tasks, but they aren't. For the very most basic tasks, things like turning on and off a group of lights, major tech ecosystems like HomeKit provide a user experience that is significantly worse than the model home of 1950.

You could install a Lutron system, and it would solve those fundamental tasks much better... for a much higher price. But it's not like Lutron uses all that money to be an absolute technical powerhouse, a center of innovation at the cutting edge. No, even the latest Lutron products are really very simple, technically. The technical leaders here, Google, Apple, are the companies that can't figure out how to make a damn light switch.

The problem with modern home automation platforms is that they are too ambitious. They are trying to apply enormously complex systems to very simple tasks, and thus contaminating the simplest of electrical systems with all the convenience and ease of a Smart TV.

Sometimes that's what it feels like this whole industry is doing: adding complexity while the core decays. From automatic programming to AI coding agents, video terminals to Electron, the scope of the possible expands while the fundamentals become more and more irritating.

But back to the real point, I hope you learned about some cool light switches. Check out the Kyle Switch Plates reference and you'll start seeing these buildings and homes, at least if you live in an area that built up during the era that they were common (1950s to the 1970s).

2025-05-11 air traffic control

Air traffic control has been in the news lately, on account of my country's declining ability to do it. Well, that's a long-term trend, resulting from decades of under-investment, severe capture by our increasingly incompetent defense-industrial complex, no small degree of management incompetence in the FAA, and long-lasting effects of Reagan crushing the PATCO strike. But that's just my opinion, you know, maybe airplanes got too woke. In any case, it's an interesting time to consider how weird parts of air traffic control are. The technical, administrative, and social aspects of ATC all seem two notches more complicated than you would expect. ATC is heavily influenced by its peculiar and often accidental development, a product of necessity that perpetually trails behind the need, and a beneficiary of hand-me-down military practices and technology.

Aviation Radio

In the early days of aviation, there was little need for ATC---there just weren't many planes, and technology didn't allow ground-based controllers to do much of value. There was some use of flags and signal lights to clear aircraft to land, but for the most part ATC had to wait for the development of aviation radio. The impetus for that work came mostly from the First World War.

Here we have to note that the history of aviation is very closely intertwined with the history of warfare. Aviation technology has always rapidly advanced during major conflicts, and as we will see, ATC is no exception.

By 1913, the US Army Signal Corps was experimenting with the use of radio to communicate with aircraft. This was pretty early in radio technology, and the aircraft radios were huge and awkward to operate, but it was also early in aviation and "huge and awkward to operate" could be similarly applied to the aircraft of the day. Even so, radio had obvious potential in aviation. The first military application for aircraft was reconnaissance. Pilots could fly past the front to find artillery positions and otherwise provide useful information, and then return with maps. Well, even better than returning with a map was providing the information in real-time, and by the end of the war medium-frequency AM radios were well developed for aircraft.

Radios in aircraft led naturally to another wartime innovation: ground control. Military personnel on the ground used radio to coordinate the schedules and routes of reconnaissance planes, and later to inform on the positions of fighters and other enemy assets. Without any real way to know where the planes were, this was all pretty primitive, but it set the basic pattern that people on the ground could keep track of aircraft and provide useful information.

Post-war, civil aviation rapidly advanced. The early 1920s saw numerous commercial airlines adopting radio, mostly for business purposes like schedule coordination. Once you were in contact with someone on the ground, though, it was only logical to ask about weather and conditions. Many of our modern practices like weather briefings, flight plans, and route clearances originated as more or less formal practices within individual airlines.

Air Mail

The government was not left out of the action. The Post Office operated what may have been the largest commercial aviation operation in the world during the early 1920s, in the form of Air Mail. The Post Office itself did not have any aircraft; all of the flying was contracted out---initially to the Army Air Service, and later to a long list of regional airlines. Air Mail was considered a high priority by the Post Office and proved very popular with the public. When the transcontinental route began proper operation in 1920, it became possible to get a letter from New York City to San Francisco in just 33 hours by transferring it between airplanes in a nearly non-stop relay race.

The Post Office's largesse in contracting the service to private operators provided not only the funding but the very motivation for much of our modern aviation industry. Air travel was not very popular at the time, being loud and uncomfortable, but the mail didn't complain. The many contract mail carriers of the 1920s grew and consolidated into what are now some of the United States' largest companies. For around a decade, the Post Office almost singlehandedly bankrolled civil aviation, and passengers were a side hustle [1].

Air mail ambition was not only of economic benefit. Air mail routes were often longer and more challenging than commercial passenger routes. Transcontinental service required regular flights through sparsely populated parts of the interior, challenging the navigation technology of the time and making rescue of downed pilots a major concern. Notably, air mail operators did far more nighttime flying than any other commercial aviation in the 1920s. The post office became the government's de facto technical leader in civil aviation. Besides the network of beacons and markers built to guide air mail between cities, the post office built 17 Air Mail Radio Stations along the transcontinental route.

The Air Mail Radio Stations were the company radio system for the entire air mail enterprise, and the closest thing to a nationwide, public air traffic control service to then exist. They did not, however, provide what we would now call control. Their role was mainly to provide pilots with information (including, critically, weather reports) and to keep loose tabs on air mail flights so that a disappearance would be noticed in time to send search and rescue.

In 1926, the Watres Act created the Aeronautic Branch of the Department of Commerce. The Aeronautic Branch assumed a number of responsibilities, but one of them was the maintenance of the Air Mail routes. Similarly, the Air Mail Radio Stations became Aeronautics Branch facilities, and took on the new name of Flight Service Stations. No longer just for the contract mail carriers, the Flight Service Stations made up a nationwide network of government-provided services to aviators. They were the first edifices in what we now call the National Airspace System (NAS): a complex combination of physical facilities, technologies, and operating practices that enable safe aviation.

In 1935, the first en-route air traffic control center opened, a facility in Newark owned by a group of airlines. The Aeronautic Branch, since renamed the Bureau of Air Commerce, supported the airlines in developing this new concept of en-route control that used radio communications and paperwork to track which aircraft were in which airways. The rising number of commercial aircraft made in-air collisions a bigger problem, so the Newark control center was quickly followed by more facilities built on the same pattern. In 1936, the Bureau of Air Commerce took ownership of these centers, and ATC became a government function alongside the advisory and safety services provided by the flight service stations.

En route center controllers worked off of position reports from pilots via radio, but needed a way to visualize and track aircraft's positions and their intended flight paths. Several techniques helped: first, airlines shared their flight planning paperwork with the control centers, establishing "flight plans" that corresponded to each aircraft in the sky. Controllers adopted a work aid called a "flight strip," a small piece of paper with the key information about an aircraft's identity and flight plan that could easily be handed between stations. By arranging the flight strips on display boards full of slots, controllers could visualize the ordering of aircraft in terms of altitude and airway.

Second, each center was equipped with a large plotting table map where controllers pushed markers around to correspond to the position reports from aircraft. A small flag on each marker gave the flight number, so it could easily be correlated to a flight strip on one of the boards mounted around the plotting table. This basic concept of air traffic control, of a flight strip and a position marker, is still in use today.

Radar

The Second World War changed aviation more than any other event of history. Among the many advancements were two British inventions of particular significance: first, the jet engine, which would make modern passenger airliners practical. Second, the radar, and more specifically the magnetron. This was a development of such significance that the British government treated it as a secret akin to nuclear weapons; indeed, the UK effectively traded radar technology to the US in exchange for participation in US nuclear weapons research.

Radar created radical new possibilities for air defense, and complimented previous air defense development in Britain. During WWI, the organization tasked with defending London from aerial attack had developed a method called "ground-controlled interception" or GCI. Under GCI, ground-based observers identify possible targets and then direct attack aircraft towards them via radio. The advent of radar made GCI tremendously more powerful, allowing a relatively small number of radar-assisted air defense centers to monitor for inbound attack and then direct defenders with real-time vectors.

In the first implementation, radar stations reported contacts via telephone to "filter centers" that correlated tracks from separate radars to create a unified view of the airspace---drawn in grease pencil on a preprinted map. Filter center staff took radar and visual reports and updated the map by moving the marks. This consolidated information was then provided to air defense bases, once again by telephone.

Later technical developments in the UK made the process more automated. The invention of the "plan position indicator" or PPI, the type of radar scope we are all familiar with today, made the radar far easier to operate and interpret. Radar sets that automatically swept over 360 degrees allowed each radar station to see all activity in its area, rather than just aircraft passing through a defensive line. These new capabilities eliminated the need for much of the manual work: radar stations could see attacking aircraft and defending aircraft on one PPI, and communicated directly with defenders by radio.

It became routine for a radar operator to give a pilot navigation vectors by radio, based on real-time observation of the pilot's position and heading. A controller took strategic command of the airspace, effectively steering the aircraft from a top-down view. The ease and efficiency of this workflow was a significant factor in the end of the Battle of Britain, and its remarkable efficacy was noticed in the US as well.

At the same time, changes were afoot in the US. WWII was tremendously disruptive to civil aviation; while aviation technology rapidly advanced due to wartime needs those same pressing demands lead to a slowdown in nonmilitary activity. A heavy volume of military logistics flights and flight training, as well as growing concerns about defending the US from an invasion, meant that ATC was still a priority. A reorganization of the Bureau of Air Commerce replaced it with the Civil Aeronautics Authority, or CAA. The CAA's role greatly expanded as it assumed responsibility for airport control towers and commissioned new en route centers.

As WWII came to a close, CAA en route control centers began to adopt GCI techniques. By 1955, the name Air Route Traffic Control Center (ARTCC) had been adopted for en route centers and the first air surveillance radars were installed. In a radar-equipped ARTCC, the map where controllers pushed markers around was replaced with a large tabletop PPI built to a Navy design. The controllers still pushed markers around to track the identities of aircraft, but they moved them based on their corresponding radar "blips" instead of radio position reports.

Air Defense

After WWII, post-war prosperity and wartime technology like the jet engine lead to huge growth in commercial aviation. During the 1950s, radar was adopted by more and more ATC facilities (both "terminal" at airports and "en route" at ARTCCs), but there were few major changes in ATC procedure. With more and more planes in the air, tracking flight plans and their corresponding positions became labor intensive and error-prone. A particular problem was the increasing range and speed of aircraft, and corresponding longer passenger flights, that meant that many aircraft passed from the territory of one ARTCC into another. This required that controllers "hand off" the aircraft, informing the "next" ARTCC of the flight plan and position at which the aircraft would enter their airspace.

In 1956, 128 people died in a mid-air collision of two commercial airliners over the Grand Canyon. In 1958, 49 people died when a military fighter struck a commercial airliner over Nevada. These were not the only such incidents in the mid-1950s, and public trust in aviation started to decline. Something had to be done. First, in 1958 the CAA gave way to the Federal Aviation Administration. This was more than just a name change: the FAA's authority was greatly increased compared to the CAA, most notably by granting it authority over military aviation.

This is a difficult topic to explain succinctly, so I will only give broad strokes. Prior to 1958, military aviation was completely distinct from civil aviation, with no coordination and often no communication at all between the two. This was, of course, a factor in the 1958 collision. Further, the 1956 collision, while it did not involve the military, did result in part from communications issues between separate distinct CAA facilities and the airline's own control facilities. After 1958, ATC was completely unified into one organization, the FAA, which assumed the work of the military controllers of the time and some of the role of the airlines. The military continues to have its own air controllers to this day, and military aircraft continue to include privileges such as (practical but not legal) exemption from transponder requirements, but military flights over the US are still beholden to the same ATC as civil flights. Some exceptions apply, void where prohibited, etc.

The FAA's suddenly increased scope only made the practical challenges of ATC more difficult, and commercial aviation numbers continued to rise. As soon as the FAA was formed, it was understood that there needed to be major investments in improving the National Airspace System. While the first couple of years were dominated by the transition, the FAA's second director (Najeeb Halaby) prepared two lengthy reports examining the situation and recommending improvements. One of these, the Beacon report (also called Project Beacon), specifically addressed ATC. The Beacon report's recommendations included massive expansion of radar-based control (called "positive control" because of the controller's access to real-time feedback on aircraft movements) and new control procedures for airways and airports. Even better, for our purposes, it recommended the adoption of general-purpose computers and software to automate ATC functions.

Meanwhile, the Cold War was heating up. US air defense, a minor concern in the few short years after WWII, became a higher priority than ever before. The Soviet Union had long-range aircraft capable of reaching the United States, and nuclear weapons meant that only a few such aircraft had to make it to cause massive destruction. Considering the vast size of the United States (and, considering the new unified air defense command between the United States and Canada, all of North America) made this a formidable challenge.

During the 1950s, the newly minted Air Force worked closely with MIT's Lincoln Laboratory (an important center of radar research) and IBM to design a computerized, integrated, networked system for GCI. When the Air Force committed to purchasing the system, it was christened the Semi-Automated Ground Environment, or SAGE. SAGE is a critical juncture in the history of the computer and computer communications, the first system to demonstrate many parts of modern computer technology and, moreover, perhaps the first large-scale computer system of any kind.

SAGE is an expansive topic that I will not take on here; I'm sure it will be the focus of a future article but it's a pretty well-known and well-covered topic. I have not so far felt like I had much new to contribute, despite it being the first item on my "list of topics" for the last five years. But one of the things I want to tell you about SAGE, that is perhaps not so well known, is that SAGE was not used for ATC. SAGE was a purely military system. It was commissioned by the Air Force, and its numerous operating facilities (called "direction centers") were located on Air Force bases along with the interceptor forces they would direct.

However, there was obvious overlap between the functionality of SAGE and the needs of ATC. SAGE direction centers continuously received tracks from remote data sites using modems over leased telephone lines, and automatically correlated multiple radar tracks to a single aircraft. Once an operator entered information about an aircraft, SAGE stored that information for retrieval by other radar operators. When an aircraft with associated data passed from the territory of one direction center to another, the aircraft's position and related information were automatically transmitted to the next direction center by modem.

One of the key demands of air defense is the identification of aircraft---any unknown track might be routine commercial activity, or it could be an inbound attack. The air defense command received flight plan data on commercial flights (and more broadly all flights entering North America) from the FAA and entered them into SAGE, allowing radar operators to retrieve "flight strip" data on any aircraft on their scope.

Recognizing this interconnection with ATC, as soon as SAGE direction centers were being installed the Air Force started work on an upgrade called SAGE Air Traffic Integration, or SATIN. SATIN would extend SAGE to serve the ATC use-case as well, providing SAGE consoles directly in ARTCCs and enhancing SAGE to perform non-military safety functions like conflict warning and forward projection of flight plans for scheduling. Flight strips would be replaced by teletype output, and in general made less necessary by the computer's ability to filter the radar scope.

Experimental trial installations were made, and the FAA participated readily in the research efforts. Enhancement of SAGE to meet ATC requirements seemed likely to meet the Beacon report's recommendations and radically improve ARTCC operations, sooner and cheaper than development of an FAA-specific system.

As it happened, well, it didn't happen. SATIN became interconnected with another planned SAGE upgrade to the Super Combat Centers (SCC), deep underground combat command centers with greatly enhanced SAGE computer equipment. SATIN and SCC planners were so confident that the last three Air Defense Sectors scheduled for SAGE installation, including my own Albuquerque, were delayed under the assumption that the improved SATIN/SCC equipment should be installed instead of the soon-obsolete original system. SCC cost estimates ballooned, and the program's ambitions were reduced month by month until it was canceled entirely in 1960. Albuquerque never got a SAGE installation, and the Albuquerque air defense sector was eliminated by reorganization later in 1960 anyway.

Flight Service Stations

Remember those Flight Service Stations, the ones that were originally built by the Post Office? One of the oddities of ATC is that they never went away. FSS were transferred to the CAB, to the CAA, and then to the FAA. During the 1930s and 1940s many more were built, expanding coverage across much of the country.

Throughout the development of ATC, the FSS remained responsible for non-control functions like weather briefing and flight plan management. Because aircraft operating under instrument flight rules must closely comply with ATC, the involvement of FSS in IFR flights is very limited, and FSS mostly serve VFR traffic.

As ATC became common, the FSS gained a new and somewhat odd role: playing go-between for ATC. FSS were more numerous and often located in sparser areas between cities (while ATC facilities tended to be in cities), so especially in the mid-century, pilots were more likely to be able to reach an FSS than ATC. It was, for a time, routine for FSS to relay instructions between pilots and controllers. This is still done today, although improved communications have made the need much less common.

As weather dissemination improved (another topic for a future post), FSS gained access to extensive weather conditions and forecasting information from the Weather Service. This connectivity is bidirectional; during the midcentury FSS not only received weather forecasts by teletype but transmitted pilot reports of weather conditions back to the Weather Service. Today these communications have, of course, been computerized, although the legacy teletype format doggedly persists.

There has always been an odd schism between the FSS and ATC: they are operated by different departments, out of different facilities, with different functions and operating practices. In 2005, the FAA cut costs by privatizing the FSS function entirely. Flight service is now operated by Leidos, one of the largest government contractors. All FSS operations have been centralized to one facility that communicates via remote radio sites.

While flight service is still available, increasing automation has made the stations far less important, and the general perception is that flight service is in its last years. Last I looked, Leidos was not hiring for flight service and the expectation was that they would never hire again, retiring the service along with its staff.

Flight service does maintain one of my favorite internet phenomenon, the phone number domain name: 1800wxbrief.com. One of the odd manifestations of the FSS/ATC schism and the FAA's very partial privatization is that Leidos maintains an online aviation weather portal that is separate from, and competes with, the Weather Service's aviationweather.gov. Since Flight Service traditionally has the responsibility for weather briefings, it is honestly unclear to what extent Leidos vs. the National Weather Service should be investing in aviation weather information services. For its part, the FAA seems to consider aviationweather.gov the official source, while it pays for 1800wxbrief.com. There's also weathercams.faa.gov, which duplicates a very large portion (maybe all?) of the weather information on Leidos's portal and some of the NWS's. It's just one of those things. Or three of those things, rather. Speaking of duplication due to poor planning...

The National Airspace System

Left in the lurch by the Air Force, the FAA launched its own program for ATC automation. While the Air Force was deploying SAGE, the FAA had mostly been waiting, and various ARTCCs had adopted a hodgepodge of methods ranging from one-off computer systems to completely paper-based tracking. By 1960 radar was ubiquitous, but different radar systems were used at different facilities, and correlation between radar contacts and flight plans was completely manual. The FAA needed something better, and with growing congressional support for ATC modernization, they had the money to fund what they called National Airspace System En Route Stage A.

Further bolstering historical confusion between SAGE and ATC, the FAA decided on a practical, if ironic, solution: buy their own SAGE.

In an upcoming article, we'll learn about the FAA's first fully integrated computerized air traffic control system. While the failed detour through SATIN delayed the development of this system, the nearly decade-long delay between the design of SAGE and the FAA's contract allowed significant technical improvements. This "New SAGE," while directly based on SAGE at a functional level, used later off-the-shelf computer equipment including the IBM System/360, giving it far more resemblance to our modern world of computing than SAGE with its enormous, bespoke AN/FSQ-7.

And we're still dealing with the consequences today!

[1] It also laid the groundwork for the consolidation of the industry, with a 1930 decision that took air mail contracts away from most of the smaller companies and awarded them instead to the precursors of United, TWA, and American Airlines.

2025-05-04 iBeacons

You know sometimes a technology just sort of... comes and goes? Without leaving much of an impression? And then gets lodged in your brain for the next decade? Let's talk about one of those: the iBeacon.

I think the reason that iBeacons loom so large in my memory is that the technology was announced at WWDC in 2013. Picture yourself in 2013: Steve Jobs had only died a couple of years ago, Apple was still widely viewed as a visionary leader in consumer technology, and WWDC was still happening. Back then, pretty much anything announced at an Apple event was a Big Deal that got Big Coverage. Even, it turns out, if it was a minor development for a niche application. That's the iBeacon, a specific solution to a specific problem. It's not really that interesting, but the valance of it's Apple origin makes it seem cool?

iBeacon Technology

Let's start out with what iBeacon is, as it's so simple as to be underwhelming. Way back in the '00s, a group of vendors developed a sort of "Diet Bluetooth": a wireless protocol that was directly based on Bluetooth but simplified and optimized for low-power, low-data-rate devices. This went through an unfortunate series of names, including the delightful Wibree, but eventually settled on Bluetooth Low Energy (BLE). BLE is not just lower-power, but also easier to implement, so it shows up in all kinds of smart devices today. Back in 2011, it was quite new, and Apple was one of the first vendors to adopt it.

BLE is far less connection-oriented than regular Bluetooth; you may have noticed that BLE devices are often used entirely without conventional "pairing." A lot of typical BLE profiles involve just broadcasting some data into the void for any device that cares (and is in short range) to receive, which is pretty similar to ANT+ and unsurprisingly appears in ANT+-like applications of fitness monitors and other sensors. Of course, despite the simpler association model, BLE applications need some way to find devices, so BLE provides an advertising mechanism in which devices transmit their identifying info at regular intervals.

And that's all iBeacon really is: a standard for very simple BLE devices that do nothing but transmit advertisements with a unique ID as the payload. Add a type field on the advertising packet to specify that the device is trying to be an iBeacon and you're done. You interact with an iBeacon by receiving its advertisements, so you know that you are near it. Any BLE device with advertisements enabled could be used this way, but iBeacons are built only for this purpose.

The applications for iBeacon are pretty much defined by its implementation in iOS; there's not much of a standard even if only for the reason that there's not much to put in a standard. It's all obvious. iOS provides two principle APIs for working with iBeacons: the region monitoring API allows an app to determine if it is near an iBeacon, including registering the region so that the app will be started when the iBeacon enters range. This is useful for apps that want to do something in response to the user being in a specific location.

The ranging API allows an app to get a list of all of the nearby iBeacons and a rough range from the device to the iBeacon. iBeacons can actually operate at substantial ranges---up to hundreds of meters for more powerful beacons with external power, so ranging mode can potentially be used as sort of a lightweight local positioning system to estimate the location of the user within a larger space.

iBeacon IDs are in the format of a UUID, followed by a "major" number and a "minor" number. There are different ways that these get used, especially if you are buying cheap iBeacons and not reconfiguring them, but the general idea is roughly that the UUID identifies the operator, the major a deployment, and the minor a beacon within the deployment. In practice this might be less common than just every beacon having its own UUID due to how they're sourced. It would be interesting to survey iBeacon applications to see which they do.

Promoted Applications

So where do you actually use these? Retail! Apple seems to have designed the iBeacon pretty much exclusively for "proximity marketing" applications in the retail environment. It goes something like this: when you're in a store and open that store's app, the app will know what beacons you are nearby and display relevant content. For example, in a grocery store, the grocer's app might offer e-coupons for cosmetics when you are in the cosmetics section.

That's, uhh, kind of the whole thing? The imagined universe of applications around the launch of iBeacon was pretty underwhelming to me, even at the time, and it still seems that way. That's presumably why iBeacon had so little success in consumer-facing applications. You might wonder, who actually used iBeacons?

Well, Apple did, obviously. During 2013 and into 2014 iBeacons were installed in all US Apple stores, and prompted the Apple Store app to send notifications about upgrade offers and other in-store deals. Unsurprisingly, this Apple Store implementation was considered the flagship deployment. It generated a fair amount of press, including speculation as to whether or not it would prove the concept for other buyers.

Around the same time, Apple penned a deal with Major League Baseball that would see iBeacons installed in MLB stadiums. For the 2014 season, MLB Advanced Marketing, a joint venture of team owners, had installed iBeacon technology in 20 stadiums.

Baseball fans will be able to utilize iBeacon technology within MLB.com At The Ballpark when the award-winning app's 2014 update is released for Opening Day. Complete details on new features being developed by MLBAM for At The Ballpark, including iBeacon capabilities, will be available in March.

What's the point? the iBeacons "enable the At The Ballpark app to play specific videos or offer coupons."

This exact story repeats for other retail companies that have picked the technology up at various points, including giants like Target and WalMart. The iBeacons are simply a way to target advertising based on location, with better indoor precision and lower power consumption than GPS. Aiding these applications along, Apple integrated iBeacon support into the iOS location framework and further blurred the lines between iBeacon and other positioning services by introducing location-based-advertising features that operated on geofencing alone.

Some creative thinkers did develop more complex applications for the iBeacon. One of the early adopters was a company called Exact Editions, which prepared the Apple Newsstand version of a number of major magazines back when "readable on iPad" was thought to be the future of print media. Exact Editions explored a "read for free" feature where partner magazines would be freely accessible to users at partnering locations like coffee shops and book stores. This does not seem to have been a success, but using the proximity of an iBeacon to unlock some paywalled media is at least a little creative, if probably ill-advised considering security considerations we'll discuss later.

The world of applications raises interesting questions about the other half of the mobile ecosystem: how did this all work on Android? iOS has built-in support for iBeacons. An operating system service scans for iBeacons and dispatches notifications to apps as appropriate. On Android, there has never been this type of OS-level support, but Android apps have access to relatively rich low-level Bluetooth functionality and can easily scan for iBeacons themselves. Several popular libraries exist for this purpose, and it's not unusual for them to be used to give ported cross-platform apps more or less equivalent functionality. These apps do need to run in the background if they're to notify the user proactively, but especially back in 2013 Android was far more generous about background work than iOS.

iBeacons found expanded success through ShopKick, a retail loyalty platform that installed iBeacons in locations of some major retailers like American Eagle. These powered location-based advertising and offers in the ShopKick app as well as retailer-specific apps, which is kind of the start of a larger, more seamless network, but it doesn't seem to have caught on. Honestly, consumers just don't seem to want location-based advertising that much. Maybe because, when you're standing in an American Eagle, getting ads for products carried in the American Eagle is inane and irritating. iBeacons sort of foresaw cooler screens in this regard.

To be completely honest, I'm skeptical that anyone ever really believed in the location-based advertising thing. I mean, I don't know, the advertising industry is pretty good at self-deception, but I don't think there were ever any real signs of hyper-local smartphone-based advertising taking off. I think the play was always data collection, and advertising and special offers just provided a convenient cover story.

Real Applications

iBeacons are one of those technologies that feels like a flop from a consumer perspective but has, in actuality, enjoyed surprisingly widespread deployments. The reason, of course, is data mining.

To Apple's credit, they took a set of precautions in the design of the iBeacon iOS features that probably felt sufficient in 2013. Despite the fact that a lot of journalists described iBeacons as being used to "notify a user to install an app," that was never actually a capability (a very similar-seeming iOS feature attached to Siri actually used conventional geofencing rather than iBeacons). iBeacons only did anything if the user already had an app installed that either scanned for iBeacons when in the foreground or registered for region notifications.

In theory, this limited iBeacons to companies with which consumers already had some kind of relationship. What Apple may not have foreseen, or perhaps simply accepted, is the incredible willingness of your typical consumer brand to sell that relationship to anyone who would pay.

iBeacons became, in practice, just another major advancement in pervasive consumer surveillance. The New York Times reported in 2019 that popular applications were including SDKs that reported iBeacon contacts to third-party consumer data brokers. This data became one of several streams that was used to sell consumer location history to advertisers.

It's a little difficult to assign blame and credit, here. Apple, to their credit, kept iBeacon features in iOS relatively locked down. This suggests that they weren't trying to facilitate massive location surveillance. That said, Apple always marketed iBeacon to developers based on exactly this kind of consumer tracking and micro-targeting, they just intended for it to be done under the auspices of a single brand. That industry would obviously form data exchanges and recruit random apps into reporting everything in your proximity isn't surprising, but maybe Apple failed to foresee it.

They certainly weren't the worst offender. Apple's promotion of iBeacon opened the floodgates for everyone else to do the same thing. During 2014 and 2015, Facebook started offering bluetooth beacons to businesses that were ostensibly supposed to facilitate in-app special offers (though I'm not sure that those ever really materialized) but were pretty transparently just a location data collection play.

Google jumped into the fray in their Signature Google style, with an offering that was confusing, semi-secret, incoherently marketed, and short lived. Google's Project Beacon, or Google My Business, also shipped free Bluetooth beacons out to businesses to give Android location services a boost. Google My Business seems to have been the source of a fair amount of confusion even at the time, and we can virtually guarantee that (as reporters speculated at the time) Google was intentionally vague and evasive about the system to avoid negative attention from privacy advocates.

In the case of Facebook, well, they don't have the level of opsec that Google does so things are a little better documented:

Leaked documents show that Facebook worried that users would 'freak out' and spread 'negative memes' about the program. The company recently removed the Facebook Bluetooth beacons section from their website.

The real deployment of iBeacons and closely related third-party iBeacon-like products [1] occurred at massive scale but largely in secret. It became yet another dark project of the advertising-industrial complex, perhaps the most successful yet of a long-running series of retail consumer surveillance systems.

Payments

One interesting thing about iBeacon is how it was compared to NFC. The two really aren't that similar, especially considering the vast difference in usable ranges, but NFC was the first radio technology to be adopted for "location marketing" applications. "Tap your phone to see our menu," kinds of things. Back in 2013, Apple had rather notably not implemented NFC in its products, despite its increasing adoption on Android.

But, there is much more to this story than learning about new iPads and getting a surprise notification that you are eligible for a subsidized iPhone upgrade. What we're seeing is Apple pioneering the way mobile devices can be utilized to make shopping a better experience for consumers. What we're seeing is Apple putting its money where its mouth is when it decided not to support NFC. (MacObserver)

Some commentators viewed iBeacon as Apple's response to NFC, and I think there's more to that than you might think. In early marketing, Apple kept positioning iBeacon for payments. That's a little weird, right, because iBeacons are a purely one-way broadcast system.

Still, part of Apple's flagship iBeacon implementation was a payment system:

Here's how he describes the purchase he made there, using his iPhone and the EasyPay system: "We started by using the iPhone to scan the product barcode and then we had to enter our Apple ID, pretty much the way we would for any online Apple purchase [using the credit card data on file with one's Apple account]. The one key difference was that this transaction ended with a digital receipt, one that we could show to a clerk if anyone stopped us on the way out."

Apple Wallet only kinda-sorta existed at the time, although Apple was clearly already midway into a project to expand into consumer payments. It says a lot about this point in time in phone-based payments that several reporters talk about iBeacon payments as a feature of iTunes, since Apple was mostly implementing general-purpose billing by bolting it onto iTunes accounts.

It seems like what happened is that Apple committed to developing a pay-by-phone solution, but decided against NFC. To be competitive with other entrants in the pay-by-phone market, they had to come up with some kind of technical solution to interact with retail POS, and iBeacon was their choice. From a modern perspective this seems outright insane; like, Bluetooth broadcasts are obviously not the right way to initiate a payment flow, and besides, there's a whole industry-standard stack dedicated to that purpose... built on NFC.

But remember, this was 2013! EMV was not yet in meaningful use in the US; several major banks and payment networks had just committed to rolling it out in 2012 and every American can tell you that the process was long and torturous. Because of the stringent security standards around EMV, Android devices did not implement EMV until ARM secure enclaves became widely available. EMVCo, the industry body behind EMV, did not have a certification program for smartphones until 2016.

Android phones offered several "tap-to-pay" solutions, from Google's frequently rebranded Google Wallet^w^wAndroid Pay^w^wGoogle Wallet to Verizon's embarrassingly rebranded ISIS^wSoftcard and Samsung Pay. All of these initially relied on proprietary NFC protocols with bespoke payment terminal implementations. This was sketchy enough, and few enough phones actually had NFC, that the most successful US pay-by-phone implementations like Walmart's and Starbucks' used barcodes for communication. It would take almost a decade before things really settled down and smartphones all just implemented EMV.

So, in that context, Apple's decision isn't so odd. They must have figured that iBeacon could solve the same "initial handshake" problem as Walmart's QR codes, but more conveniently and using radio hardware that they already included in their phones. iBeacon-based payment flows used the iBeacon only to inform the phone of what payment devices were nearby, everything else happened via interaction with a cloud service or whatever mechanism the payment vendor chose to implement. Apple used their proprietary payments system through what would become your Apple Account, PayPal slapped together an iBeacon-based fast path to PayPal transfers, etc.

I don't think that Apple's iBeacon-based payments solution ever really shipped. It did get some use, most notably by Apple, but these all seem to have been early-stage implementations, and the complete end-to-end SDK that a lot of developers expected never landed.

You might remember that this was a very chaotic time in phone-based payments, solutions were coming and going. When Apple Pay was properly announced a year after iBeacons, there was little mention of Bluetooth. By the time in-store Apple Pay became common, Apple had given up and adopted NFC.

Limitations

One of the great weaknesses of iBeacon was the security design, or lack thereof. iBeacon advertisements were sent in plaintext with no authentication of any type. This did, of course, radically simplify implementation, but it also made iBeacon untrustworthy for any important purpose. It is quite trivial, with a device like an Android phone, to "clone" any iBeacon and transmit its identifiers wherever you want. This problem might have killed off the whole location-based-paywall-unlocking concept had market forces not already done so. It also opens the door to a lot of nuisance attacks on iBeacon-based location marketing, which may have limited the depth of iBeacon features in major apps.

iBeacon was also positioned as a sort of local positioning system, but it really wasn't. iBeacon offers no actual time-of-flight measurements, only RSSI-based estimation of range. Even with correct on-site calibration (which can be aided by adjusting a fixed RSSI-range bias value included in some iBeacon advertisements) this type of estimation is very inaccurate, and in my little experiments with a Bluetooth beacon location library I can see swings from 30m to 70m estimated range based only on how I hold my phone. iBeacon positioning has never been accurate enough to do more than assert whether or not a phone is "near" the beacon, and "near" can take on different values depending on the beacon's transmit power.

Developers have long looked towards Bluetooth as a potential local positioning solution, and it's never quite delivered. The industry is now turning towards Ultra-Wideband or UWB technology, which combines a high-rate, high-bandwidth radio signal with a time-of-flight radio ranging protocol to provide very accurate distance measurements. Apple is, once again, a technical leader in this field and UWB radios have been integrated into the iPhone 11 and later.

Senescence

iBeacon arrived to some fanfare, quietly proliferated in the shadows of the advertising industry, and then faded away. The Wikipedia article on iBeacons hasn't really been updated since support on Windows Phone was relevant. Apple doesn't much talk about iBeacons any more, and their compatriots Facebook and Google both sunset their beacon programs years ago.

Part of the problem is, well, the pervasive surveillance thing. The idea of Bluetooth beacons cooperating with your phone to track your every move proved unpopular with the public, and so progressively tighter privacy restrictions in mobile operating systems and app stores have clamped down on every grocery store app selling location data to whatever broker bids the most. I mean, they still do, but it's gotten harder to use Bluetooth as an aid. Even Android, the platform of "do whatever you want in the background, battery be damned," strongly discourages Bluetooth scanning by non-foreground apps.

Still, the basic technology remains in widespread use. BLE beacons have absolutely proliferated, there are plenty of apps you can use to list nearby beacons and there almost certainly are nearby beacons. One of my cars has, like, four separate BLE beacons going on all the time, related to a phone-based keyless entry system that I don't think the automaker even supports any more. Bluetooth beacons, as a basic primitive, are so useful that they get thrown into all kinds of applications. My earbuds are a BLE beacon, which the (terrible, miserable, no-good) Bose app uses to detect their proximity when they're paired to another device. A lot of smart home devices like light bulbs are beacons. The irony, perhaps, of iBeacon-based location tracking is that it's a victim of its own success. There is so much "background" BLE beacon activity that you scarcely need to add purpose-built beacons to track users, and only privacy measures in mobile operating systems and the beacons themselves (some of which rotate IDs) save us.

Apple is no exception to the widespread use of Bluetooth beacons: iBeacon lives on in virtually every apple device. If you do try out a Bluetooth beacon scanning app, you'll discover pretty much every Apple product in a 30 meter radius. From MacBooks Pro to Airpods, almost all Apple products transmit iBeacon advertisements to their surroundings. These are used for the initial handshake process of peer-to-peer features like Airdrop, and Find My/AirTag technology seems to be derived from the iBeacon protocol (in the sense that anything can be derived from such a straightforward design). Of course, pretty much all of these applications now randomize identifiers to prevent passive use of device advertisements for long-term tracking.

Here's some good news: iBeacons are readily available in a variety of form factors, and they are very cheap. Lots of libraries exist for working with them. If you've ever wanted some sort of location-based behavior for something like home automation, iBeacons might offer a good solution. They're neat, in an old technology way. Retrotech from the different world of 2013.

It's retro in more ways than one. It's funny, and a bit quaint, to read the contemporary privacy concerns around iBeacon. If only they had known how bad things would get! Bluetooth beacons were the least of our concerns.

[1] Things can be a little confusing here because the iBeacon is such a straightforward concept, and Apple's implementation is so simple. We could define "iBeacon" as including only officially endorsed products from Apple affiliates, or as including any device that behaves the same as official products (e.g. by using the iBeacon BLE advertisement type codes), or as any device that is performing substantially the same function (but using a different advertising format). I usually mean the latter of these three as there isn't really much difference between an iBeacon and ten million other BLE beacons that are doing the same thing with a slightly different identifier format. Facebook and Google's efforts fall into this camp.

2025-04-18 white alice

When we last talked about Troposcatter, it was Pole Vault. Pole Vault was the first troposcatter communications network, on the east coast of Canada. It would not be alone for long. By the time the first Pole Vault stations were complete, work was already underway on a similar network for Alaska: the White Alice Communication System, WACS.

USACE illustration of troposcatter antennas

Alaska has long posed a challenge for communications. In the 1860s, Western Union wanted to extend their telegraph network from the United States into Europe. Although the technology would be demonstrated shortly after, undersea telegraph cables were still notional and it seemed that a route that minimized the ocean crossing would be preferable---of course, that route maximized the length on land, stretching through present-day Alaska and Siberia on each side of the Bering Strait. This task proved more formidable than Western Union had imagined, and the first transatlantic telegraph cable (on a much further south crossing) was completed before the arctic segments of the overland route. The "Western Union Telegraph Expedition" abandoned its work, leaving a telegraph line well into British Columbia that would serve as one of the principle communications assets in the region for decades after.

This ill-fated telegraph line failed to link San Francisco to Moscow, but its aftermath included a much larger impact on Russian interests in North America: the purchase of Alaska in 1867. Shortly after, the US military began its expansion into the new frontier. The Army Signal Corps, mostly to fulfill its function in observing the weather, built and staffed small installations that stretched further and further west. Later, in the 1890s, a gold rush brought a sudden influx of American settlers to Alaska's rugged terrain. The sudden economic importance of Klondike, and the rather colorful personalities of the prospectors looking to exploit it, created a much larger need for military presence. Fortuitously, many of the forts present had been built by the Signal Corps, which had already started on lines of communication. Construction was difficult, though, and without Alaskan communications as major priority there was only minimal coverage.

Things changed in 1900, when Congress appropriated a substantial budget to the Washington-Alaska Military Cable and Telegraph System. The Signal Corps set on Alaska like, well, like an army, and extensive telegraph and later telephone lines were built to link the various military outposts. Later renamed the Alaska Communications System, these cables brought the first telecommunication to much of Alaska. The arrival of the telegraph was quite revolutionary for remote towns, who could now receive news in real-time that had previously been delayed by as much as a year [1]. Telegraphy was important to civilians as well, something that Congress had anticipated: The original act authorizing the Alaska Communications System dictated that it would carry commercial traffic as well. The military had an unusual role in Alaska, and one aspect of it was telecommunications provider.

In 1925, an outbreak of diphtheria began to claim the lives of children in Nome, a town in far western Alaska on the Seward Peninsula. The daring winter delivery of antidiphtheria serum by dog sled is widely remembered due to its tangential connection to the Iditarod, but there were two sides of the "serum run." The message from Nome's sole doctor requesting the urgent shipment was transmitted from Nome to the Public Health Service in DC over the Alaska Communications System. It gives us some perspective on the importance of the telegraph in Alaska that the 600 mile route to Nome took five days and many feats of heroism---but at the same time could be crossed instantaneously by telegrams.

The Alaska Communications System included some use of radio from the beginning. A pair of HF radio stations specifically handled traffic for Nome, covering a 100-mile stretch too difficult for even the intrepid Signal Corps. While not a totally new technology to the military, radio was quite new to the telegraph business, and the ACS to Nome was probably the first commercial radiotelegraph system on the continent. By the 1930s, the condition of the Alaskan telegraph cables had decayed while demand for telephony had increased. Much of ACS was upgraded and modernized to medium-frequency radiotelephone links. In towns small and large, even in Anchorage itself, the sole telephone connection to the contiguous United States was an ACS telephone installed in the general store.

Alaskan communications became an even greater focus of the military with the onset of the Second World War. A few weeks after Pearl Harbor, the Japanese attacked Fort Mears in the Aleutian Islands. Fort Mears had no telecommunications connections, so despite the proximity of other airbases support was slow to come. The lack of a telegraph or telephone line contributed to 43 deaths and focused attention on the ACS. By 1944, the Army Signal Corps had a workforce of 2,000 dedicated to Alaska.

WWII brought more than one kind of attention to Alaska. Several Japanese assaults on the Aleutian islands represented the largest threats to American soil outside of Pearl Harbor, showing both Alaska's vulnerability and the strategic importance given to it by its relative proximity to Eurasia. WWII ended but, in 1949, the USSR demonstrated an atomic weapon. A combination of Soviet expansionism and the new specter of nuclear war turned military planners towards air defense. Like the Canadian Maritimes in the East, Alaska covered a huge swath of the airspace through which Soviet bombers might approach the US. Alaska was, once again, a battleground.

USAF photo

The early Cold War military buildup of Alaska was particularly heavy on air defense. During the late '40s and early '50s, more than a dozen new radar and control sites were built. The doctrine of ground-controlled interception requires real-time communication between radar centers, stressing the limited number of voice channels available on the ACS. As early as 1948, the Signal Corps had begun experiments to choose an upgrade path. Canadian early-warning radar networks, including the Distant Early Warning Line, were on the drawing board and would require many communications channels in particularly remote parts of Alaska.

Initially, point-to-point microwave was used in relatively favorable terrain (where the construction of relay stations about every 50 miles was practical). For the more difficult segments, the Signal Corps found that VHF radio could provide useful communications at ranges over 100 miles. VHF radiotelephones were installed at air defense radar stations, but there was a big problem: the airspace surveillance radar of the 1950s also operated in the VHF band, and caused so much interference with the radiotelephones that they were difficult to use. The radar stations were probably the most important users of the network, so VHF would have to be abandoned.

In 1954, a military study group was formed to evaluate options for the ACS. That group, in turn, requested a proposal from AT&T. Bell Laboratories had been involved in the design and evaluation of Pole Vault, the first sites of which had been completed two years before, so they naturally positioned troposcatter as the best option.

It is worth mentioning the unusual relationship AT&T had with Alaska, or rather, the lack of one. While the Bell System enjoyed a monopoly on telephony in most of the United States [2], they had never expanded into Alaska. Alaska was only a territory, after all, and a very sparsely populated one at that. The paucity of long-distance leads to or from Alaska (only one connected to Anchorage, for example) limited the potential for integration of Alaska into the broader Bell System anyway. Long-distance telecommunications in Alaska were a military project, and AT&T was involved only as a vendor.

Because of the high cost of troposcatter stations, proven during Pole Vault construction, a hybrid was proposed: microwave stations could be spaced every 50 miles along the road network, while troposcatter would cover the long stretches without roads.

In 1955, the Signal Corps awarded Western Electric a contract for the White Alice Communications System. The Corps of Engineers surveyed the locations of 31 sites, verifying each by constructing a temporary antenna tower. The Corps of Engineers led construction of the first 11 sites, and the final 20 were built on contract by Western Electric itself. All sites used radio equipment furnished by Western Electric and were built to Western Electric designs.

Construction was far from straightforward. Difficult conditions delayed completion of the original network until 1959, two years later than intended. A much larger issue, though, was the budget. The original WACS was expected to cost $38 million. By the time the first 31 sites were complete, the bill totaled $113 million---equivalent to over a billion dollars today. Western Electric had underestimated not only the complexity of the sites but the difficulty of their construction. A WECo report read:

On numerous occasions, the men were forced to surrender before the onslaught of cold, wind and snow and were immobilized for days, even weeks . This ordeal of waiting was of times made doubly galling by the knowledge that supplies and parts needed for the job were only a few miles distant but inaccessible because the white wall of winter had become impenetrable

WACS initial capability included 31 stations, of which 22 were troposcatter and the remainder only microwave (using Western Electric's TD-2). A few stations were equipped with both troposcatter and microwave, serving as relays between the two carriers.

In 1958, construction started on the Ballistic Missile Early Warning System or BMEWS. BMEWS was an over-the-horizon radar system intended to provide early warning of a Soviet attack. BMEWS would provide as little as 15 minutes warning, requiring that alerts reach NORAD in Colorado as quickly as possible. One BMEWS set was installed in Greenland, where the Pole Vault system was expanded to provide communications. Similarly, the BMEWS set at Clear Missile Early Warning Station in central Alaska relied on White Alice. Planners were concerned about the ability of the Soviet Union to suppress an alert by destroying infrastructure, so two redundant chains of microwave sites were added to White Alice. One stretched from Clear to Ketchikan where it connected to an undersea cable to Seattle. The other went east, towards Canada, where it met existing telephone cables on the Alaska Highway.

A further expansion of White Alice started the next year, in 1959. Troposcatter sites were extended through the Aleutian islands in "Project Stretchout" to serve new DEW Line stations. During the 1960s, existing WACS sites were expanded and new antennas were installed at Air Force installations. These were generally microwave links connecting the airbases to existing troposcatter stations.

In total, WACS reached 71 sites. Four large sites served as key switching points with multiple radio links and telephone exchanges. Pedro Dome, for example, had a 15,000 square foot communications building with dormitories, a power plant, and extensive equipment rooms. Support facilities included a vehicle maintenance building, storage warehouse, and extensive fuel tanks. A few WACS sites even had tramways for access between the "lower camp" (where equipment and personnel were housed) and the "upper camp" (where the antennas were located)... although they apparently did not fare well in the Alaskan conditions.

While Western Electric had initially planned for six people and 25 KW of power at each station, the final requirements were typically 20 people and 120-180 KW of generator capacity. Some sites stored over half a million gallons of fuel---conditions often meant that resupply was only possible during the summer.

Besides troposcatter and microwave radios, the equipment included tandem telephone exchanges. These are described in a couple of documents as "ATSS-4A," ATSS standing for Alaska Telephone Switching System. Based on the naming and some circumstantial evidence, I believe these were Western Electric 4A crossbar exchanges. They were later incorporated into AUTOVON, but also handled commercial long-distance traffic between Alaskan towns.

With troposcatter comes large antennas, and depending on connection lengths, WACS troposcatter antennas ranged from 30' dishes to 120' "billboard" antennas similar to those seen at Pole Vault sites. The larger antennas handled up to 50kW of transmit power. Some 60' and 120' antennas included their own fuel tanks and steam plants that heated the antennas through winter to minimize snow accumulation.

Nearly all of the equipment used by WACS was manufactured by Western Electric, with a lot of reuse of standard telephone equipment. For example, muxing on the troposcatter links used standard K-carrier (originally for telephone cables) and L-carrier (originally for coaxial cables). Troposcatter links operated at about 900 MHz with a wide bandwidth, and carrier two L-carrier supergroups (60 channels) and one K-carrier (12 channels) for a nominal capacity of 132 channels, although most stations did not have fully-populated L-carrier groups so actual capacity varied based on need. This was standard telephone carrier equipment in widespread use on the long-distance network, but some output modifications were made to suit the radio application.

The exception to the Western Electric rule was the radio sets themselves. They were manufactured by Radio Engineering Laboratories, the same company that built the radios for Pole Vault. REL pulled out all of the tricks they had developed for Pole Vault, and the longer WACS links used two antennas at different positions for space diversity. Each antenna had two feed horns, of orthoganal polarization, matching similar dual-transmitters for further diversity. REL equipment selected the best signal of the four available receiver options.

USAF photo

WACS represented an enormous improvement in Alaskan communications. The entire system was multi-channel with redundancy in many key parts of the network. Outside of the larger cities, WACS often brought the first usable long-distance telephone service. Even in Anchorage, WACS provided the only multi-channel connection. Despite these achievements, WACS was set for much the same fate as other troposcatter systems: obsolescence after the invention of communications satellites.

The experimental satellites Telstar 1 and 2 launched in the early 1960s, and the military began a shift towards satellite communications shortly after. Besides, the formidable cost of WACS had become a political issue. Maintenance of the system overran estimates by just as much as construction, and placing this cost on taxpayers was controversial since much of the traffic carried by the system consisted of regular commercial telephone calls. Besides, a general reticence to allocate money to WACS had lead to a general decay of the system. WACS capacity was insufficient for the rapidly increasing long-distance telephone traffic of the '60s, and due to decreased maintenance funding reliability was beginning to decline.

The retirement of a Cold War communications system is not unusual, but the particular fate of WACS is. It entered a long second life.

After acting as the sole long-distance provider for 60 years, the military began its retreat. In 1969, Congress passed the Alaska Communications Disposal Act. It called for complete divestment of the Alaska Communications System and WACS, to a private owner determined by a bidding process. Several large independent communications companies bid, but the winner was RCA. Committing to a $28.5 million purchase price followed by $30 million in upgrades, RCA reorganized the Alaska Communications System as RCA Alascom.

Transfer of the many ACS assets from the military to RCA took 13 years, involving both outright transfer of property and complex lease agreements on sites colocated with military installations. RCA's interest in Alaskan communications was closely connected to the coming satellite revolution: RCA had just built the Bartlett Earth Station, the first satellite ground station in Alaska. While Bartlett was originally an ACS asset owned by the Signal Corps, it became just the first of multiple ground stations that RCA would build for Alascom. Several of the new ground stations were colocated with WACS sites, establishing satellite as an alternative to the troposcatter links.

Alascom appears to have been the first domestic satellite voice network in commercial use, initially relying on a Canadian communications satellite [3]. In 1974, SATCOM 1 and 2 launched. These were not the first commercial communications satellites, but they represented a significant increase in capacity over previous commercial designs and are sometimes thought of as the true beginning of the satellite communications era. Both were built and owned by RCA, and Alascom took advantage of the new transponders.

At the same time, Alascom launched a modernization effort. 22 of the former WACS stations were converted to satellite ground stations, a project that took much of the '70s as Alascom struggled with the same conditions that had made WACS so challenging to begin with. Modernization also included the installation of DMS-10 telephone switches and conversion of some connections to digital.

A series of regulatory and business changes in the 1970s lead RCA to step away from the domestic communications industry. In 1979, Alascom sold to Pacific Power and Light, now for $200 million and $90 million in debt. PP&L continued on much the same trajectory, expanding the Alascom system to over 200 ground stations and launching the satellite Aurora I---the first of a small series of satellites that gave Alaska the distinction of being the only state with its own satellite communications network. For much of the '70s to the '00s, large parts of Alaska relied on satellite relay for calls between towns.

In a slight twist of irony considering its long lack of interest in the state, AT&T purchased parts of Alascom from PP&L in 1995, forming AT&T Alascom which has faded away as an independent brand. Other parts of the former ACS network, generally non-toll (or non-long-distance) operations, were split off into then PP&L subsidiary CenturyTel. While CenturyTel has since merged into CenturyLink, the Alaskan assets were first sold to Alaska Communications. Alaska Communications considers itself the successor of the ACS heritage, giving them a claim to over 100 years of communications history.

As electronics technology has continued to improve, penetration of microwave relays into inland Alaska has increased. Fewer towns rely on satellite today than in the 1970s, and the half-second latency to geosynchronous orbit is probably not missed. Alaska communications have also become more competitive, with long-distance connectivity available from General Communications (GCI) as well as AT&T and Alaska Communications.

Still, the legacy of Alaska's complex and expensive long-distance infrastructure still echoes in our telephone bills. State and federal regulators have allowed for extra fees on telephone service in Alaska and calls into Alaska, both intended to offset the high cost of infrastructure. Alaska is generally the most expensive long-distance calling destination in the United States, even when considering the territories.

But what of White Alice?

The history of the Alaska Communications System's transition to private ownership is complex and not especially well documented. While RCA's winning bid following the Alaska Communications Disposal Act set the big picture, the actual details of the transition were established by many individual negotiations spanning over a decade. Depending on the station, WACS troposcatter sites generally conveyed to RCA in 1973 or 1974. Some, colocated with active military installations, were leased rather than included in the sale. RCA generally decommissioned each WACS site once a satellite ground station was ready to replace it, either on-site or nearby.

For some WACS sites, this meant the troposcatter equipment was shut down in 1973. Others remained in use later. The Boswell Bay troposcatter station seems to have been the last turned down, in 1985. The 1980s were decidedly the end of WACS. Alascom's sale to PP&L cemented the plan to shut down all troposcatter operations, and the 1980 Comprehensive Environmental Response, Compensation, and Liability Act lead to the establishment of the Formerly Used Defense Sites (FUDS) program within DoD. Under FUDS, the Corps of Engineers surveyed the disused WACS sites and found nearly all had significant contamination by asbestos (used in seemingly every building material in the '50s and '60s) and leaked fuel oil.

As a result, most White Alice sites were demolished between 1986 and 1999. The cost of demolition and remediation in such remote locations was sometimes greater than the original construction. No WACS sites remain intact today.

USAF photo

Postscript:

A 1988 Corps of Engineers historical inventory of WACS, prepared due to the demolition of many of the stations, mentions that meteor burst communications might replace troposcatter. Meteor burst is a fascinating communications mode, similar in many ways to troposcatter but with the twist that the reflecting surface is not the troposphere but the ionized trail of meteors entering the atmosphere. Meteor burst connections only work when there is a meteor actively vaporizing in the upper atmosphere, but atmospheric entry of small meteors is common enough that meteor burst communications are practical for low-rate packetized communications. For example, meteor burst has been used for large weather and agricultural telemetry systems.

The Alaska Meteor Burst Communications System was implemented in 1977 by several federal agencies, and was used primarily for automated environmental telemetry. Unlike most meteor burst systems, though, it seems to have been used for real-time communications by the BLM and FAA. I can't find much information, but they seem to have built portable teleprinter terminals for this use.

Even more interesting, the Air Force's Alaskan Air Command built its own meteor burst network around the same time. This network was entirely for real-time use, and demonstrated the successful transmission of radar track data from radar stations across the state to Elmendorf Air Force base. Even better, the Air Force experimented with the use of meteor burst for intercept control by fitting aircraft with a small speech synthesizer that translated coded messages into short phrases. The Air Force experimented with several meteor burst systems during the Cold War, anticipating that it might be a survivable communications system in wartime. More details on these will have to fill a future article.

[1] Crews of the Western Union Telegraph Expedition reportedly continued work for a full year after the completion of the transatlantic telephone cable, because news of it hadn't reached them yet.

[2] Eliding here some complexities like GTE and their relationship to the Bell System.

[3] Perhaps owing to the large size of the country and many geographical challenges to cable laying, Canada has often led North America in satellite communications technology.

Note: I have edited this post to add more information, a couple of hours after originally publishing it. I forgot about a source I had open in a tab. Sorry.

Error'd: Lucky Penny

High-roller Matthew D. fears Finance. "This is from our corporate expense system. Will they flag my expenses in the April-December quarter as too high? And do we really need a search function for a list of 12 items?"

0

 

Tightfisted Adam R. begrudges a trifling sum. "The tipping culture is getting out of hand. After I chose 'Custom Tip' for some takeout, they filled out the default tip with a few extra femtocents. What a rip!"

1

 

Cool Customer Reinier B. sums this up: "I got some free B&J icecream a while back. Since one of them was priced at €0.01, the other one obviously had to cost zero point minus 1 euros to make a total of zero euro. Makes sense. Or probably not."

3

 

An anonymous browniedad is ready to pack his poptart off for the summer. "I know {First Name} is really excited for camp..." Kudos on getting Mom to agree to that name choice!

2

 

Finally, another anonymous assembler's retrospective visualisation. "CoPilot rendering a graphical answer of the semantics of a pointer. Point taken. " There's no error'd here really, but I'm wondering how long before this kind of wtf illustration lands somewhere "serious".

4

 

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: Recasting the Team

Nina's team has a new developer on the team. They're not a junior developer, though Nina wishes they could replace this developer with a junior. Inexperience is better than whatever this Java code is.

Object[] test = (Object[]) options;
List<SchedulePlatform> schedulePlatformList = (List<SchedulePlatform>)((Object[])options)[0];
List<TableColumn> visibleTableCols = (List<TableColumn>)((Object[])options)[1];

We start by casting options into an array of Objects. That's already a code stench, but we actually don't even use the test variable and instead just redo the cast multiple times.

But worse than that, we cast to an array of object, access an element, and then cast that element to a collection type. I do not know what is in the options variable, but based on how it gets used, I don't like it. What it seems to be is a class (holding different options as fields) rendered as an array (holding different options as elements).

The new developer (ab)uses this pattern everywhere.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

CodeSOD: Format Identified

Many nations have some form of national identification number, especially around taxes. Argentina is no exception.

Their "CUIT" (Clave Única de Identificación Tributaria) and "CUIL" (Código Único de Identificación Laboral) are formatted as "##-########-#".

Now, as datasets often don't store things in their canonical representation, Nick's co-worker was given a task: "given a list of numbers, reformat them to look like CUIT/CUIL. That co-worker went off for five days, and produced this Java function.

public String normalizarCuitCuil(String cuitCuilOrigen){
	String valorNormalizado = new String();
	
	if (cuitCuilOrigen == null || "".equals(cuitCuilOrigen) || cuitCuilOrigen.length() < MINIMA_CANTIDAD_ACEPTADA_DE_CARACTERES_PARA_NORMALIZAR){
		valorNormalizado = "";
	}else{
		StringBuilder numerosDelCuitCuil = new StringBuilder(13);
		cuitCuilOrigen = cuitCuilOrigen.trim();
		
		// Se obtienen solo los números:
		Matcher buscadorDePatron =  patternNumeros.matcher(cuitCuilOrigen);
		while (buscadorDePatron.find()){
			numerosDelCuitCuil.append(buscadorDePatron.group());
		}
		
		// Se le agregan los guiones:
		valorNormalizado = numerosDelCuitCuil.toString().substring(0,2) 
							+ "-"
							+ numerosDelCuitCuil.toString().substring(2,numerosDelCuitCuil.toString().length()-1) 
							+ "-"
							+ numerosDelCuitCuil.toString().substring(numerosDelCuitCuil.toString().length()-1, numerosDelCuitCuil.toString().length());
		
	}
	return valorNormalizado;
}

We start with a basic sanity check that the string exists and is long enough. If it isn't, we return an empty string, which already annoys me, because an empty result is not a good way to communicate "I failed to parse".

But assuming we have data, we construct a string builder and trim whitespace. And already we have a problem: we already validated that the string was long enough, but if the string contained more trailing whitespace than a newline, we're looking at a problem. Now, maybe we can assume the data is good, but the next line implies that we can't rely on that- they create a regex matcher to identify numeric values, and for each numeric value they find, they append it to our StringBuilder. This implies that the string may contain non-numeric values which need to be rejected, which means our length validation was still wrong.

So either the data is clean and we're overvalidating, or the data is dirty and we're validating in the wrong order.

But all of that's a preamble to a terrible abuse of string builders, where they discard all the advantages of using a StringBuilder by calling toString again and again and again. Now, maybe the function caches results or the compiler can optimize it, but the result is a particularly unreadable blob of slicing code.

Now, this is ugly, but at least it works, assuming the input data is good. It definitely should never pass a code review, but it's not the kind of bad code that leaves one waking up in the middle of the night in a cold sweat.

No, what gets me about this is that it took five days to write. And according to Nick, the responsible developer wasn't just slacking off or going to meetings the whole time, they were at their desk poking at their Java IDE and looking confused for all five days.

And of course, because it took so long to write the feature, management didn't want to waste more time on kicking it back via a code review. So voila: it got forced through and released to production since it passed testing.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

The Missing Link of Ignorance

Our anonymous submitter, whom we'll call Craig, worked for GlobalCon. GlobalCon relied on an offshore team on the other side of the world for adding/removing users from the system, support calls, ticket tracking, and other client services. One day at work, an urgent escalated ticket from Martin, the offshore support team lead, fell into Craig's queue. Seated before his cubicle workstation, Craig opened the ticket right away:

A fictional example of a parcel delivery SMS phishing message

The new GlobalCon support website is not working. Appears to have been taken over by ChatGPT. The entire support team is blocked by this.

Instead of feeling any sense of urgency, Craig snorted out loud from perverse amusement.

"What was that now?" The voice of Nellie, his coworker, wafted over the cubicle wall that separated them.

"Urgent ticket from the offshore team," Craig replied.

"What is it this time?" Nellie couldn't suppress her glee.

"They're dead in the water because the new support page was, quote, taken over by ChatGPT."

Nellie laughed out loud.

"Hey! I know humor is important to surviving this job." A level, more mature voice piped up behind Craig from the cube across from his. It belonged to Dana, his manager. "But it really is urgent if they're all blocked. Do your best to help, escalate to me if you get stuck."

"OK, thanks. I got this," Craig assured her.

He was already 99.999% certain that no part of their web domain had gone down or been conquered by a belligerent AI, or else he would've heard of it by now. To make sure, Craig opened support.globalcon.com in a browser tab: sure enough, it worked. Martin had supplied no further detail, no logs or screenshots or videos, and no steps to reproduce, which was sadly typical of most of these escalations. At a loss, Craig took a screenshot of the webpage, opened the ticket, and posted the following: Everything's fine on this end. If it's still not working for you, let's do a screenshare.

Granted, a screensharing session was less than ideal given the 12-hour time difference. Craig hoped that whatever nefarious shenanigans ChatGPT had allegedly committed were resolved by now.

The next day, Craig received an update. Still not working. The entire team is still blocked. We're too busy to do a screenshare, please resolve ASAP.

Craig checked the website again with both laptop and phone. He had other people visit the website for him, trying different operating systems and web browsers. Every combination worked. Two things mystified him: how was the entire offshore team having this issue, and how were they "too busy" for anything if they were all dead in the water? At a loss, Craig attached an updated screenshot to the ticket and typed out the best CYA response he could muster. The new support website is up and has never experienced any issues. With no further proof or steps to reproduce this, I don't know what to tell you. I think a screensharing session would be the best thing at this point.

The next day, Martin parroted his last message almost word for word, except this time he assented to a screensharing session, suggesting the next morning for himself.

It was deep into the evening when Craig set up his work laptop on his kitchen counter and started a call and session for Martin to join. "OK. Can you show me what you guys are trying to do?"

To his surprise, he watched Martin open up Microsoft Teams first thing. From there, Martin accessed a chat to the entire offshore support team from the CPO of GlobalCon. The message proudly introduced the new support website and outlined the steps for accessing it. One of those steps was to visit support.globalcon.com.

The web address was rendered as blue outlined text, a hyperlink. Craig observed Martin clicking the link. A web browser opened up. Lo and behold, the page that finally appeared was www.chatgpt.com.

Craig blinked with surprise. "Hang on! I'm gonna take over for a second."

Upon taking control of the session, Craig switched back to Teams and accessed the link's details. The link text was correct, but the link destination was ChatGPT. It seemed like a copy/paste error that the CPO had tried to fix, not realizing that they'd needed to do more than simply update the link text.

"This looks like a bad link," Craig said. "It got sent to your entire team. And all of you have been trying to access the support site with this link?"

"Correct," Martin replied.

Craig was glad he couldn't be seen frowning and shaking his head. "Lemme show you what I've been doing. Then you can show everyone else, OK?"

After surrendering control of the session, Craig patiently walked Martin through the steps of opening a web browser, typing support.globalcon.com into the header, and hitting Return. The site opened without any issue. From there, Craig taught Martin how to create a bookmark for it.

"Just click on that from now on, and it'll always take you to the right place," Craig said. "In the future, before you click on any hyperlink, make sure you hover your mouse over it to see where it actually goes. Links can be labeled one thing when they actually take you somewhere else. That's how phishing works."

"Oh," Martin said. "Thanks!"

The call ended on a positive note, but left Craig marveling at the irony of lecturing the tech support lead on Internet 101 in the dead of night.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Classic WTF: Superhero Wanted

It's a holiday in the US today, so we're taking a long weekend. We flip back to a classic story of a company wanting to fill 15 different positions by hiring only one person. It's okay, Martin handles the database. Original - Remy

A curious email arrived in Phil's Inbox. "Windows Support Engineer required. Must have experience of the following:" and then a long list of Microsoft products.

Phil frowned. The location was convenient; the salary was fine, just the list of software seemed somewhat intimidating. Nevertheless, he replied to the agency saying that he was interested in applying for the position.

A few days later, Phil met Jason, the guy from the recruitment agency, in a hotel foyer. "It's a young, dynamic company", the recruiter explained,"They're growing really fast. They've got tons of funding and their BI Analysis Suite is positioning them to be a leading player in their field."

Phil nodded. "Ummm, I'm a bit worried about this list of products", referring to the job description. "I've never dealt with Microsoft Proxy Server 1.0, and I haven't dealt with Windows 95 OSR2 for a long while."

"Don't worry," Jason assured, "The Director is more an idea man. He just made a list of everything he's ever heard of. You'll just be supporting Windows Server 2003 and their flagship application."

Phil winced. He was a vanilla network administrator – supporting a custom app wasn't quite what he was looking for, but he desperately wanted to get out of his current job.

A few days later, Phil arrived for his interview. The company had rented smart offices on a new business park on the edge of town. He was ushered into the conference room, where he was joined by The Director and The Manager.

"So", said The Manager. "You've seen our brochure?"

"Yeah", said Phil, glancing at the glossy brochure in front of him with bright, Barbie-pink lettering all over it.

"You've seen a demo version of our application – what do you think?"

"Well, I think that it's great!", said Phil. He'd done his research – there were over 115 companies offering something very similar, and theirs wasn't anything special. "I particularly like the icons."

"Wonderful!" The Director cheered while firing up PowerPoint. "These are our servers. We rent some rack space in a data center 100 miles away." Phil looked at the projected picture. It showed a rack of a dozen servers.

"They certainly look nice." said Phil. They did look nice – brand new with green lights.

"Now, we also rent space in another data center on the other side of the country," The Manager added.

"This one is in a former cold-war bunker!" he said proudly. "It's very secure!" Phil looked up at another photo of some more servers.

"What we want the successful applicant to do is to take care of the servers on a day to day basis, but we also need to move those servers to the other data center", said The Director. "Without any interruption of service."

"Also, we need someone to set up the IT for the entire office. You know, email, file & print, internet access – that kind of thing. We've got a dozen salespeople starting next week, they'll all need email."

"And we need it to be secure."

"And we need it to be documented."

Phil was scribbled notes as best he could while the interviewing duo tag teamed him with questions.

"You'll also provide second line support to end users of the application."

"And day-to-day IT support to our own staff. Any questions?"

Phil looked up. "Ah… which back-end database does the application use?" he asked, expecting the answer would be SQL Server or perhaps Oracle, but The Director's reply surprised him.

"Oh, we wrote our own database from scratch. Martin wrote it." Phil realized his mouth was open, and shut it. The Director saw his expression, and explained. "You see, off the shelf databases have several disadvantages – the data gets fragmented, they're not quick enough, and so on. But don't have to worry about that – Martin takes care of the database. Do you have any more questions?"

Phil frowned. "So, to summarize: you want a data center guy to take care of your servers. You want someone to migrate the application from one data center to another, without any outage. You want a network administrator to set up, document and maintain an entire network from scratch. You want someone to provide internal support to the staff. And you want a second line support person to support the our flagship application."

"Exactly", beamed The Director paternally. "We want one person who can do all those things. Can you do that?"

Phil took a deep breath. "I don't know," he replied, and that was the honest answer.

"Right", The Manager said. "Well, if you have any questions, just give either of us a call, okay?"

Moments later, Phil was standing outside, clutching the garish brochure with the pink letters. His head was spinning. Could he do all that stuff? Did he want to? Was Martin a genius or a madman to reinvent the wheel with the celebrated database?

In the end, Phil was not offered the job and decided it might be best to stick it out at his old job for a while longer. After all, compared to Martin, maybe his job wasn't so bad after all.

[Advertisement] Plan Your .NET 9 Migration with Confidence
Your journey to .NET 9 is more than just one decision.Avoid migration migraines with the advice in this free guide. Download Free Guide Now!

Error'd: Mike's Job Search Job

Underqualified Mike S. is suffering a job hunt. "I could handle uD83D and uDC77 well enough, but I am a little short of uD83C and the all important uDFFE requirement."

0

 

Frank forecasts frustration. "The weather app I'm using seems to be a bit confused on my location as I'm on vacation right now." It would be a simple matter for the app to simply identify each location, if it can't meaningfully choose only one.

1

 

Marc Würth is making me hungry. Says Marc "I was looking through my Evernote notes for "transactional" (email service). It didn't find anything. Evernote, though, tried to be helpful and thought I was looking for some basil (German "Basilikum")."

3

 

"To be from or not from be," muses Michael R. Indeed, that is the question at Stansted Shakespeare airport.

4

 

That is not the King," Brendan commented. "I posted this on Discord, and my friend responded with "They have succeeded in alignment. Their AI is truly gender blind." Not only gender-blind but apparently also existence-blind as well. I think the Bard might have something quotable here as well but it escapes me. Comment section is open.

2

 

...and angels sing thee to thy rest.
[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

CodeSOD: A Trying Block

Mark sends us a very simple Java function which has the job of parsing an integer from a string. Now, you might say, "But Java has a built in for that, Integer.parseInt," and have I got good news for you: they actually used it. It's just everything else they did wrong.

private int makeInteger(String s)
{
  int i=0;
  try
  {
    Integer.parseInt(s);
  }
  catch (NumberFormatException e)
  {
    i=0;
    return i;
  }
  i=Integer.parseInt(s);
  return i;
}

This function is really the story of variable i, the most useless variable ever. It's doing its best, but there's just nothing for it to do here.

We start by setting i to zero. Then we attempt to parse the integer, and do nothing with the result. If it fails, we set i to zero again, just for fun, and then return i. Why not just return 0? Because then what would poor i get to do?

Assuming we didn't throw an exception, we parse the input again, storing its result in i, and then return i. Again, we treat i like a child who wants to help paint the living room: we give it a dry brush and a section of wall we're not planning to paint and let it go to town. Nothing it does matters, but it feels like a participant.

Now, Mark went ahead and refactored this function basically right away, into a more terse and clear version:

private int makeInteger(String s)
{
  try
  {
    return Integer.parseInt(s);
  }
  catch (NumberFormatException e)
  {
    return 0;
  }
}

He went about his development work, and then a few days later came across makeInteger reverted back to its original version. For a moment, he wanted to be mad at someone for reverting his change, but no- this was in an entirely different class. With that information, Mark went and did a search for makeInteger in the code, only to find 39 copies of this function, with minor variations.

There are an unknown number of copies of the function where the name is slightly different than makeInteger, but a search for Integer.parseInt implies that there may be many more.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

CodeSOD: Buff Reading

Frank inherited some code that reads URLs from a file, and puts them into a collection. This is a delightfully simple task. What could go wrong?

static String[]  readFile(String filename) {
    String record = null;
    Vector vURLs = new Vector();
    int recCnt = 0;

    try {
        FileReader fr = new FileReader(filename);
        BufferedReader br = new BufferedReader(fr);

        record = new String();

        while ((record = br.readLine()) != null) {
            vURLs.add(new String(record));
            //System.out.println(recCnt + ": " + vURLs.get(recCnt));
            recCnt++;
        }
    } catch (IOException e) {
        // catch possible io errors from readLine()
        System.out.println("IOException error reading " + filename + " in readURLs()!\n");
        e.printStackTrace();
    }

    System.out.println("Reading URLs ...\n");

    int arrCnt = 0;
    String[] sURLs = new String[vURLs.size()];
    Enumeration eURLs = vURLs.elements();

    for (Enumeration e = vURLs.elements() ; e.hasMoreElements() ;) {
        sURLs[arrCnt] = (String)e.nextElement();
        System.out.println(arrCnt + ": " + sURLs[arrCnt]);
        arrCnt++;
    }

    if (recCnt != arrCnt++) {
        System.out.println("WARNING: The number of URLs in the input file does not match the number of URLs in the array!\n\n");
    }

    return sURLs;
} // end of readFile()

So, we start by using a FileReader and a BufferedReader, which is the basic pattern any Java tutorial on file handling will tell you to do.

What I see here is that the developer responsible didn't fully understand how strings work in Java. They initialize record to a new String() only to immediately discard that reference in their while loop. They also copy the record by doing a new String which is utterly unnecessary.

As they load the Vector of strings, they also increment a recCount variable, which is superfluous since the collection can tell you how many elements are in it.

Once the Vector is populated, they need to copy all this data into a String[]. Instead of using the toArray function, which is built in and does that, they iterate across the Vector and put each element into the array.

As they build the array, they increment an arrCnt variable. Then, they do a check: if (recCnt != arrCnt++). Look at that line. Look at the post-increment on arrCnt, despite never using arrCnt again. Why is that there? Just for fun, apparently. Why is this check even there?

The only way it's possible for the counts to not match is if somehow an exception was thrown after vURLs.add(new String(record)); but before recCount++, which doesn't seem likely. Certainly, if it happens, there's something worse going on.

Now, I'm going to be generous and assume that this code predates Java 8- it just looks old. But it's worth noting that in Java 8, the BufferedReader class got a lines() function which returns a Stream<String> that can be converted directly toArray, making all of this code superfluous, but also, so much of this code is just superfluous anyway.

Anyway, for a fun game, start making the last use of every variable be a post-increment before it goes out of scope. See how many code reviews you can sneak it through!

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

Representative Line: What the FFFFFFFF

Combining Java with lower-level bit manipulations is asking for trouble- not because the language is inadequate to the task, but because so many of the developers who work in Java are so used to working at a high level they might not quite "get" what they need to do.

Victor inherited one such project, which used bitmasks and bitwise operations a great deal, based on the network protocol it implemented. Here's how the developers responsible created their bitmasks:

private static long FFFFFFFF = Long.parseLong("FFFFFFFF", 16);

So, the first thing that's important to note, is that Java does support hex literals, so 0xFFFFFFFF is a perfectly valid literal. So we don't need to create a string and parse it. But we also don't need to make a constant simply named FFFFFFFF, which is just the old twenty = 20 constant pattern: technically you've made a constant but you haven't actually made the magic number go away.

Of course, this also isn't actually a constant, so it's entirely possible that FFFFFFFF could hold a value which isn't 0xFFFFFFFF.

[Advertisement] Picking up NuGet is easy. Getting good at it takes time. Download our guide to learn the best practice of NuGet for the Enterprise.

Representative Line: Identifying the Representative

Kate inherited a system where Java code generates JavaScript (by good old fashioned string concatenation) and embeds it into an output template. The Java code was written by someone who didn't fully understand Java, but JavaScript was also a language they didn't understand, and the resulting unholy mess was buggy and difficult to maintain.

Why trying to debug the JavaScript, Kate had to dig through the generated code, which led to this little representative line:

dojo.byId('html;------sites------fileadmin------en------fileadmin------index.html;;12').setAttribute('isLocked','true');

The byId function is an alias to the browser's document.getElementById function. The ID on display here is clearly generated by the Java code, resulting in an absolutely cursed ID for an element in the page. The semicolons are field separators, which means you can parse the ID to get other information about it. I have no idea what the 12 means, but it clearly means something. Then there's that long kebab-looking string. It seems like maybe some sort of hierarchy information? But maybe not, because fileadmin appears twice? Why are there so many dashes? If I got an answer to that question, would I survive it? Would I be able to navigate the world if I understood the dark secret of those dashes? Or would I have to give myself over to our Dark Lords and dedicate my life to bringing about the end of all things?

Like all good representative lines, this one hints at darker, deeper evils in the codebase- the code that generates (or parses) this ID must be especially cursed.

The only element which needs to have its isLocked attribute set to true is the developer responsible for this: they must be locked away before they harm the rest of us.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Error'd: Teamwork

Whatever would we do without teamwork.

David doesn't know. "Microsoft Teams seems to have lost count (it wasn't a very big copy/paste)"

4

 

A follow-up from an anonymous doesn't know either. "Teams doing its best impression of a ransom note just to say you signed out. At least it still remembers how to suggest closing your browser. Small victories."

1

 

Bob F. just wants to make memes. "I've been setting my picture widths in this document to 7.5" for weeks, and suddenly after the latest MS Word update, Microsoft thinks 7.5 is not between -22.0 and 22.0. They must be using AI math to determine this."

2

 

Ewan W. wonders "a social life: priceless...?". Ewan has some brand confusion but after the Boom Battle Bar I bet I know why.

0

 

Big spender Bob B. maybe misunderstands NaN. He gleefully exclaims "I'm very happy to get 15% off - Here's hoping the total ends up as NaN and I get it all free." Yikes. 191.78-NaN is indeed NaN, but that just means you're going to end up owing them NaN. Don't put that on a credit card!

3

 

[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.

CodeSOD: A Jammed Up Session

Andre has inherited a rather antique ASP .Net WebForms application. It's a large one, with many pages in it, but they all follow a certain pattern. Let's see if you can spot it.

protected void btnSearch_Click(object sender, EventArgs e)
{
    ArrayList paramsRel = new ArrayList();
    paramsRel["Name"] = txtNome.Text;
    paramsRel["Date"] = txtDate.Text;
    Session["paramsRel"] = paramsRel;
   
    List<Client> clients = Controller.FindClients();
    //Some other code
}

Now, at first glance, this doesn't look terrible. Using an ArrayList as a dictionary and frankly, storing a dictionary in the Session object is weird, but it's not an automatic red flag. But wait, why is it called paramsRel? They couldn't be… no, they wouldn't…

public List<Client> FindClients()
{
    ArrayList paramsRel = (ArrayList)Session["paramsRel"];
    string name = (string)paramsRel["Name"];
    string dateStr = (string)paramsRel["Date"];
    DateTime date = DateTime.Parse(dateStr);
   
   //More code...
}

Now there's the red flag. paramsRel is how they pass parameters to functions. They stuff it into the Session, then call a function which retrieves it from that Session.

This pattern is used everywhere in the application. You can see that there's a vague gesture in the direction of trying to implement some kind of Model-View-Controller pattern (as FindClients is a member of the Controller object), but that modularization gets undercut by everything depending on Session as a pseudoglobal for passing state information around.

The only good news is that the Session object is synchronized so there's no thread safety issue here, though not for want of trying.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.

CodeSOD: itouhhh…

Frequently in programming, we can make a tradeoff: use less (or more) CPU in exchange for using more (or less) memory. Lookup tables are a great example: use a big pile of memory to turn complicated calculations into O(1) operations.

So, for example, implementing itoa, the C library function for turning an integer into a character array (aka, a string), you could maybe make it more efficient using a lookup table.

I say "maybe", because Helen inherited some C code that, well, even if it were more efficient, it doesn't help because it's wrong.

Let's start with the lookup table:

char an[1000][3] = 
{
	{'0','0','0'},{'0','0','1'},{'0','0','2'},{'0','0','3'},{'0','0','4'},{'0','0','5'},{'0','0','6'},{'0','0','7'},{'0','0','8'},{'0','0','9'},
	{'0','1','0'},{'0','1','1'},{'0','1','2'},{'0','1','3'},{'0','1','4'},{'0','1','5'},{'0','1','6'},{'0','1','7'},{'0','1','8'},{'0','1','9'},
    …

I'm abbreviating the lookup table for now. This lookup table is meant to be use to convert every number from 0…999 into a string representation.

Let's take a look at how it's used.

int ll = f->cfg.len_len;
long dl = f->data_len;
// Prepare length
if ( NULL == dst )
{
    dst_len = f->data_len + ll + 1 ;
    dst = (char*) malloc ( dst_len );
}
else
//if( dst_len < ll + dl )
if( dst_len < (unsigned) (ll + dl) )
{
    // TO DOO - error should be processed
    break;
}
long i2;
switch ( f->cfg.len_fmt)
{
    case ASCII_FORM:
    {
        if ( ll < 2 )
        {
            dst[0]=an[dl][2];
        }
        else if ( ll < 3 )
        {
            dst[0]=an[dl][1];
            dst[1]=an[dl][2];
        }
        else if ( ll < 4 )
        {
            dst[0]=an[dl][0];
            dst[1]=an[dl][1];
            dst[2]=an[dl][2];
        }
        else if ( ll < 5 )
        {
            i2 = dl / 1000;
            dst[0]=an[i2][2];
            i2 = dl % 1000;
            dst[3]=an[i2][2];
            dst[2]=an[i2][1];
            dst[1]=an[i2][0];
        }
        else if ( ll < 6 )
        {
            i2 = dl / 1000;
            dst[0]=an[i2][1];
            dst[1]=an[i2][2];
            i2 = dl % 1000;
            dst[4]=an[i2][2];
            dst[3]=an[i2][1];
            dst[2]=an[i2][0];
        }
        else
        {
            // General case
            for ( int k = ll  ; k > 0  ; k-- )
            {
                dst[k-1] ='0' + dl % 10;
                dl/=10;
            }
        }

        dst[dl]=0;

        break;
    }
}

Okay, we start with some reasonable bounds checking. I have no idea what to make of a struct member called len_len- the length of the length? I'm lacking some context here.

Then we get into the switch statement. For all values less than 4 digits, everything makes sense, more or less. I'm not sure what the point of using a 2D array for you lookup table is if you're also copying one character at a time, but for such a small number of copies I'm sure it's fine.

But then we get into the len_lens longer than 3, and we start dividing by 1000 so that our lookup table continues to work. Which, again, I guess is fine, but I'm still left wondering why we're doing this, why this specific chain of optimizations is what we need to do. And frankly, why we couldn't just use itoa or a similar library function which already does this and is probably more optimized than anything I'm going to write.

When we have an output longer than 5 characters, we just use a naive for-loop and some modulus as our "general" case.

So no, I don't like this code. It reeks of premature optimization, and it also has the vibe of someone starting to optimize without fully understanding the problem they were optimizing, and trying to change course midstream without changing their solution.

But there's a punchline to all of this. Because, you see, I skipped most of the lookup table. Would you like to see how it ends? Of course you do:

{'9','8','0'},{'9','8','1'},{'9','8','2'},{'9','8','3'},{'9','8','4'},{'9','8','5'},{'9','8','6'},{'9','8','7'},{'9','8','8'},{'9','8','9'}
};

The lookup table doesn't work for values from 990 to 999. There are just no entries there. All this effort to optimize converting integers to text and we end up here: with a function that doesn't work for 1% of the possible values it could receive. And, given that the result is an out-of-bounds array access, it fails with everyone's favorite problem: undefined behavior. Usually it'll segfault, but who knows! Maybe it returns whatever bytes it finds? Maybe it sends the nasal demons after you. The compiler is allowed to do anything.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

CodeSOD: Exactly a Date

Alexandar sends us some C# date handling code. The best thing one can say is that they didn't reinvent any wheels, but that might be worse, because they used the existing wheels to drive right off a cliff.

try
{
    var date = DateTime.ParseExact(member.PubDate.ToString(), "M/d/yyyy h:mm:ss tt", null); 
    objCustomResult.PublishedDate = date;
}
catch (Exception datEx)
{
}

member.PubDate is a Nullable<DateTime>. So its ToString will return one of two things. If there is a value there, it'll return the DateTimes value. If it's null, it'll just return an empty string. Attempting to parse the empty string will throw an exception, which we helpfully swallow, do nothing about, and leave objCustomResult.PublishedDate in whatever state it was in- I'm going to guess null, but I have no idea.

Part of this WTF is that they break the advantages of using nullable types- the entire point is to be able to handle null values without having to worry about exceptions getting tossed around. But that's just a small part.

The real WTF is taking a DateTime value, turning it into a string, only to parse it back out. But because this is in .NET, it's more subtle than just the generation of useless strings, because member.PubDate.ToString()'s return value may change depending on your culture info settings.

Which sure, this is almost certainly server-side code running on a single server with a well known locale configured. So this probably won't ever blow up on them, but it's 100% the kind of thing everyone thinks is fine until the day it's not.

The punchline is that ToString allows you to specify the format you want the date formatted in, which means they could have written this:

var date = DateTime.ParseExact(member.PubDate.ToString("M/d/yyyy h:mm:ss tt"), "M/d/yyyy h:mm:ss tt", null);

But if they did that, I suppose that would have possibly tickled their little grey cells and made them realize how stupid this entire block of code was?

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!

CodeSOD: Would a Function by Any Other Name Still be WTF?

"Don't use exception handling for normal flow control," is generally good advice. But Andy's lead had a PhD in computer science, and with that kind of education, wasn't about to let good advice or best practices tell them what to do. That's why, when they needed to validate inputs, they wrote code C# like this:


    public static bool IsDecimal(string theValue)
    {
        try
        {
            Convert.ToDouble(theValue);
            return true;
        }
        catch
        {
            return false;
        }
    } 

They attempt to convert, and if they succeed, great, return true. If they fail, an exception gets caught, and they return false. What could be simpler?

Well, using the built in TryParse function would be simpler. Despite its name, actually avoids throwing an exception, even internally, because exceptions are expensive in .NET. And it is already implemented, so you don't have to do this.

Also, Decimal is a type in C#- a 16-byte floating point value. Now, I know they didn't actually mean Decimal, just "a value with 0 or more digits behind the decimal point", but pedantry is the root of clarity, and the naming convention makes this bad code unclear about its intent and purpose. Per the docs there are Single and Double values which can't be represented as Decimal and trigger an OverflowException. And conversely, Decimal loses precision if converted to Double. This means a value that would be represented as Decimal might not pass this function, and a value that can't be represented as Decimal might, and none of this actually matters but the name of the function is bad.

[Advertisement] ProGet’s got you covered with security and access controls on your NuGet feeds. Learn more.

Sky Extends A.I. Automation to Your Entire Mac

By: Nick Heer

Federico Viticci, MacStories:

For the past two weeks, I’ve been able to use Sky, the new app from the people behind Shortcuts who left Apple two years ago. As soon as I saw a demo, I felt the same way I did about Editorial, Workflow, and Shortcuts: I knew Sky was going to fundamentally change how I think about my macOS workflow and the role of automation in my everyday tasks.

Only this time, because of AI and LLMs, Sky is more intuitive than all those apps and requires a different approach, as I will explain in this exclusive preview story ahead of a full review of the app later this year.

Matthew Cassinelli has also been using an early version of Sky:

Sky bridges the gap between old-school scripting, modern automation, and new-age LLM technology, built with a deep love for working on the Mac as a platform.

This feels like the so-far-unfulfilled promise of Apple Intelligence — but more. The ways I want to automate iOS are limited. But the kinds of things I want help with on my Mac are boundless. Viticci shares the example of automatically sorting a disorganized folder in Finder, and that is absolutely something I want to do easier than I currently can. Yes, I could cobble together something with AppleScript or an Automator workflow, but it would be so much nicer if I could just tell my computer to do something in the most natural language I understand. This is fascinating.

⌥ Permalink

Judge Dismisses 2021 Rumble Antitrust Suit Against Google on Statute of Limitations Grounds

By: Nick Heer

Mike Scarcella, Reuters:

Alphabet’s Google has persuaded a federal judge in California to reject a lawsuit from video platform Rumble accusing the technology giant of illegally monopolizing the online video-sharing market.

In a ruling on Wednesday, U.S. District Judge Haywood Gilliam Jr said Rumble’s 2021 lawsuit seeking more than $2 billion in damages was untimely filed outside the four-year statute of limitations for antitrust claims.

Rumble is dishonest and irritating, but I thought its case in which it argued Google engages in self-preferencing could be interesting. It seems to rank YouTube videos more highly than those from other sources. This can be explained by YouTube’s overwhelming popularity — it consistently ranks in the top ten web services according to Cloudflare — yet I can see anyone’s discomfort in taking Google’s word for it, since it has misrepresented its ranking criteria.

This is an unsatisfying outcome, but it seems Rumble has another suit it is still litigating.

⌥ Permalink

Google is Burying the Web Alive

By: Nick Heer

John Herrman, New York magazine:

But I also don’t want to assume Google knows exactly how this stuff will play out for Google, much less what it will actually mean for millions of websites, and their visitors, if Google stops sending as many people beyond its results pages. Google’s push into productizing generative AI is substantially fear-driven, faith-based, and informed by the actions of competitors that are far less invested in and dependent on the vast collection of behaviors — websites full of content authentic and inauthentic, volunteer and commercial, social and antisocial, archival and up-to-date — that make up what’s left of the web and have far less to lose. […]

Very nearly since it launched, Google has attempted to answer users’ questions as immediately as possible. It had the “I’m Feeling Lucky” button since it was still a stanford.edu subdomain, and it has since steadily changed the results page to more directly respond to queries. But this seems entirely different — a way to benefit from Google’s decades-long ingestion of the web and giving almost nothing back. Or, perhaps, giving back something ultimately worse: invented answers users cannot trust, and will struggle to check because sources are intermingled and buried.

⌥ Permalink

The CIA’s 2010s Covert Communication Websites

By: Nick Heer

Ciro Santilli:

This article is about covert agent communication channel websites used by the CIA in many countries from the late 2000s until the early 2010s, when they were uncovered by counter intelligence of the targeted countries circa 2010-2013.

This is a pretty clever scheme in theory, but seems to have been pretty sloppy in practice. That is, many of the sites seem to share enough elements allowing an enterprising person to link the seemingly unrelated sites — even, as it turns out, years later and after they have been pulled offline. That apparently resulted in the deaths of, according to Foreign Policy, dozens of people.

⌥ Permalink

Apple Gets Its Annual Fraud Prevention Headlines

By: Nick Heer

Apple issued a news release today touting the safety of the App Store, dutifully covered without context by outlets like 9to5Mac, AppleInsider, and MacRumors. This has become an annual tradition in trying to convince people — specifically, developers and regulators — of the wisdom of allowing native software to be distributed for iOS only through the App Store. Apple published similar stats in 2021, 2022, 2023, and 2024, reflecting the company’s efforts in each preceding year. Each contains similar figures; for example:

  • In its new report, Apple says it “terminated more than 146,000 developer accounts over fraud concerns” in 2024, an increase from 118,000 in 2023, which itself was a decrease from 428,000 in 2022. Apple said the decrease between 2022 and 2023 was “thanks to continued improvements to prevent the creation of potentially fraudulent accounts in the first place”. Does the increase in 2024 reflect poorer initial anti-fraud controls, or an increase in fraud attempts? Is it possible to know either way?

  • Apple says it deactivated “nearly 129 million customer accounts” in 2024, a significant decrease from deactivating 374 million the year prior. However, it blocked 711 million account creations in 2024, which is several times greater than the 153 million blocked in the year before. Compare to 2022, when it disabled 282 million accounts and prevented the creation of 198 million potentially fraudulent accounts. In 2021, the same numbers were 170 million and 118 million; in 2020, 244 million and 424 million. These numbers are all over the place.

  • A new statistic Apple is publishing this year is “illicit app distribution”. It says that, in the past month, it “stopped nearly 4.6 million attempts to install or launch apps distributed illicitly outside the App Store or approved third-party marketplaces”. These are not necessarily fraudulent, pirated, or otherwise untoward apps. This statistic is basically a reflection of the control maintained by Apple over iOS regardless of user intentions.

There are plenty of numbers just like these in Apple’s press release. They all look impressive in large part because just about any statistic would be at Apple’s scale. Apple is also undeniably using the App Store to act as a fraud reduction filter, with mixed results. I do not expect a 100% success rate, but I still do not know how much can be gleaned from context-free numbers.

⌥ Permalink

Lawyers Keep Failing Clients By Relying on A.I.

By: Nick Heer

Nicholas Chrastil, the Guardian:

State officials have praised Butler Snow for its experience in defending prison cases – and specifically William Lunsford, head of the constitutional and civil rights litigation practice group at the firm. But now the firm is facing sanctions by the federal judge overseeing Johnson’s case after an attorney at the firm, working with Lunsford, cited cases generated by artificial intelligence – which turned out not to exist.

It is one of a growing number of instances in which attorneys around the country have faced consequences for including false, AI-generated information in official legal filings. A database attempting to track the prevalence of the cases has identified 106 instances around the globe in which courts have found “AI hallucinations” in court documents.

The database is now up to 120 cases, including some fairly high-profile ones like that against Timothy Burke.

Here is a little behind-the-scenes from this weekend’s piece about “nimble fingers” and Apple’s supply chain. The claim, as framed by Tripp Mickle, in the New York Times, is that “[y]oung Chinese women have small fingers, and that has made them a valuable contributor to iPhone production because they are more nimble at installing screws and other miniature parts”. This sounded suspicious to me because I thought about it for five seconds. There are other countries where small objects are carefully assembled by hand, for example, and attributing a characteristic like “small fingers” to hundreds of millions of “young Chinese women” seems reductive, to put it mildly. But this assumption had to come from somewhere, especially since Patrick McGee also mentioned it.

So I used both DuckDuckGo and Google to search for relevant keywords within a date range of the last fifteen years and excluding the past month or so. I could not quickly find anything of relevance; both thought I was looking for smartphones for use with small hands. So I thought this might be a good time to try ChatGPT. It immediately returned a quote from a 2014 report from an international labour organization, but did not tell me the title of the report or give me a link. I asked it for the title. ChatGPT responded it was actually a 2012 report that mentioned “nimble fingers” of young women being valuable, and gave me the title. But when I found copies of the report, there was no such quote or anything remotely relevant. I did, however, get the phrase “nimble fingers”, which sent me down the correct search path to finding articles documenting this longstanding prejudice.

Whether because of time crunch or laziness, it baffles me how law firms charging as much as they do have repeatedly failed to verify the claims generated by artificial intelligence tools.

⌥ Permalink

⌥ ‘Nimble Fingers’ Racism and iPhone Manufacturing

By: Nick Heer

Tripp Mickle, of the New York Times, wrote another one of those articles exploring the feasibility of iPhone manufacturing in the United States. There is basically nothing new here; the only reason it seems to have been published is because the U.S. president farted out yet another tariff idea, this time one targeted specifically at the iPhone at a rate of 25%.1

Anyway, there is one thing in this article — bizarrely arranged in a question-and-answer format — that is notable:

What does China offer that the United States doesn’t?

Small hands, a massive, seasonal work force and millions of engineers.

Young Chinese women have small fingers, and that has made them a valuable contributor to iPhone production because they are more nimble at installing screws and other miniature parts in the small device, supply chain experts said. In a recent analysis the company did to explore the feasibility of moving production to the United States, the company determined that it couldn’t find people with those skills in the United States, said two people familiar with the analysis who spoke on the condition of anonymity.

I will get to the racial component of this in a moment, but this answer has no internal logic. There are two sentences in that larger paragraph. The second posits that people in the U.S. do not have the “skills” needed to carefully assemble iPhones, but the skills as defined in the first sentence are small fingers — which is not a skill. I need someone from the Times to please explain to me how someone can be trained to shrink their fingers.

Anyway, this is racist trash. In response to a question from Julia Carrie Wong of the Guardian, Times communications director Charlie Stadtlander disputed the story was furthering “racial or genetic generalizations”, and linked to a podcast segment clipped by Mickle. In it, Patrick McGee, author of “Apple in China”, says:

The tasks that are often being done to make iPhones require little fingers. So the fact that it’s young Chinese women with little fingers — that actually matters. Like, Apple engineers will talk about this.

The podcast in question is, unsurprisingly, Bari Weiss’; McGee did not mention any of this when he appeared on, for example, the Daily Show.

Maybe some Apple engineers actually believe this, and maybe some supply chain experts do, too. But it is a longstanding sexist stereotype. (Thanks to Kat for the Feminist Review link.) It is ridiculous to see this published in a paper of record as though it is just one fact among many, instead of something which ought to be debunked.

The Times has previously reported why iPhones cannot really be made in the U.S. in any significant quantity. It has nothing to do with finger size, and everything to do with a supply chain the company has helped build for decades, as McGee talks about extensively in that Daily Show interview and, presumably, writes about in his book. (I do not yet have a copy.) Wages play a role, but it is the sheer concentration of manufacturing capability that explains why iPhones are made in China, and why it has been so difficult for Apple to extricate itself from the country.


  1. About which the funniest comment comes from Anuj Ahooja on Threads. ↥︎

Dara Khosrowshahi Knows Uber Is Just Reinventing the Bus

By: Nick Heer

Uber CEO Dara Khosrowshahi was on the Verge’s “Decoder” podcast with Nilay Patel, and was asked about Route Share:

I read this press release announcing Route Share, and I had this very mid-2010s reaction, which was what if Uber just invented a bus. Did you just invent a bus?

I think to some extent it’s inspired by the bus. If you step back a little bit, a part of us looking to expand and grow is about making Uber more affordable to more people. I think one of the things that makes tech companies different from most companies out there is that our goal is to lower prices. If we lower the price, then we can extend the audience.

There is more to Khosrowshahi’s answer, but I am going to interject with three objections. First, the idea that Route Share is “inspired” “to some extent” by a bus is patently ridiculous — it is a vehicle with multiple passengers who embark and disembark at fixed points along a fixed route. It is a bus. A bad one, but a bus.

Second, tech companies are not the only kinds of companies that want to lower prices. Basically every consumer business is routinely marketed on lowering prices and saving customers money. This is the whole entire concept of big box stores like Costco and Walmart. Whether they are actually saving people money is a whole different point.

Which brings me to my third objection, which is that Uber has been raising prices, not reducing them. In the past year, according to a Gridwise report, Uber’s fares increased by 7.2% in the United States, even though driver pay fell 3.4%. Uber has been steadily increasing its average fare since 2018, probably to set the groundwork for its 2019 initial public offering.

Patel does not raise any similar objections.

Anyway, back to Khosrowshahi:

There are two ways of lowering price as it relates to Route Share. One is you get more than one person to share a car because cars cost money, drivers’ time costs money, etc., or you reduce the size or price of the vehicle. And we’re doing that actively. For example, with two-wheelers and three-wheelers in a lot of countries. We’ve been going after this shared concept, which is a bus, for many, many years. We started with UberX Share, for example, which is on-demand sharing.

But this concept takes it to the next level. If you schedule and create consistency among routes, then I think we can up the matching quotient, so to speak, and then essentially pass the savings on to the consumer. So, call it a next-gen bus, but the goal is just to reduce prices to the consumer and then help with congestion and the environment. That’s all good as well.

Given the premise of “you get more than one person to share a car because cars cost money”, you might think Khosrowshahi would discuss the advantageous economics of increasing vehicle capacity. Instead, he cleverly pivots to smaller vehicles, despite Khosrowshahi and Patel discussing earlier how often their Uber ride occurs in a Toyota Highlander — a “mid-size” but still large SUV. This is an obviously inefficient way of moving one driver and one passenger around a city.

We just need better public transit. We should have an adequate supply of taxis, yes, but it is vastly better for everyone if we improve our existing infrastructure of trains and buses. Part of the magic of living in a city is the viability of shared public services like these.

⌥ Permalink

An Elixir of Production, Not of Craft

By: Nick Heer

Greg Storey begins this piece with a well-known quote from Plato’s “Phaedrus”, in which the invention of writing is decried as “an elixir not of memory, but of reminding”. Storey compares this to a criticism of large language models, and writes:

Even though Plato thought writing might kill memory, he still wrote it down.

But this was not Plato’s thought — it was the opinion of Socrates expressed through Thamus. Socrates was too dismissive of the written word for a reason he believed worthwhile — that memory alone is a sufficient marker of intelligence and wisdom.

If anything, I think Storey’s error in attribution actually reinforces the lesson we can draw from it. If we relied on the pessimism of Socrates, we might not know what he said today; after all, human memory is faulty. Because Plato bothered to write it down, we can learn from it. But the ability to interpret it remains ours.

What struck me most about this article, though, is this part:

The real threat to creativity isn’t a language model. It’s a workplace that rewards speed over depth, scale over care, automation over meaning. If we’re going to talk about what robs people of agency, let’s start there. […]

Thanks to new technologies — from writing to large language models, from bicycles to jets — we are able to dramatically increase the volume of work done in our waking hours and that, in turn, increases the pressure to produce even more. The economic term for this is “productivity”, which I have always disliked. It distills everything down to the ratio of input effort compared to output value. In its most raw terms, it rewards the simplistic view of what a workplace ought to be, as Storey expresses well.

⌥ Permalink

Reflecting on Tom Cruise’s Stunt Work

By: Nick Heer

Ryan Francis Bradley, New York Times Magazine:

Only — what if we did know exactly how he did the thing, and why? Before the previous installment of the franchise, “Dead Reckoning,” Paramount released a nine-minute featurette titled “The Biggest Stunt in Cinema History.” It was a behind-the-scenes look at that midair-motorbike moment, tracking how Cruise and his crew pulled it off. We saw a huge ramp running off the edge of a Norwegian fjord. We heard about Cruise doing endless motocross jumps as preparation (13,000 of them, the featurette claims) and skydiving repeatedly (more than 500 dives). We saw him touching down from a jump, his parachute still airborne above him, and giving the director Christopher McQuarrie a dap and a casual “Hey, McQ.” We heard a chorus of stunt trainers telling us how fantastic Cruise is (“an amazing individual,” his base-jumping coach says). And we hear from Cruise himself, asking his driving question: “How can we involve the audience?”

The featurette was an excellent bit of Tom Cruise propaganda and a compelling look at his dedication to (or obsession with) his own mythology (or pathology). But for the movie itself, the advance release of this featurette was completely undermining. When the jump scene finally arrived, it was impossible to ignore what you already knew about it. […]

Not only was the stunt compromised by the featurette, the way it was shot and edited did not help matters. Something about it does not look quite right — maybe it is the perpetual late afternoon light — and the whole sequence feels unbelievable. That is, I know Cruise is the one performing the stunt, but if I found out each shot contained a computer-generated replacement for Cruise, it would not surprise me.

I am as excited for this instalment as anyone. I hope it looks as good as a $300 million blockbuster should. But the way this franchise has been shot since “Fallout” has been a sore spot for me and, with the same director, cinematographer, and editor as “Dead Reckoning”, I cannot imagine why it would be much different.

⌥ Permalink

Tim Cook Called Texas Governor to Stop App Store Age Checking Legislation

By: Nick Heer

Rolfe Winkler, Amrith Ramkumar, and Meghan Bobrowsky, Wall Street Journal:

Apple stepped up efforts in recent weeks to fight Texas legislation that would require the iPhone-maker to verify ages of device users, even drafting Chief Executive Tim Cook into the fight.

The CEO called Texas Gov. Greg Abbott last week to ask for changes to the legislation or, failing that, for a veto, according to people familiar with the call. These people said that the conversation was cordial and that it made clear the extent of Apple’s interest in stopping the bill.

Abbott has yet to say whether he will sign it, though it passed the Texas legislature with veto-proof majorities.

This comes just a few months after Apple announced it would be introducing age range APIs in iOS later this year. Earlier this month, U.S. lawmakers announced federal bills with the same intent. This is clearly the direction things are going. Is there something specific in Texas’ bill that makes it particularly objectionable? Or is it simply the case Apple and Google would prefer a single federal law instead of individual state laws?

⌥ Permalink

Sponsor: Magic Lasso Adblock: 2.0× Faster Web Browsing in Safari

By: Nick Heer

Want to experience twice as fast load times in Safari on your iPhone, iPad, and Mac?

Then download Magic Lasso Adblock — the ad blocker designed for you.

Magic Lasso Adblock: 2.0× Faster Web Browsing in Safari

As an efficient, high performance and native Safari ad blocker, Magic Lasso blocks all intrusive ads, trackers, and annoyances – delivering a faster, cleaner, and more secure web browsing experience.

By cutting down on ads and trackers, common news websites load 2× faster and browsing uses less data while saving energy and battery life.

Rely on Magic Lasso Adblock to:

  • Improve your privacy and security by removing ad trackers

  • Block all YouTube ads, including pre-roll video ads

  • Block annoying cookie notices and privacy prompts

  • Double battery life during heavy web browsing

  • Lower data usage when on the go

With over 5,000 five star reviews, it’s simply the best ad blocker for your iPhone, iPad, and Mac.

And unlike some other ad blockers, Magic Lasso Adblock respects your privacy, doesn’t accept payment from advertisers, and is 100% supported by its community of users.

So, join over 350,000 users and download Magic Lasso Adblock today.

⌥ Permalink

U.S. Spy Agencies Get One-Stop Shop to Buy Personal Data

By: Nick Heer

Remember how, in 2023, the U.S. Office of the Director of National Intelligence published a report acknowledging mass stockpiling of third-party data it had purchased? It turns out there is so much private information about people it is creating a big headache for the intelligence agencies — not because of any laws or ethical qualms, but simply because of the sheer volume.

Sam Biddle, the Intercept:

The Office of the Director of National Intelligence is working on a system to centralize and “streamline” the use of commercially available information, or CAI, like location data derived from mobile ads, by American spy agencies, according to contract documents reviewed by The Intercept. The data portal will include information deemed by the ODNI as highly sensitive, that which can be “misused to cause substantial harm, embarrassment, and inconvenience to U.S. persons.” The documents state spy agencies will use the web portal not just to search through reams of private data, but also run them through artificial intelligence tools for further analysis.

Apparently, the plan is to feed all this data purchased from brokers and digital advertising companies into artificial intelligence systems. The DNI says it has rules about purchasing and using this data, so there is nothing to worry about.

By the way, the DNI’s Freedom of Information Act page was recently updated to remove links to released records and FOIA logs. They were live on May 5 but, as of May 16, those pages have been removed, and direct links no longer resolve either. Strange.

Update: The ODNI told me its “website is currently under construction”.

⌥ Permalink

Speculating About the Hardware Ambitions of OpenAI

By: Nick Heer

Berber Jin, Wall Street Journal:

Altman and Ive offered a few hints at the secret project they have been working on [at a staff meeting]. The product will be capable of being fully aware of a user’s surroundings and life, will be unobtrusive, able to rest in one’s pocket or on one’s desk, and will be a third core device a person would put on a desk after a MacBook Pro and an iPhone.

Ambitious, albeit marginally less hubristic than considering it a replacement for either of those two device categories.

Stephen Hackett:

If OpenAI’s future product is meant to work with the iPhone and Android phones, then the company is opening a whole other set of worms, from the integration itself to the fact that most people will still prefer to simply pull their phone out of their pockets for basically any task.

I am reminded of an April 2024 article by Jason Snell at Six Colors:

The problem is that I’m dismissing the Ai Pin and looking forward to the Apple Watch specifically because of the control Apple has over its platforms. Yes, the company’s entire business model is based on tightly integrating its hardware and software, and it allows devices like the Apple Watch to exist. But that focus on tight integration comes at a cost (to everyone but Apple, anyway): Nobody else can have the access Apple has.

A problem OpenAI could have with this device is the same as was faced by Humane, which is that Apple treats third-party hardware and software as second-class citizens in its post-P.C. ecosystem. OpenAI is laying the groundwork for better individual context. But this is a significant limitation, and it is one I am curious to see how it is overcome.

Whatever this thing is, it is undeniably interesting to me. OpenAI has become a household name on a foundation of an academic-sounding product that has changed the world. Jony Ive has been the name attached to entire eras of design. There is plenty to criticize about both. Yet the combination of these things is surely intriguing, inviting the kind of speculation that used to be commonplace in tech before it all became rote. I have little faith our world will become meaningfully better with another gadget in it. Yet I hope the result is captivating, at least, because we could use some of that.

⌥ Permalink

GeoGuessr Community Maps Go Dark in Protest of EWC Ties to Human Rights Abuses

By: Nick Heer

Jessica Conditt, Engadget:

A group of GeoGuessr map creators have pulled their contributions from the game to protest its participation in the Esports World Cup 2025, calling the tournament “a sportswashing tool used by the government of Saudi Arabia to distract from and conceal its horrific human rights record.” The protestors say the blackout will hold until the game’s publisher, GeoGuessr AB, cancels its planned Last Chance Wildcard tournament at the EWC in Riyadh, Saudi Arabia, from July 21 to 27.

Those participating in this blackout created some of the most popular and notable maps in the game. Good for them.

Update: GeoGuessr says it is withdrawing from the EWC.

⌥ Permalink

The Carbon Footprint Sham

By: Nick Heer

Thinking about the energy “footprint” of artificial intelligence products makes it a good time to re-link to Mark Kaufman’s excellent 2020 Mashable article in which he explores the idea of a carbon footprint:

The genius of the “carbon footprint” is that it gives us something to ostensibly do about the climate problem. No ordinary person can slash 1 billion tons of carbon dioxide emissions. But we can toss a plastic bottle into a recycling bin, carpool to work, or eat fewer cheeseburgers. “Psychologically we’re not built for big global transformations,” said John Cook, a cognitive scientist at the Center for Climate Change Communication at George Mason University. “It’s hard to wrap our head around it.”

Ogilvy & Mather, the marketers hired by British Petroleum, wove the overwhelming challenges inherent in transforming the dominant global energy system with manipulative tactics that made something intangible (carbon dioxide and methane — both potent greenhouse gases — are invisible), tangible. A footprint. Your footprint.

The framing of most of the A.I. articles I have seen thankfully shies away from ascribing individual blame; instead, they point to systemic flaws. This is preferable, but it still does little at the scale of electricity generation worldwide.

⌥ Permalink

The Energy Footprint of A.I.

By: Nick Heer

Casey Crownhart, MIT Technology Review:

Today, new analysis by MIT Technology Review provides an unprecedented and comprehensive look at how much energy the AI industry uses — down to a single query — to trace where its carbon footprint stands now, and where it’s headed, as AI barrels towards billions of daily users.

We spoke to two dozen experts measuring AI’s energy demands, evaluated different AI models and prompts, pored over hundreds of pages of projections and reports, and questioned top AI model makers about their plans. Ultimately, we found that the common understanding of AI’s energy consumption is full of holes.

This robust story comes on the heels of a series of other discussions about how much energy is used by A.I. products and services. Last month, for example, Andy Masley published a comparison of using ChatGPT against other common activities. The Economist ran another, and similar articles have been published before. As far as I can tell, they all come down to the same general conclusion: training A.I. models is energy-intensive, using A.I. products is not, lots of things we do online and offline have a greater impact on the environment, and the current energy use of A.I. is the lowest it will be from now on.

There are lots of good reasons to critique artificial intelligence. I am not sure its environmental impact is a particularly strong one; I think the true energy footprint of tech companies, of which A.I. is one part, is more relevant. Even more pressing, however, is our need to electrify our world as much as we can, and that will require a better and cleaner grid.

⌥ Permalink

Jony Ive’s ‘io’ Acquired by OpenAI; Ive to Remain as Designer

By: Nick Heer

Last month, the Information reported OpenAI was considering buying io Products — unfortunate capitalization theirs — for around $500 million. The company, founded by Jony Ive and employing several ex-Apple designers and engineers, was already known to be working with OpenAI, but it was still an external entity. Now, it is not, to the tune of over $6 billion in equity.

OpenAI today published a press release and video — set in LoveFrom’s distinctive proprietary serif face — featuring Ive and Sam Altman in conversation. There is barely a hint of what they are working on but, whether because of honesty or just clever packaging, it comes across as an earnest attempt to think about the new technologies OpenAI has successfully brought to the world as part of our broader cultural fabric. Of course, it will be expressed in something that can be assembled in a factory and sold for money, so let us not get too teary-eyed. We have heard a similar tune before.

The video promises revealing something “next year”.

⌥ Permalink

Two Major Newspapers Published an A.I.-Generated Guide to Summer Books That Do Not Exist

By: Nick Heer

Albert Burneko, Defector:

Over this past weekend, the Chicago Sun-Times and Philadelphia Inquirer’s weekend editions included identical huge “Best of Summer” inserts; in the Inquirer’s digital edition the insert runs 54 pages, while the entire rest of the paper occupies 36. Before long, readers began noticing something strange about the “Summer reading list for 2025” section of the insert. Namely, that while the list includes some very well-known authors, most of the books listed in it do not exist.

This is the kind of fluffy insert long purchased by publishers to pad newspapers. In this case, it appears to be produced by Hearst Communications, which feels about right for something with Hearst’s name on it. I cannot imagine most publishers read these things very carefully; adding more work or responsibility is not the point of buying a guide like this.

What I found very funny today was watching the real-time reporting of this story in parallel with Google’s I/O presentation, at which it announced one artificial intelligence feature after another. On the one hand, A.I. features can help you buy event tickets or generate emails offering travel advice based on photos from trips you have taken. On the other, it is inventing books, experts, and diet advice.

⌥ Permalink

My blocking of some crawlers is an editorial decision unrelated to crawl volume

By: cks

Recently I read a lobste.rs comment on one of my recent entries that said, in part:

Repeat after me everyone: the problem with these scrapers is not that they scrape for LLM’s, it’s that they are ill-mannered to the point of being abusive. LLM’s have nothing to do with it.

This may be some people's view but it is not mine. For me, blocking web scrapers here on Wandering Thoughts is partly an editorial decision of whether I want any of my resources or my writing to be fed into whatever they're doing. I will certainly block scrapers for doing what I consider an abusive level of crawling, and in practice most of the scrapers that I block come to my attention due to their volume, but I will block low-volume scrapers because I simply don't like what they're doing it for.

Are you a 'brand intelligence' firm that scrapes the web and sells your services to brands and advertisers? Blocked. In general, do you charge for access to whatever you're generating from scraping me? Probably blocked. Are you building a free search site for a cause (and with a point of view) that I don't particularly like? Almost certainly blocked. All of this is an editorial decision on my part on what I want to be even vaguely associated with and what I don't, not a technical decision based on the scraping's effects on my site.

I am not going to even bother trying to 'justify' this decision. It's a decision that needs no justification to some and to others, it's one that can never be justified. My view is that ethics matter. Technology and our decisions of what to do with technology are not politically neutral. We can make choices, and passively not doing anything is a choice too.

(I could say a lot of things here, probably badly, but ethics and politics are in part about what sort of a society we want, and there's no such thing as a neutral stance on that. See also.)

I would block LLM scrapers regardless of how polite they are. The only difference them being politer would make is that I would be less likely to notice (and then block) them. I'm probably not alone in this view.

Our Grafana and Loki installs have quietly become 'legacy software' here

By: cks

At this point we've been running Grafana for quite some time (since late 2018), and (Grafana) Loki for rather less time and on a more ad-hoc and experimental basis. However, over time both have become 'legacy software' here, by which I mean that we (I) have frozen their versions and don't update them any more, and we (I) mostly or entirely don't touch their configurations any more (including, with Grafana, building or changing dashboards).

We froze our Grafana version due to backward compatibility issues. With Loki I could say that I ran out of enthusiasm for going through updates, but part of it was that Loki explicitly deprecated 'promtail' in favour of a more complex solution ('Alloy') that seemed to mostly neglect the one promtail feature we seriously cared about, namely reading logs from the systemd/journald complex. Another factor was it became increasingly obvious that Loki was not intended for our simple setup and future versions of Loki might well work even worse in it than our current version does.

Part of Grafana and Loki going without updates and becoming 'legacy' is that any future changes in them would be big changes. If we ever have to update our Grafana version, we'll likely have to rebuild a significant number of our current dashboards, because they use panels that aren't supported any more and the replacements have a quite different look and effect, requiring substantial dashboard changes for the dashboards to stay decently usable. With Loki, if the current version stopped working I'd probably either discard the idea entirely (which would make me a bit sad, as I've done useful things through Loki) or switch to something else that had similar functionality. Trying to navigate the rapids of updating to a current Loki is probably roughly as much work (and has roughly as much chance of requiring me to restart our log collection from scratch) as moving to another project.

(People keep mentioning VictoriaLogs (and I know people have had good experiences with it), but my motivation for touching any part of our Loki environment is very low. It works, it hasn't eaten the server it's on and shows no sign of doing that any time soon, and I'm disinclined to do any more work with smart log collection until a clear need shows up. Our canonical source of history for logs continues to be our central syslog server.)

Intel versus AMD is currently an emotional decision for me

By: cks

I recently read Michael Stapelberg's My 2025 high-end Linux PC. One of the decisions Stapelberg made was choosing an Intel (desktop) CPU because of better (ie lower) idle power draw. This is a perfectly rational decision to make, one with good reasoning behind it, and also as I read the article I realized that it was one I wouldn't have made. Not because I don't value idle power draw; like Stapelberg's machine but more so, my desktops spend most of their time essentially idle. Instead, it was because I realized (or confirmed my opinion) that right now, I can't stand to buy Intel CPUs.

I am tired of all sorts of aspects of Intel. I'm tired of their relentless CPU product micro-segmentation across desktops and servers, with things like ECC allowed in some but not all models. I'm tired of their whole dance of P-cores and E-cores, and also of having to carefully read spec sheets to understand the P-core and E-core tradeoffs for a particular model. I'm tired of Intel just generally being behind AMD and repeatedly falling on its face with desperate warmed over CPU refreshes that try to make up for its process node failings. I'm tired of Intel's hardware design failure with their 13th and 14th generation CPUs (see eg here). I'm sure AMD Ryzens have CPU errata too that would horrify me if I knew, but they're not getting rubbed in my face the way the Intel issue is.

At this point Intel has very little going for its desktop CPUs as compared to the current generation AMD Ryzens. Intel CPUs have better idle power levels, and may have better single-core burst performance. In absolute performance I probably won't notice much difference, and unlike Stapelberg I don't do the kind of work where I really care about build speed (and if I do, I have access to much more powerful machines). As far as the idle power goes, I likely will notice the better idle power level (some of the time), but my system is likely to idle at lower power in general than Stapelberg's will, especially at home where I'll try to use the onboard graphics if at all possible (so I won't have the (idle) power price of a GPU card).

(At work I need to drive two 4K displays at 60Hz and I don't think there are many motherboards that will do that with onboard graphics, even if the CPU's built in graphics system is up to it in general.)

But I don't care about the idle power issue. If or when I build a new home desktop, I'll eat the extra 20 watts or so of idle power usage for an AMD CPU (although this may vary in practice, especially with screens blanked). And I'll do it because right now I simply don't want to give Intel my money.

My GNU Emacs settings for the vertico package (as of mid 2025)

By: cks

As covered in my Emacs packages, vertico is one of the third party Emacs packages that I have installed to modify how minibuffer completion works for me, or at least how it looks. In my experience, vertico took a significant amount of customization before I really liked it (eventually including some custom code), so I'm going to write down some notes about why I made various settings.

Vertico itself is there to always show me a number of the completion targets, as a help to narrowing in on what I want; I'm willing to trade vertical space during completion for a better view of what I'm navigating around. It's not the only way to do this (there's fido-vertical-mode in standard GNU Emacs, for example), but it's what I started with and it has a number of settings that let me control both how densely the completions are presented (and so how many of them I get to see at once) and how they're presented.

The first thing I do with vertico is override its key binding for TAB, because I want standard Emacs minibuffer tab completion, not vertico's default behavior of inserting the current thing completion is currently on. Specifically, my key bindings are:

 :bind (:map vertico-map
             ("TAB" . minibuffer-complete)
             ;; M-v is taken by vertico
             ("M-g M-c" . switch-to-completions)
             ;; Original tab binding, which we want sometimes when
             ;; using orderless completion.
             ("M-TAB" . vertico-insert))

I normally work by using regular tab completion and orderless's completion until I'm happy, then hitting M-TAB if necessary and then RET. I use M-g M-c so rarely that I'd forgotten it until writing this entry. Using M-TAB is especially likely for a long filename completion, where I might use the cursor keys (or theoretically the mouse) to move vertico's selection to a directory and then hit M-TAB to fill it in so I can then tab-complete within it.

Normally, vertico displays a single column of completion candidates, which potentially leaves a lot of wasted space on the right; I use marginalia to add information some sorts of completion targets (such as Emacs Lisp function names) in this space. For other sorts of completions where there's no particular additional information, such as MH-E mail folder names, I use vertico's vertico-multiform-mode to switch to a vertico-grid so I fill the space with several columns of completion candidates and reduce the number of vertical lines that vertico uses (both are part of vertico's extensions).

(I also have vertico-mouse enabled when I'm using Emacs under X, but in practice I mostly don't use it.)

Another important change (for me) is to turn off vertico's default behavior of remembering the history of your completions and putting recently used entries first in the list. This sounds like a fine idea, but in practice I want my completion order to be completely predictable and I'm rarely completing the same thing over and over again. The one exception is my custom MH-E folder completion, where I do enable history because I may be, for example, refiling messages into one of a few folders. This is done through another extension, vertico-sort, or at least I think it is.

(When vertico is installed as an ELPA or MELPA package and then use-package'd, you apparently get all of the extensions without necessarily having to specifically enable them and can just use bits from them.)

My feeling is that effective use of vertico probably requires this sort of customization if you regularly use minibuffer completion for anything beyond standard things where vertico (and possibly marginalia) can make good use of all of your horizontal space. Beyond what key bindings and other vertico behavior you can stand and what behavior you have to change, you want to figure out how to tune vertico so that it's significantly useful for each thing you regularly complete, instead of mostly showing you a lot of empty space and useless results. This is intrinsically a relatively personal thing.

PS: One area where vertico's completion history is not as useful as it looks is filename completion or anything that looks like it (such as standard MH-E folder completion). This is because Emacs filename completion and thus vertico's history happens component by component, while you probably want your history to give you the full path that you wound up completing.

PPS: I experimented with setting vertico-resize, but found that the resulting jumping around was too visually distracting.

A thought on JavaScript "proof of work" anti-scraper systems

By: cks

One of the things that people are increasingly using these days to deal with the issue of aggressive LLM and other web scrapers is JavaScript based "proof of work" systems, where your web server requires visiting clients to run some JavaScript to solve a challenge; one such system (increasingly widely used) is Xe Iaso's Anubis. One of the things that people say about these systems is that LLM scrapers will just start spending the CPU time to run this challenge JavaScript, and LLM scrapers may well have lots of CPU time available through means such as compromised machines. One of my thoughts is that things are not quite as simple for the LLM scrapers as they look.

An LLM scraper is operating in a hostile environment (although its operator may not realize this). In a hostile environment, dealing with JavaScript proof of work systems is not as simple as simply running it, because you can't particularly tell a JavaScript proof of work system from JavaScript that does other things. Letting your scraper run JavaScript means that it can also run JavaScript for other purposes, for example for people who would like to exploit your scraper's CPU to do some cryptocurrency mining, or simply have you run JavaScript for as long as you'll let it keep going (perhaps because they've recognized you as a LLM scraper and want to waste as much of your CPU as possible).

An LLM scraper can try to recognize a JavaScript proof of work system but this is a losing game. The other parties have every reason to make themselves look like a proof of work system, and the proof of work systems don't necessarily have an interest in being recognized (partly because this might allow LLM scrapers to short-cut their JavaScript with optimized host implementations of the challenges). And as both spammers and cryptocurrency miners have demonstrated, there is no honor among thieves. If LLM scrapers dangle free computation in front of people, someone will spring up to take advantage of it. This leaves LLM scrapers trying to pick a JavaScript runtime limit that doesn't cut them off from too many sites, while sites can try to recognize LLM scrapers and increase their proof of work difficulty if they see a suspect.

(This is probably not an original thought, but it's been floating around my head for a while.)

PS: JavaScript proof of work systems aren't the greatest thing, but they're going to happen unless someone convincingly demonstrates a better alternative.

The length of file names in early Unix

By: cks

If you use Unix today, you can enjoy relatively long file names on more or less any filesystem that you care to name. But it wasn't always this way. Research V7 had 14-byte filenames, and the System III/System V lineage continued this restriction until it merged with BSD Unix, which had significantly increased this limit as part of moving to a new filesystem (initially called the 'Fast File System', for good reasons). You might wonder where this unusual number came from, and for that matter, what the file name limit was on very early Unixes (it was 8 bytes, which surprised me; I vaguely assumed that it had been 14 from the start).

I've mentioned before that the early versions of Unix had a quite simple format for directory entries. In V7, we can find the directory structure specified in sys/dir.h (dir(5) helpfully directs you to sys/dir.h), which is so short that I will quote it in full:

#ifndef	DIRSIZ
#define	DIRSIZ	14
#endif
struct	direct
{
    ino_t    d_ino;
    char     d_name[DIRSIZ];
};

To fill in the last blank, ino_t was a 16-bit (two byte) unsigned integer (and field alignment on PDP-11s meant that this structure required no padding), for a total of 16 bytes. This directory structure goes back to V4 Unix. In V3 Unix and before, directory entries were only ten bytes long, with 8 byte file names.

(Unix V4 (the Fourth Edition) was when the kernel was rewritten in C, so that may have been considered a good time to do this change. I do have to wonder how they handled the move from the old directory format to the new one, since Unix at this time didn't have multiple filesystem types inside the kernel; you just had the filesystem, plus all of your user tools knew the directory structure.)

One benefit of the change in filename size is that 16-byte directory entries fit evenly in 512-byte disk blocks (or other powers-of-two buffer sizes). You never have a directory entry that spans two disk blocks, so you can deal with directories a block at a time. Ten byte directory entries don't have this property; eight-byte ones would, but then that would leave space for only six character file names, and presumably that was considered too small even in Unix V1.

PS: That inode numbers in V7 (and earlier) were 16-bit unsigned integers does mean what you think it means; there could only be at most 65,536 inodes in a single classical V7 filesystem. If you needed more files, you had better make more filesystems. Early Unix had a lot of low limits like that, some of them quite hard-coded.

What keeps Wandering Thoughts more or less free of comment spam (2025 edition)

By: cks

Like everywhere else, Wandering Thoughts (this blog) gets a certain amount of automated comment spam attempts. Over the years I've fiddled around with a variety of anti-spam precautions, although not all of them have worked out over time. It's been a long time since I've written anything about this, because one particular trick has been extremely effective ever since I introduced it.

That one trick is a honeypot text field in my 'write a comment' form. This field is normally hidden by CSS, and in any case the label for the field says not to put anything in it. However, for a very long time now, automated comment spam systems seem to operate by stuffing some text into every (text) form field that they find before they submit the form, which always trips over this. I log the form field's text out of curiosity; sometimes it's garbage and sometimes it's (probably) meaningful for the spam comment that the system is trying to submit.

Obviously this doesn't stop human-submitted spam, which I get a small amount of every so often. In general I don't expect anything I can reasonably do to stop humans who do the work themselves; we've seen this play out in email and I don't have any expectations that I can do better. It also probably wouldn't work if I was using a popular platform that had this as a general standard feature, because then it would be worth the time of the people writing automated comment spam systems to automatically recognize it and work around it.

Making comments on Wandering Thoughts also has an additional small obstacle in the way of automated comment spammers, which is that you must initially preview your comment before you can submit it (although you don't have to submit the comment that you previewed, you can edit it after the first preview). Based on a quick look at my server logs, I don't think this matters to the current automated comment spam systems that try things here, as they only appear to try submitting once. I consider requiring people to preview their comment before posting it to be a good idea in general, especially since Wandering Thoughts uses a custom wiki-syntax and a forced preview gives people some chance of noticing any mistakes.

(I think some amount of people trying to write comments here do miss this requirement and wind up not actually posting their comment in the end. Or maybe they decide not to after writing one version of it; server logs give me only so much information.)

In a world that is increasingly introducing various sorts of aggressive precautions against LLM crawlers, including 'proof of work' challenges, all of this may become increasingly irrelevant. This could go either way; either the automated comment spammers die off as more and more systems have protections that are too aggressive for them to deal with, or the automated systems become increasingly browser-based and sidestep my major precaution because they no longer 'see' the honeypot field.

Fedora's DNF 5 and the curse of mandatory too-smart output

By: cks

DNF is Fedora's high(er) level package management system, which pretty much any system administrator is going to have to use to install and upgrade packages. Fedora 41 and later have switched from DNF 4 to DNF 5 as their normal (and probably almost mandatory) version of DNF. I ran into some problems with this switch, and since then I've found other issues, all of which boil down to a simple issue: DNF 5 insists on doing too-smart output.

Regardless of what you set your $TERM to and what else you do, if DNF 5 is connected to a terminal (and perhaps if it isn't), it will pretty-print its output in an assortment of ways. As far as I can tell it simply assumes ANSI cursor addressability, among other things, and will always fit its output to the width of your terminal window, truncating output as necessary. This includes output from RPM package scripts that are running as part of the update. Did one of them print a line longer than your current terminal width? Tough, it was probably truncated. Are you using script so that you can capture and review all of the output from DNF and RPM package scripts? Again, tough, you can't turn off the progress bars and other things that will make a complete mess of the typescript.

(It's possible that you can find the information you want in /var/log/dnf5.log in un-truncated and readable form, but if so it's buried in debug output and I'm not sure I trust dnf5.log in general.)

DNF 5 is far from the only offender these days. An increasing number of command line programs simply assume that they should always produce 'smart' output (ideally only if they're connected to a terminal). They have no command line option to turn this off and since they always use 'ANSI' escape sequences, they ignore the tradition of '$TERM' and especially 'TERM=dumb' to turn that off. Some of them can specifically disable colour output (typically with one of a number of environment variables, which may or may not be documented, and sometimes with a command line option), but that's usually the limits of their willingness to stop doing things. The idea of printing one whole line at a time as you do things and not printing progress bars, interleaving output, and so on has increasingly become a non-starter for modern command line tools.

(Another semi-offender is Debian's 'apt' and also 'apt-get' to some extent, although apt-get's progress bars can be turned off and 'apt' is explicitly a more user friendly front end to apt-get and friends.)

PS: I can't run DNF with its output directed into a file because it wants you to interact with it to approve things, and I don't feel like letting it run freely without that.

Thinking about what you'd want in a modern simple web server

By: cks

Over on the Fediverse, I said:

I'm currently thinking about what you'd want in a simple modern web server that made life easy for sites that weren't purely static. I think you want CGI, FastCGI, and HTTP reverse proxying, plus process supervision. Automatic HTTPS of course. Rate limiting support, and who knows what you'd want to make it easier to deal with the LLM crawler problem.

(This is where I imagine a 'stick a third party proxy in the middle' mode of operation.)

What I left out of my Fediverse post is that this would be aimed at small scale sites. Larger, more complex sites can and should invest in the power, performance, and so on of headline choices like Apache, Nginx, and so on. And yes, one obvious candidate in this area is Caddy, but at the same time something that has "more scalable" (than alternatives) as a headline features is not really targeting the same area as I'm thinking of.

This goal of simplicity of operation is why I put "process supervision" into the list of features. In a traditional reverse proxy situation (whether this is FastCGI or HTTP), you manage the reverse proxy process separately from the main webserver, but that requires more work from you. Putting process supervision into the web server has the goal of making all of that more transparent to you. Ideally, in common configurations you wouldn't even really care that there was a separate process handling FastCGI, PHP, or whatever; you could just put things into a directory or add some simple configuration to the web server and restart it, and everything would work. Ideally this would extend to automatically supporting PHP by just putting PHP files somewhere in the directory tree, just like CGI; internally the web server would start a FastCGI process to handle them or something.

(Possibly you'd implement CGI through a FastCGI gateway, but if so this would be more or less pre-configured into the web server and it'd ship with a FastCGI gateway for this (and for PHP).)

This is also the goal for making it easy to stick a third party filtering proxy in the middle of processing requests. Rather than having to explicitly set up two web servers (a frontend and a backend) with an anti-LLM filtering proxy in the middle, you would write some web server configuration bits and then your one web server would split itself into a frontend and a backend with the filtering proxy in the middle. There's no technical reason you can't do this, and even control what's run through the filtering proxy and what's served directly by the front end web server.

This simple web server should probably include support for HTTP Basic Authentication, so that you can easily create access restricted areas within your website. I'm not sure if it should include support for any other sort of authentication, but if it did it would probably be OpenID Connect (OIDC), since that would let you (and other people) authenticate through external identity providers.

It would be nice if the web server included some degree of support for more or less automatic smart in-memory (or on-disk) caching, so that if some popular site linked to your little server, things wouldn't explode (or these days, if a link to your site was shared on the Fediverse and all of the Fediverse servers that it propagated to immediately descended on your server). At the very least there should be enough rate limiting that your little server wouldn't fall over, and perhaps some degree of bandwidth limits you could set so that you wouldn't wake up to discover you had run over your outgoing bandwidth limits and were facing large charges.

I doubt anyone is going to write such a web server, since this isn't likely to be the kind of web server that sets the world on fire, and probably something like Caddy is more or less good enough.

(Doing a good job of writing such a server would also involve a fair amount of research to learn what people want to run at a small scale, how much they know, what sort of server resources they have or want to use, what server side languages they wind up using, what features they need, and so on. I certainly don't know enough about the small scale web today.)

PS: One reason I'm interested in this is that I'd sort of like such a server myself. These days I use Apache and I'm quite familiar with it, but at the same time I know it's a big beast and sometimes it has entirely too many configuration options and special settings and so on.

The five platforms we have to cover when planning systems

By: cks

Suppose, not entirely hypothetically, that you're going to need a 'VPN' system that authenticates through OIDC. What platforms do you need this VPN system to support? In our environment, the answer is that we have five platforms that we need to care about, and they're the obvious four plus one more: Windows, macOS, iOS, Android, and Linux.

We need to cover these five platforms because people here use our services from all of those platforms. Both Windows and macOS are popular on laptops (and desktops, which still linger around), and there's enough people who use Linux to be something we need to care about. On mobile devices (phones and tablets), obviously iOS and Android are the two big options, with people using either or both. We don't usually worry about the versions of Windows and macOS and suggest that people to stick to supported ones, but that may need to change with Windows 10.

Needing to support mobile devices unquestionably narrows our options for what we can use, at least in theory, because there are certain sorts of things you can semi-reasonably do on Linux, macOS, and Windows that are infeasible to do (at least for us) on mobile devices. But we have to support access to various of our services even on iOS and Android, which constrains us to certain sorts of solutions, and ideally ones that can deal with network interruptions (which are quite common on mobile devices in Toronto, as anyone who takes our subways is familiar with).

(And obviously it's easier for open source systems to support Linux, macOS, and Windows than it is for them to extend this support to Android and especially iOS. This extends to us patching and rebuilding them for local needs; with various modern languages, we can produce Windows or macOS binaries from modified open source projects. Not so much for mobile devices.)

In an ideal world it would be easy to find out the support matrix of platforms (and features) for any given project. In this world, the information can sometimes be obscure, especially for what features are supported on what platforms. One of my resolutions to myself is that when I find interesting projects but they seem to have platform limitations, I should note down where in their documentation they discuss this, so I can find it later to see if things have changed (or to discuss with people why certain projects might be troublesome).

Python, type hints, and feeling like they create a different language

By: cks

At this point I've only written a few, relatively small programs with type hints. At times when doing this, I've wound up feeling that I was writing programs in a language that wasn't quite exactly Python (but obviously was closely related to it). What was idiomatic in one language was non-idiomatic in the other, and I wanted to write code differently. This feeling of difference is one reason I've kept going back and forth over whether I should use type hints (well, in personal programs).

Looking back, I suspect that this is partly a product of a style where I tried to use typing.NewType a lot. As I found out, this may not really be what I want to do. Using type aliases (or just structural descriptions of the types) seems like it's going to be easier, since it's mostly just a matter of marking up things. I also suspect that this feeling that typed Python is a somewhat different language from plain Python is a product of my lack of experience with typed Python (which I can fix by doing more with types in my own code, perhaps revising existing programs to add type annotations).

However, I suspect some of this feeling of difference is that you (I) want to structure 'typed' Python code differently than untyped code. In untyped Python, duck typing is fine, including things like returning None or some meaningful type, and you can to a certain extent pass things around without caring what type they are. In this sort of situation, typed Python has pushed me toward narrowing the types involved in my code (although typing.Optional can help here). Sometimes this is a good thing; at other times, I wind up using '0.0' to mean 'this float value is not set' when in untyped Python I would use 'None' (because propagating the type difference of the second way through the code is too annoying). Or to put it another way, typed Python feels less casual, and there are good and bad aspects to this.

Unfortunately, one significant source of Python code that I work on is effectively off limits for type hints, and that's the Python code I write for work. For that code, I need to stick to the subset of Python that my co-workers know and can readily understand, and that subset doesn't include Python's type hints. I could try to teach my co-workers about type hints, but my view is that if I'm wrestling with whether it's worth it, my co-workers will be even less receptive to the idea of trying to learn and remember them (especially when they look at my Python code only infrequently). If we were constantly working with medium to large Python programs where type hints were valuable for documenting things and avoiding irritating errors it would be one thing, but as it is our programs are small and we can go months between touching any Python code. I care about Python type hints and have active exposure to them, and even I have to refresh my memory on them from time to time.

(Perhaps some day type hints will be pervasive enough in third party Python code and code examples that my co-workers will absorb and remember them through osmosis, but that day isn't today.)

The lack of a good command line way to sort IPv6 addresses

By: cks

A few years ago, I wrote about how 'sort -V' can sort IPv4 addresses into their natural order for you. Even back then I was smart enough to put in that 'IPv4' qualification and note that this didn't work with IPv6 addresses, and said that I didn't know of any way to handle IPv6 addresses with existing command line tools. As far as I know, that remains the case today, although you can probably build a Perl, Python, or other language program that does such sorting for you if you need to do this regularly.

Unix tools like 'sort' are pretty flexible, so you might innocently wonder why it can't be coerced into sorting IPv6 addresses. The first problem is that IPv6 addresses are written in hex without leading 0s, not decimal. Conventional sort will correctly sort hex numbers if all of the numbers are the same length, but IPv6 addresses are written in hex groups that conventionally drop leading zeros, so you will have 'ff' instead of '00ff' in common output (or '0' instead of '0000'). The second and bigger problem is the IPv6 '::' notation, which stands for the longest run of all-zero fields, ie some number of '0000' fields.

(I'm ignoring IPv6 scopes and zones for this, let's assume we have public IPv6 addresses.)

If IPv6 addresses were written out in full, with leading 0s on fields and all their 0000 fields, you could handle them as a simple conventional sort (you wouldn't even need to tell sort that the field separator was ':'). Unfortunately they almost never are, so you need to either transform them to that form, print them out, sort the output, and perhaps transform them back, or read them into a program as 128-bit numbers, sort the numbers, and print them back out as IPv6 addresses. Ideally your language of choice for this has a way to sort a collection of IPv6 addresses.

The very determined can probably do this with awk with enough work (people have done amazing things in awk). But my feeling is that doing this in conventional Unix command line tools is a Turing tarpit; you might as well use a language where there's a type of IPv6 addresses that exposes the functionality that you need.

(And because IPv6 addresses are so complex, I suspect that GNU Sort will never support them directly. If you need GNU Sort to deal with them, the best option is a program that turns them into their full form.)

PS: People have probably written programs to sort IPv6 addresses, but with the state of the Internet today, the challenge is finding them.

It's not obvious how to verify TLS client certificates issued for domains

By: cks

TLS server certificate verification has two parts; you first verify that the TLS certificate is valid, CA-signed certificate, and then you verify that the TLS certificate is for the host you're connecting to. One of the practical issues with TLS 'Client Authentication' certificates for host and domain names (which are on the way out) is that there's no standard meaning for how you do the second part of this verification, and if you even should. In particular, what host name are you validating the TLS client certificate against?

Some existing protocols provide the 'client host name' to the server; for example, SMTP has the EHLO command. However, existing protocols tend not to have explicitly standardized using this name (or any specific approach) for verifying a TLS client certificate if one is presented to the server, and large mail providers vary in what they send as a TLS client certificate in SMTP conversations. For example, Google's use of 'smtp.gmail.com' doesn't match any of the other names available, so its only meaning is 'this connection comes from a machine that has access to private keys for a TLS certificate for smtp.gmail.com', which hopefully means that it belongs to GMail and is supposed to be used for this purpose.

If there is no validation of the TLS client certificate host name, that is all that a validly signed TLS client certificate means; the connecting host has access to the private keys and so can be presumed to be 'part of' that domain or host. This isn't nothing, but it doesn't authenticate what exactly the client host is. If you want to validate the host name, you have to decide what to validate against and there are multiple answers. If you design the protocol you can have the protocol send a client host name and then validate the TLS certificate against the hostname; this is slightly better than using the TLS certificate's hostname as is in the rest of your processing, since the TLS certificate might have a wildcard host name. Otherwise, you might validate the TLS certificate host name against its reverse DNS, which is more complicated than you might expect and which will fail if DNS isn't working. If the TLS client certificate doesn't have a wildcard, you could also try to look up the IP addresses associated with the host names in the TLS certificate and see if any of the IP addresses match, but again you're depending on DNS.

(You can require non-wildcard TLS certificate names in your protocol, but people may not like it for various reasons.)

This dependency on DNS for TLS client certificates is different from the DNS dependency for TLS server certificates. If DNS doesn't work for the server case, you're not connecting at all since you have no target IPs; if you can connect, you have a target hostname to validate against (in the straightforward case of using a hostname instead of an IP address). In the TLS client certificate case, the client can connect but then the TLS server may deny it access for apparently arbitrary reasons.

That your protocol has to specifically decide what verifying TLS client certificates means (and there are multiple possible answers) is, I suspect, one reason that TLS client certificates aren't used more in general Internet protocols. In turn this is a disincentive for servers implementing TLS-based protocols (including SMTP) from telling TLS clients that they can send a TLS client certificate, since it's not clear what you should do with it if one is sent.

Let's Encrypt drops "Client Authentication" from its TLS certificates

By: cks

The TLS news of the time interval is that Let's Encrypt certificates will no longer be usable to authenticate your client to a TLS server (via a number of people on the Fediverse). This is driven by a change in Chrome's "Root Program", covered in section 3.2, with a further discussion of this in Chrome's charmingly named Moving Forward, Together in the "Understanding dedicated hierarchies" section; apparently only half of the current root Certificate Authorities actually issue TLS server certificates. As far as I know this is not yet a CA/Browser Forum requirement, so this is all driven by Chrome.

In TLS client authentication, a TLS client (the thing connecting to a TLS server) can present its own TLS certificate to the TLS server, just as the TLS server presents its certificate to the client. The server can then authenticate the client certificate however it wants to, although how to do this is not as clear as when you're authenticating a TLS server's certificate. To enable this usage, a TLS certificate and the entire certificate chain must be marked as 'you can use these TLS certificates for client authentication' (and similarly, a TLS certificate that will be used to authenticate a server to clients must be marked as such). That marking is what Let's Encrypt is removing.

This doesn't affect public web PKI, which basically never used conventional CA-issued host and domain TLS certificates as TLS client certificates (websites that used TLS client certificates used other sorts of TLS certificates). It does potentially affect some non-web public TLS, where domain TLS certificates have seen small usage in adding more authentication to SMTP connections between mail systems. I run some spam trap SMTP servers that advertise that sending mail systems can include a TLS client certificate if the sender wants to, and some senders (including GMail and Outlook) do send proper public TLS certificates (and somewhat more SMTP senders include bad TLS certificates). Most mail servers don't, though, and given that one of the best sources of free TLS certificates has just dropped support for this usage, that's unlikely to change. Let's Encrypt's TLS certificates can still be used by your SMTP server for receiving email, but you'll no longer be able to use them for sending it.

On the one hand, I don't think this is going to have material effects on much public Internet traffic and TLS usage. On the other hand, it does cut off some possibilities in non-web public TLS, at least until someone starts up a free, ACME-enabled Certificate Authority that will issue TLS client certificates. And probably some number of mail servers will keep sending their TLS certificates to people as client certificates even though they're no longer valid for that purpose.

PS: If you're building your own system and you want to, there's nothing stopping you from accepting public TLS server certificates from TLS clients (although you'll have to tell your TLS library to validate them as TLS server certificates, not client certificates, since they won't be marked as valid for TLS client usage). Doing the security analysis is up to you but I don't think it's a fatally flawed idea.

Classical "Single user computers" were a flawed or at least limited idea

By: cks

Every so often people yearn for a lost (1980s or so) era of 'single user computers', whether these are simple personal computers or high end things like Lisp machines and Smalltalk workstations. It's my view that the whole idea of a 1980s style "single user computer" is not what we actually want and has some significant flaws in practice.

The platonic image of a single user computer in this style was one where everything about the computer (or at least its software) was open to your inspection and modification, from the very lowest level of the 'operating system' (which was more of a runtime environment than an OS as such) to the highest things you interacted with (both Lisp machines and Smalltalk environments often touted this as a significant attraction, and it's often repeated in stories about them). In personal computers this was a simple machine that you had full control over from system boot onward.

The problem is that this unitary, open environment is (or was) complex and often lacked resilience. Famously, in the case of early personal computers, you could crash the entire system with programming mistakes, and if there's one thing people do all the time, it's make mistakes. Most personal computers mitigated this by only doing one thing at once, but even then it was unpleasant, and the Amiga would let you blow multiple processes up at once if you could fit them all into RAM. Even on better protected systems, like Lisp and Smalltalk, you still had the complexity and connectedness of a unitary environment.

One of the things that we've learned from computing over the past N decades is that separation, isolation, and abstraction are good ideas. People can only keep track of so many things in their heads at once, and modularity (in the broad sense) is one large way we keep things within that limit (or at least closer to it). Single user computers were quite personal but usually not very modular. There are reasons that people moved to computers with things like memory protection, multiple processes, and various sorts of privilege separation.

(Let us not forget the great power of just having things in separate objects, where you can move around or manipulate or revert just one object instead of 'your entire world'.)

I think that there is a role for computers that are unapologetically designed to be used by only a single person who is in full control of everything and able to change it if they want to. But I don't think any of the classical "single user computer" designs are how we want to realize a modern version of the idea.

(As a practical matter I think that a usable modern computer system has to be beyond the understanding of any single person. There is just too much complexity involved in anything except very restricted computing, even if you start from complete scratch. This implies that an 'understandable' system really needs strong boundaries between its modules so that you can focus on the bits that are of interest to you without having to learn lots of things about the rest of the system or risk changing things you don't intend to.)

Two broad approaches to having Multi-Factor Authentication everywhere

By: cks

In this modern age, more and more people are facing more and more pressure to have pervasive Multi-Factor Authentication, with every authentication your people perform protected by MFA in some way. I've come to feel that there are two broad approaches to achieving this and one of them is more realistic than the other, although it's also less appealing in some ways and less neat (and arguably less secure).

The 'proper' way to protect everything with MFA is to separately and individually add MFA to everything you have that does authentication. Ideally you will have a central 'single sign on' system, perhaps using OIDC, and certainly your people will want you to have only one form of MFA even if it's not all run through your SSO. What this implies is that you need to add MFA to every service and protocol you have, which ranges from generally easy (websites) through being annoying to people or requiring odd things (SSH) to almost impossible at the moment (IMAP, authenticated SMTP, and POP3). If you opt to set it up with no exemptions for internal access, this approach to MFA insures that absolutely everything is MFA protected without any holes through which an un-MFA'd authentication can be done.

The other way is to create some form of MFA-protected network access (a VPN, a mesh network, a MFA-authenticated SSH jumphost, there are many options) and then restrict all non-MFA access to coming through this MFA-protected network access. For services where it's easy enough, you might support additional MFA authenticated access from outside your special network. For other services where MFA isn't easy or isn't feasible, they're only accessible from the MFA-protected environment and a necessary step for getting access to them is to bring up your MFA-protected connection. This approach to MFA has the obvious problem that if someone gets access to your MFA-protected network, they have non-MFA access to everything else, and the not as obvious problem that attackers might be able to MFA as one person to the network access and then do non-MFA authentication as another person on your systems and services.

The proper way is quite appealing to system administrators. It gives us an array of interesting challenges to solve, neat technology to poke at, and appealingly strong security guarantees. Unfortunately the proper way has two downsides; there's essentially no chance of it covering your IMAP and authenticated SMTP services any time soon (unless you're willing to accept some significant restrictions), and it requires your people to learn and use a bewildering variety of special purpose, one-off interfaces and sometimes software (and when it needs software, there may be restrictions on what platforms the software is readily available on). Although it's less neat and less nominally secure, the practical advantage of the MFA protected network access approach is that it's universal and it's one single thing for people to deal with (and by extension, as long as the network system itself covers all platforms you care about, your services are fully accessible from all platforms).

(In practice the MFA protected network approach will probably be two things for people to deal with, not one, since if you have websites the natural way to protect them is with OIDC (or if you have to, SAML) through your single sign on system. Hopefully your SSO system is also what's being used for the MFA network access, so people only have to sign on to it once a day or whatever.)

Using awk to check your script's configuration file

By: cks

Suppose, not hypothetically, that you have a shell script with a relatively simple configuration file format that people can still accidentally get wrong. You'd like to check the configuration file for problems before you use it in the rest of your script, for example by using it with 'join' (where things like the wrong number or type of fields will be a problem). Recently on the Fediverse I shared how I was doing this with awk, so here's a slightly more elaborate and filled out version:

errs=$(awk '
         $1 ~ "^#" { next }
         NF != 3 {
            printf " line %d: wrong number of fields\n", NR;
            next }
         [...]
         ' "$cfg_file"
       )

if [ -n "$errs" ]; then
   echo "$prog: Errors found in '$cfg_file'. Stopping." 1>&2
   echo "$errs" 1>&2
   exit 1
fi

(Here I've chosen to have awk's diagnostic messages indented by one space when the script prints them out, hence the space before 'line %d: ...'.)

The advantage of having awk simply print out the errors it detects and letting the script deal with them later is that you don't need to mess around with awk's exit status; your awk program can simply print what it finds and be done. Using awk for the syntax checks is handy because it lets you express a fair amount of logic and checks relatively simply (you can even check for duplicate entries and so on), and it also gives you line numbers for free.

One trick with using awk in this way is to progressively filter things in your checks (by skipping further processing of the current line with 'next'). We start out by skipping all comments, then we report and otherwise skip every line with the wrong number of fields, and then every check after this can assume that at least we have the right number of fields so it can confidently check what should be in each one. If the number of fields in a line is wrong there's no point in complaining about how one of them has the wrong sort of value, and the early check and 'next' to skip the rest of this line's processing is the simple way.

If you're also having awk process the configuration file later you might be tempted to have it check for errors at the same time, in an all-in-one awk program, but my view is that it's simpler to split the error checking from the processing. That way you don't have to worry about stopping the processing if you detect errors or intermingle processing logic with checking logic. You do have to make sure the two versions have the same handling of comments and so on, but in simple configuration file formats this is usually easy.

(Speaking from personal experience, you don't want to use '$1 == "#"' as your comment definition, because then you can't just stick a '#' in front of an existing configuration file line to comment it out. Instead you have to remember to make it '# ', and someday you'll forget.)

PS: If your awk program is big and complex enough, it might make more sense to use a here document to create a shell variable containing it, which will let you avoid certain sorts of annoying quoting problems.

Our need for re-provisioning support in mesh networks (and elsewhere)

By: cks

In a comment on my entry on how WireGuard mesh networks need a provisioning system, vcarceler pointed me to Innernet (also), an interesting but opinionated provisioning system for WireGuard. However, two bits of it combined made me twitch a bit; Innernet only allows you to provision a given node once, and once a node is assigned an internal IP, that IP is never reused. This lack of support for re-provisioning machines would be a problem for us and we'd likely have to do something about it, one way or another. Nor is this an issue unique to Innernet, as a number of mesh network systems have it.

Our important servers have fixed, durable identities, and in practice these identities are both DNS names and IP addresses (we have some generic machines, but they aren't as important). We also regularly re-provision these servers, which is to say that we reinstall them from scratch, usually on new hardware. In the usual course of events this happens roughly every two years or every four years, depending on whether we're upgrading the machine for every Ubuntu LTS release or every other one. Over time this is a lot of re-provisionings, and we need the re-provisioned servers to keep their 'identity' when this happens.

We especially need to be able to rebuild a dead server as an identical replacement if its hardware completely breaks and eats its system disks. We're already in a crisis, we don't want to have a worse crisis because other things need to be updated because we can't exactly replace the server but instead have to build a new server that fills the same role, or will once DNS is updated, configurations are updated, etc etc.

This is relatively straightforward for regular Linux servers with regular networking; there's the issue of SSH host keys, but there's several solutions. But obviously there's a problem if the server is also a mesh network node and the mesh network system will not let it be re-provisioned under the same name or the same internal IP address. Accepting this limitation would make it difficult to use the mesh network for some things, especially things where we don't want to depend on DNS working (for example, sending system logs via syslog). Working around the limitation requires reverse engineering where the mesh network system stores local state and hopefully being able to save a copy elsewhere and restore it; among other things, this has implications for the mesh network system's security model.

For us, it would be better if mesh networking systems explicitly allowed this re-provisioning. They could make it a non-default setting that took explicit manual action on the part of the network administrator (and possibly required nodes to cooperate and extend more trust than normal to the central provisioning system). Or a system like Innernet could have a separate class of IP addresses, call them 'service addresses', that could be assigned and reassigned to nodes by administrators. A node would always have its unique identity but could also be assigned one or more service addresses.

(Of course our other option is to not use a mesh network system that imposes this restriction, even if it would otherwise make our lives easier. Unless we really need the system for some other reason or its local state management is explicitly documented, this is our more likely choice.)

PS: The other problem with permanently 'consuming' IP addresses as machines are re-provisioned is that you run out of them sooner or later unless you use gigantic network blocks that are many times larger than the number of servers you'll ever have (well, in IPv4, but we're not going to switch to IPv6 just to enable a mesh network provisioning system).

How and why typical (SaaS) pricing is too high for university departments

By: cks

One thing I've seen repeatedly is that companies that sell SaaS or SaaS like things and offer educational pricing (because they want to sell to universities too) are setting (initial) educational pricing that is in practice much too high. Today I'm going to work through a schematic example to explain what I mean. All of this is based on how it works in Canadian and I believe US universities; other university systems may be somewhat different.

Let's suppose that you're a SaaS vendor and like many vendors, you price your product at $X per person per month; I'll pick $5 (US, because most of the time the prices are in USD). Since you want to sell to universities and other educational institutions and you understand they don't have as much money to spend as regular companies, you offer a generous academic discount; they pay only $3 USD per person per month.

(If these numbers seem low, I'm deliberately stacking the deck in the favour of the SaaS company. Things get worse for your pricing as the numbers go up.)

The research and graduate student side of a large but not enormous university department is considering your software. They have 100 professors 'in' the department, 50 technical and administrative support staff (this is a low ratio), and professors have an average of 10 graduate students, research assistants, postdocs, outside collaborators, undergraduate students helping out with research projects, and so on around them, for a total of 1,000 additional people 'in' the department who will also have to be covered. These 1,150 people will cost the department $3,450 USD a month for your software, a total of $41,400 USD a year, which is a significant saving over what a commercial company would pay for the same number of people.

Unfortunately, unless your software is extremely compelling or absolutely necessary, this cost is likely to be a very tough sell. In many departments, that's enough money to fund (or mostly fund) an additional low-level staff position, and it's certainly enough money to hire more TAs, supplement more graduate student stipends (these are often the same thing, since hiring graduate students as TAs is one of the ways that you support them), or pay for summer projects, all of which are likely to be more useful and meaningful to the department than a year of your service. It's also more than enough money to cause people in the department to ask awkward questions like 'how much technical staff time will it take to put together an inferior but functional enough alternate to this', which may well not be $41,000 worth of time (especially not every year).

(Of course putting together a complete equivalent of your SaaS will cost much more than that, since you have multiple full time programmers working on it and you've invested years in your software at this point. But university departments are already used to not having nice things, and staff time is often considered almost free.)

If you decide to make your pricing nicer by only charging based on the actual number of people who wind up using your stuff, unfortunately you've probably made the situation worse for the university department. One thing that's worse than a large predictable bill is an uncertain but possibly large bill; the department will have to reserve and allocate the money in its budget to cover the full cost, and then figure out what to do with the unused budget at the end of the year (or the end of every month, or whatever). Among other things, this may lead to awkward conversations with higher powers about how the department's initial budget and actual spending don't necessarily match up.

As we can see from the numbers, one big part of the issue is those 1,000 non-professor, non-staff people. These people aren't really "employees" the way they would be in a conventional organization (and mostly don't think of themselves as employees), and the university isn't set up to support their work and spend money on them in the way it is for the people it considers actual employees. The university cares if a staff member or a professor can't get their work done, and having them work faster or better is potentially valuable to the university. This is mostly not true for graduate students and many other additional people around a department (and almost entirely not true if the person is an outside collaborator, an undergraduate doing extra work to prepare for graduate studies elsewhere, and so on).

In practice, most of those 1,000 extra people will and must be supported on a shoestring basis (for everything, not just for your SaaS). The university as a whole and their department in particular will probably only pay a meaningful per-person price for them for things that are either absolutely necessary or extremely compelling. At the same time, often the software that the department is considering is something that those people should be using too, and they may need a substitute if the department can't afford the software for them. And once the department has the substitute, it becomes budgetarily tempting and perhaps politically better if everyone uses the substitute and the department doesn't get your software at all.

(It's probably okay to charge a very low price for such people, as opposed to just throwing them in for free, but it has to be low enough that the department or the university doesn't have to think hard about it. One way to look at it is that regardless of the numbers, the collective group of those extra people is 'less important' to provide services to than the technical staff, the administrative staff, and the professors, and the costs probably should work out accordingly. Certainly the collective group of extra people isn't more important than the other groups, despite having a lot more people in it.)

Incidentally, all of this applies just as much (if not more so) when the 'vendor' is the university's central organizations and they decide to charge (back) people within the university for something on a per-person basis. If this is truly cost recovery and accurately represents the actual costs to provide the service, then it's not going to be something that most graduate students get (unless the university opts to explicitly subsidize it for them).

PS: All of this is much worse if undergraduate students need to be covered too, because there are even more of them. But often the department or the university can get away with not covering them, partly because their interactions with the university are often much narrower than those of graduate students.

Using WireGuard seriously as a mesh network needs a provisioning system

By: cks

One thing that my recent experience expanding our WireGuard mesh network has driven home to me is how (and why) WireGuard needs a provisioning system, especially if you're using it as a mesh networking system. In fact I think that if you use a mesh WireGuard setup at any real scale, you're going to wind up either adopting or building such a provisioning system.

In a 'VPN' WireGuard setup with a bunch of clients and one or a small number of gateway servers, adding a new client is mostly a matter of generating and giving it some critical information. However, it's possible to more or less automate this and make it relatively easy for people who want to connect to you to do this. You'll still need to update your WireGuard VPN server too, but at least you only have one of them (probably), and it may well be the host where you generate the client configuration and provide it to the client's owner.

The extra problem with adding a new client to a WireGuard mesh network is that there's many more WireGuard nodes that need to be updated (and also the new client needs a lot more information; it needs to know about all of the other nodes it's supposed to talk to). More broadly, every time you change the mesh network configuration, every node needs to update with the new information. If you add a client, remove a client, a client changes its keys for some reason (perhaps it had to be re-provisioned because the hardware died), all of these means nodes need updates (or at least the nodes that talk to the changed node). In the VPN model, only the VPN server node (and the new client) needed updates.

Our little WireGuard mesh is operating at a small scale, so we can afford to do this by hand. As you have more WireGuard nodes and more changes in nodes, you're not going to want to manually update things one by one, any more than you want to do that for other system administration work. Thus, you're going to want some sort of a provisioning system, where at a minimum you can say 'this is a new node' or 'this node has been removed' and all of your WireGuard configurations are regenerated, propagated to WireGuard nodes, trigger WireGuard configuration reloads, and so on. Some amount of this can be relatively generic in your configuration management system, but not all of it.

(Many configuration systems can propagate client-specific files to clients on changes and then trigger client side actions when the files are updated. But you have to build the per-client WireGuard configuration.)

PS: I haven't looked into systems that will do this for you, either as pure WireGuard provisioning systems or as bigger 'mesh networking using WireGuard' software, so I don't have any opinions on how you want to handle this. I don't even know if people have built and published things that are just WireGuard provisioning systems, or if everything out there is a 'mesh networking based on WireGuard' complex system.

Some notes on using 'join' to supplement one file with data from another

By: cks

Recently I said something vaguely grumpy about the venerable Unix 'join' tool. As the POSIX specification page for join will unhelpfully tell you, join is a 'relational database operator', which means that it implements the rough equivalent of SQL joins. One way to use join is to add additional information for some lines in your input data.

Suppose, not entirely hypothetically, that we have an input file (or data stream) that starts with a login name and contains some additional information, and that for some logins (but not all of them) we have useful additional data about them in another file. Using join, the simple case of this is easy, if the 'master' and 'suppl' files are already sorted:

join -1 1 -2 1 -a 1 master suppl

(I'm sticking to POSIX syntax here. Some versions of join accept '-j 1' as an alternative to '-1 1 -2 1'.)

Our specific options tell join to join each line of 'master' and 'suppl' on the first field in each (the login) and print them, and also print all of the lines from 'master' that didn't have a login in 'suppl' (that's the '-a 1' argument). For lines with matching logins, we get all of the fields from 'master' and then all of the extra fields from 'suppl'; for lines from 'master' that don't match, we just get the fields from 'master'. Generally you'll tell apart which lines got supplemented and which ones didn't by how many fields they have.

If we want something other than all of the fields in the order that they are in the existing data source, in theory we have the '-o <list>' option to tell join what fields from each source to output. However, this option has a little problem, which I will show you by quoting the important bit from the POSIX standard (emphasis mine):

The fields specified by list shall be written for all selected output lines. Fields selected by list that do not appear in the input shall be treated as empty output fields.

What that means is that if we're also printing non-joined lines from our 'master' file, our '-o' still applies and any fields we specified from 'suppl' will be blank and empty (unless you use '-e'). This can be inconvenient if you were re-ordering fields so that, for example, a field from 'suppl' was listed before some fields from 'master'. It also means that you want to use '1.1' to get the login from 'master', which is always going to be there, not '2.1', the login from 'suppl', which is only there some of the time.

(All of this assumes that your supplementary file is listed second and the master file first.)

On the other hand, using '-e' we can simplify life in some situations. Suppose that 'suppl' contains only one additional interesting piece of information, and it has a default value that you'll use if 'suppl' doesn't contain a line for the login. Then if 'master' has three fields and 'suppl' two, we can write:

join -1 1 -2 1 -a 1 -e "$DEFVALUE" -o '1.1,1.2,1.3,2.2' master suppl

Now we don't have to try to tell whether or not a line from 'master' was supplemented by counting how many fields it has; everything has the same number of fields, it's just sometimes the last (supplementary) field is the default value.

(This is harder to apply if you have multiple fields from the 'suppl' file, but possibly you can find a 'there is nothing here' value that works for the rest of your processing.)

In Apache, using OIDC instead of SAML makes for easier testing

By: cks

In my earlier installment, I wrote about my views on the common Apache modules for SAML and OIDC authentication, where I concluded that OpenIDC was generally easier to use than Mellon (for SAML). Recently I came up with another reason to prefer OIDC, one sufficiently strong enough that we converted one of our remaining Mellon uses over to OIDC. The advantage is that OIDC is easier to test if you're building a new version of your web server under another name.

Suppose that you're (re)building a version of your Apache based web server with authentication on, for example, a new version of Ubuntu, using a test server name. You want to test that everything still works before you deploy it, including your authentication. If you're using Mellon, as far as I can see you have to generate an entirely new SP configuration using your test server's name and then load it into your SAML IdP. You can't use your existing SAML SP configuration from your existing web server, because it specifies the exact URL the SAML IdP needs to use for various parts of the SAML protocol, and of course those URLs point to your production web server under its production name. As far as I know, to get another set of URLs that point to your test server, you need to set up an entirely new SP configuration.

OIDC has an equivalent thing in its redirect URI, but the OIDC redirect URL works somewhat differently. OIDC identity providers typically allow you to list multiple allowed redirect URIs for a given OIDC client, and it's the client that tells the server what redirect URI to use during authentication. So when you need to test your new server build under a different name, you don't need to register a new OIDC client; you can just add some more redirect URIs to your existing production OIDC client registration to allow your new test server to provide its own redirect URI. In the OpenIDC module, this will typically require no Apache configuration changes at all (from the production version), as the module automatically uses the current virtual host as the host for the redirect URI. This makes testing rather easier in practice, and it also generally tests the Apache OIDC configuration you'll use in production, instead of a changed version of it.

(You can put a hostname in the Apache OIDCRedirectURI directive, but it's simpler to not do so. Even if you did use a full URL in this, that's a single change in a text file.)

Chosing between "it works for now" and "it works in the long term"

By: cks

A comment on my entry about how Netplan can only have WireGuard peers in one file made me realize one of my implicit system administration views (it's the first one by Jon). That is the tradeoff between something that works now and something that not only works now but is likely to keep working in the long term. In system administration this is a tradeoff, not an obvious choice, because what you want is different depending on the circumstances.

Something that works now is, for example, something that works because of how Netplan's code is currently written, where you can hack around an issue by structuring your code, your configuration files, or your system in a particular way. As a system administrator I do a surprisingly large amount of these, for example to fix or work around issues in systemd units that people have written in less than ideal or simply mistaken ways.

Something that's going to keep working in the longer term is doing things 'correctly', which is to say in whatever way that the software wants you to do and supports. Sometimes this means doing things the hard way when the software doesn't actually implement some feature that would make your life better, even if you could work around it with something that works now but isn't necessarily guaranteed to keep working in the future.

When you need something to work and there's no other way to do it, you have to take a solution that (only) works now. Sometimes you take a 'works now' solution even if there's an alternative because you expect your works-now version to be good enough for the lifetime of this system, this OS release, or whatever; you'll revisit things for the next version (at least in theory, workarounds to get things going can last a surprisingly long time if they don't break anything). You can't always insist on a 'works now and in the future' solution.

On the other hand, sometimes you don't want to do a works-now thing even if you could. A works-now thing is in some sense technical debt, with all that that implies, and this particular situation isn't important enough to justify taking on such debt. You may solve the problem properly, or you may decide that the problem isn't big and important enough to solve at all and you'll leave things in their imperfect state. One of the things I think about when making this decision is how annoying it would be and how much would have to change if my works-now solution broke because of some update.

(Another is how ugly the works-now solution is, including how big of a note we're going to want to write for our future selves so we can understand what this peculiar load bearing thing is. The longer the note, the more I generally wind up questioning the decision.)

It can feel bad to not deal with a problem by taking a works-now solution. After all, it works, and otherwise you're stuck with the problem (or with less pleasant solutions). But sometimes it's the right option and the works-now solution is simply 'too clever'.

(I've undoubtedly made this decision many times over my career. But Jon's comment and my reply to it crystalized the distinction between a 'works now' and a 'works for the long term' solution in my mind in a way that I think I can sort of articulate.)

Netplan can only have WireGuard peers in one file

By: cks

We have started using WireGuard to build a small mesh network so that machines outside of our network can securely get at some services inside it (for example, to send syslog entries to our central syslog server). Since this is all on Ubuntu, we set it up through Netplan, which works but which I said 'has warts' in my first entry about it. Today I discovered another wart due to what I'll call the WireGuard provisioning problem:

Current status: provisioning WireGuard endpoints is exhausting, at least in Ubuntu 22.04 and 24.04 with netplan. So many netplan files to update. I wonder if Netplan will accept files that just define a single peer for a WG network, but I suspect not.

The core WireGuard provisioning problem is that when you add a new WireGuard peer, you have to tell all of the other peers about it (or at least all of the other peers you want to be able to talk to the new peer). When you're using iNetplan, it would be convenient if you could put each peer in a separate file in /etc/netplan; then when you add a new peer, you just propagate the new Netplan file for the peer to everything (and do the special Netplan dance required to update peers).

(Apparently I should now call it 'Canonical Netplan', as that's what its front page calls it. At least that makes it clear exactly who is responsible for Netplan's state and how it's not going to be widely used.)

Unfortunately this doesn't work, and it doesn't work in a dangerous way, which is that Netplan only notices one set of WireGuard peers in one netplan file (at least on servers, using systemd-networkd as the backend). If you put each peer in its own file, only the first peer is picked up. If you define some peers in the file where you define your WireGuard private key, local address, and so on, and some peers in another file, only peers from whichever is first will be used (even if the first file only defines peers, which isn't enough to bring up a WireGuard device by itself). As far as I can see, Netplan doesn't report any errors or warnings to the system logs on boot about this situation; instead, you silently get incomplete WireGuard configurations.

This is visibly and clearly a Netplan issue, because on servers you can inspect the systemd-networkd files written by Netplan (in /run/systemd/network). When I do this, the WireGuard .netdev file has only the peers from one file defined in it (and the .netdev file matches the state of the WireGuard interface). This is especially striking when the netplan file with the private key and listening port (and some peers) is second; since the .netdev file contains the private key and so on, Netplan is clearly merging data from more than one netplan file, not completely ignoring everything except the first one. It's just ignoring any peers encountered after the first set of them.

My overall conclusion is that in Netplan, you need to put all configuration for a given WireGuard interface into a single file, however tempting it might be to try splitting it up (for example, to put core WireGuard configuration stuff in one file and then list all peers in another one).

I don't know if this is an already filed Netplan bug and I don't plan on bothering to file one for it, partly because I don't expect Canonical to fix Netplan issues any more than I expect them to fix anything else and partly for other reasons.

PS: I'm aware that we could build a system to generate the Netplan WireGuard file, or maybe find a YAML manipulating program that could insert and delete blocks that matched some criteria. I'm not interested in building yet another bespoke custom system to deal with what is (for us) a minor problem, since we don't expect to be constantly deploying or removing WireGuard peers.

I moved my local Firefox changes between Git trees the easy way

By: cks

Firefox recently officially switched to Git, in a completely different Git tree than their old mirror. This presented me a little bit of a problem because I have a collection of local changes I make to my own Firefox builds, which I carry as constantly-rebased commits on top of the upstream Firefox tree. The change in upstream trees meant that I was going to have to move my commits to the new tree. When I wrote my first entry I thought I might try to do this in some clever way similar to rebasing my own changes on top of something that was rebased, but in the end I decided to do it the simple and brute force way that I was confident would either work or would leave me in a situation I could back out from easily.

This simple and brute force way was to get both my old tree and my new 'firefox' tree up to date, then export my changes with 'git format-patch' from the old tree and import them into the new tree with 'git am'. There were a few irritations along the way, of course. First I (re)discovered that 'git am' can't directly consume the directory of patches you create with 'git format-patch'. Git-am will consume a Maildir of patches, but git-format-patch will only give you a directory full of files with names like '00NN-<author>-<title>.patch', which is not a proper Maildir. The solution is to cat all of the .patch files together in order to some other file, which is now a mailbox that git-am will handle. The other minor thing is that git-am unsurprisingly has no 'dry-run' option (which would probably be hard to implement). Of course in my situation, I can always reset 'main' back to 'origin/main', which was one reason I was willing to try this.

(Looking at the 'git format-patch' manual page suggests that what I might have wanted was the '--stdout' option, which would have automatically created the mbox format version for me. On the other hand it was sort of nice to be able to look at the list of patches and see that they were exactly what I expected.)

On the one hand, moving my changes in this brute force way (and to a completely separate new tree) feels like giving in to my unfamiliarity with git. There are probably clever git ways to do this move in a single tree without having to turn everything into patches and then apply them (even if most of that is automated). On the other hand, this got the job done with minimal hassles and time consumed, and sometimes I need to put a stop to my programmer's urge to be clever.

LLMs ('AI') are coming for our jobs whether or not they work

By: cks

Over on the Fediverse, I said something about this:

Hot take: I don't really know what vibe coding is but I can confidently predict that it's 'coming for', if not your job, then definitely the jobs of the people who work in internal development at medium to large non-tech companies. I can predict this because management at such companies has *always* wanted to get rid of programmers, and has consistently seized on every excuse presented by the industry to do so. COBOL, report generators, rule based systems, etc etc etc at length.

(The story I heard is that at one point COBOL's English language basis was at least said to enable non-programmers to understand COBOL programs and maybe even write them, and this was seen as a feature by organizations adopting it.)

The current LLM craze is also coming for the jobs of system administrators for the same reason; we're overhead, just like internal development at (most) non-tech companies. In most non-tech organizations, both internal development and system administration is something similar to janitorial services; you have to have it because otherwise your organization falls over, but you don't like it and you're happy to spend as little on it as possible. And, unfortunately, we have a long history in technology that shows the long term results don't matter for the people making short term decisions about how many people to hire and who.

(Are they eating their seed corn? Well, they probably don't think it matters to them, and anyway that's a collective problem, which 'the market' is generally bad at solving.)

As I sort of suggested by using 'excuse' in my Fediverse post, it doesn't really matter if LLMs truly work, especially if they work over the long run. All they need to do in order to get senior management enthused about 'cutting costs' is appear to work well enough over the short term, and appearing to work is not necessarily a matter of substance. In sort of a flipside of how part of computer security is convincing people, sometimes it's enough to simply convince (senior) people and not have obvious failures.

(I have other thoughts about the LLM craze and 'vibe coding', as I understand it, but they don't fit within the margins of this entry.)

PS: I know it's picky of me to call this an 'LLM craze' instead of an 'AI craze', but I feel I have to both as someone who works in a computer science department that does all sorts of AI research beyond LLMs and as someone who was around for a much, much earlier 'AI' craze (that wasn't all of AI either, cf).

These days, Linux audio seems to just work (at least for me)

By: cks

For a long time, the common perception was that 'Linux audio' was the punchline for a not particularly funny joke. I sort of shared that belief; although audio had basically worked for me for a long time, I had a simple configuration and dreaded having to make more complex audio work in my unusual desktop environment. But these days, audio seems to just work for me, even in systems that have somewhat complex audio options.

On my office desktop, I've wound up with three potential audio outputs and two audio inputs: the motherboard's standard sound system, a USB headset with a microphone that I use for online meetings, the microphone on my USB webcam, and (to my surprise) a HDMI audio output because my LCD displays do in fact have tiny little speakers built in. In PulseAudio (or whatever is emulating it today), I have the program I use for online meetings set to use the USB headset and everything else plays sound through the motherboard's sound system (which I have basic desktop speakers plugged into). All of this works sufficiently seamlessly that I don't think about it, although I do keep a script around to reset the default audio destination.

On my home desktop, for a long time I had a simple single-output audio system that played through the motherboard's sound system (plus a microphone on a USB webcam that was mostly not connected). Recently I got an outboard USB DAC and, contrary to my fears, it basically plugged in and just worked. It was easy to set the USB DAC as the default output in pavucontrol and all of the settings related to it stick around even when I put it to sleep overnight and it drops off the USB bus. I was quite pleased by how painless the USB DAC was to get working, since I'd been expecting much more hassles.

(Normally I wouldn't bother meticulously switching the USB DAC to standby mode when I'm not using it for an extended time, but I noticed that the case is clearly cooler when it rests in standby mode.)

This is still a relatively simple audio configuration because it's basically static. I can imagine more complex ones, where you have audio outputs that aren't always present and that you want some programs (or more generally audio sources) to use when they are present, perhaps even with priorities. I don't know if the Linux audio systems that Linux distributions are using these days could cope with that, or if they did would give you any easy way to configure it.

(I'm aware that PulseAudio and so on can be fearsomely complex under the hood. As far as the current actual audio system goes, I believe that what my Fedora 41 machines are using for audio is PipeWire (also) with WirePlumber, based on what processes seem to be running. I think this is the current Fedora 41 audio configuration in general, but I'm not sure.)

The HTTP status codes of responses from about 22 hours of traffic to here (part 2)

By: cks

A few months ago, I wrote an entry about this topic, because I'd started putting in some blocks against crawlers, including things that claimed to be old versions of browsers, and I'd also started rate-limiting syndication feed fetching. Unfortunately, my rules at the time were flawed, rejecting a lot of people that I actually wanted to accept. So here are some revised numbers from today, a day when my logs suggest that I've seen what I'd call broadly typical traffic and traffic levels.

I'll start with the overall numbers (for HTTP status codes) for all requests:

  10592 403		[26.6%]
   9872 304		[24.8%]
   9388 429		[23.6%]
   8037 200		[20.2%]
   1629 302		[ 4.1%]
    114 301
     47 404
      2 400
      2 206

This is a much more balanced picture of activity than the last time around, with a lot less of the overall traffic being HTTP 403s. The HTTP 403s are from aggressive blocks, the HTTP 304s and HTTP 429s are mostly from syndication feed fetchers, and the HTTP 302s are mostly from things with various flaws that I redirect to informative static pages instead of giving HTTP 403s. The two HTTP 206s were from Facebook's 'externalhit' agent on a recent entry. A disturbing amount of the HTTP 403s were from Bing's crawler and almost 500 of them were from something claiming to be an Akkoma Fediverse server. 8.5% of the HTTP 403s were from something using Go's default User-Agent string.

The most popular User-Agent strings today for successful requests (of anything) were for versions of NetNewsWire, FreshRSS, and Miniflux, then Googlebot and Applebot, and then Chrome 130 on 'Windows NT 10'. Although I haven't checked, I assume that all of the first three were for syndication feeds specifically, with few or no fetches of other things. Meanwhile, Googlebot and Applebot can only fetch regular pages; they're blocked from syndication feeds.

The picture for syndication feeds looks like this:

   9923 304		[42%]
   9535 429		[40%]
   1984 403		[ 8.5%]
   1600 200		[ 6.8%]
    301 302
     34 301
      1 404

On the one hand it's nice that 42% of syndication feed fetches successfully did a conditional GET. On the other hand, it's not nice that 40% of them got rate-limited, or that there were clearly more explicitly blocked requests that there were HTTP 200 responses. On the sort of good side, 37% of the blocked feed fetches were from one IP that's using "Go-http-client/1.1" as its User-Agent (and which accounts for 80% of the blocks of that). This time around, about 58% of the requests were for my syndication feed, which is better than it was before but still not great.

These days, if certain problems are detected in a request I redirect the request to a static page about the problem. This gives me some indication of how often these issues are detected, although crawlers may be re-visiting the pages on their own (I can't tell). Today's breakdown of this is roughly:

   78%  too-old browser
   13%  too generic a User-Agent
    9%  unexpectedly using HTTP/1.0

There were slightly more HTTP 302 responses from requests to here than there were requests for these static pages, so I suspect that not everything that gets these redirects follows them (or at least doesn't bother re-fetching the static page).

I hope that the better balance in HTTP status codes here is a sign that I have my blocks in a better state than I did a couple of months ago. It would be even better if the bad crawlers would go away, but there's little sign of that happening any time soon.

The complexity of mixing mesh networking and routes to subnets

By: cks

One of the in things these days is encrypted (overlay) mesh networks, where you have a bunch of nodes and the nodes have encrypted connections to each other that they use for (at least) internal IP traffic. WireGuard is one of the things that can be used for this. A popular thing to add to such mesh network solutions is 'subnet routes', where nodes will act as gateways to specific subnets, not just endpoints in themselves. This way, if you have an internal network of servers at your cloud provider, you can establish a single node on your mesh network and route to the internal network through that node, rather than having to enroll every machine in the internal network.

(There are various reasons not to enroll every machine, including that on some of them it would be a security or stability risk.)

In simple configurations this is easy to reason about and easy to set up through the tools that these systems tend to give you. Unfortunately, our network configuration isn't simple. We have an environment with multiple internal networks, some of which are partially firewalled off from each other, and where people would want to enroll various internal machines in any mesh networking setup (partly so they can be reached directly). This creates problems for a simple 'every node can advertise some routes and you accept the whole bundle' model.

The first problem is what I'll call the direct subnet problem. Suppose that you have a subnet with a bunch of machines on it and two of them are nodes (call them A and B), with one of them (call it A) advertising a route to the subnet so that other machines in the mesh can reach it. The direct subnet problem is that you don't want B to ever send its traffic for the subnet to A; since it's directly connected to the subnet, it should send the traffic directly. Whether or not this happens automatically depends on various implementation choices the setup makes.

The second problem is the indirect subnet problem. Suppose that you have a collection of internal networks that can all talk to each other (perhaps through firewalls and somewhat selectively). Not all of the machines on all of the internal networks are part of the mesh, and you want people who are outside of your networks to be able to reach all of the internal machines, so you have a mesh node that advertises routes to all of your internal networks. However, if a mesh node is already inside your perimeter and can reach your internal networks, you don't want it to go through your mesh gateway; you want it to send its traffic directly.

(You especially want this if mesh nodes have different mesh IPs from their normal IPs, because you probably want the traffic to come from the normal IP, not the mesh IP.)

You can handle the direct subnet case with a general rule like 'if you're directly attached to this network, ignore a mesh subnet route to it', or by some automatic system like route priorities. The indirect subnet case can't be handled automatically because it requires knowledge about your specific network configuration and what can reach what without the mesh (and what you want to reach what without the mesh, since some traffic you want to go over the mesh even if there's a non-mesh route between the two nodes). As far as I can see, to deal with this you need the ability to selectively configure or accept (subnet) routes on a mesh node by mesh node basis.

(In a simple topology you can get away with accepting or not accepting all subnet routes, but in a more complex one you can't. You might have two separate locations, each with their own set of internal subnets. Mesh nodes in each location want the other location's subnet routes, but not their own location's subnet routes.)

Being reminded that Git commits are separate from Git trees

By: cks

Firefox's official source repository has moved to Git, but to a completely new Git repository, not the Git mirror that I've used for the past few years. This led me to a lament on the Fediverse:

This is my sad face that Firefox's switch to using git of course has completely different commit IDs than the old not-official gecko-dev git repository, meaning that I get to re-clone everything from scratch (all ~8 GB of it). Oh well, so it goes in the land of commit hashes.

Then Tim Chase pointed out something that I should have thought of:

If you add the new repo as a secondary remote in your existing one and pull from it, would it mitigate pulling all the blobs (which likely remain the same), limiting your transfer to just the commit-objects (and possibly some treeish items and tags)?

Git is famously a form of content-addressed storage, or more specifically a tree of content addressed storage, where as much as possible is kept the same over time. This includes all the portions of the actual source tree. A Git commit doesn't directly include a source tree; instead it just has the hash of the source tree (well, its top level, cf).

What this means is that if you completely change the commits so that all of them have new hashes, for example by rebuilding your history from scratch in a new version of the repository, but you keep the actual tree contents the same in most or all of the commits, the only thing that actually changes is the commits. If you add this new repository (with its new commit history) as a Git remote to your existing repository and pull from it, most or all of the tree contents are the same across the two sets of commits and won't have to be fetched. So you don't fetch gigabytes of tree contents, you only fetch megabytes (one hopes) of commits.

As I mentioned on the Fediverse, I was told this too late to save me from re-fetching the entire new Firefox repository from scratch on my office desktop (which has lots of bandwidth). I may yet try this on my home desktop, or alternately use it on my office desktop to easily move my local changes on top of the new official Git history.

(I think this is effectively rebasing my own changes on top of something that's been rebased, which I've done before, although not recently. I'll also want to refresh my understanding of what 'git rebase' does.)

The appeal of keyboard launchers for (Unix) desktops

By: cks

A keyboard launcher is a big part of my (modern) desktop, but over on the Fediverse I recently said something about them in general:

I don't necessarily suggest that people use dmenu or some equivalent. Keyboard launchers in GUI desktops are an acquired taste and you need to do a bunch of setup and infrastructure work before they really shine. But if you like driving things by the keyboard and will write scripts, dmenu or equivalents can be awesome.

The basic job of a pure keyboard launcher is to let you hit a key, start typing, and then select and do 'something'. Generally the keyboard launcher will make a window appear so that you can see what you're typing and maybe what you could complete it to or select.

The simplest and generally easiest way to use a keyboard launcher, and how many of them come configured to work, is to use it to select and run programs. You can find a version of this idea in GNOME, and even Windows has a pseudo-launcher in that you can hit a key to pop up the Start menu and the modern Start menu lets you type in stuff to search your programs (and other things). One problem with the GNOME version, and many basic versions, is that in practice you don't necessarily launch desktop programs all that often or launch very many different ones, so you can have easier ways to invoke the ones you care about. One problem with the Windows version (at least in my experience) is that it will do too much, which is to say that no matter what garbage you type into it by accident, it will do something with that garbage (such as launching a web search).

The happy spot for a keyboard launcher is somewhere in the middle, where they do a variety of things that are useful for you but not without limits. The best window launcher for your desktop is one that gives you fast access to whatever things you do a lot, ideally with completion so you type as little as possible. When you have it tuned up and working smoothly the feel is magical; I tap a key, type a couple of characters and then hit tab, hit return, and the right thing happens without me thinking about it, all fast enough that I can and do type ahead blindly (which then goes wrong if the keyboard launcher doesn't start fast enough).

The problem with keyboard launchers, and why they're not for everyone, is that everyone has a different set of things that they do a lot and that are useful for them to trigger entirely through the keyboard. No keyboard launcher will come precisely set up for what you do a lot in their default installation, so at a minimum you need to spend the time and effort to curate what the launcher will do and how it does it. If you're more ambitious, you may need to build supporting scripts that give the launcher a list of things to complete and then act on them when you complete one. If you don't curate the launcher and throw in the kitchen sink, you wind up with the Windows experience where it will certainly do something when you type things but perhaps not really what you wanted.

(For example, I routinely ssh to a lot of our machines, so my particular keyboard launcher setup lets me type a machine name (with completion) to start a session to it. But I had to build all of that, including sourcing the machine names I wanted included from somewhere, and this isn't necessarily useful for people who aren't constantly ssh'ing to machines.)

There are a variety of keyboard launchers for both X and Wayland, basically none of which I have any experience with. See the Arch Wiki section on application launchers. Someday I will have to get a Wayland equivalent to my particular modified dmenu, a thought that fills me with no more enthusiasm than any other part of replacing my whole X environment.

PS: Another issue with keyboard launchers is that sometimes you're wrong about what you want to do with them. I once built an entire keyboard launcher setup to select terminal windows and then later wound up abandoning it when I didn't use it enough.

Updating venv-based things by replacing the venv not updating it

By: cks

These days, we have mostly switched over to installing third-party Python programs (and sometimes things like Django) in virtual environments instead of various past practices. This is clearly the way Python expects you to do things and increasingly problems emerge if you don't. One of the issues I've been thinking about is how we want to handle updating these programs when they release new versions, because there are two approaches.

One option would be to update the existing venv in place, through various 'pip' commands. However, pip-based upgrades have some long standing issues, and also they give you no straightforward way to revert an upgrade if something goes wrong. The other option is to build a separate venv with the new version of the program (and all of its current dependency versions) and then swap the whole new venv into place, which works because venvs can generally be moved around. You can even work with symbolic links, creating a situation where you refer to 'dir/program', which is a symlink to 'dir/venvs/program-1.2.0' or 'dir/venvs/programs-1.3.0' or whatever you want today.

In practice we're more likely to have 'dir/program' be a real venv and just create 'dir/program-new', rename directories, and so on. The full scale version with always versioned directories is likely to only be used for things, like Django, where we want to be able to easily see what version we're running and switch back very simply.

Our Django versions were always going to be handled by building entirely new venvs and switching to them (it's the venv version of what we did before). We haven't had upgrades of other venv based programs until recently, and when I started thinking about it, I reached the obvious conclusion: we'll update everything by building a new venv and replacing the old one, because this deals with pretty much all of the issues at the small cost of yet more disk space for yet more venvs.

(This feels quite obvious once I'd made the decision, but I want to write it down anyway. And who knows, maybe there are reasons to update venvs in place. The one that I can think of is to only change the main program version but not any of the dependencies, if they're still compatible.)

The glass box/opaque box unit testing argument in light of standards

By: cks

One of the traditional divides in unit testing is whether you should write 'glass box' or 'opaque box' tests (like GeePawHill I think I prefer those terms to the traditional ones), which is to say whether you should write tests exploiting your knowledge of the module's code or without it. Since I prefer testing inside my modules, I'm implicitly on the side of glass box tests; even if I'm testing public APIs, I write tests with knowledge of potential corner cases. Recently, another reason for this occurred to me, by analogy to standards.

I've read about standards (and read the actual standards) enough by now to have absorbed the lesson that it is very hard to write a (computer) standard that can't be implemented perversely. Our standards need good faith implementations and there's only so much you can do to make it hard for people implementing them in bad faith. After that, you have to let the 'market' sort it out (including the market of whether or not people want to use perverse implementations, which generally they don't).

(Of course some time the market doesn't give you a real choice. Optimizing C compilers are an example, where your only two real options (GCC and LLVM) have aggressively exploited arguably perverse readings of 'undefined behavior' as part of their code optimization passes. There's some recent evidence that this might not always be worth it [PDF], via.)

If you look at them in the right way, unit tests are also a sort of standard. And like standards, opaque box unit tests have a very hard time of completely preventing perverse implementations. While people usually don't deliberately create perverse implementations, they can happen by accident or by misunderstandings, and there can be areas of perverse problems due to bugs. Your cheapest assurance that you don't have a perverse implementation is to peer inside and then write glass box tests that in part target the areas where perverse problems could arise. If you write opaque box tests, you're basically hoping that you can imagine all of the perverse mistakes that you'll make.

(Some things are amenable to exhaustive testing, but usually not very many.)

PS: One way to get perverse implementations is 'write code until all of the tests pass, then stop'. This doesn't guarantee a perverse implementation but it certainly puts the onus on the tests to force the implementation to do things, much like with standards (cf).

Trying to understand OpenID Connect (OIDC) and its relation to OAuth2

By: cks

The OIDC specification describes it as "a simple identity layer" on top of OAuth2. As I've been discovering, this is sort of technically true but also misleading. Since I think I've finally sorted this out, here's what I've come to understand about the relationship.

OAuth2 describes a HTTP-based protocol where a client (typically using a web browser) can obtain an access token from an authorization server and then present this token to a resource server to gain access to something. For example, your mail client works with a browser to obtain an access token from an OAuth2 identity provider, which it then presents to your IMAP server. However, the base OAuth2 specification is only concerned with the interaction between clients and the authorization server; it explicitly has nothing to say about issues like how a resource server validations and uses the access tokens. This is right at the start of RFC 6749:

The interaction between the authorization server and resource server is beyond the scope of this specification. [...]

Because it's purely about the client to authorization server flows, the base OAuth2 RFC provides nothing that will let your IMAP server validate the alleged 'OAuth2 access token' your mail client has handed it (or find out from the token who you are). There were customary ways to do this, and then later you had RFC 7662 Token Introspection or perhaps RFC 9068 JWT access tokens, but those are all outside basic OAuth2.

(This has obvious effects on interoperability. You can't write a resource server that works with arbitrary OAuth2 identity providers, or an OAuth2 identity provider of your own that everyone will be able to work with. I suspect that this is one reason why, for example, IMAP mail clients often only support a few big OAuth2 identity providers.)

OIDC takes the OAuth2 specification and augments it in a number of ways. In addition to an OAuth2 access token, an OIDC identity provider can also give clients (you) an ID Token that's a (signed) JSON Web Token (JWT) that has a specific structure and contains at least a minimal set of information about who authenticated. An OIDC IdP also provides an official Userinfo endpoint that will provide information about an access token, although this is different information than the RFC 7662 Token Introspection endpoint.

Both of these changes make resource servers and by extension OIDC identity providers much more generic. If a client hands a resource server either an OIDC ID Token or an OIDC Access Token, the resource server ('consumer') has standard ways to inspect and verify them. If your resource server isn't too picky (or is sufficiently configurable), I think it can work with either an OIDC Userinfo endpoint or an OAuth2 RFC 7662 Token Introspection endpoint (I believe this is true of Dovecot, cf).

(OIDC is especially convenient in cases like websites, where the client that gets the OIDC ID Token and Access Token is the same thing that uses them.)

An OAuth2 client can talk to an OIDC IdP as if it was an OAuth2 IdP and get back an access token, because the OIDC IdP protocol flow is compatible with the OAuth2 protocol flow. This access token could be described as an 'OAuth2' access token, but this is sort of meaningless to say since OAuth2 gives you nothing you can do with an access token. An OAuth2 resource server (such as an IMAP server) that expects to get 'OAuth2 access tokens' may or may not be able to interact with any particular OIDC IdP to verify those OIDC IdP provided tokens to its satisfaction; it depends on what the resource server supports and requires. For example, if the resource server specifically requires RFC 7662 Token Introspection you may be out of luck, because OIDC IdPs aren't required to support that and not all do.

In practice, I believe that OIDC has been around for long enough and has been popular enough that consumers of 'OAuth2 access tokens', like your IMAP server, will likely have been updated so that they can work with OIDC Access Tokens. Servers can do this either by verifying the access tokens through an OIDC Userinfo endpoint (with suitable server configuration to tell them what to look for) or by letting you tell them that the access token is a JWT and verifying the JWT. OIDC doesn't require the access token to be a JWT but OIDC IdPs can (and do) use JWTs for this, and perhaps you can actually have your client software send the ID Token (which is guaranteed to be a JWT) instead of the OIDC Access Token.

(It helps that OIDC is obviously better if you want to write 'resource server' side software that works with any IdP without elaborate and perhaps custom configuration or even programming for each separate IdP.)

(I have to thank Henryk Plötz for helping me understand OAuth2's limited scope.)

(The basic OAuth2 has been extended with multiple additional standards, see eg RFC 8414, and if enough of them are implemented in both your IdP and your resource servers, some of this is fixed. OIDC has also been extended somewhat, see eg OpenID Provider Metadata discovery.)

Looking at OIDC tokens and getting information on them as a 'consumer'

By: cks

In OIDC, roughly speaking and as I understand it, there are three possible roles: the identity provider ('OP'), a Client or 'Relying Party' (the program, website, or whatever that has you authenticate with the IdP and that may then use the resulting authentication information), and what is sometimes called a 'resource server', which uses the IdP's authentication information that it gets from you (your client, acting as a RP). 'Resource Server' is actually an OAuth2 term, which comes into the picture because OIDC is 'a simple identity layer' on top of OAuth2 (to quote from the core OIDC specification). A website authenticating you with OIDC can be described as acting both as a 'RP' and a 'RS', but in cases like IMAP authentication with OIDC/OAuth2, the two roles are separate; your mail client is a RP, and the IMAP server is a RS. I will broadly call both RPs and RSs 'consumers' of OIDC tokens.

When you talk to an OIDC IdP to authenticate, you can get back either or both of an ID Token and an Access Token. The ID Token is always a JWT with some claims in it, including the 'sub(ject)', the 'issuer', and the 'aud(ience)' (which is what client the token was requested by), although this may not be all of the claims you asked for and are entitled to. In general, to verify an ID Token (as a consumer), you need to extract the issuer, consult the issuer's provider metadata to find how to get their keys, and then fetch the keys so you can check the signature on the ID Token (and then proceed to do a number of additional verifications on the information in the token, as covered in the specification). You may cache the keys to save yourself the network traffic, which allows you to do offline verification of ID Tokens. Quite commonly, you'll only accept ID Tokens from pre-configured issuers, not any random IdP on the Internet (ie, you will verify that the 'iss' claim is what you expect). As far as I know, there's no particular way in OIDC to tell if the IdP still considers the ID Token valid or to go from an ID Token alone to all of the claims you're entitled to.

The Access Token officially doesn't have to be anything more than an opaque string. To validate it and get the full set of OIDC claim information, including the token's subject (ie, who it's for), you can use the provider's Userinfo endpoint. However, this doesn't necessarily give you the 'aud' information that will let you verify that this Access Token was created for use with you and not someone else. If you have to know this information, there are two approaches, although an OIDC identity provider doesn't have to support either.

The first is that the Access Token may actually be a RFC 9068 JWT. If it is, you can validate it in the usual OIDC JWT way (as for an ID Token) and then use the information inside, possibly in combination with what you get from the Userinfo endpoint. The second is that your OAuth2 provider may support an RFC 7662 Token Introspection endpoint. This endpoint is not exposed in the issuer's provider metadata and isn't mandatory in OIDC, so your IdP may or may not support it (ours doesn't, although that may change someday).

(There's also an informal 'standard' way of obtaining information about Access Tokens that predates RFC 7662. For all of the usual reasons, this may still be supported by some large, well-established OIDC/OAuth2 identity providers.)

Under some circumstances, the ID Token and the Access Token may be tied together in that the ID Token contains a claim field that you can use to validate that you have the matching Access Token. Otherwise, if you're purely a Resource Server and someone hands you a theoretically matching ID Token and Access Token, all that you can definitely do is use the Access Token with the Userinfo endpoint and verify that the 'sub' matches. If you have a JWT Access Token or a Token Introspection endpoint, you can get more information and do more checks (and maybe the Userinfo endpoint also gives you an 'aud' claim).

If you're a straightforward Relying Party client, you get both the ID Token and the Access Token at the same time and you're supposed to keep track of them together yourself. If you're acting as a 'resource server' as well and need the additional claims that may not be in the ID Token, you're probably going to use the Access Token to talk to the Userinfo endpoint to get them; this is typically how websites acting as OIDC clients behave.

Because the only OIDC standard way to get additional claims is to obtain an Access Token and use it to access the Userinfo endpoint, I think that many OIDC clients that are acting as both a RP and a RS will always request both an ID Token and an Access Token. Unless you know the Access Token is a JWT, you want both; you'll verify the audience in the ID Token, and then use the Access Token to obtain the additional claims. Programs that are only getting things to pass to another server (for example, a mail client that will send OIDC/OAuth2 authentication to the server) may only get an Access Token, or in some protocols only obtain an ID Token.

(If you don't know all of this and you get a mail client testing program to dump the 'token' it obtains from the OIDC IdP, you can get confused because a JWT format Access Token can look just like an ID Token.)

This means that OIDC doesn't necessarily provide a consumer with a completely self-contained single object that both has all of the information about the person who authenticated and that lets you be sure that this object is intended for you. An ID Token by itself doesn't necessarily contain all of the claims, and while you can use any (opaque) Access Token to obtain a full set of claims, I believe that these claims don't have to include the 'aud' claim (although your OIDC IdP may chose to include it).

This is in a sense okay for OIDC. My understanding is that OIDC is not particularly aimed at the 'bearer token' usage case where the RP and the Resource Server are separate systems; instead, it's aimed at the 'website authenticating you' case where the RP is the party that will directly rely on the OIDC information. In this case the RP has (or can have) both the ID Token and the Access Token and all is fine.

(A lot of my understanding on this is due to the generosity of @Denvercoder9 and others after I was confused about this.)

Sidebar: Authorization flow versus implicit flow in OIDC authentication

In the implicit flow, you send people to the OIDC IdP and the OIDC IdP directly returns the ID Token and Access Token you asked for to your redirect URI, or rather has the person's browser do it. In this flow, the ID Token includes a partial hash of the Access Token and you use this to verify that the two are tied together. You need to do this because you don't actually know what happened in the person's browser to send them to your redirect URI, and it's possible things were corrupted by an attacker.

In the authorization flow, you send people to the OIDC IdP and it redirects them back to you with an 'authorization code'. You then use this code to call the OIDC IdP again at another endpoint, which replies with both the ID Token and the Access Token. Because you got both of these at once during the same HTTP conversation directly with the IdP, you automatically know that they go together. As a result, the ID Token doesn't have to contain any partial hash of the Access Token, although it can.

I think the corollary of this is that if you want to be able to hand the ID Token and the Access Token to a Resource Server and allow it to verify that the two are connected, you want to use the implicit flow, because that definitely means that the ID Token has the partial hash the Resource Server will need.

(There's also a hybrid flow which I'll let people read about in the standard.)

Chrome and the burden of developing a browser

By: cks

One part of the news of the time interval is that the US courts may require Google to spin off Chrome (cf). Over on the Fediverse, I felt this wasn't a good thing:

I have to reluctantly agree that separating Chrome from Google would probably go very badly¹. Browsers are very valuable but also very expensive public goods, and our track record of funding and organizing them as such in a way to not wind up captive to something is pretty bad (see: Mozilla, which is at best questionable on this). Google is not ideal but at least Chrome is mostly a sideline, not a main hustle.

¹ <Lauren Weinstein Fediverse post> [...]

One possible reaction to this is that it would be good for everyone if people stopped spending so much money on browsers and so everything involving them slowed down. Unfortunately, I don't think that this would work out the way people want, because popular browsers are costly beasts. To quote what I said on the Fediverse:

I suspect that the cost of simply keeping the lights on in a modern browser is probably on the order of plural millions of dollars a year. This is not implementing new things, this is fixing bugs, keeping up with security issues, monitoring CAs, and keeping the development, CI, testing, and update infrastructure running. This has costs for people, for servers, and for bandwidth.

The reality of the modern Internet is that browsers are load bearing infrastructure; a huge amount of things run through them, including and especially on minority platforms. Among other things, no browser is 'secure' and all of them are constantly under attack. We want browser projects that are used by lots of people to have enough resources (in people, build infrastructure, update servers, and so on) to be able to rapidly push out security updates. All browsers need a security team and any browser with addons (which should be all of them) needs a security team for monitoring and dealing with addons too.

(Browsers are also the people who keep Certificate Authorities honest, and Chrome is very important in this because of how many people use it.)

On the whole, it's a good thing for the web that Chrome is in the hands of an organization that can spend tens of millions of dollars a year on maintaining it without having to directly monetize it in some way. It would be better if we could collectively fund browsers as the public good that they are without having corporations in the way, because Google absolutely corrupts Chrome (also) and Mozilla has stumbled spectacularly (more than once). But we have to deal with the world that we have, not the world that we'd like to have, and in this world no government seems to be interested in seriously funding obvious Internet public goods (not only browsers but also, for example, free TLS Certificate Authorities).

(It's not obvious that a government funded browser would come out better overall, but at least there would be a chance of something different than the narrowing status quo.)

PS: Another reason that spending on browsers might not drop is that Apple (with Safari) and Microsoft (with Edge) are also in the picture. Both of these companies might take the opportunity to slow down, or they might decide that Chrome's potentially weak new position was a good moment to push for greater dominance and maybe lock-in through feature leads.

The many ways of getting access to information ('claims') in OIDC

By: cks

Any authentication and authorization framework, such as OIDC, needs a way for the identity provider (an 'OIDC OP') to provide information about the person or thing that was just authenticated. In OIDC specifically, what you get are claims that are grouped into scopes. You have to ask for specific scopes, and the IdP may restrict what scopes a particular client has access to. Well, that is not quite the full story, and the full story is complicated (more so than I expected when I started writing this entry).

When you talk to the OIDC identity server (OP) to authenticate, you (the program or website or whatever acting as the client) can get back either or both of an ID Token and an Access Token. I believe that in general your Access Token is an opaque string, although there's a standard for making it a JWT. Your ID Token is ultimately some JSON (okay, it's a JWT) and has certain mandatory claims like 'sub' (the subject) that you don't have to ask for with a scope. It would be nice if all of the claims from all of the scopes that you asked for were automatically included in the ID Token, but the OIDC standard doesn't require this. Apparently many but not all OIDC OPs include all the claims (at least by default); however, our OIDC OP doesn't currently do so, and I believe that Google's OIDC OP also doesn't include some claims.

(Unsurprisingly, I believe that there is a certain amount of OIDC-using software out there that assumes that all OIDC OPs return all claims in the ID Token.)

The standard approved and always available way to obtain the additional claims (which in some cases will be basically all claims) is to present your Access Token (not your ID Token) to the OIDC Userinfo endpoint at your OIDC OP. If your Access Token is (still) valid, what you will get back is either a plain, unsigned JSON listing of those claims (and their values) or perhaps a signed JWT of the same thing (which you can find out from the provider metadata). As far as I can see, you don't necessarily use the ID Token in this additional information flow, although you may want to be cautious and verify that the 'sub' claim is the same in the Userinfo response and the ID Token that is theoretically paired with your Access Token.

(As far as I can tell, the ID Token doesn't include a copy of the Access Token as another pseudo-claim. The two are provided to you at the same time (if you asked the OIDC OP for both), but are independent. The ID Token can't quite be verified offline because you need to get the necessary public key from the OIDC OP to check the signature.)

If I'm understanding things correctly (which may not be the case), in an OAuth2 authentication context, such as using OAUTHBEARER with the Dovecot IMAP server, I believe your local program will send the Access Token to the remote end and not do much with the ID Token, if it even requested one. The remote end then uses the Access Token with a pre-configured Userinfo endpoint to get a bunch of claims, and incidentally to validate that the Access Token is still good. In other protocols, such as the current version of OpenPubkey, your local program sends the ID Token (perhaps wrapped up) and so needs it to already include the full claims, and can't use the Userinfo approach. If what you have is a website that is both receiving the OIDC stuff and processing it, I believe that the website will normally ask for both the ID Token and the Access Token and then augment the ID Token information with additional claims from the Userinfo response (this is what the Apache OIDC module does, as far as I can see).

An OIDC OP may optionally allow clients to specifically request that certain claims be included in the ID Token that they get, through the "claims" request parameter on the initial request. One potential complication here is that you have to ask for specific claims, not simple 'all claims in this scope'; it's up to you to know what potentially non-standard claims you should ask for (and I believe that the claims you get have to be covered by the scopes you asked for and that the OIDC OP allows you to get). I don't know how widely implemented this is, but our OIDC OP supports it.

(An OIDC OP can list all of its available claims in its metadata, but doesn't have to. I believe that most OPs will list their scopes, although technically this is just 'recommended'.)

If you really want a self-contained signed object that has all of the information, I think you have to hope for an OIDC OP that either puts all claims in the ID Token by default or lets you ask for all of the claims you care about to be added for your request. Even if an OIDC OP gives you a signed userinfo response, it may not include all of the ID Token information and it might not be possible to validate various things later. You can always validate an Access Token by making a Userinfo request with it, but I don't know if there's any way to validate an ID Token.

We've chosen to 'modernize' all of our ZFS filesystems

By: cks

We are almost all of the way to the end of a multi-month process of upgrading our ZFS fileservers from Ubuntu 22.04 to 24.04 by also moving to more recent hardware. This involved migrating all of our pools and filesystems, involving terabytes of data. Our traditional way of doing this sort of migration (which we used, for example, when going from our OmniOS fileservers to our Linux fileservers was the good old reliable 'zfs send | zfs receive' approach of sending snapshots over. This sort of migration is fast, reliable, and straightforward. However, it has one drawback, which is that it preserves all of the old filesystem's history, including things like the possibility of panics and possibly other things.

We've been running ZFS for long enough that we had some ZFS filesystems that were still at ZFS filesystem version 4. In late 2023, we upgraded them all to ZFS filesystem version 5, and after that we got some infrequent kernel panics. We could never reproduce the kernel panics and they were very infrequent, but 'infrequent' is not the same as 'never' (the previous state of affairs), and it seemed likely that they were in some way related to upgrading our filesystem versions, which in turn was related to us having some number of very old filesystems. So in this migration, we deliberately decided to 'migrate' filesystems the hard way. Which is to say, rather than migrating the filesystems, we migrated the data with user level tools, moving it into pools and filesystems that were created from scratch on our new Ubuntu 24.04 fileservers (which led us to discover that default property values sometimes change in ways that we care about).

(The filesystems reused the same names as their old versions, because that keeps things easier for our people and for us.)

It's possible that this user level rewriting of all data has wound up laying things out in a better way (although all of this is on SSDs), and it's certainly insured that everything has modern metadata associated with it and so on. The 'fragmentation' value of the new pools on the new fileservers is certainly rather lower than the value for most old pools, although what that means is a bit complicated.

There's a bit of me that misses the deep history of our old filesystems, some of which dated back to our first generation Solaris ZFS fileservers. However, on the whole I'm happy that we're now using filesystems that don't have ancient historical relics and peculiarities that may not be well supported by OpenZFS's code any more (and which were only likely to get less tested and more obscure over time).

(Our pools were all (re)created from scratch as part of our migration from OmniOS to Linux, and anyway would have been remade from scratch again in this migration even if we moved the filesystems with 'zfs send'.)

My Cinnamon desktop customizations (as of 2025)

By: cks

A long time ago I wrote up some basic customizations of Cinnamon, shortly after I started using Cinnamon (also) on my laptop of the time. Since then, the laptop got replaced with another one and various things changed in both the land of Cinnamon and my customizations (eg, also). Today I feel like writing down a general outline of my current customizations, which fall into a number of areas from the modest but visible to the large but invisible.

The large but invisible category is that just like on my main fvwm-based desktop environment, I use xcape (plus a custom Cinnamon key binding for a weird key combination) to invoke my custom dmenu setup (1, 2) when I tap the CapsLock key. I have dmenu set to come up horizontally on the top of the display, which Cinnamon conveniently leaves alone in the default setup (it has its bar at the bottom). And of course I make CapsLock into an additional Control key when held.

(On the laptop I'm using a very old method of doing this. On more modern Cinnamon setups in virtual machines, I do this with Settings → Keyboard → Layout → Options, and then in the CapsLock section set CapsLock to be an additional Ctrl key.)

To start xcape up and do some other things, like load X resources, I have a personal entry in Settings → Startup Applications that runs a script in my ~/bin/X11. I could probably do this in a more modern way with an assortment of .desktop files in ~/.config/autostart (which is where my 'Startup Applications' setting actually wind up) that run each thing individually or perhaps some systemd user units. But the current approach works and is easy to modify if I want to add or remove things (I can just edit the script).

I have a number of Cinnamon 'applets' installed on my laptop and my other Cinnamon VM setups. The ones I have everywhere are Spices Update and Shutdown Applet, the latter because if I tell the (virtual) machine to log me off, shut down, or restart, I generally don't want to be nagged about it. On my laptop I also have CPU Frequency Applet (set to only display a summary) and CPU Temperature Indicator, for no compelling reason. In all environments I also pin launchers for Firefox and (Gnome) Terminal to the Cinnamon bottom bar, because I start both of them often enough. I position the Shutdown Applet on the left side, next to the launchers, because I think of it as a peculiar 'launcher' instead of an applet (on the right).

(The default Cinnamon keybindings also start a terminal with Ctrl + Alt + T, which you can still find through the same process from several years ago provided that you don't cleverly put something in .local/share/glib-2.0/schemas and then run 'glib-compile-schemas .' in that directory. If I was a smarter bear, I'd understand what I should have done when I was experimenting with something.)

On my virtual machines with Cinnamon, I don't bother with the whole xcape and dmenu framework, but I do set up the applets and the launchers and fix CapsLock.

(This entry was sort of inspired by someone I know who just became a Linux desktop user (after being a long time terminal user).)

Sidebar: My Cinnamon 'window manager' custom keybindings

I have these (on my laptop) and perpetually forget about them, so I'm going to write them down now so perhaps that will change.

move-to-corner-ne=['<Alt><Super>Right']
move-to-corner-nw=['<Alt><Super>Left']
move-to-corner-se=['<Primary><Alt><Super>Right']
move-to-corner-sw=['<Primary><Alt><Super>Left']
move-to-side-e=['<Shift><Alt><Super>Right']
move-to-side-n=['<Shift><Alt><Super>Up']
move-to-side-s=['<Shift><Alt><Super>Down']
move-to-side-w=['<Shift><Alt><Super>Left']

I have some other keybindings on the laptop but they're even less important, especially once I added dmenu.

I feel that DANE is not a good use of DNS

By: cks

DANE is commonly cited as as "wouldn't it be nice" alternative to the current web TLS ('PKI') system. It's my view that DANE is an example of why global DNS isn't a database and shouldn't be used as one. The usual way to describe DANE is that 'it lets you publish your TLS certificates in DNS'. This is not actually what it does, because DNS does not 'publish' anything in the sense of a database or a global directory. DANE lets some unknown set of entities advertise some unknown set of TLS certificates for your site to an unknown set of people. Or at least you don't know the scope of the entities, the TLS certificates, and the people, apart from you, your TLS certificate, and the people who (maybe) come directly to you without being intercepted.

(This is in a theoretical world where DNSSEC is widely deployed and reaches all the way to programs that are doing DNS resolution. That is not this world, where DNSSEC has failed.)

DNS specifically allows servers (run by people) to make up answers to things they get asked. Obviously this would be bad when the answers are about your TLS certificates, so DANE and other things like it try to paper over the problem by adding a cascading hierarchy of signing. The problem is that this doesn't eliminate the issue, it merely narrows who can insert themselves into the chain of trust, from 'the entire world' to 'anyone already in the DNSSEC path or who can break into it', including the TLD operator for your domain's TLD.

There are a relatively small number of Certificate Authorities in the world and even large ones have had problems, never mind the one that got completely compromised. Our most effective tool against TLS mis-issuance is exactly a replicated, distributed global record of issued certificates. DNS and DANE bypass this, unless you require all DANE-obtained TLS certificates to be in Certificate Transparency logs just like CA-issued TLS certificates (and even then, Certificate Transparency is an after the fact thing; the damage has probably been done once you detect it).

In addition, there's no obvious way to revoke or limit DNSSEC the way there is for a mis-behaving Certificate Authority. If a TLD had its DNSSEC environment completely compromised, does anyone think it would be removed from global DNS, the way DigiNotar was removed from global trust? That's not very likely; the damage would be too severe for most TLDs. One of the reasons that Certificate Authorities can be distrusted is that what they do is limited and you can replace one with another. This isn't true for DNS and TLDs.

DNS is extremely bad fit for a system where you absolutely want everyone to see the same 'answer' and to have high assurance that you know what that answer is (and that you are the only person who can put it there). It's especially bad if you want to globally limit who is trusted and allow that trust to be removed or limited without severe damage. In general, if security would be significantly compromised should people received a different answer than the one you set up, DNS is not what you want to use.

(I know, this is how DNS and email mostly work today, but that is historical evolution and backward compatibility. We would not design email to work like that if we were doing it from scratch today.)

(This entry was sparked by ghop's comment mentioning DANE on my first entry.)

Tailscale's surprising interaction of DNS settings and 'exit nodes'

By: cks

Tailscale is a well regarded commercial mesh networking system, based on WireGuard, that can be pressed into service as a VPN as well. As part of its general features, it allows you to set up various sorts of DNS settings for your tailnet (your own particular Tailscale mesh network), including both DNS servers for specific (sub)domains (eg an 'internal.example.org') and all DNS as a whole. As part of optionally being VPN-like, Tailscale also lets you set up exit nodes, which let you route all traffic for the Internet out the exit node (if you want to route just some subnets to somewhere, that's a subnet router, a different thing). If you're a normal person, especially if you're a system administrator, you probably have a guess as to how these two features interact. Unfortunately, you may well be wrong.

As of today, if you use a Tailscale exit node, all of your DNS traffic is routed to the exit node regardless of Tailscale DNS settings. This applies to both DNS servers for specific subdomains and to any global DNS servers you've set for your tailnet (due to, for example, 'split horizon' DNS). Currently this is documented only in one little sentence in small type in the "Use Tailscale DNS settings" portion of the client preferences documentation.

In many Tailscale environments, all this does is make your DNS queries take an extra hop (from you to the exit node and then to the configured DNS servers). Your Tailscale exit nodes are part of your tailnet, so in ordinary configurations they will have your Tailscale DNS settings and be able to query your configured DNS servers (and they will probably get the same answers, although this isn't certain). However, if one of your exit nodes isn't set up this way, potential pain and suffering is ahead of you. Your tailnet nodes that are using this exit node will get wildly different DNS answers than you expect, potentially not resolving internal domains and maybe getting different answers than you'd expect (if you have split horizon DNS).

One reason that you might set an exit node machine to not use your Tailscale DNS settings (or subnet routes) is that you're only using it as an exit node, not as a regular participant in your tailnet. Your exit node machine might be placed on a completely different network (and in a completely different trust environment) than the rest of your tailnet, and you might have walled off its (less-trusted) traffic from the rest of your network. If the only thing the machine is supposed to be is an Internet gateway, there's no reason to have it use internal DNS settings, and it might not normally be able to reach your internal DNS servers (or the rest of your internal servers).

In my view, a consequence of this is that it's probably best to have any internal DNS servers directly on your tailnet, with their tailnet IP addresses. This makes them as reachable as possible to your nodes, independent of things like subnet routes.

PS: Routing general DNS queries through a tailnet exit node makes sense in this era of geographical DNS results, where you may get different answers depending on where in the world you are and you'd like these to match up with where your exit node is.

(I'm writing this entry because this issue was quite mysterious to us when we ran into it while testing Tailscale and I couldn't find much about it in online searches.)

The clever tricks of OpenPubkey and OPKSSH

By: cks

OPKSSH (also) is a clever way of using OpenID Connect (OIDC) to authenticate your OpenSSH sessions (it's not the only way to do this). How it works is sufficiently ingenious and clever that I want to write it up, especially as one underlying part uses a general trick.

OPKSSH itself is built on top of OpenPubkey, which is a trick to associated your keypair with an OIDC token. When you perform OIDC authentication, what you get back (at an abstract level) is a signed set of 'claims' and, crucially, a nonce. The nonce is supplied by the client that initiated the OIDC authentication so that it can know that the ID token it eventually gets back actually comes from this authentication session and wasn't obtained through some other one. The client initiating OIDC authentication doesn't get to ask the OIDC identity provider (OP) to include other fields.

What OpenPubkey does is turn the nonce into a signature for a combination of your public key and a second nonce of its own, by cryptographically hashing these together through a defined process. Because the OIDC IdP is signing a set of claims that include the calculated nonce, it is effectively signing a signature of your public key. If you give people the signed OIDC ID token, your public key, and your second nonce, they can verify this (and you can augment the ID Token you back to get a PK Token that embeds this additional information).

(As I understand it, calculating the OIDC ID Token nonce this way is safe because it still includes a random value (the inner nonce) and due to the cryptographic hashing, the entire calculated nonce is still effectively a non-repeating random value.)

To smuggle this PK Token to the OpenSSH server, OPKSSH embeds it as an additional option field in an OpenSSH certificate (called 'openpubkey-pkt'). The certificate itself is for your generated PK Token private key and is (self) signed with it, but this is all perfectly fine with OpenSSH; SSH clients will send the certificate off to the server as a candidate authentication key and the server will read it in. Normally the server would reject it since it's signed by an unknown SSH certificate authority, but OPKSSH uses a clever trick with OpenSSH's AuthorizedKeysCommand server option to get its hands on the full certificate, which lets it extract the PK Token, verify everything, and tell the SSH server daemon that your public key is the underlying OpenPubkey key (which you have the private key for in your SSH client).

Smuggling information through OpenSSH certificates and then processing them with AuthorizedKeysCommand is a clever trick, but it's specific to OpenSSH. Turning a nonce into a signature is a general trick that was eye-opening to me, especially because you can probably do it repeatedly.

The appeal of serving your web pages with a single process

By: cks

As I slowly work on updating the software behind this blog to deal with the unfortunate realities of the modern web (also), I've found myself thinking (more than once) how much simpler my life would be if I was serving everything through a single process, instead of my eccentric, more or less stateless CGI-based approach. The simple great thing about doing everything through a single process (with threads, goroutines, or whatever inside it for concurrency) is that you have all the shared state you could ever want, and that shared state makes it so easy to do so many things.

Do you have people hitting one URL too often from a single IP address? That's easy to detect, track, and return HTTP 429 responses for until they cool down. Do you have an IP making too many requests across your entire site? You can track that sort of volume information. There's all sorts of potential bad stuff that it's at least easier to detect when you have easy shared global state. And the other side of this is that it's also relatively easy to add simple brute force caching in a single process with global state.

(Of course you have some practical concerns about memory and CPU usage, depending on how much stuff you're keeping track of and for how long.)

You can do a certain amount of this detection with a separate 'database' process of some sort (or a database file, like sqlite), and there's various specialized software that will let you keep this sort of data in memory (instead of on disk) and interact with it easily. But this is an extra layer or two of overhead over simply updating things in your own process, especially if you have to set up things like a database schema for what you're tracking or caching.

(It's my view that ease of implementation is especially useful when you're not sure what sort of anti-abuse measures are going to be useful. The easier it is to implement something and at least get logs of what and how much it would have done, the more you're going to try and the more likely you are to hit on something that works for you.)

Unfortunately it seems like we're only going to need more of this kind of thing in our immediate future. I don't expect the level of crawling and abuse to go down any time soon; if anything, I expect it to keep going up, especially as more and more websites move behind effective but heavyweight precautions and the crawlers turn more of their attention to the rest of us.

Looking at what NFSv4 clients have locked on a Linux NVS(v4) server

By: cks

A while ago I wrote an entry about (not) finding which NFSv4 client owns a lock on a Linux NFS(v4) server, where the best I could do was pick awkwardly through the raw NFS v4 client information in /proc/fs/nfsd/clients. Recently I discovered an alternative to doing this by hand, which is the nfsdclnts program, and as a result of digging into it and what I was seeing when I tried it out, I now believe I have a better understanding of the entire situation (which was previously somewhat confusing).

The basic thing that nfsdclnts will do is list 'locks' and some information about them with 'nfsdclnts -t lock', in addition to listing other state information such as 'open', for open files, and 'deleg', for NFS v4 delegations. The information it lists is somewhat limited, for example it will list the inode number but not the filesystem, but on the good side nfsdclnts is a Python program so you can easily modify it to report any extra information that exists in the clients/#/states files. However, this information about locks is not complete, because of how file level locks appear to normally manifest in NFS v4 client state.

(The information in the states files is limited, although it contains somewhat more than nfsdclnts shows.)

Here is how I understand NFS v4 locking and states. To start with, NFS v4 has a feature called delegations where the NFS v4 server can hand a lot of authority over a file to a NFS v4 client. When a NFS v4 client accesses a file, the NFS v4 server likes to give it a delegation if this is possible; it normally will be if no one else has the file open or active. Once a NFS v4 client holds a delegation, it can lock the file without involving the NFS v4 server. At this point, the client's 'states' file will report an opaque 'type: deleg' entry for the file (and this entry may or may not have a filename or instead be what nfsdclnts will report as 'disconnected dentry').

While a NFS v4 client has the file delegated, if any other NFS v4 client does anything with the file, including simply opening it, the NFS v4 server will recall the delegation from the original client. As a result, the original client now has to tell the NFS v4 server that it has the file locked. At this point a 'type: lock' entry for the file appears in the first NFS v4 client's states file. If the first NFS v4 client releases its lock while the second NFS v4 client is trying to acquire it, the second NFS v4 client will not have a delegation for the file, so its lock will show up as an explicit 'type: lock' entry in its states file.

An additional wrinkle, a NFS v4 client holding a delegation doesn't immediately release it once all processes have released their locks, closed the file, and so on. Instead the delegation may linger on for some time. If another NFS v4 client opens the file during this time, the first client will lose the delegation but the second NFS v4 client may not get a delegation from the NFS v4 server, so its lock will be visible as a 'type: lock' states file entry.

A third wrinkle is that multiple clients may hold read-only delegations for a file and have fcntl() read locks on it at once, with each of them having a 'type: deleg, access: r' entry for it in their states files. These will only become visible 'type: lock' states entries if the clients have to release their delegations.

So putting this all together:

  • If there is a 'type: lock' entry for the file in any states file (or it's listed in 'nfsdclnts -t lock'), the file is definitely locked by whoever has that entry.

  • If there are no 'type: deleg' or 'type: lock' entries for the file, it's definitely not locked; you can also see this by whether nfsdclnts lists it as having delegations or locks.

  • If there are 'type: deleg' entries for the file, it may or may not be locked by the NFS v4 client (or clients) with the delegation. If the delegation is an 'access: w' delegation, you can see if someone actually has the file locked by accessing the file on another NFS v4 client, which will force the NFS v4 server to recall the delegation and expose the lock if there is one.

If the delegation is 'access: r' and might have multiple read-only locks, you can't force the NFS v4 server to recall the delegation by merely opening the file read-only (for example with 'cat file' or 'less file'). Instead the server will only recall the delegation if you open the file read-write. A convenient way to do this is probably to use 'flock -x <file> -c /bin/true', although this does require you to have more permissions for the file than simply the ability to read it.

Sidebar: Disabling NFS v4 delegations on the server

Based on trawling various places, I believe this is done by writing a '0' to /proc/sys/fs/leases-enabled (or the equivalent 'fs.leases-enabled' sysctl) and then apparently restarting your NFS v4 server processes. This will disable all user level uses of fcntl()'s F_SETLEASE and F_GETLEASE as an additional effect, and I don't know if this will affect any important programs running on the NFS server itself. Based on a study of the kernel source code, I believe that you don't need to restart your NFS v4 server processes if it's sufficient for the NFS server to stop handing out new delegations but current delegations can stay until they're dropped.

(There have apparently been some NFS v4 server and client issues with delegations, cf, along with other NFS v4 issues. However, I don't know if the cure winds up being worse than the disease here, or if there's another way to deal with these stateid problems.)

The DNS system isn't a database and shouldn't be used as one

By: cks

Over on the Fediverse, I said something:

Thesis: DNS is not meaningfully a database, because it's explicitly designed and used today so that it gives different answers to different people. Is it implemented with databases? Sure. But treating it as a database is a mistake. It's a query oracle, and as a query oracle it's not trustworthy in the way that you would normally trust a database to be, for example, consistent between different people querying it.

It would be nice if we had a global, distributed, relatively easily queryable, consistent database system. It would make a lot of things pretty nice, especially if we could wrap some cryptography around it to make sure we were getting honest answers. However, the general DNS system is not such a database and can't be used as one, and as a result should not be pressed into service as one in protocols.

DNS is designed from the ground up to lie to you in unpredictable ways, and parts of the DNS system lie to you every day. We call these lies things like 'outdated cached data' or 'geolocation based DNS' (or 'split horizon DNS'), but they're lies, or at least inconsistent alternate versions of some truth. The same fundamental properties that allow these inconsistent alternate versions also allow for more deliberate and specific lies, and they also mean that no one can know with assurance what version of DNS anyone else is seeing.

(People who want to reduce the chance for active lies as much as possible must do a variety of relatively extreme things, like query DNS from multiple vantage points around the Internet and perhaps through multiple third party DNS servers. No, checking DNSSEC isn't enough, even when it's present (also), because that just changes who can be lying to you.)

Anything that uses the global DNS system should be designed to expect outdated, inconsistent, and varying answers to the questions it asks (and sometimes incorrect answers, for various reasons). Sometimes those answers will be lies (including the lie of 'that name doesn't exist'). If your design can't deal with all of this, you shouldn't be using DNS.

ZFS's delayed compression of written data (when compression is enabled)

By: cks

In a comment on my entry about how Unix files have at least two sizes, Leah Neukirchen said that 'ZFS compresses asynchronously' and noted that this could cause the reported block size of a just-written file to change over time. This way of describing ZFS's behavior made me twitch and it took me a bit of thinking to realize why. What ZFS does is delayed compression (which is asynchronous with your user level write() calls), but not true 'asynchronous compression' that happens later at an unpredictable time.

Like basically all filesystems, ZFS doesn't immediately start writing data to disk when you do a write() system call. Instead it buffers this data in memory for a while and only writes it later. As part of this, ZFS doesn't immediately decide where on disk the data will be written (this is often called 'delayed allocation' and is common in many filesystems) and otherwise prepare it to be written out. As part of this delayed allocation and preparation, ZFS doesn't immediately compress your written data, and as a result ZFS doesn't know how many disk blocks your data will take up. Instead your data is only compressed and has disk blocks allocated for it as part of ZFS's pipeline of actually performing IO, when the data is flushed to disk, and only then is its physical block size known.

However, once written to disk, the data's compression or lack of it is never changed (nor is anything else about it; ZFS never modifies data once it's written). For example, data isn't initially written in uncompressed form and then asynchronously compressed later. Nor is there anything that goes around asynchronously compressing or decompressing data if you turn on or off compression on a ZFS filesystem (or change the compression algorithm). This periodically irks people who wish they could turn compression on on an existing filesystem, or change the compression algorithm, and have this take effect 'in place' to shrink the amount of space the filesystem is using.

Delaying compressing data until you're writing it out is a sensible decision for a variety of reasons. One of them is that ZFS compresses your data in potentially large chunks, and you may not write() all of that chunk at once. If you wrote half a chunk now and then half a chunk later before it got flushed to disk, it would be a waste of effort to compress your half a chunk now and then throw the away the work when you compressed the whole chunk.

(I also suspect that it was simpler to add compression to ZFS as part of its IO pipeline than to do it separately. ZFS already had a multi-stage IO pipeline, so adding compression and decompression as another step was probably relatively straightforward.)

Unix files have (at least) two sizes

By: cks

I'll start by presenting things in illustrated form:

; ls -l testfile
-rw-r--r-- 1 cks 262144 Apr 13 22:03 testfile
; ls -s testfile
1 testfile
; ls -slh testfile
512 -rw-r--r-- 1 cks 256K Apr 13 22:03 testfile

The two well known sizes that Unix files have are the logical 'size' in bytes and what stat.h describes as "the number of blocks allocated for this object", often converted to some number of bytes (as ls is doing here in the last command). A file's size in bytes is roughly speaking the last file offset that has been written to in the file, and not all of the bytes covered by it may have actually been written; when this is the case, the result is a sparse file. Sparse files are the traditional cause of a mismatch between the byte size and the number of blocks a file uses. However, that is not what is happening here.

This file is on a ZFS filesystem with ZFS's compression turned on, and it was created with 'dd if=/dev/zero of=testfile bs=1k count=256'. In ZFS, zeroes compress extremely well, and so ZFS has written basically no physical data blocks and faithfully reported that (minimal) number in the stat() st_blocks field. However, at the POSIX level we have indeed written data to all 256 KBytes of the file; it's not a sparse file. This is an extreme example of filesystem compression, and there are plenty of lesser ones.

This leaves us with a third size, which is the number of logical blocks for this file. When a filesystem is doing data compression, this number will be different from the number of physical blocks used. As far as I can tell, the POSIX stat.h description doesn't specify which one you have to report for st_blocks. As we can see, ZFS opts to report the physical block size of the file, which is probably the more useful number for the purposes of things like 'du'. However, it does leave us with no way of finding out the logical block size, which we may care about for various reasons (for example, if our backup system can skip unwritten sparse blocks but always writes out uncompressed blocks).

This also implies that a non-sparse file can change its st_blocks number if you move it from one filesystem to another. One filesystem might have compression on and the other one have it off, or they might have different compression algorithms that give different results. In some cases this will cause the file's space usage to expand so that it doesn't actually fit into the new filesystem (or for a tree of files to expand their space usage).

(I don't know if there are any Unix filesystems that report the logical block size in st_blocks and only report the physical block size through a private filesystem API, if they report it at all.)

Mandatory short duration TLS certificates are probably coming soon

By: cks

The news of the time interval is that the maximum validity period for TLS certificates will be lowered to 47 days by March 2029, unless the CA/Browser Forum changes its mind (or is forced to) before then. The details are discussed in SC-081. In skimming the mailing list thread on the votes, a number of organizations that voted to abstain seem unenthused (and uncertain that it can actually be implemented), so this may not come to pass, especially on the timeline proposed here.

If and when this comes to pass, I feel confident that this will end manual certificate renewals at places that are still doing them. With that, it will effectively end Certificate Authorities that don't have an API that you can automatically get certificates through (not necessarily a free or public API). I'm not sure what it's going to do to the Certificate Authority business models for commercial CAs, but I also don't think the browsers care about that issue and the browsers are driving.

This will certainly cause pain. I know of places around the university that are still manually handling one-year TLS certificates; those places will have to change over the course of a few years. This pain will arrive well before 2029; based on the proposed changes, starting March 15, 2027, the maximum certificate validity period will be 100 days, which is short enough to be decidedly annoying. Even a 250 200 day validity period (starting March 15 2026) will be somewhat painful to do by hand.

I expect one consequence to be that some number of (internal) devices stop having valid TLS certificates, because they can only have certificates loaded into them manually and no one is going to do that every 40-dd or even every 90-odd days. You might manually get and load a valid TLS certificate every year; you certainly won't do it every three months (well, almost no one will).

I hope that this will encourage the creation and growth of more alternatives to Let's Encrypt, even if not all of them are free, since more and more CAs will be pushed to have an API and one obvious API to adopt is ACME.

(I can also imagine ways to charge for an ACME based API, even with standard ACME clients. One obvious way would be to only accept ACME requests for domains that the CA had some sort of site license with. You'd establish the site license through out of band means, not ACME.)

How I install personal versions of programs (on Unix)

By: cks

These days, Unixes are quite generous in what they make available through their packaging systems, so you can often get everything you want through packages that someone else worries about building, updating, managing, and so on. However, not everything is available that way; sometimes I want something that isn't packaged, and sometimes (especially on 'long term support' distributions) I want something that's more recent that the system provides (for example, Ubuntu 22.04 only has Emacs 27.1). Over time, I've evolved my own approach for managing my personal versions of such things, which is somewhat derived from the traditional approach for multi-architecture Unixes here.

The starting point is that I have a ~/lib/<architecture> directory tree. When I build something personally, I tell it that its install prefix is a per-program directory within this tree, for example, '/u/cks/lib/<arch>/emacs-30.1'. These days I only have one active architecture inside ~/lib, but old habits die hard, and someday we may start using ARM machines or FreeBSD. If I install a new version of the program, it goes in a different (versioned) subdirectory, so I have 'emacs-29.4' and 'emacs-30.1' directory trees.

I also have both a general ~/bin directory, for general scripts and other architecture independent things, and a ~/bin/bin.<arch> subdirectory, for architecture dependent things. When I install a program into ~/lib/<arch>/<whatever> and want to use it, I will make either a symbolic link or a cover script in ~/bin/bin.<arch> for it, such as '~/bin/bin.<arch>/emacs'. This symbolic link or cover script always points to what I want to use as the current version of the program, and I update it when I want to switch.

(If I'm building and installing something from the latest development tree, I'll often call the subdirectory something like 'fvwm3-git' and then rename it to have multiple versions around. This is not as good as real versioned subdirectories, but I tend to do this for things that I won't ever run two versions of at the same time; at most I'll switch back and forth.)

Some things I use, such as pipx, normally install programs (or symbolic links to them) into places like ~/.local/bin or ~/.cargo/bin. Because it's not worth fighting city hall on this one, I pretty much let them do so, but I don't add either directory to my $PATH. If I want to use a specific tool that they install and manage, I put in a symbolic link or a cover script in my ~/bin/bin.<arch>. The one exception to this is Go, where I do have ~/go/bin in my $PATH because I use enough Go based programs that it's the path of least resistance.

This setup isn't perfect, because right now I don't have a good general approach for things that depend on the Ubuntu version (where an Emacs 30.1 built on 22.04 doesn't run on 24.04). If I ran into this a lot I'd probably make an addition ~/bin/bin.<something> directory for the Ubuntu version and then put version specific things there. And in general, Go and Cargo are not ready for my home directory to be shared between different binary architectures. For Go, I would probably wind up setting $GOPATH to something like ~/lib/<arch>/go. Cargo has a similar system for deciding where it puts stuff but I haven't looked into it in detail.

(From a quick skim of 'cargo help install' and my ~/.cargo, I suspect that I'd point $CARGO_INSTALL_ROOT into my ~/lib/<arch> but leave $CARGO_HOME unset, so that various bits of Cargo's own data remain shared between architectures.)

(This elaborates a bit on a Fediverse conversation.)

PS: In theory I have a system for keeping track of the command lines used to build things (also, which I'd forgotten when I wrote the more recent entry on this system). In practice I've fallen out of the habit of using it when I build things for my ~/lib, although I should probably get back into it. For GNU Emacs, I put the ./configure command line into a file in ~/lib/<arch>, since I expected to build enough versions of Emacs over time.

One way to set up local programs in a multi-architecture Unix environment

By: cks

Back in the old days, it used to be reasonably routine to have 'multi-architecture' Unix environments with shared files (where here architecture was a combination of the process architecture and the Unix variant). The multi-architecture days have faded out, and with them fading, so has information about how people made this work with things like local binaries.

In the modern era of large local disks and build farms, the default approach is probably to simply build complete copies of '/local' for each architecture type and then distribute the result around somehow. In the old days people were a lot more interested in reducing disk space by sharing common elements and then doing things like NFS-mounting your entire '/local', which made life more tricky. There likely were many solutions to this, but the one I learned at the university as a young sprout worked like the following.

The canonical paths everyone used and had in their $PATH were things like /local/bin, /local/lib, /local/man, and /local/share. However, you didn't (NFS) mount /local; instead, you NFS mounted /local/mnt (which was sort of an arbitrary name, as we'll see). In /local/mnt there were 'share' and 'man' directories, and also a per-architecture directory for every architecture you supported, with names like 'solaris-sparc' or 'solaris-x86'. These per-architecture directories contained 'bin', 'lib', 'sbin', and so on subdirectories.

(These directories contained all of the locally installed programs, all jumbled together, which did have certain drawbacks that became more and more apparent as you added more programs.)

Each machine had a /local directory on its root filesystem that contained /local/mnt, symlinks from /local/share and /local/man to 'mnt/share' and 'mnt/man', and then symlinks for the rest of the directories that went to 'mnt/<arch>/bin' (or sbin or lib). Then everyone mounted /local/mnt on, well, /local/mnt. Since /local and its contents were local to the machine, you could have different symlinks on each machine that used the appropriate architecture (and you could even have built them on boot if you really wanted to, although in practice they were created when the machine was installed).

When you built software for this environment, you told it that its prefix was /local, and let it install itself (on a suitable build server) using /local/bin, /local/lib, /local/share and so on as the canonical paths. You had to build (and install) software repeatedly, once for each architecture, and it was on the software (and you) to make sure that /local/share/<whatever> was in fact the same from architecture to architecture. System administrators used to get grumpy when people accidentally put architecture dependent things in their 'share' areas, but generally software was pretty good about this in the days when it mattered.

(In some variants of this scheme, the mount points were a bit different because the shared stuff came from one NFS server and the architecture dependent parts from another, or might even be local if your machine was the only instance of its particular architecture.)

There were much more complicated schemes that various places did (often universities), including ones that put each separate program or software system into its own directory tree and then glued things together in various ways. Interested parties can go through LISA proceedings from the 1980s and early 1990s.

The problem of general OIDC identity provider support in clients

By: cks

I've written entries criticizing things that support using OIDC (OAuth2) authentication for not supporting it with general OIDC identity providers ('OPs' in OIDC jargon), only with specific (large) ones like Microsoft and Google (and often Github in tech-focused things). For example, there are almost no mail clients that support using your own IdP, and it's much easier to find web-based projects that support the usual few big OIDC providers and not your own OIDC OP. However, at the same time I want to acknowledge the practical problems with supporting arbitrary OIDC OPs in things, especially in things that ordinary people are going to be expected to set up themselves.

The core problem is that there is no way to automatically discover all of the information that you need to know in order to start OIDC authentication. If the person gives you their email address, perhaps you can use WebFinger to discover basic information through OIDC Identity Provider discovery, but that isn't sufficient by itself (and it also requires aligning a number of email addresses). In practice, the OIDC OP will require you to have an 'client identifier' and perhaps a 'client secret', both of which are essentially arbitrary strings. If you're a website, the OIDC standards require your 'redirect URI' to have been pre-registered with it. If you're a client program, hopefully you can supply some sort of 'localhost' redirect URI and have it accepted, but you may need to tell the person setting things up on the OIDC OP side that you need specific strings set.

(The client ID and especially the client secret are not normally supposed to be completely public; there are various issues if you publish them widely and then use them for a bunch of different things, cf.)

If you need specific information, even to know who the authenticated person is, this isn't necessarily straightforward. You may have to ask for exactly the right information, neither too much nor too little, and you can't necessarily assume you know where a user or login name is; you may have to ask the person setting up the custom OIDC IdP where to get this. On the good side, there is at least a specific place for where people's email addresses are (but you can't assume that this is the same as someone's login).

(In OIDC terms, you may need to ask for specific scopes and then use a specific claim to get the user or login name. You can't assume that the always-present 'sub' claim is a login name, although it often is; it can be an opaque identifier that's only meaningful to the identity provider.)

Now imagine that you're the author of a mail client that wants to provide a user friendly experience to people. Today, the best you can do is provide a wall of text fields that people have to enter the right information into, with very little validation possible. If people get things even a little bit wrong, all you and they may see is inscrutable error messages. You're probably going to have to describe what people need to do and the information they need to get in technical OIDC terms that assume people can navigate their specific OIDC IdP (or that someone can navigate this for them). You could create a configuration file format for this where the OIDC IdP operator can write down all of the information, give it to the people using your software, and they import it (much like OpenVPN can provide canned configuration files), but you'll be inventing that format (cue xkcd).

If you have limited time and resources to develop your software and help people using it, it's much simpler to support only a few large, known OIDC identity providers. If things need specific setup on the OIDC IdP side, you can feasibly provide that in your documentation (since there's only a few variations), and you can pre-set everything in your program, complete with knowledge about things like OIDC scopes and claims. It's also going to be fairly easy to test your code and procedures against these identity providers, while if you support custom OIDC IdPs you may need to figure out how to set up one (or several), how to configure it, and so on.

Getting older, now-replaced Fedora package updates

By: cks

Over the history of a given Fedora version, Fedora will often release multiple updates to the same package (for example, kernels, but there are many others). When it does this, the older package wind up being removed from the updates repository and are no longer readily available through mechanisms like 'dnf list --showduplicates <package>'. For a long time I used dnf's 'local' plugin to maintain a local archive of all packages I'd updated, so I could easily revert, but it turns out that as of Fedora 41's change to dnf5 (dnf version 5), that plugin is not available (presumably it hasn't been ported to dnf5, and may never be). So I decided to look into my other options for retrieving and installing older versions of packages, in case the most recent version has a bug that affects me (which has happened).

Before I take everyone on a long yak-shaving expedition, the simplest and best answer is to install the 'fedora-repos-archive' package, which installs an additional Fedora repository that has those replaced updates. After installing it, I suggest that you edit /etc/yum.repos.d/fedora-updates-archive.repo to disable it by default, which will save you time, bandwidth, and possibly aggravation. Then when you really want to see all possible versions of, say, Rust, you can do:

dnf list --showduplicates --enablerepo=updates-archive rust

You can then use 'dnf downgrade ...' as appropriate.

(Like the other Fedora repositories, updates-archive automatically knows your release version and picks packages from it. I think you can change this a bit with '--releasever=<NN>', but I'm not sure how deep the archive is.)

The other approach is to use Fedora Bodhi (also) and Fedora Koji (also) to fetch the packages for older builds, in much the same way as you can use Bodhi (and Koji) to fetch new builds that aren't in the updates or updates-testing repository yet. To start with, we're going to need to find out what's available. I think this can be done through either Bodhi or Koji, although Koji is presumably more authoritative. Let's do this for Rust in Fedora 41:

bodhi updates query --packages rust --releases f41
koji list-builds --state COMPLETE --no-draft --package rust --pattern '*.fc41'

Note that both of these listings are going to include package versions that were never released as updates for various reasons, and also versions built for the pre-release Fedora 41. Although Koji has a 'f41-updates' tag, I haven't been able to find a way to restrict 'koji list-builds' output to packages with that tag, so we're getting more than we'd like even after we use a pattern to restrict this to just Fedora 41.

(I think you may need to use the source package name, not a binary package one; if so, you can get it with 'rpm -qi rust' or whatever and looking at the 'Source RPM' line and name.)

Once you've found the package version you want, the easiest and fastest way to get it is through the koji command line client, following the directions in Installing Kernel from Koji with appropriate changes:

mkdir /tmp/scr
cd /tmp/scr
koji download-build --arch=x86_64 --arch=noarch rust-1.83.0-1.fc41

This will get you a bunch of RPMs, and then you can do 'dnf downgrade /tmp/scr/*.rpm' to have dnf do the right thing (only downgrading things you actually have installed).

One reason you might want to use Koji is that this gets you a local copy of the old package in case you want to go back and forth between it and the latest version for testing. If you use the dnf updates-archive approach, you'll be re-downloading the old version at every cycle. Of course at that point you can also use Koji to get a local copy of the latest update too, or 'dnf download ...', although Koji has the advantage that it gets all the related packages regardless of their names (so for Rust you get the 'cargo', 'clippy', and 'rustfmt' packages too).

(In theory you can work through the Fedora Bodhi website, but in practice it seems to be extremely overloaded at the moment and very slow. I suspect that the bot scraper plague is one contributing factor.)

PS: If you're using updates-archive and you just want to download the old packages, I think what you want is 'dnf download --enablerepo=updates-archive ...'.

Fedora 41 seems to have dropped an old XFT font 'property'

By: cks

Today I upgraded my office desktop from Fedora 40 to Fedora 41, and as traditional there was a little issue:

Current status: it has been '0' days since a Fedora upgrade caused X font problems, this time because xft apparently no longer accepts 'encoding=...' as a font specification argument/option.

One of the small issues with XFT fonts is that they don't really have canonical names. As covered in the "Font Name" section of fonts.conf, a given XFT font is a composite of a family, a size, and a number of attributes that may be used to narrow down the selection of the XFT font until there's only one option left (or no option left). One way to write that in textual form is, for example, 'Sans:Condensed Bold:size=13'.

For a long time, one of the 'name=value' properties that XFT font matching accepted was 'encoding=<something>'. For example, you might say 'encoding=iso10646-1' to specify 'Unicode' (and back in the long ago days, this apparently could make a difference for font rendering). Although I can't find 'encoding=' documented in historical fonts.conf stuff, I appear to have used it for more than a decade, dating back to when I first converted my fvwm configuration from XLFD fonts to XFT fonts. It's still accepted today on Fedora 40 (although I suspect it does nothing):

: f40 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
DejaVuSans.ttf: "DejaVu Sans" "Regular"

However, it's no longer accepted on Fedora 41:

: f41 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
Unable to parse the pattern

Initially I thought this had to be a change in fontconfig, but that doesn't seem to be the case; both Fedora 40 and Fedora 41 use the same version, '2.15.0', just with different build numbers (partly because of a mass rebuild for Fedora 41). Freetype itself went from version 2.13.2 to 2.13.3, but the release notes don't seem to have anything relevant. So I'm at a loss. At least it was easy to fix once I knew what had happened; I just had to take the ':encoding=iso10646-1' bit out from the places I had it.

(The visual manifestation was that all of my fvwm menus and window title bars switched to a tiny font. For historical reasons all of my XFT font specifications in my fvwm configuration file used 'encoding=...', so in Fedora 41 none of them worked and fvwm reported 'can't load font <whatever>' and fell back to its default of an XLFD font, which was tiny on my HiDPI display.)

PS: I suspect that this change will be coming in other Linux distributions sooner or later. Unsurprisingly, Ubuntu 24.04's fc-match still accepts 'encoding=...'.

PPS: Based on ltrace output, FcNameParse() appears to be what fails on Fedora 41.

❌