❌

Normal view

There are new articles available, click to refresh the page.
Today β€” 7 December 2025Chris's Wiki :: blog

Our mixed assortment of DNS server software (as of December 2025)

By: cks
7 December 2025 at 04:12

Without deliberately planning it, we've wound up running an assortment of DNS server software on an assortment of DNS servers. A lot of this involves history, so I might as well tell the story of that history in the process. This starts with our three sets of DNS servers: our internal DNS master (with a duplicate) that holds both the internal and external views of our zones, our resolving DNS servers (which use our internal zones), and our public authoritative DNS server (carrying our external zones, along with various relics of the past). These days we also have an additional resolving DNS server that resolves from outside our networks and so gives the people who can use it an external view of our zones.

In the beginning we ran Bind on everything, as was the custom in those days (and I suspect we started out without a separation between the three types of DNS servers, but that predates my time here), and I believe all of the DNS servers were Solaris. Eventually we moved the resolving DNS servers and the public authoritative DNS server to OpenBSD (and the internal DNS master to Ubuntu), still using Bind. Then OpenBSD switched which nameservers they liked from Bind to Unbound and NSD, so we went along with that. Our authoritative DNS server had a relatively easy NSD configuration, but our resolving DNS servers presented some challenges and we wound up with a complex Unbound plus NSD setup. Recently we switched our internal resolvers to using Bind on Ubuntu, and then we switched our public authoritative DNS server from OpenBSD to Ubuntu but kept it still with NSD, since we already had a working NSD configuration for it.

This has wound up with us running the following setups:

  • Our internal DNS masters run Bind in a somewhat complex split horizon configuration.

  • Our internal DNS resolvers run Bind in a simpler configuration where they act as internal authoritative secondary DNS servers for our own zones and as general resolvers.

  • Our public authoritative DNS server (and its hot spare) run NSD as an authoritative secondary, doing zone transfers from our internal DNS masters.

  • We have an external DNS resolver machine that runs Unbound in an extremely simple configuration. We opted to build this machine with Unbound because we didn't need it to act as anything other than a pure resolver, and Unbound is simple to set up for that.

At one level, this is splitting our knowledge and resources among three DNS servers rather than focusing on one. At another level, two out of the three DNS servers are being used in quite simple setups (and we already had the NSD setup written from prior use). Our only complex configurations are all Bind based, and we've explicitly picked Bind for complex setups because we feel we understand it fairly well from long experience with it.

(Specifically, I can configure a simple Unbound resolver faster and easier than I can do the same with Bind. I'm sure there's a simple resolver-only Bind configuration, it's just that I've never built one and I have built several simple and not so simple Unbound setups.)

Getting out of being people's secondary authoritative DNS server is hard

By: cks
6 December 2025 at 03:28

Many, many years ago, my department operated one of the university's secondary authoritative DNS servers, which was used by most everyone with a university subdomain and as a result was listed as one of their DNS NS records. This DNs server was also the authoritative DNS server for our own domains, because this was in the era where servers were expensive and it made perfect sense to do this. At the time, departments who wanted a subdomain pretty much needed to have a Unix system administrator and probably run their own primary DNS server and so on. Over time, the university's DNS infrastructure shifted drastically, with central IT offering more and more support, and more than half a decade ago our authoritative DNS server stopped being a university secondary, after a lot of notice to everyone.

Experienced system administrators can guess what happened next. Or rather, what didn't happen next. References to our DNS server lingered in various places for years, both in the university's root zones as DNS glue records and in people's own DNS zone files as theoretically authoritative records. As late as the middle of last year, when I started grinding away on this, I believe that roughly half of our authoritative DNS server's traffic was for old zones we didn't serve and was getting DNS 'Refused' responses. The situation is much better today, after several rounds of finding other people's zones that were still pointing to us, but it's still not quite over and it took a bunch of tedious work to get this far.

(Why I care about this is that it's hard to see if your authoritative DNS server is correctly answering everything it should if things like tcpdumps of DNS traffic are absolutely flooded with bad traffic that your DNS server is (correctly) rejecting.)

In theory, what we should have done when we stopped being a university secondary authoritative DNS server was to switch the authoritative DNS server for our own domains to another name and another IP address; this would have completely cut off everyone else when we turned the old server off and removed its name from our DNS. In practice the transition was not clearcut, because for a while we kept on being a secondary for some other university zones that have long-standing associations with the department. Also, I think we were optimistic about how responsive people would be (and how many of them we could reach).

(Also, there's a great deal of history tied up in the specific name and IP address of our current authoritative DNS server. It's been there for a very long time.)

PS: Even when no one is incorrectly pointing to us, there's clearly a background Internet radiation of external machines throwing random DNS queries at us. But that's another entry.

In Linux, filesystems can and do have things with inode number zero

By: cks
5 December 2025 at 04:19

A while back I wrote about how in POSIX you could theoretically use inode (number) zero. Not all Unixes consider inode zero to be valid; prominently, OpenBSD's getdents(2) doesn't return valid entries with an inode number of 0, and by extension, OpenBSD's filesystems won't have anything that uses inode zero. However, Linux is a different beast.

Recently, I saw a Go commit message with the interesting description of:

os: allow direntries to have zero inodes on Linux

Some Linux filesystems have been known to return valid entries with zero inodes. This new behavior also puts Go in agreement with recent glibc.

This fixes issue #76428, and the issue has a simple reproduction to create something with inode numbers of zero. According to the bug report:

[...] On a Linux system with libfuse 3.17.1 or later, you can do this easily with GVFS:

# Create many dir entries
(cd big && printf '%04x ' {0..1023} | xargs mkdir -p)
gio mount sftp://localhost/$PWD/big

The resulting filesystem mount is in /run/user/$UID/gvfs (see the issue for the exact long path) and can be experimentally verified to have entries with inode numbers of zero (well, as reported by reading the directory). On systems using glibc 2.37 and later, you can look at this directory with 'ls' and see the zero inode numbers.

(Interested parties can try their favorite non-C or non-glibc bindings to see if those environments correctly handle this case.)

That this requires glibc 2.37 is due to this glibc bug, first opened in 2010 (but rejected at the time for reasons you can read in the glibc bug) and then resurfaced in 2016 and eventually fixed in 2022 (and then again in 2024 for the thread safe version of readdir). The 2016 glibc issue has a bit of a discussion about the kernel side. As covered in the Go issue, libfuse returning a zero inode number may be a bug itself, but there are (many) versions of libfuse out in the wild that actually do this today.

Of course, libfuse (and gvfs) may not be the only Linux filesystems and filesystem environments that can create this effect. I believe there are alternate language bindings and APIs for the kernel FUSE (also, also) support, so they might have the same bug as libfuse does.

(Both Go and Rust have at least one native binding to the kernel FUSE driver. I haven't looked at either to see what they do about inode numbers.)

PS: My understanding of the Linux (kernel) situation is that if you have something inside the kernel that needs an inode number and you ask the kernel to give you one (through get_next_ino(), an internal function for this), the kernel will carefully avoid giving you inode number 0. A lot of things get inode numbers this way, so this makes life easier for everyone. However, a filesystem can decide on inode numbers itself, and when it does it can use inode number 0 (either explicitly or by zeroing out the d_ino field in the getdents(2) dirent structs that it returns, which I believe is what's happening in the libfuse situation).

Some things on X11's obscure DirectColor visual type

By: cks
4 December 2025 at 03:21

The X Window System has a long standing concept called 'visuals'; to simplify, an X visual determines how to determine the colors of your pixels. As I wrote about a number of years ago, these days X11 mostly uses 'TrueColor' visuals, which directly supply 8-bit values for red, green, and blue ('24-bit color'). However X11 has a number of visual types, such as the straightforward PseudoColor indirect colormap (where every pixel value is an index into an RGB colormap; typically you'd get 8-bit pixels and 24-bit colormaps, so you could have 256 colors out of a full 24-bit gamut). One of the (now) obscure visual types is DirectColor. To quote:

For DirectColor, a pixel value is decomposed into separate RGB subfields, and each subfield separately indexes the colormap for the corresponding value. The RGB values can be changed dynamically.

(This is specific to X11; X10 had a different display color model.)

In a PseudoColor visual, each pixel's value is taken as a whole and used as an index into a colormap that gives the RGB values for that entry. In DirectColor, the pixel value is split apart into three values, one each for red, green, and blue, and each value indexes a separate colormap for that color component. Compared to a PseudoColor visual of the same pixel depth (size, eg each pixel is an 8-bit byte), you get less possible variety within a single color component and (I believe) no more colors in total.

When this came up in my old entry about TrueColor and PseudoColor visuals, in a comment Aristotle Pagaltzis speculated:

[...] maybe it can be implemented as three LUTs in front of a DAC’s inputs or something where the performance impact is minimal? (I’m not a hardware person.) [...]

I was recently reminded of this old entry and when I reread that comment, an obvious realization struck me about why DirectColor might make hardware sense. Back in the days of analog video, essentially every serious sort of video connection between your computer and your display carried the red, green, and blue components separately; you can see this in the VGA connector pinouts, and on old Unix workstations these might literally be separate wires connected to separate BNC connectors on your CRT display.

If you're sending the red, green, and blue signals separately you might also be generating them separately, with one DAC per color channel. If you have separate DACs, it might be easier to feed them from separate LUTs and separate pixel data, especially back in the days when much of a Unix workstation's graphics system was implemented in relatively basic, non-custom chips and components. You can split off the bits from the raw pixel value with basic hardware and then route each color channel to its own LUT, DAC, and associated circuits (although presumably you need to drive them with a common clock).

The other way to look at DirectColor is that it's a more flexible version of TrueColor. A TrueColor visual is effectively a 24-bit DirectColor visual where the color mappings for red, green, and blue are fixed rather than variable (this is in fact how it's described in the X documentation). Making these mappings variable costs you only a tiny bit of extra memory (you need 256 bytes for each color) and might require only a bit of extra hardware in the color generation process, and it enables the program using the display to change colors on the fly with small writes to the colormap rather than large writes to the framebuffer (which, back in the days, were not necessarily very fast). For instance, if you're looking at a full screen image and you want to brighten it, you could simply shift the color values in the colormaps to raise the low values, rather than recompute and redraw all the pixels.

(Apparently DirectColor was often used with 24-bit pixels, split into one byte for each color, which is the same pixel layout as a 24-bit TrueColor visual; see eg this section of the Starlink Project's Graphics Cookbook. Also, this seems to be how the A/UX X server worked. If you were going to do 8-bit pixels I suspected people preferred PseudoColor to DirectColor.)

These days this is mostly irrelevant and the basic simplicity of the TrueColor visual has won out. Well, what won out is PC graphics systems that followed the same basic approach of fixed 24-bit RGB color, and then X went along with it on PC hardware, which became more or less the only hardware.

(There probably was hardware with DirectColor support. While X on PC Unixes will probably still claim to support DirectColor visuals, as reported in things like xdpyinfo, I suspect that it involves software emulation. Although these days you could probably implement DirectColor with GPU shaders at basically no cost.)

Sending DMARC reports is somewhat hazardous

By: cks
3 December 2025 at 03:10

DMARC has a feature where you can request that other mail systems send you aggregate reports about the DMARC results that they observed for email claiming to be from you. If you're a large institution with a sprawling, complex, multi-party mail environment and you're considering trying to make your DMARC policy stricter, it's very useful to get as many DMARC reports from as many people as possible. Especially, 'you' (in a broad sense) probably want to get as much information from mail systems run by sub-units as possible, and if you're a sub-unit, you want to report DMARC information up to the organization so they have as much visibility into what's going on as possible.

In related news, I've been looking into making our mail system send out DMARC reports, and I had what was in retrospect a predictable learning experience:

Today's discovery: if you want to helpfully send out DMARC reports to people who ask for them and you operate even a moderate sized email system, you're going to need to use a dedicated sending server and you probably don't want to. Because a) you'll be sending a lot of email messages and b) a lot of them will bounce because people's DMARC records are inaccurate and c) a decent number of them will camp out in your mail queue because see b, they're trying to go to non-responsive hosts.

Really, all of this DMARC reporting nonsense was predictable from first (Internet) principles, but I didn't think about it and was just optimistic when I turned our reporting on for local reasons. Of course people are going to screw up their DMARC reporting information (or for spammers, just make it up), they screw everything up and DMARC data will be no exception.

(Or they take systems and email addresses out of service without updating their DMARC records.)

If you operate even a somewhat modest email system that gets a wide variety of email, as we do, it doesn't take very long to receive email from hundreds of From: domains that have DMARC records in DNS that request reports. When you generate your DMARC reports (whether once a day or more often), you'll send out hundreds of email messages to those report addresses. If you send them through your regular outgoing email system, you'll have a sudden influx of a lot of messages and you may trigger any anti-flood ratelimits you have. Once your reporting system has upended those hundreds of reports into your mail system, your mail system has to process through them; some of them will be delivered promptly, some of them will bounce (either directly or inside the remote mail system you hand them off to), and some of them will be theoretically destined for (currently) non-responsive hosts and thus will clog up your mail queue with repeated delivery attempts. If you're sending these reports through a general purpose mail system, your mail queue probably has a long timeout for stalled email, which is not really what you want in this case; your DMARC reports are more like 'best effort one time delivery attempt and then throw the message away' email. If this report doesn't get through and the issue is transient, you'll keep getting email with that From: domain and eventually one of your reports will go through. DMARC reports are definitely not 'gotta deliver them all' email.

So in my view, you're almost certainly going to have to be selective about what domains you send DMARC reports for. If you're considering this and you can, it may help to trawl your logs to see what domains are failing DMARC checks and pick out the ones you care about (such as, say, your organization's overall domain or domains). It's somewhat useful to report even successful DMARC results (where the email passes DMARC checks), but if you're considering acting on DMARC results, it's important to get false negatives fixed. If you want to send DMARC reports to everyone, you'll want to set up a custom mail system, perhaps on the DMARC local machine, which blasts everything out, efficiently handles potentially large queues and fast submission rates, and discards queued messages quickly (and obviously doesn't send you any bounces).

(Sending through a completely separate mail system also avoids the possibility that someone will decide to put your regular system on a blocklist because of your high rate of DMARC report email.)

PS: Some of those hundreds of From: domains with DMARC records that request reports will be spammer domains; I assume that putting a 'rua=' into your DMARC record makes it look more legitimate to (some) receiving systems. Spammers sending from their own domains can DKIM sign their messages, but having working reporting addresses requires extra work and extra exposure. And of course spammers often rotate through domains rapidly.

Password fields should usually have an option to show the text

By: cks
2 December 2025 at 03:46

I recently had to abruptly replace my smartphone, and because of how it happened I couldn't directly transfer data from the old phone to the new one; instead, I had to have the new phone restore itself from a cloud backup of the old phone (made on an OS version several years older than the new phone's OS). In the process, a number of passwords and other secrets fell off and I had to re-enter them. As I mentioned on the Fediverse, this didn't always go well:

I did get our work L2TP VPN to work with my new phone. Apparently the problem was a typo in one bit of one password secret, which is hard to see because of course there's no 'show the whole thing' option and you have to enter things character by character on a virtual phone keyboard I find slow and error-prone.

(Phone natives are probably laughing at my typing.)

(Some of the issue was that these passwords were generally not good ones for software keyboards.)

There are reasonable security reasons not to show passwords when you're entering them. In the old days, the traditional reason was shoulder surfing; today, we have to worry about various things that might capture the screen with a password visible. But at the same time, entering passwords and other secrets blindly is error prone, and especially these days the diagnostics of a failed password may be obscure and you might only get so many tries before bad things start happening.

(The smartphone approach of temporarily showing the last character you entered is a help but not a complete cure, especially if you're going back and forth three ways between the form field, the on-screen keyboard, and your saved or looked up copy of the password or secret.)

Partly as a result of my recent experiences, I've definitely come around to viewing those 'reveal the plain text of the password' options that some applications have as a good thing. I think a lot of applications should at least consider whether and how to do this, and how to make password entry less error prone in general. This especially applies if your application (and overall environment) doesn't allow pasting into the field (either from a memorized passwords system or by the person involved simply copying and pasting it from elsewhere, such as support site instructions).

In some cases, you might want to not even treat a 'password' field as a password (with hidden text) by default. Often things like wireless network 'passwords' or L2TP pre-shared keys are broadly known and perhaps don't need to be carefully guarded during input the way genuine account passwords do. If possible I'd still offer an option to hide the input text in whatever way is usual on your platform, but you could reasonably start the field out as not hidden.

Unfortunately, as of December 2025 I think there's no general way to do this in HTML forms in pure CSS, without JavaScript (there may be some browser-specific CSS attributes). I believe support for this is on the CSS roadmap somewhere, but that probably means at least several years before it starts being common.

(The good news is that a pure CSS system will presumably degrade harmlessly if the CSS isn't supported; the password will just stay hidden, which is no worse than today's situation with a basic form.)

Go still supports building non-module programs with GOPATH

By: cks
1 December 2025 at 02:52

When Go 1.18 was released, I said that it made module mode mandatory, which I wasn't a fan of because it can break backward compatibility in practice (and switching a program to Go modules can be non-trivial). Recently on the Fediverse, @thepudds very helpfully taught me that I wasn't entirely correct and Go still sort of supports non-module GOPATH usage, and in fact according to issue 60915, the current support is going to be preserved indefinitely.

Specifically, what's preserved today (and into the future) is support for using 'go build' and 'go install' in non-module mode (with 'GO111MODULE=off'). This inherits all of the behavior of Go 1.17 and earlier, including the use of things in the program's /vendor/ area (which can be important if you made local hacks). This allows you to rebuild and modify programs that you already have a complete GOPATH environment for (with all of their direct and indirect dependencies fetched). Since Go 1.22 and later don't support the non-module version of 'go get', assembling such an environment from scratch is up to you (if, for example, you need to modify an old non-module program). If you have a saved version of a suitable earlier version of Go, using that is probably the easiest way.

(Initially I thought Go 1.17 was the latest version you could use for this, but that was wrong; you can use anything up through Go 1.21. Go 1.17 is merely the latest version where you can do this without explicitly setting 'GO111MODULE=off'.)

Of course you could just build your old non-module programs with your saved copy of Go 1.21 (if it still runs in your current OS and hardware environment), but rebuilding things with a modern version of Go has various advantages and may be required to support modern architectures and operating system versions that you're targeting. The latest versions of Go have compiler and runtime improvements and optimizations, standard library improvements, support for various more modern things in TLS and so on, and a certain amount of security fixes; you'll also get better support for using 'go version -m' on your built binaries (which is useful for tracking things later).

Learning this is probably going to get me to change how I handle some of our old programs. Even if I don't update their code, rebuilding them periodically on the latest Go version to update their binaries is probably a good thing, especially if they deal with cryptography (including SSH) or HTTP things.

(In retrospect this was implied by what the Go 1.18 release notes said. In fact even at the time I didn't read enough of the release notes; in forced 'Go modules off' mode, the Go 1.18 'go get' will still get things for you. That ability was removed later, in Go 1.22. Right up through Go 1.21, 'GO111MODULE=off go get [-u]' will do the traditional dependency fetching and so on for you.)

Discovering that my smartphone had infiltrated my life

By: cks
30 November 2025 at 02:45

While I have a smartphone, I think of myself as not particularly using it all that much. I got a smartphone quite late, it spends a lot of its life merely sitting there (not even necessarily in the same room as me, especially at home), and while I installed various apps (such as a SSH client) I rarely use them; they're mostly for weird emergencies. Then I suddenly couldn't use my current smartphone any more and all sorts of things came out of the woodwork, both things I sort of knew about but hadn't realized how much they'd affect me and things that I didn't even think about until I had a dead phone.

The really obvious and somewhat nerve wracking thing I expected from the start is that plenty of things want to send you text messages (both for SMS authentication codes and to tell you what steps to do to, for example, get your new replacement smartphone). With no operating smartphone I couldn't receive them. I found myself on tenterhooks all through the replacement process, hoping very much that my bank wouldn't decide it needed to authenticate my credit card usage through either its smartphone app or a text message (and I was lucky that I could authenticate some things through another device). Had I been without a smartphone for a more extended time, I could see a number of things where I'd probably have had to make in-person visits to a bank branch.

(Another obvious thing I knew about is that my bike computer wants to talk to a smartphone app (also). At a different time of year this would have been a real issue, but fortunately my bike club's recreational riding season is over so all it did was delay me uploading one commute ride.)

In less obvious things, I use my smartphone as my alarm clock. With my smartphone unavailable I discovered that I had no good alternative (although I had some not so good ones that are too quiet). I've also become used to using my phone for a quick check of the weather on the way out the door, and to check the arrival time of TTC buses, neither of which were available. Nor could I check email (or text messages) on the way to pick up my new phone because with no smartphone I had no data coverage. I was lucky enough to have another wifi-enabled device available that I took with me, which turned out to be critical for the pickup process.

(It also felt weird and wrong to walk out of the door without the weight of my phone in my pocket, as if I was forgetting my keys or something equally important. And there were times on the trip to get the replacement phone when I found myself realizing that if I'd had an operating smartphone, I'd have taken it out for a quick look at this or that or whatever.)

On the level of mere inconveniences, over time I've gotten pulled into using my smartphone's payment setup for things like grocery purchases. I could still do that in several other ways even without a smartphone, but none of them would have been as nice an experience. There would also have been paper cuts in things like checking the balance on my public transit fare card and topping it up.

Having gone through this experience with my smartphone, I'm now wondering what other bits of technology have quietly infiltrated both my personal life and things at work without me noticing their actual importance. I suspect that there are some more and I'll only realize it when they break.

PS: The smartphone I had to replace is the same one I got back in late 2016, so I got a bit over nine years of usage out of it. This is pretty good by smartphone standards (although for the past few years I was carefully ignoring that it had questionable support for security bugs; there were some updates, but also some known issues that weren't being fixed).

Do you care about (all) HTTP requests from cloud provider IP address space?

By: cks
29 November 2025 at 04:21

About a month ago Mike Hoye wrote Raised Shields, in which Hoye said, about defending small websites from crawler abuse in this day and age:

If you only care about humans I strongly advise you to block every cloudhost subnet you can find, pretty easy given the effort they put into finding you. Most of the worst actors out there are living comfortably on Azure, GCP, Yandex and sometimes Huawei’s servers.

(As usual, there's no point in complaining about abusive crawlers to the cloud providers.)

I've said something similar on the Fediverse:

Today's idle thought: how many small web servers actually have any reason to accept requests from AWS or Google Cloud IP address space? If you search through your logs with (eg) grepcidr, you may find that there's little or nothing of value coming from there, and they sure are popular with LLM crawlers these days.

You definitely want to search your logs before doing this, and you may find that you want to make some exceptions even if you do opt for it. For example, you might want or need to let cloud-hosted things fetch your syndication feeds, because there are a fair number of people and feed readers that do their fetching from the cloud. Possibly you'll find that you have a significant number of real visitors that are using do it yourself personal VPN setups that have cloud exit points.

(How many exceptions you want to make may depend on how much of a hard line you want to take. I suspect that Mike Hoye's line is much harder than mine.)

However, I think that for a lot of small, personal web servers and web sites you'll find that almost nothing of genuine value comes from the big cloud provider networks, from AWS, Google Cloud, Azure, Oracle, and so on. You're probably not getting real visitors from these clouds, people who are interested in reading your work and engaging with it. Instead you'll most likely see an ever-growing horde of obvious crawlers, increasingly suspicious user agents, claims to be things that they aren't, and so on.

On the one hand, it's in some sense morally pure to not block these cloud areas unless they're causing your site active harm; it's certainly what the ethos was on the older Internet, and it was a good and useful ethos for those times. On the other hand, that view is part of what got us here. More and more, these days are the days of Raised Shields, as we react to the new environment (much as email had to react to the new environment of ever increasing spam).

If you're doing this, one useful trick you can play if you have the right web server environment is to do your blocking with HTTP 429 Too Many Requests responses. Using this HTTP code is in some sense inaccurate, but it has the useful effect that very few things will take it as a permanent error the way they may take, for example, HTTP 403 (or HTTP 404). This gives you a chance to monitor your web server logs and add a suitable exemption for traffic that you turn out to want after all, without your error responses doing anything permanent (like potentially removing your pages from search engine indexes). You can also arrange to serve up a custom error page for this case, with an explanation or a link to an explanation.

(My view is that serving a 400-series HTTP error response is better than a HTTP 302 temporary redirect to your explanation, for various reasons. Possibly there are clever things you can do with error pages in general.)

We can't fund our way out of the free and open source maintenance problem

By: cks
28 November 2025 at 04:18

It's in the tech news a lot these days that there are 'problems' with free and open source maintenance. I put 'problems' in quotes because the issue is mostly that FOSS maintenance isn't happening as fast or as much as the people who've come to depend on it would like, and the people who maintain FOSS are increasingly saying 'no' when corporations turn up (cf, also). But even with all the corporate presence, there are still a reasonable number of people who use non-corporate FOSS operating systems like Debian Linux, FreeBSD, and so on, and they too suffer when parts of the FOSS software stack struggle with maintenance. Every so often, people will suggest that the problem would be solved if only corporations would properly fund this maintenance work. However, I don't believe this can actually work even in a world where corporations are willing to properly fund such things (in this world, they're very clearly not).

One big problem with 'funding' as a solution to the FOSS maintenance problems is that for many FOSS maintainers, there isn't enough work available to support them. Many FOSS people write and support only a small number of things that don't necessarily need much active development and bug fixing (people have done studies on this), and so can't feasibly provide full time employment (especially at something equivalent to a competitive salary). Certainly,there's plenty of large projects that are underfunded and could support one or more people working on them full time, but there's also a long tail of smaller, less obvious dependencies that are also important for various sorts of maintenance.

(In a way, the lack of funding pushes people toward small projects. With no funding, you have to do your projects in your spare time and the easiest way to make that work is to choose some small area or modest project that simply doesn't need that much time to develop or maintain.)

There are models where people who work on FOSS can be funded to do a bit of work on a lot of projects. But that's not the same as having funding to work full time on your own little project (or set of little projects). It's much more like regular work, in that you're being paid to do development work on other people's stuff (and I suspect that it will be much more time consuming than one might expect, since anyone doing this will have to come up to speed on a whole bunch of projects).

(I'm assuming the FOSS funding equivalent of a perfectly spherical frictionless object from physics examples, so we can wave away all other issues except that there is not enough work on individual projects. In the real world there are a huge host of additional problems with funding people for FOSS work that create significant extra friction (eg, potential liabilities).)

PS: Even though we can't solve the whole problem with funding, companies absolutely should be trying to use funding to solve as much of it as possible. That they manifestly aren't is one of many things that is probably going to bring everything down as pressure builds to do something.

(I'm sure I'm far from the first person to write about this issue with funding FOSS work. I just feel like writing it down myself, partly as elaboration on some parts of past Fediverse posts.)

Sidebar: It's full time work that matters

If someone is already working a regular full time job, their spare time is a limited resource and there are many claims on it. For various reasons, not everyone will take money to spend (potentially) most of their spare time maintaining their FOSS work. Many people will only be willing to spend a limited amount of their spare time on FOSS stuff, even if you could fund them at reasonable rates for all of their spare time. The only way to really get 'enough' time is to fund people to work full time, so their FOSS work replaces their regular full time job.

One of the reasons I suspect some people won't take money for their extra time is that they already have one job and they don't want to effectively get a second one. They do FOSS work deliberately because it's a break from 'job' style work.

(This points to another, bigger issue; there are plenty of people doing all sorts of hobbies, such as photography, who have no desire to 'go pro' in their hobby no matter how avid and good they are. I suspect there are people writing and maintaining important FOSS software who similarly have no desire to 'go pro' with their software maintenance.)

Duplicate metric labels and group_*() operations in Prometheus

By: cks
27 November 2025 at 02:44

Suppose that you have an internal master DNS server and a backup for that master server. The two servers are theoretically fed from the same data and so should have the same DNS zone contents, and especially they should have the same DNS zone SOAs for all zones in both of their internal and external views. They both run Bind and you use the Bind exporter, which provides the SOA values for every zone Bind is configured to be a primary or a secondary for. So you can write an alert with an expression like this:

bind_zone_serial{host="backup"}
  != on (view,zone_name)
    bind_zone_serial{host="primary"}

This is a perfectly good alert (well, alert rule), but it has lost all of the additional labels you might want in your alert. Especially, it has lost both host names. You could hard-code the host name in your message about the alert, but it would be nice to do better and propagate your standard labels into the alert. To do this you want to use one of group_left() and group_right(), but which one you want depends on where you want the labels to come from.

(Normally you have to chose between the two depending on which side has multiple matches, but in this case we have a one to one matching.)

For labels that are duplicated between both sides, the group_*() operators pick which side's labels you get, but backwards from their names. If you use group_right(), the duplicate label values come from the left; if you use group_left(), the duplicate label values come from the right. Here, we might change the backup host's name but we're probably not going to change the primary host's name, so we likely want to preserve the 'host' label from the left side and thus we use group_right():

bind_zone_serial{host="backup"}
  != on (view,zone_name)
    group_right (job,host,instance)
      bind_zone_serial{host="primary"}

One reason this little peculiarity is on my mind at the moment is that Cloudflare's excellent pint Prometheus rule linter recently picked up a new 'redundant label' lint rule that complains about this for custom labels such as 'host':

Query is trying to join the 'host' label that is already present on the other side of the query.

(It doesn't complain about job or instance, presumably because it understands why you might do this for those labels. As the pint message will tell you, to silence this you need to disable 'promql/impossible' for this rule.)

When I first saw pint's warning I didn't think about it and removed the 'host' label from the group_right(), but fortunately I actually tested what the result would be and saw that I was now getting the wrong host name.

(This is different from pulling in labels from other metrics, where the labels aren't duplicated.)

PS: I clearly knew this at some point, when I wrote the original alert rule, but then I forgot it by the time I was looking at pint's warning message. PromQL is the kind of complex thing where the details can fall out of my mind if I don't use it often enough, which I don't these days since our alert rules are relatively stable.

BSD PF versus Linux nftables for firewalls for us

By: cks
26 November 2025 at 03:48

One of the reactions I saw to our move from OpenBSD to FreeBSD for firewalls was to wonder why we weren't moving all the way to nftables based Linux firewalls. It's true that this would reduce the number of different Unixes we have to operate and probably get us more or less state of the art 10G network performance. However, I have some negative views on the choice of PF versus nftables, both in our specific situation and in general.

(I've written about this before but it was in the implicit context of Linux iptables.)

In our specific situation:

  • We have a lot of existing, relatively complex PF firewall rules; for example, our perimeter firewall has over 400 non-comment lines of rules, definitions, and so on. Translating these from OpenBSD PF to FreeBSD PF is easy, if it's necessary at all. Translating everything to nftables is a lot more work, and as far as I know there's no translation tool, especially not one that we could really trust. We'd probably have to basically rebuild each firewall from the ground up, which is both a lot of work and a high-stakes thing. We'd have to be extremely convinced that we had to do this in order to undertake it.

  • We have a lot of well developed tooling around operating, monitoring, and gathering metrics from PF-based firewalls, most of it locally created. Much or all of this tooling ports straight over from OpenBSD to FreeBSD, while we have no equivalent tooling for nftables and would have to develop (or find) equivalents.

  • We already know PF and almost all of that knowledge transfers over from OpenBSD PF to FreeBSD PF (and more will transfer with FreeBSD 15, which has some PF and PF syntax updates from modern OpenBSD).

In general (much of which also applies to our specific situation):

  • There are a number of important PF features that nftables at best has in incomplete, awkward versions. For example, nftables' version of pflog is awkward and half-baked compared to the real thing (also). While you may be able to put together some nftables based rough equivalent of BSD pfsync, casual reading suggests that it's a lot more involved and complex (and maybe less integrated with nftables).

  • The BSD PF firewall system is straightforward and easy to understand and predict. The Linux firewall system is much more complex and harder to understand, and this complexity bleeds through into nftables configuration, where you need to know chains and tables and so on. Much of this Linux complexity is not documented in ways that are particularly accessible.

  • Nftables documentation is opaque compared to the BSD pf.conf manual page (also). Partly this is because there is no 'nftables.conf' manual page; instead, your entry point is the nft manual page, which is both a command line tool and the documentation of the format of nftables rules. I find that these are two tastes that don't go well together.

    (This is somewhat forced by the nftables decision to retain compatibility with adding and removing rules on the fly. PF doesn't give you a choice, you load your entire ruleset from a file.)

  • nftables is already the third firewall rule format and system that the Linux kernel has had over the time that I've been writing Linux firewall rules (ipchains, iptables, nftables). I have no confidence that there won't be a fourth before too long. PF has been quite stable by comparison.

What I mostly care about is what I have to write and read to get the IP filtering and firewall setup that we want (and then understand it later), not how it gets compiled down and represented in the kernel (this has come up before). Assuming that the nftables backend is capable enough and the result performs sufficiently well, I'd be reasonably happy with a PF like syntax (and semantics) on top of kernel nftables (although we'd still have things like the pflog and pfsync issues).

Can I get things done in nftables? Certainly, nftables is relatively inoffensive. Do I want to write nftables rules? No, not really, no more than I want to write iptables rules. I do write nftables and iptables rules when I need to do firewall and IP filtering things on a Linux machine, but for a dedicated machine for this purpose I'd rather use a PF-based environment (which is now FreeBSD).

As far as I can tell, the state of Linux IP filtering documentation is partly a result of the fact that Linux doesn't have a unified IP filtering system and environment the way that OpenBSD does and FreeBSD mostly does (or at least successfully appears to so far). When the IP filtering system is multiple more or less separate pieces and subsystems, you naturally tend to get documentation that looks at each piece in isolation and assumes you already know all of the rest.

(Let's also acknowledge that writing good documentation for a complex system is hard, and the Linux IP filtering system has evolved to be very complex.)

PS: There's no real comparison between PF and the older iptables system; PF is clearly far more high level than you can reasonably do in iptables, which by comparison is basically an IP filtering assembly language. I'm willing to tentatively assume that nftables can be used in a higher level way than iptables can (I haven't used it for enough to have a well informed view either way); if it can't, then there's again no real comparison between PF and nftables.

Making Polkit authenticate people like su does (with group wheel)

By: cks
25 November 2025 at 03:46

Polkit is how a lot of things on modern Linux systems decide whether or not to let people do privileged operations, including systemd's run0, which effectively functions as another su or sudo. Polkit normally has a significantly different authentication model than su or sudo, where an arbitrary login can authenticate for privileged operations by giving the password of any 'administrator' account (accounts in group wheel or group admin, depending on your Linux distribution).

Suppose, not hypothetically, that you want a su like model in Polkit, one where people in group 'wheel' can authenticate by providing the root password, while people not in group 'wheel' cannot authenticate for privileged operations at all. In my earlier entry on learning about Polkit and adjusting it I put forward an untested Polkit stanza to do this. Now I've tested it and I can provide an actual working version.

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // must exist but have a locked password
        return ["unix-user:nobody"];
    }
});

(This goes in /etc/polkit-1/rules.d/50-default.rules, and the filename is important because it has to replace the standard version in /usr/share/polkit-1/rules.d.)

This doesn't quite work the way 'su' does, where it will just refuse to work for people not in group wheel. Instead, if you're not in group wheel you'll be prompted for the password of 'nobody' (or whatever other login you're using), which you can never successfully supply because the password is locked.

As I've experimentally determined, it doesn't work to return an empty list ('[]'), or a Unix group that doesn't exist ('unix-group:nosuchgroup'), or a Unix group that exists but has no members. In all cases my Fedora 42 system falls back to asking for the root password, which I assume is a built-in default for privileged authentication. Instead you apparently have to return something that Polkit thinks it can plausibly use to authenticate the person, even if that authentication can't succeed. Hopefully Polkit will never get smart enough to work that out and stop accepting accounts with locked passwords.

(If you want to be friendly and you expect people on your servers to run into this a lot, you should probably create a login with a more useful name and GECOS field, perhaps 'not-allowed' and 'You cannot authenticate for this operation', that has a locked password. People may or may not realize what's going on, but at least they have a chance.)

PS: This is with the Fedora 42 version of Polkit, which is version 126. This appears to be the most recent version from the upstream project.

Sidebar: Disabling Polkit entirely

Initially I assumed that Polkit had explicit rules somewhere that authorized the 'root' user. However, as far as I can tell this isn't true; there's no normal rules that specifically authorize root or any other UID 0 login name, and despite that root can perform actions that are restricted to groups that root isn't in. I believe this means that you can explicitly disable all discretionary Polkit authorization with an '00-disable.rules' file that contains:

polkit.addRule(function(action, subject) {
    return polkit.Result.NO;
});

Based on experimentation, this disables absolutely everything, even actions that are considered generally harmless (like libvirt's 'virsh list', which I think normally anyone can do).

A slightly more friendly version can be had by creating a situation where there are no allowed administrative users. I think this would be done with a 50-default.rules file that contained:

polkit.addAdminRule(function(action, subject) {
    // must exist but have a locked password
    return ["unix-user:nobody"];
});

You'd also want to make sure that nobody is in any special groups that rules in /usr/share/polkit-1/rules.d use to allow automatic access. You can look for these by grep'ing for 'isInGroup'.

The (early) good and bad parts of Polkit for a system administrator

By: cks
24 November 2025 at 03:46

At a high level, Polkit is how a lot of things on modern Linux systems decide whether or not to let you do privileged operations. After looking into it a bit, I've wound up feeling that Polkit has both good and bad aspects from the perspective of a system administrator (especially a system administrator with multi-user Linux systems, where most of the people using them aren't supposed to have any special privileges). While I've used (desktop) Linuxes with Polkit for a while and relied on it for a certain amount of what I was doing, I've done so blindly, effectively as a normal person. This is the first I've looked at the details of Polkit, which is why I'm calling this my early reactions.

On the good side, Polkit is a single source of authorization decisions, much like PAM. On a modern Linux system, there are a steadily increasing number of programs that do privileged things, even on servers (such as systemd's run0). These could all have their own bespoke custom authorization systems, much as how sudo has its own custom one, but instead most of them have centralized on Polkit. In theory Polkit gives you a single thing to look at and a single thing to learn, rather than learning systemd's authentication system, NetworkManager's authentication system, etc. It also means that programs have less of a temptation to hard-code (some of) their authentication rules, because Polkit is very flexible.

(In many cases programs couldn't feasibly use PAM instead, because they want certain actions to be automatically authorized. For example, in its standard configuration libvirt wants everyone in group 'libvirt' to be able to issue libvirt VM management commands without constantly having to authenticate. PAM could probably be extended to do this but it would start to get complicated, partly because PAM configuration files aren't a programming language and so implementing logic in PAM gets awkward in a hurry.)

On the bad side, Polkit is a non-declarative authorization system, and a complex one with its rules not in any single place (instead they're distributed through multiple files in two different formats). Authorization decisions are normally made in (JavaScript) code, which means that they can encode essentially arbitrary logic (although there are standard forms of things). This means that the only way to know who is authorized to do a particular thing is to read its XML 'action' file and then look through all of the JavaScript code to find and then understand things that apply to it.

(Even 'who is authorized' is imprecise by default. Polkit normally allows anyone to authenticate as any administrative account, provided that they know its password and possibly other authentication information. This makes the passwords of people in group wheel or group admin very dangerous things, since anyone who can get their hands on one can probably execute any Polkit-protected action.)

This creates a situation where there's no way in Polkit to get a global overview of who is authorized to do what, or what a particular person has authorization for, since this doesn't exist in a declarative form and instead has to be determined on the fly by evaluating code. Instead you have to know what's customary, like the group that's 'administrative' for your Linux distribution (wheel or admin, typically) and what special groups (like 'libvirt') do what, or you have to read and understand all of the JavaScript and XML involved.

In other words, there's no feasible way to audit what Polkit is allowing people to do on your system. You have to trust that programs have made sensible decisions in their Polkit configuration (ones that you agree with), or run the risk of system malfunctions by turning everything off (or allowing only root to be authorized to do things).

(Not even Polkit itself can give you visibility into why a decision was made or fully predict it in advance, because the JavaScript rules have no pre-filtering to narrow down what they apply to. The only way you find out what a rule really does is invoking it. Well, invoking the function that the addRule() or addAdminRule() added to the rule stack.)

This complexity (and the resulting opacity of authorization) is probably intrinsic in Polkit's goals. I even think they made the right decision by having you write logic in JavaScript rather than try to create their own language for it. However, I do wish Polkit had a declarative subset that could express all of the simple cases, reserving JavaScript rules only for complex ones. I think this would make the overall system much easier for system administrators to understand and analyze, so we had a much better idea (and much better control) over who was authorized for what.

Brief notes on learning and adjusting Polkit on modern Linuxes

By: cks
23 November 2025 at 04:07

Polkit (also, also) is a multi-faceted user level thing used to control access to privileged operations. It's probably used by various D-Bus services on your system, which you can more or less get a list of with pkaction, and there's a pkexec program that's like su and sudo. There are two reasons that you might care about Polkit on your system. First, there might be tools you want to use that use Polkit, such as systemd's run0 (which is developing some interesting options). The other is that Polkit gives people an alternate way to get access to root or other privileges on your servers and you may have opinions about that and what authentication should be required.

Unfortunately, Polkit configuration is arcane and as far as I know, there aren't really any readily accessible options for it. For instance, if you want to force people to authenticate for root-level things using the root password instead of their password, as far as I know you're going to have to write some JavaScript yourself to define a suitable Administrator identity rule. The polkit manual page seems to document what you can put in the code reasonably well, but I'm not sure how you test your new rules and some areas seem underdocumented (for example, it's not clear how 'addAdminRule()' can be used to say that the current user cannot authenticate as an administrative user at all).

(If and when I wind up needing to test rules, I will probably try to do it in a scratch virtual machine that I can blow up. Fortunately Polkit is never likely to be my only way to authenticate things.)

Polkit also has some paper cuts in its current setup. For example, as far as I can see there's no easy way to tell Polkit-using programs that you want to immediately authenticate for administrative access as yourself, rather than be offered a menu of people in group wheel (yourself included) and having to pick yourself. It's also not clear to me (and I lack a test system) if the default setup blocks people who aren't in group wheel (or group admin, depending on your Linux distribution flavour) from administrative authentication or if instead they get to pick authenticating using one of your passwords. I suspect it's the latter.

(All of this makes Polkit seem like it's not really built for multi-user Linux systems, or at least multi-user systems where not everyone is an administrator.)

PS: Now that I've looked at it, I have some issues with Polkit from the perspective of a system administrator, but those are going to be for another entry.

Sidebar: Some options for Polkit (root) authentication

If you want everyone to authenticate as root for administrative actions, I think what you want is:

polkit.addAdminRule(function(action, subject) {
    return ["unix-user:0"];
});

If you want to restrict this to people in group wheel, I think you want something like:

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // might not work to say 'no'?
        return [];
    }
});

If you want people in group wheel to authenticate as themselves, not root, I think you return 'unix-user:' + subject.user instead of 'unix-user:0'. I don't know if people still get prompted by Polkit to pick a user if there's only one possible user.

You can't (easily) ignore errors in Python

By: cks
22 November 2025 at 04:18

Yesterday I wrote about how there's always going to be a way to not write code for error handling. When I wrote that entry I deliberately didn't phrase it as 'ignoring errors', because in some languages it's either not possible to do that or at least very difficult, and one of them is Python.

As every Python programmer knows, errors raise exceptions in Python and you can catch those exceptions, either narrowly or (very) broadly (possibly by accident). If you don't handle an exception, it bubbles up and terminates your program (which is nice if that's what you want and does mean that errors can't be casually ignored). On the surface it seems like you can ignore errors by simply surrounding all of your code with a try:/except: block that catches everything. But if you do this, you're not ignoring errors in the same way as you do in a language where errors are return values. In a language where you can genuinely ignore errors, all of your code keeps on running when errors happen. But in Python, if you put a broad try block around your code, your code stops executing at the first exception that gets raised, rather than continuing on to the other code within the try block.

(If there's further code outside the try block, it will run but probably not work very well because there will likely be a lot that simply didn't happen inside the try block. Your code skipped right from the statement that raised the exception to the first statement outside the try block.)

To get the C or Go like experience that your program keeps running its code even after an exception, you need to effectively catch and ignore exceptions separately for each statement. You can write this out by hand, putting each statement in its own try: block, but you'll probably get tired of this very fast, the result will be hard to read, and it's extremely obviously not like regular Python. This is the sign that Python doesn't really let you ignore errors in any easy way. All Python lets you do easily is suppress messages about errors and potentially make them not terminate your program. The closer you want to get to actually ignoring all errors, the more work you'll have to do.

(There are probably clever things you can do with Python debugging hooks since I believe that Python debuggers can intercept exceptions, although I'm not sure if they can resume execution after unhandled ones. But this is not going to really be easy.)

There's always going to be a way to not code error handling

By: cks
21 November 2025 at 03:55

Over on the Fediverse, I said something:

My hot take on Rust .unwrap(): no matter what you do, people want convenient shortcut ways of not explicitly handling errors in programming languages. And then people will use them in what turn out to be inappropriate places, because people aren't always right and sometimes make mistakes.

Every popular programming language lets your code not handle errors in some way, taking an optimistic approach. If you're lucky, your program notices at runtime when there actually is an error.

The subtext for this is that Cloudflare had a global outage where one contributing factor was using Rust's .unwrap(), which will panic your program if an error actually happens.

Every popular programming language has something like this. In Python you can ignore the possibility of exceptions, in C and Go you can ignore or explicitly discard error returns, in Java you can catch and ignore all exceptions, and so on. What varies from language to language is what the consequences are. In Python and Rust, your program dies (with an uncaught exception or a panic, respectively). In Go, your program either sails on making an increasingly big mess or panics (for example, if another return value is nil when there's an error and you try to do something with it that requires a non-nil value).

(Some languages let you have it either way. The default state of the Bourne shell is to sail onward in the face of failures, but you can change that with 'set -e' (mostly) and even get good error reports sometimes.)

These features don't exist because language designers are idiots (especially since error handling isn't a solved problem). They ultimately exist because people want a way to not so much ignore errors as not write code to 'handle' them. These people don't expect errors, they think in practice errors will either be extremely infrequent or not happen, and they don't want to write code that will deal with them anyway (if they're forced to write code that does something, often their choice will be to end the program).

You could probably create a programming language that didn't allow you to do this (possibly Haskell and other monad-using functional languages are close to it). I suspect it would be unpopular. If it wasn't unpopular, I suspect people would write their own functions or whatever to ignore the possibility of errors (either with or without ending the program if an error actually happens). People want to not have to write error handling, and they'll make it happen one way or another.

(Then, as I mentioned, some of the time they'll turn out to be wrong about errors not happening.)

Automatically scrubbing ZFS pools periodically on FreeBSD

By: cks
20 November 2025 at 03:17

We've been moving from OpenBSD to FreeBSD for firewalls. One advantage of this is giving us a mirrored ZFS pool for the machine's filesystems; we have a lot of experience operating ZFS and it's a simple, reliable, and fully supported way of getting mirrored system disks on important machines. ZFS has checksums and you want to periodically 'scrub' your ZFS pools to verify all of your data (in all of its copies) through these checksums (ideally relatively frequently). All of this is part of basic ZFS knowledge, so I was a little bit surprised to discover that none of our FreeBSD machines had ever scrubbed their root pools, despite some of them having been running for months.

It turns out that while FreeBSD comes with a configuration option to do periodic ZFS scrubs, the option isn't enabled by default (as of FreeBSD 14.3). Instead you have to know to enable it, which admittedly isn't too hard to find once you start looking.

FreeBSD has a general periodic(8) system for triggering things on a daily, weekly, monthly, or other basis. As covered in the manual page, the default configuration for this is in /etc/defaults/periodic.conf and you can override things by creating or modifying /etc/periodic.conf. ZFS scrubs are a 'daily' periodic setting, and as of 14.3 the basic thing you want is an /etc/periodic.conf with:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"

FreeBSD will normally scrub each pool a certain number of days after its previous scrub (either a manual scrub or an automatic scrub through the periodic system). The default number of days is 35, which is a bit high for my tastes, so I suggest that you shorten it, making your periodic.conf stanza be:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold="14"

There are other options you can set that are covered in /etc/defaults/periodic.conf.

(That the daily automatic scrubs happen some number of days after the pool was last scrubbed means that you can adjust their timing by doing a manual scrub. If you have a bunch of machines that you set up at the same time, you can get them to space out their scrubs by scrubbing one a day by hand, and so on.)

Looking at the other ZFS periodic options, I might also enable the daily ZFS status report, because I'm not certain if there's anything else that will alert you if or when ZFS starts reporting errors:

# Find out about ZFS errors?
daily_status_zfs_enable="YES"

You can also tell ZFS to TRIM your SSDs every day. As far as I can see there's no option to do the TRIM less often than once a day; I guess if you want that you have to create your own weekly or monthly periodic script (perhaps by copying the 801.trim-zfs daily script and modifying it appropriately). Or you can just do 'zpool trim ...' every so often by hand.

We're (now) moving from OpenBSD to FreeBSD for firewalls

By: cks
19 November 2025 at 04:17

A bit over a year ago I wrote about why we'd become interested in FreeBSD; to summarize, FreeBSD appeared promising as a better, easier to manage host operating system for PF-based things. Since then we've done enough with FreeBSD to have decided that we actively prefer it to OpenBSD. It's been relatively straightforward to convert our firewall OpenBSD PF rulesets to FreeBSD PF and the resulting firewalls have clearly better performance on our 10G network than our older OpenBSD ones did (with less tuning).

(It's possible that the very latest OpenBSD has significantly improved bridging and routing firewall performance so that it no longer requires the fastest single-core CPU performance you can get to go decently. But pragmatically it's too late; FreeBSD had that performance earlier and we now have more confidence in FreeBSD's performance in the firewall role than OpenBSD's.)

There are some nice things about FreeBSD, like root on ZFS, and broadly I feel that it's more friendly than OpenBSD. But those are secondary to its firewall network performance (and PF compatibility); if its network performance was no better than OpenBSD (or worse), we wouldn't be interested. Since it is better, it's now displacing OpenBSD for our firewalls and our latest VPN servers. We've stopped building new OpenBSD machines, so as firewalls come up for replacement they get rebuilt as FreeBSD machines.

(We have a couple of non-firewall OpenBSD machines that will likely turn into Ubuntu machines when we replace them, although we can't be sure until it actually happens.)

Would we consider going back to OpenBSD? Maybe, but probably not. Now that we've migrated a significant number of firewalls, moving the remaining ones to FreeBSD is the easiest approach, even if new OpenBSD firewalls would equal their performance. And the FreeBSD 10G firewall performance we're getting is sufficiently good that it leaves OpenBSD relatively little ground to exceed it.

(There are some things about FreeBSD that we're not entirely enthused about. We're going to be doing more firewall upgrades than we used to with OpenBSD, for one.)

PS: As before, I don't think there's anything wrong with OpenBSD if it meets your needs. We used it happily for years until we started being less happy with its performance on 10G Ethernet. A lot of people don't have that issue.

A surprise with how '#!' handles its program argument in practice

By: cks
18 November 2025 at 03:54

Every so often I get to be surprised about some Unix thing. Today's surprise is the actual behavior of '#!' in practice on at least Linux, FreeBSD, and OpenBSD, which I learned about from a comment by Aristotle Pagaltzis on my entry on (not) using '#!/usr/bin/env'. I'll quote the starting part here:

In fact the shebang line doesn’t require absolute paths, you can use relative paths too. The path is simply resolved from your current directory, just as any other path would be – the kernel simply doesn’t do anything special for shebang line paths at all. [...]

I found this so surprising that I tested it on our Linux servers as well as a FreeBSD and an OpenBSD machine. On the Linux servers (and probably on the others too), the kernel really does accept the full collection of relative paths in '#!'. You can write '#!python3', '#!bin/python3', '#!../python3', '#!../../../usr/bin/python3', and so on, and provided that your current directory is in the right place in the filesystem, they all worked.

(On FreeBSD and OpenBSD I only tested the '#!python3' case.)

As far as I can tell, this behavior goes all the way back to 4.2 BSD (which isn't quite the origin point of '#!' support in the Unix kernel but is about as close as we can get). The execve() kernel implementation in sys/kern_exec.c finds the program from your '#!' line with a namei() call that uses the same arguments (apart from the name) as it did to find the initial executable, and that initial executable can definitely be a relative path.

Although this is probably the easiest way to implement '#!' inside the kernel, I'm a little bit surprised that it survived in Linux (in a completely independent implementation) and in OpenBSD (where the security people might have had a double-take at some point). But given Hyrum's Law there are probably people out there who are depending on this behavior so we're now stuck with it.

(In the kernel, you'd have to go at least a little bit out of your way to check that the new path starts with a '/' or use a kernel name lookup function that only resolves absolute paths. Using a general name lookup function that accepts both absolute and relative paths is the simplest approach.)

PS: I don't have access to Illumos based systems, other BSDs (NetBSD, etc), or macOS, but I'd be surprised if they had different behavior. People with access to less mainstream Unixes (including commercial ones like AIX) can give it a try to see if there are any Unixes that don't support relative paths in '#!'.

People are sending HTTP requests with X-Forwarded-For across the Internet

By: cks
17 November 2025 at 03:49

Over on the Fediverse, I shared a discovery that came from turning over some rocks here on Wandering Thoughts:

This is my face when some people out there on the Internet send out HTTP requests with X-Forwarded-For headers, and maybe even not maliciously or lying. Take a bow, ZScaler.

The HTTP X-Forwarded-For header is something that I normally expect to see only on something behind a reverse proxy, where the reverse proxy frontend is using it to tell the backend the real originating IP (which is otherwise not available when the HTTP requests are forwarded with HTTP). As a corollary of this usage, if you're operating a reverse proxy frontend you want to remove or rename any X-Forwarded-For headers that you receive from the HTTP client, because it may be trying to fool your backend about who it is. You can use another X- header name for this purpose if you want, but using X-Forwarded-For has the advantage that it's a de-facto standard and so random reverse proxy aware software is likely to have an option to look at X-Forwarded-For.

(See, for example, the security and privacy concerns section of the MDN page.)

Wandering Thoughts doesn't run behind a reverse proxy, and so I assume that I wouldn't see X-Forwarded-For headers if I looked for them. More exactly I assumed that I could take the presence of an X-Forwarded-For header as an indication of a bad request. As I found out, this doesn't seem to be the case; one source of apparently legitimate traffic to Wandering Thoughts appears to attach what are probably legitimate X-Forwarded-For headers to requests going through it. I believe this particular place operates partly as a (forward) HTTP proxy; if they aren't making up the X-Forwarded-For IP addresses, they're willing to leak the origin IPs of people using them to third parties.

All of this makes me more curious than usual to know what HTTP headers and header values show up on requests to Wandering Thoughts. But not curious enough to stick in logging, because that would be quite verbose unless I could narrow things down to only some requests. Possibly I should stick in logging that can be quickly turned on and off, so I can dump header information only briefly.

(These days I've periodically wound up in a mood to hack on DWiki, the underlying engine behind Wandering Thoughts. It reminds me that I enjoy programming.)

We haven't seen ZFS checksum failures for a couple of years

By: cks
16 November 2025 at 04:04

Over on the Fediverse I mentioned something about our regular ZFS scrubs:

Another weekend, another set of ZFS scrubs of work's multiple terabytes of data sitting on a collection of consumer 4 TB SSDs (mirrored, we aren't crazy, and also we have backups). As usual there is not a checksum error to be seen. I think it's been years since any came up.

I accept that SSDs decay (we've had some die, of course) and random read errors happen, but our ZFS-based experience across both HDDs and SSDs has been that the rate is really low for us. Probably we're not big enough.

We regularly scrub our pools through automation, currently once every few weeks. Back in 2022 I wrote about us seeing only a few errors since we moved to SSDs in 2018, and then I had the impression that everything had been quiet since then. Hand-checking our records tells me that I'm slightly wrong about this and we had some errors on our fileservers in 2023, but none since then.

  • starting in January of 2023, one particular SSD began experiencing infrequent read and checksum errors that persisted (off and on) through early March of 2023, when we gave in and replaced it. This was a relatively new 4 TB SSD that had only been in service for a few months at the time.

  • In late March of 2023 we saw a checksum error on a disk that later in the year (in November) experienced some read errors, and then in late February of 2024 had read and write errors. We replaced the disk at that point.

I believe these two SSDs are the only ones that we've replaced since 2022, although I'm not certain and we've gone through a significant amount of SSD shuffling since then for reasons outside the scope of this entry. That shuffling means that I'm not going to try to give any number for what percentage of our fileserver SSDs have had problems.

In the first case, the checksum errors were effectively a lesser form of the read errors we saw at the same time, so it was obvious the SSD had problems. In the second case the checksum error may have been a very early warning sign of what later became an obvious slow SSD failure. Or it could be coincidence.

(It also could be that modern SSDs have so much internal error checking and correction that if there is some sort of data rot or mis-read it's most likely to be noticed inside the SSD and create a read failure at the protocol level (SAS, SATA, NVMe, etc).)

I definitely believe that disk read errors and slow disk failures happen from time to time, and if you have a large enough population of disks (SSDs or HDDs or both) you definitely need to worry about these problems. We get all sorts of benefits from ZFS checksums and ZFS scrubs, and the peace of mind about this is one of them. But it looks like we're not big enough to have run into this across our fileserver population.

(At the moment we have 114 4 TB SSDs in use across our production fileservers.)

OIDC, Identity Providers, and avoiding some obvious security exposures

By: cks
15 November 2025 at 04:40

OIDC (and OAuth2) has some frustrating elements that make it harder for programs to support arbitrary identity providers (as discussed in my entry on the problems facing MFA-enabled IMAP in early 2025). However, my view is that these elements exist for good reason, and the ultimate reason is that an OIDC-like environment is by default an obvious security exposure (or several of them). I'm not sure there's any easy way around the entire set of problems that push towards these elements or something quite like them.

Let's imagine a platonically ideal OIDC-like identity provider for clients to use, something that's probably much like the original vision of OpenID. In this version, people (with accounts) can authenticate to the identity provider from all over the Internet, and it will provide them with a signed identity token. The first problem is that we've just asked identity providers to set up an Internet-exposed account and password guessing system. Anyone can show up, try it out, and best of all if it works they don't just get current access to something, they get an identity token.

(Within a trusted network, such as an organization's intranet, this exposed authentication endpoint is less of a concern.)

The second problem is that identity token, because the IdP doesn't actually provide the identity token to the person, it provides the token to something that asked for it. One of the uses of that identity token is to present it to other things to demonstrate that you're acting on the person's behalf; for example, your IMAP client presents it to your IMAP server. If what the identity token is valid for is not restricted in some way, a malicious party could get you to 'sign up with your <X> ID' for their website, take the identity token it got from the IdP, and reuse it with your IMAP server.

To avoid issues, this identity token must have a limited scope (and everything that uses identity tokens needs to check that the token for them). This implies that you can't just ask for an identity token in general, you have to ask for it for use with something specific. As a further safety measure the identity provider doesn't want to give such a scoped token to anything except the thing that's supposed to get it. You (an attacker) should not be able to tell the identity provider 'please create a token for webserver X, and give it to me, not webserver X' (this is part of the restrictions on OIDC redirect URIs).

In OIDC, what deals with much of these risks is client IDs, optionally client secrets, and redirect URIs. Client IDs are used to limit what an identity token can be used for and where it can be sent to (in combination with redirect URIs), and a client secret can be used by something getting a token to prove that it is the client ID it claims to be. If you don't have the right information, the OIDC IdP won't even talk to you. However, this means that all of this information has to be given to the client, or at least obtained by the client and stored by it.

(These days OIDC has a specification for Dynamic Client Registration and can support 'open' dynamic registration of clients, if desired (although it's apparently not widely implemented). But clients do have to register to get the risk-mitigating information for the main IdP endpoint, and I don't know how this is supposed to handle the IMAP situation if the IMAP server wants to verify that the OIDC token it receives was intended for it, since each dynamic client will have a different client ID.)

My script to 'activate' Python virtual environments

By: cks
14 November 2025 at 03:27

After I wrote about Python virtual environments and source code trees, I impulsively decided to set up the development tree of our Django application to use a Django venv instead of a 'pip install --user' version of Django. Once I started doing this, I quickly decided that I wanted a general script that would switch me into a venv. This sounds a little bit peculiar if you know Python virtual environments so let me explain.

Activating a Python virtual environment mostly means making sure that its 'bin' directory is first on your $PATH, so that 'python3' and 'pip' and so on come from it. Venvs come with files that can be sourced into common shells in order to do this (with the one for Bourne shells called 'activate'), but for me this has three limits. You have to use the full path to the script, they change your current shell environment instead of giving you a new one that you can just exit to discard this 'activation', and I use a non-standard shell that they don't work in. My 'venv' script is designed to work around all three of those limitations. As a script, it starts a new shell (or runs a command) instead of changing my current shell environment, and I set it up so that it knows my standard place to keep virtual environments (and then I made it so that I can use symbolic links to create 'django' as the name of 'whatever my current Django venv is').

(One of the reasons I want my 'venv' command to default to running a shell for me is that I'm putting the Python LSP server into my Django venvs, so I want to start GNU Emacs from an environment with $PATH set properly to get the right LSP server.)

My initial version only looked for venvs in my standard location for development related venvs. But almost immediately after starting to use it, I found that I wanted to be able to activate pipx venvs too, so I added ~/.local/pipx/venvs to what I really should consider to be a 'venv search path' and formalize into an environment variable with a default value.

I've stuffed a few other features into the venv script. It will print out the full path to the venv if I ask it to (in addition to running a command, which can be just 'true'), or something to set $PATH. I also found I sometimes wanted it to change directory to the root of the venv. Right now I'm still experimenting with how I want to build other scripts on top of this one, so some of this will probably change in time.

One of my surprises about writing the script is how much nicer it's made working with venvs (or working with things in venvs). There's nothing it does that wasn't possible before, but the script has removed friction (more friction than I realized was there, which is traditional for me).

PS: This feels like a sufficiently obvious idea that I suspect that a lot of people have written 'activate a venv somewhere along a venv search path' scripts. There's unlikely to be anything special about mine, but it works with my specific shell.

Getting feedback as a small web crawler operator

By: cks
13 November 2025 at 04:17

Suppose, hypothetically, that you're trying to set up a small web crawler for a good purpose. These days you might be focused on web search for text focused sites, or small human written sites, or similar things, and certainly given the bad things that are happening with the major crawlers we could use them. As a small crawler, you might want to get feedback and problem reports from web site operators about what your crawler is doing (or not doing). As it happens, I have some advice and views on this.

  • Above all, remember that you are not Google or even Bing. Web site operators need Google to crawl them, and they have no choice but to bend over backward for Google and to send out plaintive signals into the void if Googlebot is doing something undesirable. Since you're not Google and you need websites much more than they need you, the simplest thing for website operators to do with and about your crawler is to ignore the issue, potentially block you if you're causing problems, and move on.

    You cannot expect people to routinely reach out to you. Anyone who does reach out to you is axiomatically doing you a favour, at the expense of some amount of their limited time and at some risk to themselves.

  • Website operators have no reason to trust you or trust that problem reports will be well received. This is a lesson plenty of people have painfully learned from reporting spam (email or otherwise) and other abuse; a lot of the time your reports can wind up in the hands of people who aren't well intentioned toward you (either going directly to them or 'helpfully' being passed on by the ISP). At best you confirm that your email address is alive and get added to more spam address lists; at worst you get abused in various ways.

    The consequence of this is that if you want to get feedback, you should make it as low-risk as possible for people. The lowest risk way (to website operators) is for you to have a feedback form on your site that doesn't require email or other contact methods. If you require that website operators reveal their email addresses, social media handles, or whatever, you will get much less feedback (this includes VCS forge handles if you force them to make issue reports on some VCS forge).

    (This feedback form should be easy to find, for example being directly linked from the web crawler information URL in your User-Agent.)

  • As far as feedback goes, both your intentions and your views on the reasonableness of what your web crawler is doing (and how someone's website behaves) are irrelevant. What matters is the views of website operators, who are generally doing you a favour by not simply blocking or ignoring your crawler and moving on. If you disagree with their feedback, the best thing to do is be quiet (and maybe say something neutral if they ask for a reply). This is probably most important if your feedback happens through a public VCS forge issue tracker, where future people who are thinking about filing an issue the way you asked may skim over past issues to see how they went.

    (You may or may not ignore website operator feedback that you disagree with depending on how much you want to crawl (all of) their site.)

At the moment, most website operators who notice a previously unknown crawler will likely assume that it's an (abusive) LLM crawler. One way to lower the chances of this is to follow social conventions around crawlers for things like crawler User-Agents and not setting the Referer header. I don't think you have to completely imitate how Googlebot, bingbot, Applebot, the archive.org bot and so on format their User-Agent strings, but it's going to help to generally look like them and clearly put the same sort of information into yours. Similarly, if you can it will help to crawl from clearly identified IPs with reverse DNS. The more that people think you're legitimate and honest, the more likely they are to spend the time and take the risk to give you feedback; the more sketchy or even uncertain you look, the less likely you are to get feedback.

(In general, any time you make website operators uncertain about an aspect of your web crawler, some number of them will not be charitable in their guess. The more explicit and unambiguous you are in the more places, the better.)

Building and running a web crawler is not an easy thing on today's web. It requires both technical knowledge of various details of HTTP and how you're supposed to react to things (eg), and current social knowledge of what is customary and expected of web crawlers, as well as what you may need to avoid (for example, you may not want to start your User-Agent with 'Mozilla/5.0' any more, and in general the whole anti-crawling area is rapidly changing and evolving right now). Many website operators revisit blocks and other reactions to 'bad' web crawlers only infrequently, so you may only get one chance to get things right. This expertise can't be outsourced to a random web crawling library because many of them don't have it either.

(While this entry was sparked by a conversation I had on the Fediverse, I want to be explicit that it is in no way intended as a subtoot of that conversation. I just realized that I had some general views that didn't fit within the margins of Fediverse posts.)

Firefox's sudden weird font choice and fixing it

By: cks
12 November 2025 at 04:03

Today, while I was in the middle of using my normal browser instance, it decided to switch from DejaVu Sans to Noto Sans as my default font:

Dear Firefox: why are you using Noto Sans all of a sudden? I have you set to DejaVu Sans (and DejaVu everything), and fc-match 'sans' and fc-match serif both say they're DejaVu (and give the DejaVu TTF files). This is my angry face.

This is a quite noticeable change for me because it changes the font I see on Wandering Thoughts, my start page, and other things that don't set any sort of explicit font. I don't like how Noto Sans looks and I want DejaVu Sans.

(I found out that it was specifically Noto Sans that Firefox was using all of a sudden through the Web Developer tools 'Font' information, and confirmed that Firefox should still be using DejaVu through the way to see this in Settings.)

After some flailing around, it appears that what I needed to do to fix this was explicitly set about:config's font.name.serif.x-western, font.name.sans-serif.x-western, and font.name.monospace.x-western to specific values instead of leaving them set to nothing, which seems to have caused Firefox to arrive on Noto Sans through some mysterious process (since the generic system font name 'sans' was still mapping to DejaVu Sans). I don't know if these are exposed through the Fonts advanced options in Settings β†’ General, which are (still) confusing in general. It's possible that these are what are used for 'Latin'.

(I used to be using the default 'sans', 'serif', and 'monospace' font names that cascaded through to the DejaVu family. Now I've specifically set everything to the DejaVu set, because if something in Fedora or Firefox decides that the default mapping should be different, I don't want Firefox to follow it, I want it to stay with DejaVu.)

I don't know why Firefox would suddenly decide these pages are 'western' instead of 'unicode'; all of them are served as or labeled as UTF-8, and nothing about that has changed recently. Unfortunately, as far as I know there's no way to get Firefox to tell you what font.name preference name it used to pick (default) fonts for a HTML document. When it sends HTTP 304 Not Modified responses, Wandering Thoughts doesn't include a Content-Type header (with the UTF-8 character set), but as far as I know that's a standard behavior and browsers presumably cope with it.

(Firefox does see 'Noto Sans' as a system UI font, which it uses on things like HTML form buttons, so it didn't come from nowhere.)

It makes me sad that Firefox continues to have no global default font choice. You can set 'Unicode' but as I've just seen, this doesn't make what you set there the default for unset font preferences, and the only way to find out what unset font preferences you have is to inspect about:config.

PS: For people who aren't aware of this, it's possible for Firefox to forget some of your about:config preferences. Working around this probably requires using Firefox policies (via), which can force-set arbitrary about:config preferences (among other things).

Discovering orphaned binaries in /usr/sbin on Fedora 42

By: cks
11 November 2025 at 04:10

Over on the Fediverse, I shared a somewhat unwelcome discovery I made after upgrading to Fedora 42:

This is my face when I have quite a few binaries in /usr/sbin on my office Fedora desktop that aren't owned by any package. Presumably they were once owned by packages, but the packages got removed without the files being removed with them, which isn't supposed to happen.

(My office Fedora install has been around for almost 20 years now without being reinstalled, so things have had time to happen. But some of these binaries date from 2021.)

There seem to be two sorts of these lingering, unowned /usr/sbin programs. One sort, such as /usr/sbin/getcaps, seems to have been left behind when its package moved things to /usr/bin, possibly due to this RPM bug (via). The other sort is genuinely unowned programs dating to anywhere from 2007 (at the oldest) to 2021 (at the newest), which have nothing else left of them sitting around. The newest programs are what I believe are wireless management programs: iwconfig, iwevent, iwgetid, iwlist, iwpriv, and iwspy, and also "ifrename" (which I believe was also part of a 'wireless-tools' package). I had the wireless-tools package installed on my office desktop until recently, but I removed it some time during Fedora 40, probably sparked by the /sbin to /usr/sbin migration, and it's possible that binaries didn't get cleaned up properly due to that migration.

The most interesting orphan is /usr/sbin/sln, dating from 2018, when apparently various people discovered it as an orphan on their system. Unlike all the other orphan programs, the sln manual page is still shipped as part of the standard 'man-pages' package and so you can read sln(8) online. Based on the manual page, it sounds like it may have been part of glibc at one point.

(Another orphaned program from 2018 is pam_tally, although it's coupled to pam_tally2.so, which did get removed.)

I don't know if there's any good way to get mappings from files to RPM packages for old Fedora versions. If there is, I'd certainly pick through it to try to find where various of these files came from originally. Unfortunately I suspect that for sufficiently old Fedora versions, much of this information is either offline or can't be processed by modern versions of things like dnf.

(The basic information is used by eg 'dnf provides' and can be built by hand from the raw RPMs, but I have no desire to download all of the RPMs for decade-old Fedora versions even if they're still available somewhere. I'm curious but not that curious.)

PS: At the moment I'm inclined to leave everything as it is until at least Fedora 43, since RPM bugs are still being sorted out here. I'll have to clean up genuinely orphaned files at some point but I don't think there's any rush. And I'm not removing any more old packages that use '/sbin/<whatever>', since that seems like it has some bugs.

Python virtual environments and source code trees

By: cks
10 November 2025 at 04:22

Python virtual environments are mostly great for actually deploying software. Provided that you're using the same version of Python (3) everywhere (including CPU architecture), you can make a single directory tree (a venv) and then copy and move it around freely as a self-contained artifact. It's also relatively easy to use venvs to switch the version of packages or programs you're using, for example Django. However, venvs have their frictions, at least for me, and often I prefer to do Python development outside of them, especially for our Django web application).

(This means using 'pip install --user' to install things like Django, to the extent that it's still possible.)

One point of friction is in their interaction with working on the source code of our Django web application. As is probably common, this source code lives in its own version control system controlled directory tree (we use Mercurial for this for reasons). If Django is installed as a user package, the native 'python3' will properly see it and be able to import Django modules, so I can directly or indirectly run Django commands with the standard Python and my standard $PATH.

If Django is installed in a venv, I have two options. The manual way is to always make sure that this Django venv is first on my $PATH before the system Python, so that 'python3' is always from the venv and not from the system. This has a little bit of a challenge with Python scripts, and is one of the few places where '#!/usr/bin/env python3' makes sense. In my particular environment it requires extra work because I don't use a standard Unix shell and so I can't use any of the venv bin/activate things to do all the work for me.

The automatic way is to make all of the convenience scripts that I use to interact with Django explicitly specify the venv python3 (including for things like running a test HTTP server and invoking local management commands), which works fine since a program can be outside the venv it uses. This leaves me with the question of where the Django venv should be, and especially if it should be outside the source tree or in a non-VCS-controlled path inside the tree. Outside the source tree is the pure option but leaves me with a naming problem that has various solutions. Inside the source tree (but not VCS controlled) is appealingly simple but puts a big blob of otherwise unrelated data into the source tree.

(Of course I could do both at once by having a 'venv' symlink in the source tree, ignored by Mercurial, that points to wherever the Django venv is today.)

Since 'pip install --user' seems more and more deprecated as time goes by, I should probably move to developing with a Django venv sooner or later. I will probably use a venv outside the source tree, and I haven't decided about an in-tree symlink.

(I'll still have the LSP server problem but I have that today. Probably I'll install the LSP server into the Django venv.)

PS: Since this isn't a new problem, the Python community has probably come up with some best practices for dealing with it. But in today's Internet search environment I have no idea how to find reliable sources.

A HTTP User-Agent that claims to be Googlebot is now a bad idea

By: cks
9 November 2025 at 04:04

Once upon a time, people seem to have had a little thing for mentioning Googlebot in their HTTP User-Agent header, much like browsers threw in claims to make them look like Firefox or whatever (the ultimate source of the now-ritual 'Mozilla/5.0' at the start of almost every browser's User-Agent). People might put in 'allow like Googlebot' or just say 'Googlebot' in their User-Agent. Some people are still doing this today, for example:

Gwene/1.0 (The gwene.org rss-to-news gateway) Googlebot

This is now an increasingly bad idea on the web and if you're doing it, you should stop. The problem is that there are various malicious crawlers out there claiming to be Googlebot, and Google publishes their crawler IP address ranges. Anything claiming to be Googlebot that is not from a listed Google IP is extremely suspicious and in this day and age of increasing anti-crawler defenses, blocking all 'Googlebot' activity that isn't from one of their listed IP ranges is an obvious thing to do. Web sites may go even further and immediately taint the IP address or IP address range involved in impersonating Googlebot, blocking or degrading further requests regardless of the User-Agent.

(Gwene is not exactly claiming to be Googlebot but they're trying to get simple Googlebot-recognizers to match them against Googlebot allowances. This is questionable at best. These days such attempts may do more harm than good as they get swept up in precautions against Googlebot forgery, or rules that block Googlebot from things it shouldn't be fetching, like syndication feeds.)

A similar thing applies to bingbot and the User-Agent of any other prominent web search engines, and Bing does publish their IP address ranges. However, I don't think I've ever seen someone impersonate bingbot (which probably doesn't surprise anyone). I don't know if anyone ever impersonates Archive.org (no one has in the past week here), but it's possible that crawler operators will fish to see if people give special allowances to them that can be exploited.

(The corollary of this is that if you have a website, an extremely good signal of bad stuff is someone impersonating Googlebot and maybe you could easily block that. I think this would be fairly easy to do in an Apache <If> clause that then Allow's from Googlebot's listed IP addresses and Denies everything else, but I haven't actually tested it.)

Containers and giving up on expecting good software installation practices

By: cks
8 November 2025 at 03:58

Over on the Fediverse, I mentioned a grump I have about containers:

As a sysadmin, containers irritate me because they amount to abandoning the idea of well done, well organized, well understood, etc installation of software. Can't make your software install in a sensible way that people can control and limit? Throw it into a container, who cares what it sprays where across the filesystem and how much it wants to be the exclusive owner and controller of everything in sight.

(This is a somewhat irrational grump.)

To be specific, it's by and large abandoning the idea of well done installs of software on shared servers. If you're only installing software inside a container, your software can spray itself all over the (container) filesystem, put itself in hard-coded paths wherever it feels like, and so on, even if you have completely automated instructions for how to get it to do that inside a container image that's being built. Some software doesn't do this and is well mannered when installed outside a container, but some software does and you'll find notes to the effect that the only supported way of installing it is 'here is this container image', or 'here is the automated instructions for building a container image'.

To be fair to containers, some of this is due to missing Unix APIs (or APIs that theoretically exist but aren't standardized). Do you want multiple Unix logins for your software so that it can isolate different pieces of itself? There's no automated way to do that. Do you run on specific ports? There's generally no machine-readable way to advertise that, and people may want you to build in mechanisms to vary those ports and then specify the new ports to other pieces of your software (that would all be bundled into a container image). And so on. A container allows you to put yourself in an isolated space of Unix UIDs, network ports, and so on, one where you won't conflict with anyone else and won't have to try to get the people who want to use your software to create and manage the various details (because you've supplied either a pre-built image or reliable image building instructions).

But I don't have to be happy that software doesn't necessarily even try, that we seem to be increasingly abandoning much of the idea of running services in shared environments. Shared environments are convenient. A shared Unix environment gives you a lot of power and avoids a lot of complexity that containers create. Fortunately there's still plenty of software that is willing to be installed on shared systems.

(Then there is the related grump that the modern Linux software distribution model seems to be moving toward container-like things, which has a whole collection of issues associated with it.)

Go's runtime may someday start explicitly freeing some internal memory

By: cks
7 November 2025 at 03:30

One of my peculiar hobbies is that I read every commit message for the Go (development) repository. Often this is boring, but sometimes I discover things I find amusing:

This is my amused face when Go is adding explicit, non-GC freeing of memory from within the runtime and compiler-generated code under some circumstances. It's perfectly sensible, but still.

It turns out that right now, the only thing that's been added is a 'GOEXPERIMENT=runtimefree' Go experiment, which you can set without build errors. There's no actual use of it in the current development tree.

The proposal that led to this doesn't seem to currently be visible in a mainline commit in the Go proposal repository, but until it surfaces you can access Directly freeing user memory to reduce GC work from the (proposed?) change (update: see below for the final version), and also Go issue 74299: runtime, cmd/compile: add runtime.free, runtime.freetracked and GOEXPERIMENT=runtimefree and the commit itself, which only adds the Go experiment flag. A preview of performance results (from a link in issue 74299) is in the message of slices: free intermediate memory in Collect via runtime.freeSlice.

(Looking into this has caused me to find the Go Release Dashboard, and see eg the pending proposals section, where you can find multiple things for this proposal.)

Update: The accepted proposal is now merged in the Go proposals repository, Directly freeing user memory to reduce GC work.

I feel the overall idea is perfectly sensible, for all that it feels a bit peculiar in a language with a mark and sweep garbage collector. As the proposal points out, there are situations where the runtime knows that something doesn't escape but it has to allocate it on the heap instead of the stack, and also situations where the runtime knows that some value is dead but the compiler can't prove it. In both situations we can reduce pressure on memory allocation and to some extent garbage collection by explicitly marking the objects as free right away. A runtime example cited in the proposal is when maps grow and split, which is safe since map values are unaddressable so no one can have (validly formed) pointers to them.

(Because unused objects aren't traversed by the garbage collector, this doesn't directly reduce the amount of work GC has to do but it does mean GC might not have to run as much.)

Sadly, so far only the GOEXPERIMENT setting has landed in the Go development tree so there's nothing to actually play with (and no code to easily read). We have to look from afar and anticipate, and at this point it's possible no actual code will land until after Go 1.26, since based on the usual schedule there will be a release freeze soon, leaving not very much time to land all of these changes).

(The whole situation turns out to be less exciting than I thought when I read the commit message and made my Fediverse post, but that's one reason to write these entries.)

PS: In general, garbage collected languages can also have immediate freeing of memory, for example if they use reference counting. CPython is an example and CPython people can be quite used to deterministic, immediate collection of unreferenced objects along with side effects such as closing file descriptors. Sometimes this can mask bugs.

A problem for downloading things with curl

By: cks
6 November 2025 at 04:24

For various reasons, I'm working to switch from wget to curl, and generally this has been going okay. However, I've now run into one situation where I don't know how to make curl do what I want. It is, of course, a project that doesn't bother to do easily-fetched downloads, but in a very specific way. In fact it's Django (again).

The Django URLs for downloads look like this:

https://www.djangoproject.com/download/5.2.8/tarball/

The way the websites of many projects turn these into actual files is to provide a filename in the HTTP Content-Disposition header in the reply. In curl, these websites can be handled with the -J (--remote-header-name) option, which uses the filename from the Content-Disposition if there is one.

Unfortunately, Django's current website does not operate this way. Instead, the URL above is a HTTP redirection to the actual .tar.gz file (on media.djangoproject.com). The .tar.gz file is then served without a Content-Disposition header as an application/octet-stream. Wget will handle this with --trust-server-names, but as far as I can tell from searching through the curl manpage, there is no option that will do this in curl.

(In optimistic hope I even tried --location-trusted, but no luck.)

If curl is directed straight to the final URL, 'curl -O' alone is enough to get the right file name. However, if curl goes through a redirection, there seems to be no option that will cause it to re-evaluate the 'remote name' based on the new URL; the initial URL and the name derived from it sticks, and you get a file unhelpfully called 'tarball' (in this case). If you try to be clever by running the initial curl without -O but capturing any potential redirection with "-w '%{redirect_url}\n'" so you can manually follow it in a second curl command, this works (for one level of redirections) but leaves you with a zero-length file called 'tarball' from the first curl.

It's possible that this means curl is the wrong tool for the kind of file downloads I want to do from websites like this, and I should get something else entirely. However, that something else should at least be a completely self contained binary so that I can easily drag it around to all of the assorted systems where I need to do this.

(I could always try to write my own in Go, or even take this as an opportunity to learn Rust, but that way lies madness and a lot of exciting discoveries about HTTP downloads in the wild. The more likely answer is that I hold my nose and keep using wget for this specific case.)

PS: I think it's possible to write a complex script using curl that more or less works here, but one of the costs is that you have to make first a HEAD and then a GET request to the final target, and that irritates me.

Some notes on duplicating xterm windows

By: cks
5 November 2025 at 03:45

Recently on the Fediverse, Dave Fischer mentioned a neat hack:

In the decades-long process of getting my fvwm config JUST RIGHT, my xterm right-click menu now has a "duplicate" command, which opens a new xterm with the same geometry, on the same node, IN THE SAME DIRECTORY. (Directory info aquired via /proc.)

[...]

(See also a followup note.)

This led to @grawity sharing an xterm-native approach to this, using xterm's spawn-new-terminal() internal function that's available through xterm's keybindings facility.

I have a long-standing shell function in my shell that attempts to do this (imaginatively called 'spawn'), but this is only available in environments where my shell is set up, so I was quite interested in the whole area and did some experiments. The good news is that xterm's 'spawn-new-terminal' works, in that it will start a new xterm and the new xterm will be in the right directory. The bad news for me is that that's about all that it will do, and in my environment this has two limitations that will probably make it not something I use a lot.

The first limitation is that this starts an xterm that doesn't copy the command line state or settings of the parent xterm. If you've set special options on the parent xterm (for example, you like your root xterms to have a red foreground), this won't be carried over to the new xterm. Similarly, if you've increased (or decreased) the font size in your current xterm or otherwise changed its settings, spawn-new-terminal doesn't duplicate these; you get a default xterm. This is reasonable but disappointing.

(While spawn-new-terminal takes arguments that I believe it will pass to the new xterm, as far as I know there's no way to retrieve the current xterm's command line arguments to insert them here.)

The larger limitation for me is that when I'm at home, I'm often running SSH inside of an xterm in order to log in to some other system (I have a 'sshterm' script to automate all the aspects of this). What I really want when I 'duplicate' such an xterm is not a copy of the local xterm running a local shell (or even starting another SSH to the remove system), but the remote (shell) context, with the same (remote) current directory and so on. This is impossible to get in general and difficult to set up even for situations where it's theoretically possible. To use spawn-new-terminal effectively, you basically need either all local xterms or copious use of remote X forwarded over SSH (where the xterm is running on the remote system, so a duplicate of it will be as well and can get the right current directory).

Going through this experience has given me some ideas on how to improve the situation overall. Probably I should write a 'spawn' shell script to replace or augment my 'spawn' shell function so I can readily have it in more places. Then when I'm ssh'd in to a system, I can make the 'spawn' script at least print out a command line or two for me to copy and paste to get set up again.

(Two command lines is the easiest approach, with one command that starts the right xterm plus SSH combination and the other a 'cd' to the right place that I'd execute in the new logged in window. It's probably possible to combine these into an all-in-one script but that starts to get too clever in various ways, especially as SSH has no straightforward way to pass extra information to a login shell.)

My GPS bike computer is less distracting than the non-computer option

By: cks
4 November 2025 at 02:44

I have a GPS bike computer primarily for following pre-planned routes, because it became a better supported option than our old paper cue sheets. One of the benefits of switching to from paper cue sheets to a GPS unit was better supported route following, but after I made the switch, I found that it was also less distracting than using paper cue sheets. On the surface this might sound paradoxical, since people often say that computer screens are more distracting. It's true that a GPS bike computer has a lot that you can look at, but for route following, a GPS bike computer also has features that let me not pay attention to it.

When I used paper cue sheets, I always had to pay a certain amount of attention to following the route. I needed to keep track of where we were on the cue sheet's route, and either remember what the next turn was or look at the cue sheet frequently enough that I could be sure I wouldn't miss it. I also needed to devote a certain amount of effort to scanning street signs to recognize the street we'd be turning on to. All of this distracted me from looking around and enjoying the ride; I could never check out completely from route following.

When I follow a route on my GPS bike computer, it's much easier to not pay attention to route following most of the time. My GPS bike computer will beep at me and display a turn alert when we get close to a turn, and I always have it display the distance to the next turn so I can take a quick glance to reassure myself that we're nowhere near the turn. If there's any ambiguity about where to turn, I can look at the route's trace on a map and see that the turn is, for example, two streets ahead, and of course the GPS bike computer is always keeping track of where in the route I am.

Because the GPS bike computer can tell me when I need to pay attention to following the route, I'm free to not pay attention at other times. I can stop thinking about the route at all and look around at the scenery, talk with my fellow club riders, and so on.

(When I look around there are similar situations at work, with some of our systems. Our metrics, monitoring, and alerting system often has the net effect that I don't even look at how things are going because I assume that silence means all is okay. And if I want to do the equivalent of glancing at my GPS bike computer to check the distance to the next turn, I can look at our dashboards.)

How I handle URLs in my unusual X desktop

By: cks
3 November 2025 at 04:34

I have an unusual X desktop environment that has evolved over a long period, and as part of that I have an equally unusual and slowly evolved set of ways to handle URLs. By 'handle URLs', what I mean is going from an URL somewhere (email, text in a terminal, etc) to having the URL open in one of my several browser environments. Tied into this is handling non-URL things that I also want to open in a browser, for example searching for various sorts of things in various web places.

The simplest place to start is at the end. I have several browser environments and to go along with them I have a script for each that opens URLs provided as command line arguments in a new window of that browser. If there's no command line arguments, the scripts open a default page (usually a blank page, but for my main browser it's a special start page of links). For most browsers this works by running 'firefox <whatever>' and so will start the browser if it's not already running, but for my main browser I use a lightweight program that uses Firefox's X-based remote control protocol. which means I have to start the browser outside of it.

Layered on top of these browser specific scripts is a general script to open URLs that I call 'openurl'. The purpose of openurl is to pick a browser environment based on the particular site I'm going to. For example, if I'm opening the URL of a site where I know I need JavaScript, the script opens the URL in my special 'just make it work' JavaScript enabled Firefox. Most urls open in my normal, locked down Firefox. I configure programs like Thunderbird to open URLs through this openurl script, sometimes directly and sometimes indirectly.

(I haven't tried to hook openurl into the complex mechanisms that xdg-open uses to decide how to open URLs. Probably I should but the whole xdg-open thing irritates me.)

Layered on top of openurl and the specific browser scripts is a collection of scripts that read the X selection and do a collection of URL-related things with it. One script reads the X selection, looks for it being a URL, and either feeds the URL to openurl or just runs openurl to open my start page. Other scripts feed the URL to alternate browser environments or do an Internet search for the selection. Then I have a fvwm menu with all of these scripts in it and one of my fvwm mouse button bindings brings up this menu. This lets me select a URL in a terminal window, bring up the menu, and open it in either the default browser choice or a specific browser choice.

(I also have a menu entry for 'open the selection in my main browser' in one of my main fvwm menus, the one attached to the middle mouse button, which makes it basically reflexive to open a new browser window or open some URL in my normal browser.)

The other way I handle URLs is through dmenu. One of the things my dmenu environment does is recognize URLs and open them in my default browser environment. I also have short dmenu commands to open URLs in my other browser environments, or open URLs based on the parameters I pass the command (such as a 'pd' script that opens Python documentation for a standard library module). Dmenu itself can paste in the current X selection with a keystroke, which makes it convenient to move URLs around. Dmenu is also how I typically open a URL if I'm typing it in instead of copying it from the X selection, rather than opening a new browser window, focusing the URL bar, and entering the URL there.

(I have dmenu set up to also recognize 'about:*' as URLs and have various Firefox about: things pre-configured as hidden completions in dmenu, along with some commonly used website URLs.)

As mentioned, dmenu specifically opens plain URLs in my default browser environment rather than going through openurl. I may change this someday but in practice there aren't enough special sites that it's an issue. Also, I've made dedicated little dmenu-specific scripts that open up the various sites I care about in the appropriate browser, so I can type 'mastodon' in dmenu to open up my Fediverse account in the JavaScript-enabled Firefox instance.

Trying to understand Firefox's approaches to tracking cookie isolation

By: cks
2 November 2025 at 02:50

As I learned recently, modern versions of Firefox have two different techniques that try to defeat (unknown) tracking cookies. As covered in the browser addon JavaScript API documentation, in Tracking protection, these are called first-party isolation and dynamic partitioning (or storage partitioning, the documentation seems to use both). Of these two, first party isolation is the easier to describe and understand. To quote the documentation:

When first-party isolation is on, cookies are qualified by the domain of the original page the user visited (essentially, the domain shown to the user in the URL bar, also known as the "first-party domain").

(In practice, this appears to be the top level domain of the site, not necessarily the site's domain itself. For example, Cookie Manager reports that a cookie set from '<...>.cs.toronto.edu' has the first party domain 'toronto.edu'.)

Storage partitioning is harder to understand, and again I'll quote the Storage partitioning section of the cookie API documentation:

When using dynamic partitioning, Firefox partitions the storage accessible to JavaScript APIs by top-level site while providing appropriate access to unpartitioned storage to enable common use cases. [...]

Generally, top-level documents are in unpartitioned storage, while third-party iframes are in partitioned storage. If a partition key cannot be determined, the default (unpartitioned storage) is used. [...]

If you read non-technical writeups like Firefox rolling out Total Cookie Protection (from 2022), it certainly sounds like they're describing first-party isolation. However, if you check things like Status of partitioning in Firefox and the cookies API documentation on first-party isolation, as far as I can tell what Firefox actually normally uses for "Total Cookie Protection" is storage partitioning.

Based on what I can decode from the two descriptions and from the fact that Tor Browser defaults to first-party isolation, it appears that first-party isolation is better and stricter than storage partitioning. Presumably it also causes problems on more websites, enough so that Firefox either no longer uses it for Total Cookie Protection or never did, despite their description sounding like first-party isolation.

(So far I haven't run into any issues with first-party isolation in my cookie-heavy browser environment. It's possible that websites have switched how they do things to avoid problems.)

First-party isolation can be enabled in about:config by setting privacy.firstparty.isolate to true. If and when you do this, the normal Settings β†’ Privacy and Security will show a warning banner at the top to the effect of:

You are using First Party Isolation (FPI), which overrides some of Firefox’s cookie settings.

All of this is relevant to me because one of my add-ons, Cookie AutoDelete, probably works with first-party isolation but almost certainly doesn't work with storage isolation (ie, it will fail to delete some cookies under storage isolation, although I believe it can still delete unpartitioned cookies). Given what I've learned, I'm likely to turn on first-party isolation in my main browser environment soon.

If Cookie Manager is reporting correct information to me, it's possible to have cookies that are both first-party isolated and partitioned; the one I've seen so far is from Youtube. Cookie Manager can't seem to remove these cookies. Based on what I've read about (storage or dynamic) partitioned cookies, I suspect that these are created by embedded iframes.

(Turning on or off first-party isolation effectively drops all of the cookies you currently have, so it's probably best to do it when you restart your browser.)

My mistake with swallowing EnvironmentError errors in our Django application

By: cks
1 November 2025 at 02:50

We have a little Django application to handle request for Unix accounts. Once upon a time it was genuinely little, but it's slowly accreted features over the years. One of the features it grew over the years was a command line program (a Django management command) to bulk-load account request information from files. We use this to handle things like each year's new group of incoming graduate students; rather than force the new graduate students to find the web form on their own, we get information on all of them from the graduate program people and load them into the system in bulk.

One of the things that regularly happens with new graduate students is that they were already involved on the research side of the department. For example, as an undergraduate you might work on a research project with a professor, and then you get admitted as a graduate student (maybe with that professor, or maybe with someone else). When this happens, the new graduate student already has an account and we don't want to give them another one (for various reasons). To detect situations where someone already has an existing account, the bulk loader reads some historical data out of a couple of files and looks through it to match any existing accounts to the new graduate students.

When I originally wrote the code to load data from files, for some reason I decided that it wasn't particular bad if the files didn't exist or couldn't be read, so I wrote code that looked more or less like this:

try:
  fp = open(fname, "r")
  [process file]
  fp.close()
except EnvironmentError:
  pass

Of course, for testing purposes (and other reasons, for example to suppress this check) we should be able to change where the data files were read from, so I made the file names of the data files be argparse options, set the default values to the standard locations that the production application recorded things, and called it all good.

Except that for the past two years, one of the default file names was wrong; when I added this specific file, I made a typo in the file name. Using the command line option to change the file name worked so this passed my initial testing when I added the specific type of historical data, but in production, using my typo'd default file name, we silently never detected existing Unix logins for new graduate students (and others) through this particular type of historical data.

All of this happened because I made a deliberate design decision to silently swallow all EnvironmentError exceptions when trying to open and read these files, instead of either failing or at least reporting a warning. When I made the decision (back in 2013, it turns out), I was probably thinking that the only source of errors was if you ran it as the wrong user or deliberately supplied nonexistent files; I doubt it ever occurred to me that I could make an embarrassing typo in the name of any of the production files. One of the lessons I draw from this is that I don't always even understand the possible sources of errors, which makes it all the more dangerous to casually ignore them.

(Even silently ignoring nonexistent files is rather questionable in retrospect. I don't really know what I was thinking in 2013.)

Removing Fedora's selinux-policy-targeted package is mostly harmless so far

By: cks
1 November 2025 at 01:32

A while back I discussed why I might want to remove the selinux-policy-targeted RPM package for a Fedora 42 upgrade. Today, I upgraded my office workstation from Fedora 41 to Fedora 42, and as part of preparing for that upgrade I removed the selinux-policy-targeted policy (and all of the packages that depended on it). The result appears to work, although there were a few things that came up during the upgrade and I may reinstall at least selinux-policy-targeted itself to get rid of them (for now).

The root issue appears to be that when I removed the selinux-policy-targeted package, I probably should have edited /etc/selinux/config to set SELINUXTYPE to some bogus value, not left it set to "targeted". For entirely sensible reasons, various packages have postinstall scripts that assume that if your SELinux configuration says your SELinux type is 'targeted', they can do things that implicitly or explicitly require things from the package or from the selinux-policy package, which got removed when I removed selinux-policy-targeted.

I'm not sure if my change to SELINUXTYPE will completely fix things, because I suspect that there are other assumptions about SELinux policy programs and data files being present lurking in standard, still-installed package tools and so on. Some of these standard SELinux related packages definitely can't be removed without gutting Fedora of things that are important to me, so I'll either have to live with periodic failures of postinstall scripts or put selinux-policy-targeted and some other bits back. On the whole, reinstalling selinux-policy-targeted is probably the safest way and the issue that caused me to remove it only applies during Fedora version upgrades and might anyway be fixed in Fedora 42.

What this illustrates to me is that regardless of package dependencies, SELinux is not really optional on Fedora. The Fedora environment assumes that a functioning SELinux environment is there and if it isn't, things are likely to go wrong. I can't blame Fedora for this, or for not fully capturing this in package dependencies (and Fedora did protect the selinux-policy-targeted package from being removed; I overrode that by hand, so what happens afterward is on me).

(Although I haven't checked modern versions of Fedora, I suspect that there's no official way to install Fedora without getting a SELinux policy package installed, and possibly selinux-policy-targeted specifically.)

PS: I still plan to temporarily remove selinux-policy-targeted when I upgrade my home desktop to Fedora 42. A few package postinstall glitches is better than not being able to read DNF output due to the package's spam.

Firefox, the Cookie AutoDelete add-on, and "Total Cookie Protection"

By: cks
31 October 2025 at 03:15

In a comment on my entry on flailing around with Firefox's Multi-Account Containers, Ian Z aka nobrowser asked a good question:

The Cookie Autodelete instructions with respect to Total Cookie Protection mode are very confusing. Reading them makes me think this extension is not for me, as I have Strict Mode on in all windows, private or not. [...]

This is an interesting question (and, it turns out, relevant to my usage too) so I did some digging. The short answer is that I suspect the warning on Cookie AutoDelete's add-on page is out of date and it works fine. The long answer starts with the history of HTTP cookies.

Back in the old days, HTTP cookies were global, which is to say that browsers kept a global pool of HTTP cookies (both first party, from the website you were on, and third-party cookies), and it would send any appropriate cookie on any HTTP request to its site. This enabled third-party tracking cookies and a certain amount of CSRF attacks, since the browser would happily send your login cookies along with that request initiated by the JavaScript on some sketchy website you'd accidentally wound up on (or JavaScript injected through an ad network).

This was obviously less than ideal and people wound up working to limit the scope of HTTP cookies, starting with things like Firefox's containers and eventually escalating to first-party cookie isolation, where a cookie is restricted to whatever the first-party domain was when it was set. If you're browsing example.org and the page loads google.com/tracker, which sets a tracker cookie, that cookie will not be sent when you browse example.com and the page also loads google.com/tracker; the first tracking cookie is isolated to example.org.

(There is also storage isolation for cookies, but I think that's been displaced by first-party cookie isolation.)

However, first-party isolation has the possibility to break things you expect to work, as covered in this Firefox FAQ). As a result of this, my impression is that browsers have been cautious and slow to roll out first-party isolation by default. However, they have made it available as an option or part of an option. Firefox calls this Total Cookie Protection (also, also).

(Firefox is working to go even further, blocking all third-party cookies.)

Firefox add-ons have special APIs that allow them to do privileged things, and these include an API for dealing with cookies. When first-party cookie isolation came to pass, these APIs needed to be updated to deal with such isolated cookies (and cookie tracking protection in general). For instance, cookies.remove() has to be passed a special parameter to remove a first-party isolated cookie. As covered in the documentation, an add-on using the cookies APIs without the necessary updates would only see non-isolated cookies, if there were any. So at the time the message on Cookie AutoDelete's add-on page was written, I suspect that it hadn't been updated for first-party isolation. However, based on checking the source code of Cookie AutoDelete, I believe that it currently supports first-party isolation for cookies, and in fact may have done so for some time, perhaps v3.5.0, or v3.4.0 or even earlier.

(It's also possible that this support is incomplete or buggy, or that there are still some things that you can't easily do through it that matter to Cookie AutoDelete.)

Cookie AutoDelete itself is potentially useful even if you have Firefox set to block all third-party cookies, because it will also clean up unwanted first-party cookies (assuming that it truly works with first-party isolation). Part of my uncertainly is that I'm not sure how you reliably find out what cookies you have in a browser world with first-party isolation. There's theoretically some information about this in Settings β†’ Privacy & Security β†’ Cookies and Site Data β†’ "Manage Data...", but since that's part of the normal Settings UI that normal people use, I'm not sure if it's simplifying things.

PS: Now that I've discovered all of this, I'm not certain if my standard Cookie Quick Manager add-on properly supports first-party isolated cookies. There's this comment on an issue that suggests it does support first-party isolation but not storage partitioning (also). The available Firefox documentation and Settings UI is not entirely clear about whether first-party isolation is now on more or less by default.

(That comment points to Cookie Manager as a potential partition-aware cookie manager.)

Before yesterdayChris's Wiki :: blog

My flailing around with Firefox's Multi-Account Containers

By: cks
30 October 2025 at 02:43

I have two separate Firefox environments. One of them is quite locked down so that it blocks JavaScript by default, doesn't accept cookies, and so on. Naturally this breaks a lot of things, so I have a second "just make it work" environment that runs all the JavaScript, accepts all the cookies, and so on (although of course I use uBlock Origin, I'm not crazy). This second environment is pretty risky in the sense that it's going to be heavily contaminated with tracking cookies and so on, so to mitigate the risk (and make it a better environment to test things in), I have this Firefox set to discard cookies, caches, local storage, history, and so on when it shuts down.

In theory how I use this Firefox is that I start it when I need to use some annoying site I want to just work, use the site briefly, and then close it down, flushing away all of the cookies and so on. In practice I've drifted into having a number of websites more or less constantly active in this "accept everything" Firefox, which means that I often keep it running all day (or longer at home) and all of those cookies stick around. This is less than ideal, and is a big reason why I wish Firefox had a 'open this site in a specific profile' feature. Yesterday, spurred on by Ben Zanin's Fediverse comment, I decided to make my "accept everything" Firefox environment more complicated in the pursuit of doing better (ie, throwing away at least some cookies more often).

First, I set up a combination of Multi-Account Containers for the basic multi-container support and FoxyTab to assign wildcarded domains to specific containers. My reason to use Multi-Account Containers and to confine specific domains to specific containers is that both M-A C itself and my standard Cookie Quick Manager add-on can purge all of the cookies and so on for a specific container. In theory this lets me manually purge undesired cookies, or all cookies except desired ones (for example, my active Fediverse login). Of course I'm not likely to routinely manually delete cookies, so I also installed Cookie AutoDelete with a relatively long timeout and with its container awareness turned on, and exemptions configured for the (container-confined) sites that I'm going to want to retain cookies from even when I've closed their tab.

(It would be great if Cookie AutoDelete supported different cookie timeouts for different containers. I suspect it's technically possible, along with other container-aware cookie deletion, since Cookie AutoDelete applies different retention policies in different containers.)

In FoxyTab, I've set a number of my containers to 'Limit to Designated Sites'; for example, my 'Fediverse' container is set this way. The intention is that when I click on an external link in a post while reading my Fediverse feed, any cookies that external site sets don't wind up in the Fediverse container; instead they go either in the default 'no container' environment or in any specific container I've set up for them. As part of this I've created a 'Cookie Dump' container that I've assigned as the container for various news sites and so on where I actively want a convenient way to discard all their cookies and data (which is available through Multi-Account Containers).

Of course if you look carefully, much of this doesn't really require Multi-Account Containers and FoxyTab (or containers at all). Instead I could get almost all of this just by using Cookie AutoDelete to clean out cookies from closed sites after a suitable delay. Containers do give me a bit more isolation between the different things I'm using my "just make it work" Firefox for, and maybe that's important enough to justify the complexity.

(I still have this Firefox set to discard everything when it exits. This means that I have to re-log-in every so often even for the sites where I have Cookie AutoDelete keep cookies, but that's fine.)

I wish Firefox Profiles supported assigning websites to profiles

By: cks
29 October 2025 at 03:23

One of the things that Firefox is working on these days is improving Firefox's profiles feature so that it's easier to use them. Firefox also has an existing feature that is similar to profiles, in containers and the Multi-Account Containers extension. The reason Firefox is tuning up profiles is that containers only separate some things, while profiles separate pretty much everything. A profile has a separate set of about:config settings, add-ons, add-on settings, memorized logins, and so on. I deliberately use profiles to create two separate and rather different Firefox environments. I'd like to have at least two or three more profiles, but one reason I've been lazy is that the more profiles I have, the more complex getting URLs into the right profile is (even with tooling to help).

This leads me to my wish for profiles, which is for profiles to support the kind of 'assign website to profile' and 'open website in profile' features that you currently have with containers, especially with the Multi-Account Containers extension. Actually I would like a somewhat better version than Multi-Account Containers currently offers, because as far as I can see you can't currently say 'all subdomains under this domain should open in container X' and that's a feature I very much want for one of my use cases.

(Multi-Account Containers may be able to do wildcarded subdomains with an additional add-on, but on the other hand apparently it may have been neglected or abandoned by Mozilla.)

Another way to get much of what I want would be for some of my normal add-ons to be (more) container aware. I could get a lot of the benefit of profiles (although not all of them) by using Multi-Account Containers with container aware cookie management in, say, Cookie AutoDelete (which I believe does support that, although I haven't experimented). Using containers also has the advantage that I wouldn't have to maintain N identical copies of my configuration for core extensions and bookmarklets and so on.

(I'm not sure what you can copy from one profile to a new one, and you currently don't seem to get any assistance from Firefox for it, at least in the old profile interface. This is another reason I haven't gone wild on making new Firefox profiles.)

Modern Linux filesystem mounts are rather complex things

By: cks
28 October 2025 at 03:04

Once upon a time, Unix filesystem mounts worked by putting one inode on top of another, and this was also how they worked in very early Linux. It wasn't wrong to say that mounts were really about inodes, with the names only being used to find the inodes. This is no longer how things work in Linux (and perhaps other Unixes, but Linux is what I'm most familiar with for this). Today, I believe that filesystem mounts in Linux are best understood as namespace operations.

Each separate (unmounted) filesystem is a a tree of names (a namespace). At a broad level, filesystem mounts in Linux take some name from that filesystem tree and project it on top of something in an existing namespace, generally with some properties attached to the projection. A regular conventional mount takes the root name of the new filesystem and puts the whole tree somewhere, but for a long time Linux's bind mounts took some other name in the filesystem as their starting point (what we could call the root inode of the mount). In modern Linux, there can also be multiple mount namespaces in existence at one time, with different contents and properties. A filesystem mount does not necessarily appear in all of them, and different things can be mounted at the same spot in the tree of names in different mount namespaces.

(Some mount properties are still global to the filesystem as a whole, while other mount properties are specific to a particular mount. See mount(2) for a discussion of general mount properties. I don't know if there's a mechanism to handle filesystem specific mount properties on a per mount basis.)

This can't really be implemented with an inode-based view of mounts. You can somewhat implement traditional Linux bind mounts with an inode based approach, but mount namespaces have to be separate from the underlying inodes. At a minimum a mount point must be a pair of 'this inode in this namespace has something on top of it', instead of just 'this inode has something on top of it'.

(A pure inode based approach has problems going up the directory tree even in old bind mounts, because the parent directory of a particular directory depends on how you got to the directory. If /usr/share is part of /usr and you bind mounted /usr/share to /a/b, the value of '..' depends on if you're looking at '/usr/share/..' or '/a/b/..', even though /usr/share and /a/b are the same inode in the /usr filesystem.)

If I'm reading manual pages correctly, Linux still normally requires the initial mount of any particular filesystem be of its root name (its true root inode). Only after that initial mount is made can you make bind mounts to pull out some subset of its tree of names and then unmount the original full filesystem mount. I believe that a particular filesystem can provide ways to sidestep this with a filesystem specific mount option, such as btrfs's subvol= mount option that's covered in the btrfs(5) manual page (or 'btrfs subvolume set-default').

You can add arbitrary zones to NSD (without any glue records)

By: cks
27 October 2025 at 03:29

Suppose, not hypothetically, that you have a very small DNS server for a captive network situation, where the DNS server exists only to give clients answers for a small set of hosts. One of the ways you can implement this is with an authoritative DNS servers, such as NSD, that simply has an extremely minimal set of DNS data. If you're using NSD for this, you might be curious how minimal you can be and how much you need to mimic ordinary DNS structure.

Here, by 'mimic ordinary DNS structure', I mean inserting various levels of NS records so there is a more or less conventional path of NS delegations from the DNS root ('.') down to your name. If you're providing DNS clients with 'dog.example.org', you might conventionally have a NS record for '.', a NS record for 'org.', and a NS record for 'example.org.', mimicking what you'd see in global DNS. Of course all of your NS records are going to point to your little DNS server, but they're present if anything looks.

Perhaps unsurprisingly, NSD doesn't require this and DNS clients normally don't either. If you say:

zone:
  name: example.org
  zonefile: example-stub

and don't have any other DNS data, NSD won't object and it will answer queries for 'dog.example.org' with your minimal stub data. This works for any zone, including completely made up ones:

zone:
  name: beyond.internal
  zonefile: beyond-stub

The actual NSD stub zone files can be quite minimal. An older OpenBSD NSD appears to be happy with zone files that have only a $ORIGIN, a $TTL, a '@ IN SOA' record, and what records you care about in the zone.

Once I thought about it, I realized I should have expected this. An authoritative DNS server normally only holds data for a small subset of zones and it has to be willing to answer queries about the data it holds. Some authoritative DNS servers (such as Bind) can also be used as resolving name servers so they'd sort of like to have information about at least the root nameservers, but NSD is a pure authoritative server so there's no reason for it to care.

As for clients, they don't normally do DNS resolution starting from the root downward. Instead, they expect to operate by sending the entire query to whatever their configured DNS resolver is, which is going to be your little NSD setup. In a number of configurations, clients either can't talk directly to outside DNS or shouldn't try to do DNS resolution that way because it won't work; they need to send everything to their configured DNS resolver so it can do, for example, "split horizon" DNS.

(Yes, the modern vogue for DNS over HTTPS puts a monkey wrench into split horizon DNS setups. That's DoH's problem, not ours.)

Since this works for a .net zone, you can use it to try to disable DNS over HTTPS resolvers in your stub DNS environment by providing a .net zone with 'use-application-dns CNAME .' or the like, to trigger at least Firefox's canary domain detection.

(I'm not going to address whether you should have such a minimal stub DNS environment or instead count on your firewall to block traffic and have a normal DNS environment, possibly with split horizon or response policy zones to introduce your special names.)

Some of the things that ZFS scrubs will detect

By: cks
26 October 2025 at 02:41

Recently I saw a discussion of my entry on how ZFS scrubs don't really check the filesystem structure where someone thought that ZFS scrubs only protected you from the disk corrupting data at rest, for example due to sectors starting to fail (here). While ZFS scrubs have their limits, they do manage to check somewhat more than this.

To start with, ZFS scrubs check the end to end hardware path for reading all your data (and implicitly for writing it). There are a variety of ways that things in the hardware path can be unreliable; for example, you might have slowly failing drive cables that are marginal and sometimes give you errors on data reads (or worse, data writes). A ZFS scrub has some chance to detect this; if a ZFS scrub passes, you know that as of that point in time you can reliably read all your data from all your disks and that all the data was reliably written.

If a scrub passes, you also know that the disks haven't done anything obviously bad with your data. This can be important if you're doing operations that you consider somewhat exotic, such as telling SSDs to discard unused sectors. If you have ZFS send TRIM commands to a SSD and then your scrub passes, you know that the SSD didn't incorrectly discard some sectors that were actually used.

Related to this, if you do a ZFS level TRIM and then the scrub passes, you know that ZFS itself didn't send TRIM commands that told the SSD to discard sectors that were actually used. In general, if ZFS has a serious problem where it writes the wrong thing to the wrong place, a scrub will detect it (although the scrub can't fix it). Similarly, a scrub will detect if a disk itself corrupted the destination of a write (or a read), or if things were corrupted somewhere in the lower level software and hardware path of the write.

There are a variety of ZFS level bugs that could theoretically write the wrong thing to the wrong place, or do something that works out to the same effect. ZFS could have a bug in free space handling (so that it incorrectly thinks some in use sectors are free and overwrites them), or it could write too much or too little, or it could correctly allocate and write data but record the location of the data incorrectly in higher level data structures, or it could accidentally not do a write (for example, if it's supposed to write a duplicate copy of some data but forgets to actually issue the IO). ZFS scrubs can detect all of these issues under the right circumstances.

(To a limited extent a ZFS scrub also checks the high level metadata of filesystems and snapshots. since it has to traverse that metadata to find the object set for each dataset and similar things. Since a scrub just verifies checksums, this won't cross check dataset level metadata like information on how much data was written in each snapshot, or the space usage.)

What little I want out of web "passkeys" in my environment

By: cks
25 October 2025 at 03:19

WebAuthn is yet another attempt to do an API for web authentication that doesn't involve passwords but that instead allows browsers, hardware tokens, and so on to do things more securely. "Passkeys" (also) is the marketing term for a "WebAuthn credential", and an increasing number of websites really, really want you to use a passkey for authentication instead of any other form of multi-factor authentication (they may or may not still require your password).

Most everyone that wants you to use passkeys also wants you to specifically use highly secure ones. The theoretically most secure are physical hardware security keys, followed by passkeys that are stored and protected in secure enclaves in various ways by the operating system (provided that the necessary special purpose hardware is available). Of course the flipside of 'secure' is 'locked in', whether locked in to your specific hardware key (or keys, generally you'd better have backups) or locked in to a particular vendor's ecosystem because their devices are the only ones that can possibly use your encrypted passkey vault.

(WebAuthn neither requires nor standardizes passkey export and import operations, and obviously security keys are built to not let anyone export the cryptographic material from them, that's the point.)

I'm extremely not interested in the security versus availability tradeoff that passkeys make in favour of security. I care far more about preserving availability of access to my variety of online accounts than about nominal high security. So if I'm going to use passkeys at all, I have some requirements:

Linux people: is there a passkeys implementation that does not use physical hardware tokens (software only), is open source, works with Firefox, and allows credentials to be backed up and copied to other devices by hand, without going through some cloud service?

I don't think I'm asking for much, but this is what I consider the minimum for me actually using passkeys. I want to be 100% sure of never losing them because I have multiple backups and can use them on multiple machines.

Apparently KeePassXC more or less does what I want (when combined with its Firefox extension), and it can even export passkeys in a plain text format (well, JSON). However, I don't know if anything else can ingest those plain text passkeys, and I don't know if KeePassXC can be told to only do passkeys with the browser and not try to take over passwords.

(But at least a plain text JSON backup of your passkeys can be imported into another KeePassXC instance without having to try to move, copy, or synchronize a KeePassXC database.)

Normally I would ignore passkeys entirely, but an increasing number of websites are clearly going to require me to use some form of multi-factor authentication, no matter how stupid this is (cf), and some of them will probably require passkeys or at least make any non-passkey option very painful. And it's possible that reasonably integrated passkeys will be a better experience than TOTP MFA with my janky minimal setup.

(Of course KeePassXC also supports TOTP, and TOTP has an extremely obvious import process that everyone supports, and I believe KeePassXC will export TOTP secrets if you ask nicely.)

While KeePassXC is okay, what I would really like is for Firefox to support 'memorized passkeys' right along with its memorized passwords (and support some kind of export and import along with it). Should people use them? Perhaps not. But it would put that choice firmly in the hands of the people using Firefox, who could decide on how much security they did or didn't want, not in the hands of websites who want to force everyone to face a real risk of losing their account so that the website can conduct security theater.

(Firefox will never support passkeys this way for an assortment of reasons. At most it may someday directly use passkeys through whatever operating system services expose them, and maybe Linux will get a generic service that works the way I want it to. Nor is Firefox ever going to support 'memorized TOTP codes'.)

Two reasons why Unix traditionally requires mount points to exist

By: cks
24 October 2025 at 02:29

Recently on the Fediverse, argv minus one asked a good question:

Why does #Linux require #mount points to exist?

And are there any circumstances where a mount can be done without a pre-existing mount point (i.e. a mount point appears out of thin air)?

I think there is one answer for why this is a good idea in general and otherwise complex to do, although you can argue about it, and then a second historical answer based on how mount points were initially implemented.

The general problem is directory listings. We obviously want and need mount points to appear in readdir() results, but in the kernel, directory listings are historically the responsibility of filesystems and are generated and returned in pieces on the fly (which is clearly necessary if you have a giant directory; the kernel doesn't read the entire thing into memory and then start giving your program slices out of it as you ask). If mount points never appear in the underlying directory, then they must be inserted at some point in this process. If mount points can sometimes exist and sometimes not, it's worse; you need to somehow keep track of which ones actually exist and then add the ones that don't at the end of the directory listing. The simplest way to make sure that mount points always exist in directory listings is to require them to have an existence in the underlying filesystem.

(This was my initial answer.)

The historical answer is that in early versions of Unix, filesystems were actually mounted on top of inodes, not directories (or filesystem objects). When you passed a (directory) path to the mount(2) system call, all it was used for was getting the corresponding inode, which was then flagged as '(this) inode is mounted on' and linked (sort of) to the new mounted filesystem on top of it. All of the things that dealt with mount points and mounted filesystem did so by inode and inode number, with no further use of the paths and the root inode of the mounted filesystem being quietly substituted for the mounted-on inode. All of the mechanics of this needed the inode and directory entry for the name to actually exist (and V7 required the name to be a directory).

I don't think modern kernels (Linux or otherwise) still use this approach to handling mounts, but I believe it lingered on for quite a while. And it's a sufficiently obvious and attractive implementation choice that early versions of Linux also used it (see the Linux 0.96c version of iget() in fs/inode.c).

Sidebar: The details of how mounts worked in V7

When you passed a path to the mount(2) system call (called 'smount()' in sys/sys3.c), it used the name to get the inode and then set the IMOUNT flag from sys/h/inode.h on it (and put the mount details in a fixed size array of mounts, which wasn't very big). When iget() in sys/iget.c was fetching inodes for you and you'd asked for an IMOUNT inode, it gave you the root inode of the filesystem instead, which worked in cooperation with name lookup in a directory (the name lookup in the directory would find the underlying inode number, and then iget() would turn it into the mounted filesystem's root inode). This gave Research Unix a simple, low code approach to finding and checking for mount points, at the cost of pinning a few more inodes into memory (not necessarily a small thing when even a big V7 system only had at most 200 inodes in memory at once, but then a big V7 system was limited to 8 mounts, see h/param.h).

We can't really do progressive rollouts of disruptive things

By: cks
23 October 2025 at 02:49

In a comment on my entry on how we reboot our machines right after updating their kernels, Jukka asked a good question:

While I do not know how many machines there are in your fleet, I wonder whether you do incremental rolling, using a small snapshot for verification before rolling out to the whole fleet?

We do this to some extent but we can't really do it very much. The core problem is that the state of almost all of our machines is directly visible and exposed to people. This is because we mostly operate an old fashioned Unix login server environment, where people specifically use particular servers (either directly by logging in to them or implicitly because their home directory is on a particular NFS fileserver). About the only genuinely generic machines we have are the nodes in our SLURM cluster, where we can take specific unused nodes out of service temporarily without anyone noticing.

(Some of these login servers in use all of the time; others we might find idle if we're extremely lucky. But it's hard to predict when someone will show up to try to use a currently empty server.)

This means that progressively rolling out a kernel update (and rebooting things) to our important, visible core servers requires multiple people-visible reboots of machines, instead of one big downtime when everything is rebooted. Generally we feel that repeated disruptions are much more annoying and disruptive overall to people; it's better to get the pain of reboot disruptions over all at once. It's also much easier to explain to people, and we don't have to annoy them with repeated notifications that yet another subset of our servers and services will be down for a bit.

(To make an incremental deployment more painful for us, these will normally have to be after-hours downtimes, which means that we'll be repeatedly staying late, perhaps once a week for three or four weeks as we progressively work through a rollout.)

In addition to the nodes of our SLURM cluster, there are a number of servers that can be rebooted in the background to some degree without people noticing much. We will often try the kernel update out on a few of them in advance, and then update others of them earlier in the day (or the day before) both as a final check and to reduce the number of systems we have to cover at the actual out of hours downtime. But a lot of our servers cannot really be tested much in advance, such as our fileservers or our web server (which is under constant load for reasons outside the scope of this entry). We can (and do) update a test fileserver or a test web server, but neither will see a production load and it's under production loads that problems are most likely to surface.

This is a specific example of how the 'cattle' model doesn't fit all situations. To have a transparent rolling update that involves reboots (or anything else that's disruptive on a single machine), you need to be able to transparently move people off of machines and then back on to them. This is hard to get in any environment where people have long term usage of specific machines, where they have login sessions and running compute jobs and so on, and where you have have non-redundant resources on a single machine (such as NFS fileservers without transparent failover from server to server).

We don't update kernels without immediately rebooting the machine

By: cks
22 October 2025 at 03:07

I've mentioned this before in passing (cf, also) but today I feel like saying it explicitly: our habit with all of our machines is to never apply a kernel update without immediately rebooting the machine into the new kernel. On our Ubuntu machines this is done by holding the relevant kernel packages; on my Fedora desktops I normally run 'dnf update --exclude "kernel*"' unless I'm willing to reboot on the spot.

The obvious reason for this is that we want to switch to the new kernel under controlled, attended conditions when we'll be able to take immediate action if something is wrong, rather than possibly have the new kernel activate at some random time without us present and paying attention if there's a power failure, a kernel panic, or whatever. This is especially acute on my desktops, where I use ZFS by building my own OpenZFS packages and kernel modules. If something goes wrong and the kernel modules don't load or don't work right, an unattended reboot can leave my desktops completely unusable and off the network until I can get to them. I'd rather avoid that if possible (sometimes it isn't).

(In general I prefer to reboot my Fedora machines with me present because weird things happen from time to time and sometimes I make mistakes, also.)

The less obvious reason is that when you reboot a machine right after applying a kernel update, it's clear in your mind that the machine has switched to a new kernel. If there are system problems in the days immediately afterward the update, you're relatively likely to remember this and at least consider the possibility that the new kernel is involved. If you apply a kernel update, walk away without rebooting, and the machine reboots a week and a half later for some unrelated reason, you may not remember that one of the things the reboot did was switch to a new kernel.

(Kernels aren't the only thing that this can happen with, since not all system updates and changes take effect immediately when made or applied. Perhaps one should reboot after making them, too.)

I'm assuming here that your Linux distribution's package management system is sensible, so there's no risk of losing old kernels (especially the one you're currently running) merely because you installed some new ones but didn't reboot into them. This is how Debian and Ubuntu behave (if you don't 'apt autoremove' kernels), but not quite how Fedora's dnf does it (as far as I know). Fedora dnf keeps the N most recent kernels around and probably doesn't let you remove the currently running kernel even if it's more than N kernels old, but I don't believe it tracks whether or not you've rebooted into those N kernels and stretches the N out if you haven't (or removes more recent installed kernels that you've never rebooted into, instead of older kernels that you did use at one point).

PS: Of course if kernel updates were perfect this wouldn't matter. However this isn't something you can assume for the Linux kernel (especially as patched by your distribution), as we've sometimes seen. Although big issues like that are relatively uncommon.

We (I) need a long range calendar reminder system

By: cks
21 October 2025 at 03:05

About four years ago I wrote an entry about how your SMART drive database of attribute meanings needs regular updates. That entry was written on the occasion of updating the database we use locally on our Ubuntu servers, and at the time we were using a mix of Ubuntu 18.04 and Ubuntu 20.04 servers, both of which had older drive databases that probably dated from early 2018 and early 2020 respectively. It is now late 2025 and we use a mix of Ubuntu 24.04 and 22.04 servers, both of which have drive databases that are from after October of 2021.

Experienced system administrators know where this one is going: today I updated our SMART drive database again, to a version of the SMART database that was more recent than the one shipped with 24.04 instead of older than it.

It's a fact of life that people forget things. People especially forget things that are a long way away, even if they make little notes in their worklog message when recording something that they did (as I did four years ago). It's definitely useful to plan ahead in your documentation and write these notes, but without an external thing to push you or something to explicitly remind you, there's no guarantee that you'll remember.

All of which leads me to the view that it would be useful for us to have a long range calendar reminder system, something that could be used to set reminders for more than a year into the future and ideally allow us to write significant email messages to our future selves to cover all of the details (although there are hacks around that, such as putting the details on a web page and having the calendar mail us a link). Right now the best calendar reminder system we have is the venerable calendar, which we can arrange to email one-line notes to our general address that reaches all sysadmins, but calendar doesn't let you include the year in the reminder date.

(For SMART drive database updates, we could get away with mailing ourselves once a year in, say, mid-June. It doesn't hurt to update the drive database more than every Ubuntu LTS release. But there are situations where a reminder several years in the future is what we want.)

PS: Of course it's not particularly difficult to build an ad-hoc script system to do this, with various levels of features. But every local ad-hoc script that we write is another little bit of overhead, and I'd like to avoid that kind of thing if at all possible in favour of a standard solution (that isn't a shared cloud provider calendar).

We need to start doing web blocking for non-technical reasons

By: cks
20 October 2025 at 03:37

My sense is that for a long time, technical people (system administrators, programmers, and so on) have seen the web as something that should be open by default and by extension, a place where we should only block things for 'technical' reasons. Common technical reasons are a harmful volume of requests or clear evidence of malign intentions, such as probing for known vulnerabilities. Otherwise, if it wasn't harming your website and wasn't showing any intention to do so, you should let it pass. I've come to think that in the modern web this is a mistake, and we need to be willing to use blocking and other measures for 'non-technical' reasons.

The core problem is that the modern web seems to be fragile and is kept going in large part by a social consensus, not technical things such as capable software and powerful servers. However, if we only react to technical problems, there's very little that preserves and reinforces this social consensus, as we're busy seeing. With little to no consequences for violating the social consensus, bad actors are incentivized to skate right up to and even over the line of causing technical problems. When we react by taking only narrow technical measures, we tacitly reward the bad actors for their actions; they can always find another technical way. They have no incentive to be nice or to even vaguely respect the social consensus, because we don't punish them for it.

So I've come to feel that if something like the current web is to be preserved, we need to take action not merely when technical problems arise but also when the social consensus is violated. We need to start blocking things for what I called editorial reasons. When software or people do things that merely shows bad manners and doesn't yet cause us technical problems, we should still block it, either soft (temporarily, perhaps with HTTP 429 Too Many Requests) or hard (permanently). We need to take action to create the web that we want to see, or we aren't going to get it or keep it.

To put it another way, if we want to see good, well behaved browsers, feed readers, URL fetchers, crawlers, and so on, we have to create disincentives for ones that are merely bad (as opposed to actively damaging). In its own way, this is another example of the refutation of Postel's Law. If we accept random crap to be friendly, we get random crap (and the quality level will probably trend down over time).

To answer one potential criticism, it's true that in some sense, blocking and so on for social reasons is not good and is in some theoretical sense arguably harmful for the overall web ecology. On the other hand, the current unchecked situation itself is also deeply harmful for the overall web ecology and it's only going to get worse if we do nothing, with more and more things effectively driven off the open web. We only get to pick the poison here.

I wish SSDs gave you CPU performance style metrics about their activity

By: cks
19 October 2025 at 02:54

Modern CPUs have an impressive collection of performance counters for detailed, low level information on things like cache misses, branch mispredictions, various sorts of stalls, and so on; on Linux you can use 'perf list' to see them all. Modern SSDs (NVMe, SATA, and SAS) are all internally quite complex, and their behavior under load depends on a lot of internal state. It would be nice to have CPU performance counter style metrics to expose some of those details. For a relevant example that's on my mind (cf), it certainly would be interesting to know how often flash writes had to stall while blocks were hastily erased, or the current erase rate.

Having written this, I checked some of our SSDs (the ones I'm most interested in at the moment) and I see that our SATA SSDs do expose some of this information as (vendor specific) SMART attributes, with things like 'block erase count' and 'NAND GB written' to TLC or SLC (as well as the host write volume and so on stuff you'd expect). NVMe does this in a different way that doesn't have the sort of easy flexibility that SMART attributes do, so a random one of ours that I checked doesn't seem to provide this sort of lower level information.

It's understandable that SSD vendors don't necessarily want to expose this sort of information, but it's quite relevant if you're trying to understand unusual drive performance. For example, for your workload do you need to TRIM your drives more often, or do they have enough pre-erased space available when you need it? Since TRIM has an overhead, you may not want to blindly do it on a frequent basis (and its full effects aren't entirely predictable since they depend on how much the drive decides to actually erase in advance).

(Having looked at SMART 'block erase count' information on one of our servers, it's definitely doing something when the server is under heavy fsync() load, but I need to cross-compare the numbers from it to other systems in order to get a better sense of what's exceptional and what's not.)

I'm currently more focused on write related metrics, but there's probably important information that could be exposed for reads and for other operations. I'd also like it if SSDs provided counters for how many of various sorts of operations they saw, because while your operating system can in theory provide this, it often doesn't (or doesn't provide them at the granularity of, say, how many writes with 'Force Unit Access' or how many 'Flush' operations were done).

(In Linux, I think I'd have to extract this low level operation information in an ad-hoc way with eBPF tracing.)

A (filesystem) journal can be a serialization point for durable writes

By: cks
18 October 2025 at 02:57

Suppose that you have a filesystem that uses some form of a journal to provide durability (as many do these days) and you have a bunch of people (or processes) writing and updating things all over the filesystem that they want to be durable, so these processes are all fsync()'ing their work on a regular basis (or the equivalent system call or synchronous write operation). In a number of filesystem designs, this creates a serialization point on the filesystem's journal.

This is related to the traditional journal fsync() problem, but that one is a bit different. In the traditional problem you have a bunch of changes from a bunch of processes, some of which one process wants to fsync() and most of which it doesn't; this can be handled by only flushing necessary things. Here we have a bunch of processes making a bunch of relatively independent changes but approximately all of the processes want to fsync() their changes.

The simple way to get durability (and possibly integrity) for fsync() is to put everything that gets fsync()'d into the journal (either directly or indirectly) and then force the journal to be durably committed to disk. If the filesystem's journal is a linear log, as is usually the case, this means that multiple processes mostly can't be separately writing and flushing journal entries at the same time. Each durable commit of the journal is a bottleneck for anyone who shows up 'too late' to get their change included in the current commit; they have to wait for the current commit to be flushed to disk before they can start adding more entries to the journal (but then everyone can be bundled into the next commit).

In some filesystems, processes can readily make durable writes outside of the journal (for example, overwriting something in place); such processes can avoid serializing on a linear journal. Even if they have to put something in the journal, you can perhaps minimize the direct linear journal contents by having them (durably) write things to various blocks independently, then put only compact pointers to those out of line blocks into the linear journal with its serializing, linear commits. The goal is to avoid having someone show up wanting to write megabytes 'to the journal' and forcing everyone to wait for their fsync(); instead people serialize only on writing a small bit of data at the end, and writing the actual data happens in parallel (assuming the disk allows that).

(I may have made this sound simple but the details are likely fiendishly complex.)

If you have a filesystem in this situation, and I believe one of them is ZFS, you may find you care a bunch about the latency of disks flushing writes to media. Of course you need the workload too, but there are certain sorts of workloads that are prone to this (for example, traditional Unix mail spools).

I believe that you can also see this sort of thing with databases, although they may be more heavily optimized for concurrent durable updates.

Sidebar: Disk handling of durable writes can also be a serialization point

Modern disks (such as NVMe SSDs) broadly have two mechanism to force things to durable storage. You can issue specific writes of specific blocks with 'Force Unit Access' (FUA) set, which causes the disk to write those blocks (and not necessarily any others) to media, or you can issue a general 'Flush' command to the disk and it will write anything it currently has in its write cache to media.

If you issue FUA writes, you don't have to wait for anything else other than your blocks to be written to media. If you issue 'Flush', you get to wait for everyone's blocks to be written out. This means that for speed you want to issue FUA writes when you want things on media, but on the other hand you may have already issued non-FUA writes for some of the blocks before you found out that you wanted them on media (for example, if someone writes a lot of data, so much that you start writeback, and then they issue a fsync()). And in general, the block IO programming model inside your operating system may favour issuing a bunch of regular writes and then inserting a 'force everything before this point to media' fencing operation into the IO stream.

NVMe SSDs and the question of how fast they can flush writes to flash

By: cks
17 October 2025 at 03:17

Over on the Fediverse, I had a question I've been wondering about:

Disk drive people, sysadmins, etc: would you expect NVMe SSDs to be appreciably faster than SATA SSDs for a relatively low bandwidth fsync() workload (eg 40 Mbytes/sec + lots of fsyncs)?

My naive thinking is that AFAIK the slow bit is writing to the flash chips to make things actually durable when you ask, and it's basically the same underlying flash chips, so I'd expect NVMe to not be much faster than SATA SSDs on this narrow workload.

This is probably at least somewhat wrong. This 2025 SSD hierarchy article doesn't explicitly cover forced writes to flash (the fsync() case), but it does cover writing 50 GBytes of data in 30,000 files, which is probably enough to run any reasonable consumer NVMe SSD out of fast write buffer storage (either RAM or fast flash). The write speeds they get on this test from good NVMe drives are well over the maximum SATA data rates, so there's clearly a sustained write advantage to NVMe SSDs over SATA SSDs.

In replies on the Fediverse, several people pointed out that NVMe SSDs are likely using newer controllers than SATA SSDs and these newer controllers may well be better at handling writes. This isn't surprising when I thought about it, especially in light of NVMe perhaps overtaking SATA for SSDs, although apparently 'enterprise' SATA/SAS SSDs are still out there and probably seeing improvements (unlike consumer SATA SSDs where price is the name of the game).

Also, apparently the real bottleneck in writing to the actual flash is finding erased blocks or, if you're unlucky, having to wait for blocks to be erased. Actual writes to the flash chips may be able to go at something close to the PCIe 3.0 (or better) bandwidth, which would help explain the Tom's Hardware large write figures (cf).

(If this is the case, then explicitly telling SSDs about discarded blocks is especially important for any write workload that will be limited by flash write speeds, including fsync() heavy workloads.)

PS: The reason I'm interested in this is that we have a SATA SSD based system that seems to have periodic performance issues related enough write IO combined with fsync()s (possibly due to write buffering interactions), and I've been wondering how much moving it to be NVMe based might help. Since this machine uses ZFS, perhaps one thing we should consider is manually doing some ZFS 'TRIM' operations.

The strange case of 'mouse action traps' in GNU Emacs with (slower) remote X

By: cks
16 October 2025 at 02:18

Some time back over on the Fediverse, I groused about GNU Emacs tooltips. That grouse was a little imprecise; the situation I usually see problems with is specifically running GNU Emacs in SSH-forwarded X from home, which has a somewhat high latency. This high latency caused me to change how I opened URLs from GNU Emacs, and it seems to be the root of the issues I'm seeing.

The direct experience I was having with tooltips was that being in a situation where Emacs might want to show a GUI tooltip would cause Emacs to stop responding to my keystrokes for a while. If the tooltip was posted and visible it would stay visible, but the stall could happen without that. However, it doesn't seem to be tooltips as such that cause this problem, because even with tooltips disabled as far as I can tell (and certainly not appearing), the cursor and my interaction with Emacs can get 'stuck' in places where there's mouse actions available.

(I tried both setting the tooltip delay times to very large numbers and setting tooltip-functions to do nothing.)

This is especially visible to me because my use of MH-E is prone to this in two cases. First, when composing email flyspell mode will attach a 'correct word' button-2 popup menu to misspelled words, which can then stall things if I move the cursor to them (especially if I use a mouse click to do so, perhaps because I want to make the word into an X selection). Second, when displaying email that has links in it, these links can be clicked on (and have hover tooltips to display what the destination URL is); what I frequently experience is that after I click on a link, when I come back to the GNU Emacs (X) window I can't immediately switch to the next message, scroll the text of the current message, or otherwise do things.

This 'trapping' and stall doesn't usually happen when I'm in the office, which is still using remote X but over a much faster and lower latency 1G network connection. Disabling tooltips themselves isn't ideal because it means I no longer get to see where links go, and anyway it's relatively pointless if it doesn't fix the real problem.

When I thought this was an issue specific to tooltips, it made sense to me because I could imagine that GNU Emacs needed to do a bunch of relatively synchronous X operations to show or clear a tooltip, and those operations could take a while over my home link. Certainly displaying regular GNU Emacs (X) menus isn't particularly fast. Without tooltips displaying it's more mysterious, but it's still possible that Emacs is doing a bunch of X operations when it thinks a mouse or tooltip target is 'active', or perhaps there's something else going on.

(I'm generally happy with GNU Emacs but that doesn't mean it's perfect or that I don't have periodic learning experiences.)

PS: In theory there are tools that can monitor and report on the flow of X events (by interposing themselves into it). In practice it's been a long time since I used any of them, and anyway there's probably nothing I can do about it if GNU Emacs is doing a lot of X operations. Plus it's probably partly the GTK toolkit at work, not GNU Emacs itself.

PPS: Having taken a brief look at the MH-E code, I'm pretty sure that it doesn't even begin to work with GNU Emacs' TRAMP (also) system for working with remote files. TRAMP has some support for running commands remotely, but MH-E has its own low-level command execution and assumes that it can run commands rapidly, whenever it feels like, and then read various results out of the filesystem. Probably the most viable approach would be to use sshfs to mount your entire ~/Mail locally, have a local install of (N)MH, and then put shims in for the very few MH commands that have to run remotely (such as inc and the low level post command that actually sends out messages you've written). I don't know if this would work very well, but it would almost certainly be better than trying to run all those MH commands remotely.

Staring at code can change what I see (a story from long ago)

By: cks
15 October 2025 at 03:14

I recently read Hillel Wayne's Sapir-Whorf does not apply to Programming Languages (via, which I will characterize as being about how programming can change how you see things even though the Sapir-Whorf hypothesis doesn't apply (Hillel Wayne points to the Tetris Effect). As it happens, long ago I experienced a particular form of this that still sticks in my memory.

Many years ago, I was recruited to be a TA for the university's upper year Operating Systems course, despite being an undergraduate at the time. One of the jobs of TAs was to mark assignments, which we did entirely by hand back in those days; any sort of automated testing was far in the future, and for these assignments I don't think we even ran the programs by hand. Instead, marking was mostly done by having students hand in printouts of their modifications to the course's toy operating system and we three TAs collectively scoured the result to see if they'd made the necessary changes and spot errors.

Since this was an OS course, some assignments required dealing with concurrency, which meant that students had to properly guard and insulate their changes (in, for example, memory handling) from various concurrency problems. Failure to completely do so would cost marks, so the TAs were on the lookout for such problems. Over the course of the course, I got very good at spotting these concurrency problems entirely by eye in the printed out code. I didn't really have to think about it, I'd be reading the code (or scanning it) and the problem would jump out at me. In the process I formed a firm view that concurrency is very hard for people to deal with, because so many students made so many mistakes (whether obvious or subtle).

(Since students were modifying the toy OS to add or change features, there was no set form that their changes had to follow; people implemented the new features in various different ways. This meant that their concurrency bugs had common patterns but not specific common forms.)

I could have thought that I was spotting these problems because I was a better programmer than these other undergraduate students (some of whom were literally my peers, it was just that I'd taken the OS course a year earlier than they had because it was one of my interests). However, one of the most interesting parts of the whole experience was getting pretty definitive proof that I wasn't, and it was my focused experience that made the difference. One of the people taking this course was a fellow undergraduate who I knew and I knew was a better programmer than I was, but when I was marking his version of one assignment I spotted what I viewed at the time as a reasonably obvious concurrency issue. So I wasn't seeing these issues when the undergraduates doing the assignment missed them because I was a better programmer, since here I wasn't: I was seeing the bugs because I was more immersed in this than they were.

(This also strongly influenced my view of how hard and tricky concurrency is. Here was a very smart programmer, one with at least some familiarity with the whole area, and they'd still made a mistake.)

Uses for DNS server delegation

By: cks
14 October 2025 at 03:52

A commentator on my entry on systemd-resolved's new DNS server delegation feature asked:

My memory might fail me here, but: wasn't something like this a feature introduced in ISC's BIND 8, and then considered to be a bad mistake and dropped again in BIND 9 ?

I don't know about Bind, but what I do know is that this feature is present in other DNS resolvers (such as Unbound) and that it has a variety of uses. Some of those uses can be substituted with other features and some can't be, at least not as-is.

The quick version of 'DNS server delegation' is that you can send all queries under some DNS zone name off to some DNS server (or servers) of your choice, rather than have DNS resolution follow any standard NS delegation chain that may or may not exist in global DNS. In Unbound, this is done through, for example, Forward Zones.

DNS server delegation has at least three uses that I know of. First, you can use it to insert entire internal TLD zones into the view that clients have. People use various top level names for these zones, such as .internal, .kvm, .sandbox (our choice), and so on. In all cases you have some authoritative servers for these zones and you need to direct queries to these servers instead of having your queries go to the root nameservers and be rejected.

(Obviously you will be sad if IANA ever assigns your internal TLD to something, but honestly if IANA allows, say, '.internal', we'll have good reason to question their sanity. The usual 'standard DNS environment' replacement for this is to move your internal TLD to be under your organizational domain and then implement split horizon DNS.)

Second, you can use it to splice in internal zones that don't exist in external DNS without going to the full overkill of split horizon authoritative data. If all of your machines live in 'corp.example.org' and you don't expose this to the outside world, you can have your public example.org servers with your public data and your corp.example.org authoritative servers, and you splice in what is effectively a fake set of NS records through DNS server delegation. Related to this, if you want you can override public DNS simply by having an internal and an external DNS server, without split horizon DNS; you use DNS server delegation to point to the internal DNS server for certain zones.

(This can be replaced with split horizon DNS, although maintaining split horizon DNS is its own set of headaches.)

Finally, you can use this to short-cut global DNS resolution for reliability in cases where you might lose external connectivity. For example, there are within-university ('on-campus' in our jargon) authoritative DNS servers for .utoronto.ca and .toronto.edu. We can use DNS server delegation to point these zones at these servers to be sure we can resolve university names even if the university's external Internet connection goes down. We can similarly point our own sub-zone at our authoritative servers, so even if our link to the university backbone goes down we can resolve our own names.

(This isn't how we actually implement this; we have a more complex split horizon DNS setup that causes our resolving DNS servers to have a complete copy of the inside view of our zones, acting as caching secondaries.)

The early Unix history of chown() being restricted to root

By: cks
13 October 2025 at 03:37

A few years ago I wrote about the divide in chown() about who got to give away files, where BSD and V7 were on one side, restricting it to root, while System III and System V were on the other, allowing the owner to give them away too. At the time I quoted the V7 chown(2) explanation of this:

[...] Only the super-user may execute this call, because if users were able to give files away, they could defeat the (nonexistent) file-space accounting procedures.

Recently, for reasons, chown(2) and its history was on my mind and so I wondered if the early Research Unixes had always had this, or if a restriction was added at some point.

The answer is that the restriction was added in V6, where the V6 chown(2) manual page has the same wording as V7. In Research Unix V5 and earlier, people can chown(2) away their own files; this is documented in the V4 chown(2) manual page and is what the V5 kernel code for chown() does. This behavior runs all the way back to the V1 chown() manual page, with an extra restriction that you can't chown() setuid files.

(Since I looked it up, the restriction on chown()'ing setuid files was lifted in V4. In V4 and later, a setuid file has its setuid bit removed on chown; in V3 you still can't give away such a file, according to the V3 chown(2) manual page.)

At this point you might wonder where the System III and System V unrestricted chown came from. The surprising to me answer seems to be that System III partly descends from PWB/UNIX, and PWB/UNIX 1.0, although it was theoretically based on V6, has pre-V6 chown(2) behavior (kernel source, manual page). I suspect that there's a story both to why V6 made chown() more restricted and also why PWB/UNIX specifically didn't take that change from V6, but I don't know if it's been documented anywhere (a casual Internet search didn't turn up anything).

(The System III chown(2) manual page says more or less the same thing as the PWB/UNIX manual page, just more formally, and the kernel code is very similar.)

Maybe why OverlayFS had its readdir() inode number issue

By: cks
12 October 2025 at 02:53

A while back I wrote about readdir()'s inode numbers versus OverlayFS, which discussed an issue where for efficiency reasons, OverlayFS sometimes returned different inode numbers in readdir() than in stat(). This is not POSIX legal unless you do some pretty perverse interpretations (as covered in my entry), but lots of filesystems deviate from POSIX semantics every so often. A more interesting question is why, and I suspect the answer is related to another issue that's come up, the problem of NFS exports of NFS mounts.

What's common in both cases is that NFS servers and OverlayFS both must create an 'identity' for a file (a NFS filehandle and an inode number, respectively). In the case of NFS servers, this identity has some strict requirements; OverlayFS has a somewhat easier life, but in general it still has to create and track some amount of information. Based on reading the OverlayFS article, I believe that OverlayFS considers this expensive enough to only want to do it when it has to.

OverlayFS definitely needs to go to this effort when people call stat(), because various programs will directly use the inode number (the POSIX 'file serial number') to tell files on the same filesystem apart. POSIX technically requires OverlayFS to do this for readdir(), but in practice almost everyone that uses readdir() isn't going to look at the inode number; they look at the file name and perhaps the d_type field to spot directories without needing to stat() everything.

If there was a special 'not a valid inode number' signal value, OverlayFS might use that, but there isn't one (in either POSIX or Linux, which is actually a problem). Since OverlayFS needs to provide some sort of arguably valid inode number, and since it's reading directories from the underlying filesystems, passing through their inode numbers from their d_ino fields is the simple answer.

(This entry was inspired by Kevin Lyda's comment on my earlier entry.)

Sidebar: Why there should be a 'not a valid inode number' signal value

Because both standards and common Unix usage include a d_ino field in the structure readdir() returns, they embed the idea that the stat()-visible inode number can easily be recovered or generated by filesystems purely by reading directories, without needing to perform additional IO. This is true in traditional Unix filesystems, but it's not obvious that you would do that all of the time in all filesystems. The on disk format of directories might only have some sort of object identifier for each name that's not easily mapped to a relatively small 'inode number' (which is required to be some C integer type), and instead the 'inode number' is an attribute you get by reading file metadata based on that object identifier (which you'll do for stat() but would like to avoid for reading directories).

But in practice if you want to design a Unix filesystem that performs decently well and doesn't just make up inode numbers in readdir(), you must store a potentially duplicate copy of your 'inode numbers' in directory entries.

Keeping notes is for myself too, illustrated (once again)

By: cks
11 October 2025 at 03:18

Yesterday I wrote about restarting or redoing something after a systemd service restarts. The non-hypothetical situation that caused me to look into this was that after we applied a package update to one system, systemd-networkd on it restarted and wiped out some critical policy based routing rules. Since I vaguely remembered this happening before, I sighed and arranged to have our rules automatically reapplied on both systems with policy based routing rules, following the pattern I worked out.

Wait, two systems? And one of them didn't seem to have problems after the systemd-networkd restart? Yesterday I ignored that and forged ahead, but really it should have set off alarm bells. The reason the other system wasn't affected was I'd already solved the problem the right way back in March of 2024, when we first hit this networkd behavior and I wrote an entry about it.

However, I hadn't left myself (or my co-workers) any notes about that March 2024 fix; I'd put it into place on the first machine (then the only machine we had that did policy based routing) and forgotten about it. My only theory is that I wanted to wait and be sure it actually fixed the problem before documenting it as 'the fix', but if so, I made a mistake by not leaving myself any notes that I had a fix in testing. When I recently built the second machine with policy based routing I copied things from the first machine, but I didn't copy the true networkd fix because I'd forgotten about it.

(It turns out to have been really useful that I wrote that March 2024 entry because it's the only documentation I have, and I'd probably have missed the real fix if not for it. I rediscovered it in the process of writing yesterday's entry.)

I know (and knew) that keeping notes is good, and that my memory is fallible. And I still let this slip through the cracks for whatever reason. Hopefully the valuable lesson I've learned from this will stick a bit so I don't stub my toe again.

(One obvious lesson is that I should make a note to myself any time I'm testing something that I'm not sure will actually work. Since it may not work I may want to formally document it in our normal system for this, but a personal note will keep me from completely losing track of it. You can see the persistence of things 'in testing' as another example of the aphorism that there's nothing as permanent as a temporary fix.)

Restarting or redoing something after a systemd service restarts

By: cks
10 October 2025 at 03:21

Suppose, not hypothetically, that your system is running some systemd based service or daemon that resets or erase your carefully cultivated state when it restarts. One example is systemd-networkd, although you can turn that off (or parts of it off, at least), but there are likely others. To clean up after this happens, you'd like to automatically restart or redo something after a systemd unit is restarted. Systemd supports this, but I found it slightly unclear how you want to do this and today I poked at it, so it's time for notes.

(This is somewhat different from triggering one unit when another unit becomes active, which I think is still not possible in general.)

First, you need to put whatever you want to do into a script and a .service unit that will run the script. The traditional way to run a script through a .service unit is:

[Unit]
....

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/your/script/here

[Install]
WantedBy=multi-user.target

(The 'RemainAfterExit' is load-bearing, also.)

To get this unit to run after another unit is started or restarted, what you need is PartOf=, which causes your unit to be stopped and started when the other unit is, along with 'After=' so that your unit starts after the other unit instead of racing it (which could be counterproductive when what you want to do is fix up something from the other unit). So you add:

[Unit]
...
PartOf=systemd-networkd.service
After=systemd-networkd.service

(This is what works for me in light testing. This assumes that the unit you want to re-run after is normally always running, as systemd-networkd is.)

In testing, you don't need to have your unit specifically enabled by itself, although you may want it to be for clarity and other reasons. Even if your unit isn't specifically enabled, systemd will start it after the other unit because of the PartOf=. If the other unit is started all of the time (as is usually the case for systemd-networkd), this effectively makes your unit enabled, although not in an obvious way (which is why I think you should specifically 'systemctl enable' it, to make it obvious). I think you can have your .service unit enabled and active without having the other unit enabled, or even present.

You can declare yourself PartOf a .target unit, and some stock package systemd units do for various services. And a .target unit can be PartOf a .service; on Fedora, 'sshd-keygen.target' is PartOf sshd.service in a surprisingly clever little arrangement to generate only the necessary keys through a templated 'sshd-keygen@.service' unit.

I admit that the whole collection of Wants=, Requires=, Requisite=, BindsTo=, PartOf=, Upholds=, and so on are somewhat confusing to me. In the past, I've used the wrong version and suffered the consequences, and I'm not sure I have them entirely right in this entry.

Note that as far as I know, PartOf= has those Requires= consequences, where if the other unit is stopped, yours will be too. In a simple 'run a script after the other unit starts' situation, stopping your unit does nothing and can be ignored.

(If this seems complicated, well, I think it is, and I think one part of the complication is that we're trying to use systemd as an event-based system when it isn't one.)

Systemd-resolved's new 'DNS Server Delegation' feature (as of systemd 258)

By: cks
9 October 2025 at 03:04

A while ago I wrote an entry about things that resolved wasn't for as of systemd 251. One of those things was arbitrary mappings of (DNS) names to DNS servers, for example if you always wanted *.internal.example.org to query a special DNS server. Systemd-resolved didn't have a direct feature for this and attempting to attach your DNS names to DNS server mappings to a network interface could go wrong in various ways. Well, time marches on and as of systemd v258 this is no longer the state of affairs.

Systemd v258 introduces systemd.dns-delegate files, which allow you to map DNS names to DNS servers independently from network interfaces. The release notes describe this as:

A new DNS "delegate zone" concept has been introduced, which are additional lookup scopes (on top of the existing per-interface and the one global scope so far supported in resolved), which carry one or more DNS server addresses and a DNS search/routing domain. It allows routing requests to specific domains to specific servers. Delegate zones can be configured via drop-ins below /etc/systemd/dns-delegate.d/*.dns-delegate.

Since systemd v258 is very new I don't have any machines where I can actually try this out, but based on the systemd.dns-delegate documentation, you can use this both for domains that you merely want diverted to some DNS server and also domains that you also want on your search path. Per resolved.conf's Domains= documentation, the latter is 'Domains=example.org' (example.org will be one of the domains that resolved tries to find single-label hostnames in, a search domain), and the former is 'Domains=~example.org' (where we merely send queries for everything under 'example.org' off to whatever DNS= you set, a route-only domain).

(While resolved.conf's Domains= officially promises to check your search domains in the order you listed them, I believe this is strictly for a single 'Domains=' setting for a single interface. If you have multiple 'Domains=' settings, for example in a global resolved.conf, a network interface, and now in a delegation, I think systemd-resolved makes no promises.)

Right now, these DNS server delegations can only be set through static files, not manipulated through resolvectl. I believe fiddling with them through resolvectl is on the roadmap, but for now I guess we get to restart resolved if we need to change things. In fact resolvectl doesn't expose anything to do with them, although I believe read-only information is available via D-Bus and maybe varlink.

Given the timing of systemd v258's release relative to Fedora releases, I probably won't be able to use this feature until Fedora 44 in the spring (Fedora 42 is current and Fedora 43 is imminent, which won't have systemd v258 given that v258 was released only a couple of weeks ago). My current systemd-resolved setup is okay (if it wasn't I'd be doing something else), but I can probably find uses for these delegations to improve it.

Why I have a GPS bike computer

By: cks
8 October 2025 at 03:42

(This is a story about technology. Sort of.)

Many bicyclists with a GPS bike computer probably have it primarily to record their bike rides and then upload them to places like Strava. I'm a bit unusual in that while I do record my rides and make some of them public, and I've come to value this, it's not my primary reason to have a GPS bike computer. Instead, my primary reason is following pre-made routes.

When I started with my recreational bike club, it was well before the era of GPS bike computers. How you followed (or lead) our routes back then was through printed cue sheets, which had all of the turns and so on listed in order, often with additional notes. One of the duties of the leader of the ride was printing out a sufficient number of cue sheets in advance and distributing them to interested parties before the start of the ride. If you were seriously into using cue sheets, you'd use a cue sheet holder (nowadays you can only find these as 'map holders', which is basically the same job); otherwise you might clip the cue sheet to a handlebar brake or gear cable or fold it up and stick it in a back jersey pocket.

Printed cue sheets have a number of nice features, such as giving you a lot of information at a glance. One of them is that a well done cue sheet was and is a lot more than just a list of all of the turns and other things worthy of note; it's an organized, well formatted list of these. The cues would be broken up into sensibly chosen sections, with whitespace between them to make it easier to narrow in on the current one, and you'd lay out the page (or pages) so that the cue or section breaks happened at convenient spots to flip the cue sheet around in cue holders or clips. You'd emphasize important turns, cautions, or other things in various ways. And so on. Some cue sheets even had a map of the route printed on the back.

(You needed to periodically flip the cue sheet around and refold it because many routes had too many turns and other cues to fit in a small amount of printed space, especially if you wanted to use a decently large font size for easy readability.)

Starting in the early 2010s, more and more TBN people started using GPS bike computers or smartphones (cf). People began converting our cue sheet routes to computerized GPS routes, with TBN eventually getting official GPS routes. Over time, more and more members got smartphones and GPS units and there was more and more interest in GPS routes and less and less interest in cue sheets. In 2015 I saw the writing on the wall for cue sheets and the club more or less deprecated them, so in August 2016 I gave in and got a GPS unit (which drove me to finally get a smartphone, because my GPS unit assumed you had one). Cue sheet first routes lingered on for some years afterward, but they're all gone by now; everything is GPS route first.

You can still get cue sheets for club routes (the club's GPS routes typically have turn cues and you can export these into something you can print). But what we don't really have any more is the old school kind of well done, organized cue sheets, and it's basically been a decade since ride leaders would turn up with any printed cue sheets at all. These days it's on you to print your own cue sheet if you need it, and also on you to make a good cue sheet from the basic cue sheet (if you care enough to do so). There are some people who still use cue sheets, but they're a decreasing minority and they probably already had the cue sheet holders and so on (which are now increasingly hard to find). A new rider who wanted to use cue sheets would have an uphill struggle and they might never understand why long time members could be so fond of them.

Cue sheets are still a viable option for route following (and they haven't fundamentally changed). They're just not very well supported any more in TBN because they stopped being popular. If you insist on sticking with them, you still can, but it's not going to be a great experience. I didn't move to a GPS unit because I couldn't possibly use cue sheets any more (I still have my cue sheet holder); I moved because I could see the writing on the wall about which one would be the more convenient, more usable option.

Applications to the (computing) technologies of your choice are left as an exercise for the reader.

PS: As a whole I think GPS bike computers are mostly superior to cue sheets for route following, but that's a different discussion (and it depends on what sort of bicycling you're doing). There are points on both sides.

A Firefox issue and perhaps how handling scaling is hard

By: cks
7 October 2025 at 03:09

Over on the Fediverse I shared a fun Firefox issue I've just run into:

Today's fun Firefox bug: if I move my (Nightly) Firefox window left and right across my X display, the text inside the window reflows to change its line wrapping back and forth. I have a HiDPI display with non-integer scaling and some other settings, so I'm assuming that Firefox is now suffering from rounding issues where the exact horizontal pixel position changes its idea of the CSS window width, triggering text reflows as it jumps back and forth by a CSS pixel.

(I've managed to reproduce this in a standard Nightly, although so far only with some of my settings.)

Close inspection says that this isn't quite what's happening, and the underlying problem is happening more often than I thought. What is actually happening is that as I move my Firefox window left and right, a thin vertical black line usually appears and disappears at the right edge of the window (past a scrollbar if there is one). Since I can see it on my HiDPI display, I suspect that this vertical line is at least two screen pixels wide. Under the right circumstances of window width, text size, and specific text content, this vertical black bar takes enough width away from the rest of the window to cause Firefox to re-flow and re-wrap text, creating easily visible changes as the window moves.

A variation of this happens when the vertical black bar isn't drawn but things on the right side of the toolbar and the URL bar area will shift left and right slightly as the window is moved horizontally. If the window is showing a scrollbar, the position of the scroll target in the scrollbar will move left and right, with the right side getting ever so slightly wider or returning back to being symmetrical. It's easiest to see this if I move the window sideways slowly, which is of course not something I do often (usually I move windows rapidly).

(This may be related to how X has a notion of sizing windows in non-pixel units if the window asks for it. Firefox in my configuration definitely asks for this; it asserts that it wants to be resized in units of 2 (display) pixels both horizontally and vertically. However, I can look at the state of a Firefox window in X and see that the window size in pixels doesn't change between the black bar appearing and disappearing.)

All of this is visible partly because under X and my window manager, windows can redisplay themselves even during an active move operation. If the window contents froze while I dragged windows around, I probably wouldn't have noticed this for some time. Text reflowing as I moved a Firefox window sideways created a quite attention-getting shimmer.

It's probably relevant that I need unusual HiDPI settings and I've also set Firefox's layout.css.devPixelsPerPx to 1.7 in about:config. That was part of why I initially assumed this was a scaling and rounding issue, and why I still suspect that area of Firefox a bit.

(I haven't filed this as a Firefox bug yet, partly because I just narrowed down what was happening in the process of writing this entry.)

What (I think) you need to do basic UDP NAT traversal

By: cks
6 October 2025 at 03:52

Yesterday I wished for a way to do native "blind" WireGuard relaying, without needing to layer something on top of WireGuard. I wished for this both because it's the simplest approach for getting through NATs and the one you need in general under some circumstances. The classic and excellent work on all of the complexities of NAT traversal is Tailscale's How NAT traversal works, which also winds up covering the situation where you absolutely have to have a relay. But, as I understand things, in a fair number of situations you can sort of do without a relay and have direct UDP NAT traversal, although you need to do some extra work to get it and you need additional pieces.

Following RFC 4787, we can divide NAT into to two categories, endpoint-independent mapping (EIM) and endpoint-dependent mapping (EDM). In EIM, the public IP and port of your outgoing NAT'd traffic depend only on your internal IP and port, not on the destination (IP or port); in EDM they (also) depend on the destination. NAT'ing firewalls normally NAT based on what could be called "flows". For TCP, flows are a real thing; you can specifically tell a single TCP connection and it's difficult to fake one. For UDP, a firewall generally has no idea of what is a valid flow, and the best it can do is accept traffic that comes from the destination IP and port, which in theory is replies from the other end.

This leads to the NAT traffic traversal trick that we can do for UDP specifically. If we have two machines that want to talk to each other on each other's UDP port 51820, the first thing they need is to learn the public IP and port being used by the other machine. This requires some sort of central coordination server as well as the ability to send traffic to somewhere on UDP port 51820 (or whatever port you care about). In the case of WireGuard, you might as well make this a server on a public IP running WireGuard and have an actual WireGuard connection to it, and the discount 'coordination server' can then be basically the WireGuard peer information from 'wg' (the 'endpoint' is the public IP and port you need).

Once the two machines know each other's public IP and port, they start sending UDP port 51820 (or whatever) packets to each other, to the public IP and port they learned through the coordination server. When each of them sends their first outgoing packet, this creates a 'flow' on their respective NAT firewall which will allow the other machine's traffic in. Depending on timing, the first few packets from the other machine may arrive before your firewall has set up its state to allow them in and will get dropped, so each side needs to keep sending until it works or until it's clear that at least one side has an EDM (or some other complication).

(For WireGuard, you'd need something that sets the peer's endpoint to your now-known host and port value and then tries to send it some traffic to trigger the outgoing packets.)

As covered in Tailscale's article, it's possible to make direct NAT traversal work in some additional circumstances with increasing degrees of effort. You may be lucky and have a local EDM firewall that can be asked to stop doing EDM for your UDP port (via a number of protocols for this), and otherwise it may be possible to feel your way around one EDM firewall.

If you can arrange a natural way to send traffic from your UDP port to your coordination server, the basic NAT setup can be done without needing the deep cooperation of the software using the port; all you need is a way to switch what remote IP and port it uses for a particular peer. Your coordination server may need special software to listen to traffic and decode which peer is which, or you may be able to exploit existing features of your software (for example, by making the coordination server a WireGuard peer). Otherwise, I think you need either some cooperation from the software involved or gory hacks.

Wishing for a way to do 'blind' (untrusted) WireGuard relaying

By: cks
5 October 2025 at 02:32

Over on the Fediverse, I sort of had a question:

I wonder if there's any way in standard WireGuard to have a zero-trust network relay, so that two WG peers that are isolated from each other (eg both behind NAT) can talk directly. The standard pure-WG approach has a public WG endpoint that everyone talks to and which acts as a router for the internal WG IPs of everyone, but this involves decrypting and re-encrypting the WG traffic.

By 'talk directly' I mean that each of the peers has the WireGuard keys of the other and the traffic between the two of them stays encrypted with those keys all the way through its travels. The traditional approach to the problem of two NAT'd machines that want to talk to each other with WireGuard is to have a WireGuard router that both of them talk to over WireGuard, but this means that the router sees the unencrypted traffic between them. This is less than ideal if you don't want to trust your router machine, for example because you want to make it a low-trust virtual machine rented from some cloud provider.

Since we love indirection in computer science, you can in theory solve this with another layer of traffic encapsulation (with a lot of caveats). The idea is that all of the 'public' endpoint IPs of WireGuard peers are actually on a private network, and you route the private network through your public router. Getting the private network packets to and from the router requires another level of encapsulation and unless you get very clever, all your traffic will go through the router even if two WireGuard peers could talk directly. Since WireGuard automatically keeps track of the current public IPs of peers, it would be ideal to do this with WireGuard, but I'm not sure that WG-in-WG can have the routing maintained the way we want.

This untrusted relay situation is of course one of the things that 'automatic mesh network on top of WireGuard' systems give you, but it would be nice to be able to do this with native features (and perhaps without an explicit control plane server that machines talk to, although that seems unlikely). As far as I know such systems implement this with their own brand of encapsulation, which I believe requires running their WireGuard stack.

(On Linux you might be able to do something clever with redirecting outgoing WireGuard packets to a 'tun' device connected to a user level program, which then wrapped them up, sent them off, received packets back, and injected the received packets into the system.)

Using systems because you know them already

By: cks
4 October 2025 at 03:35

Every so often on the Fediverse, people ask for advice on a monitoring system to run on their machine (desktop or server), and some of the time Prometheus, and when it does I wind up making awkward noises. On the one hand, we run Prometheus (and Grafana) and are happy with it, and I run separate Prometheus setups on my work and home desktops. On the other hand, I don't feel I can recommend picking Prometheus for a basic single-machine setup, despite running it that way myself.

Why do I run Prometheus on my own machines if I don't recommend that you do so? I run it because I already know Prometheus (and Grafana), and in fact my desktops (re)use much of our production Prometheus setup (but they scrape different things). This is a specific instance (and example) of a general thing in system administration, which is that not infrequently it's simpler for you to use something you already know even if it's not necessarily an exact fit (or even a great fit) for the problem. For example, if you're quite familiar with operating PostgreSQL databases, it might be simpler to use PostgreSQL for a new system where SQLite could do perfectly well and other people would find SQLite much simpler. Especially if you have canned setups, canned automation, and so on all ready to go for PostgreSQL, and not for SQLite.

(Similarly, our generic web server hammer is Apache, even if we're doing things that don't necessarily need Apache and could be done perfectly well or perhaps better with nginx, Caddy, or whatever.)

This has a flipside, where you use a tool because you know it even if there might be a significantly better option, one that would actually be easier overall even accounting for needing to learn the new option and build up the environment around it. What we could call "familiarity-driven design" is a thing, and it can even be a confining thing, one where you shape your problems to conform to the tools you already know.

(And you may not have chosen your tools with deep care and instead drifted into them.)

I don't think there's any magic way to know which side of the line you're on. Perhaps the best we can do is be a little bit skeptical about our reflexive choices, especially if we seem to be sort of forcing them in a situation that feels like it should have a simpler or better option (such as basic monitoring of a single machine).

(In a way it helps that I know so much about Prometheus because it makes me aware of various warts, even if I'm used to them and I've climbed the learning curves.)

Apache .htaccess files are important because they enable delegation

By: cks
3 October 2025 at 03:03

Apache's .htaccess files have a generally bad reputation. For example, lots of people will tell you that they can cause performance problems and you should move everything from .htaccess files into your main Apache configuration, using various pieces of Apache syntax to restrict what configuration directives apply to. The result can even be clearer, since various things can be confusing in .htaccess files (eg rewrites and redirects). Despite all of this, .htaccess files are important and valuable because of one property, which is that they enable delegation of parts of your server configuration to other people.

The Apache .htaccess documentation even spells this out in reverse, in When (not) to use .htaccess files:

In general, you should only use .htaccess files when you don't have access to the main server configuration file. [...]

If you operate the server and would be writing the .htaccess file, you can put the contents of the .htaccess in the main server configuration and make your life easier and Apache faster (and you probably should). But if the web server and its configuration isn't managed as a unitary whole by one group, then .htaccess files allow the people managing the overall Apache configuration to safely delegate things to other people on a per-directory basis, using Unix ownership. This can both enable people to do additional things and reduce the amount of work the central people have to do, letting people things scale better.

(The other thing that .htaccess files allow is dynamic updates without having to restart or reload the whole server. In some contexts this can be useful or important, for example if the updates are automatically generated at unpredictable times.)

I don't think it's an accident that .htaccess files emerged in Apache, because one common environment Apache was initially used in was old fashioned multi-user Unix web servers where, for example, every person with a login on the web server might have their own UserDir directory hierarchy. Hence features like suEXEC, so you could let people run CGIs without those CGIs having to run as the web user (a dangerous thing), and also hence the attraction of .htaccess files. If you have a bunch of (graduate) students with their own web areas, you definitely don't want to let all of them edit your departmental web server's overall configuration.

(Apache doesn't solve all your problems here, at least not in a simple configuration; you're still left with the multiuser PHP problem. Our solution to this problem is somewhat brute force.)

These environments are uncommon today but they're not extinct, at least at universities like mine, and .htaccess files (and Apache's general flexibility) remain valuable to us.

Readdir()'s inode numbers versus OverlayFS

By: cks
2 October 2025 at 03:09

Recently I re-read Deep Down the Rabbit Hole: Bash, OverlayFS, and a 30-Year-Old Surprise (via) and this time around, I stumbled over a bit in the writeup that made me raise my eyebrows:

Bash’s fallback getcwd() assumes that the inode [number] from stat() matches one returned by readdir(). OverlayFS breaks that assumption.

I wouldn't call this an 'assumption' so much as 'sane POSIX semantics', although I'm not sure that POSIX absolutely requires this.

As we've seen before, POSIX talks about 'file serial number(s)' instead of inode numbers. The best definition of these is covered in sys/stat.h, where we see that a 'file identity' is uniquely determined by the combination the inode number and the device ID (st_dev), and POSIX says that 'at any given time in a system, distinct files shall have distinct file identities' while hardlinks have the same identity. The POSIX description of readdir() and dirent.h don't caveat the d_ino file serial numbers from readdir(), so they're implicitly covered by the general rules for file serial numbers.

In theory you can claim that the POSIX guarantees don't apply here since readdir() is only supplying d_ino, the file serial number, not the device ID as well. I maintain that this fails due to a POSIX requirement:

[...] The value of the structure's d_ino member shall be set to the file serial number of the file named by the d_name member. [...]

If readdir() gives one file serial number and a fstatat() of the same name gives another, a plain reading of POSIX is that one of them is lying. Files don't have two file serial numbers, they have one. Readdir() can return duplicate d_ino numbers for files that aren't hardlinks to each other (and I think legitimately may do so in some unusual circumstances), but it can't return something different than what fstatat() does for the same name.

The perverse argument here turns on POSIX's 'at any given time'. You can argue that the readdir() is at one time and the stat() is at another time and the system is allowed to entirely change file serial numbers between the two times. This is certainly not the intent of POSIX's language but I'm not sure there's anything in the standard that rules it out, even though it makes file serial numbers fairly useless since there's no POSIX way to get a bunch of them at 'a given time' so they have to be coherent.

So to summarize, OverlayFS has chosen what are effectively non-POSIX semantics for its readdir() inode numbers (under some circumstances, in the interests of performance) and Bash used readdir()'s d_ino in a traditional Unix way that caused it to notice. Unix filesystems can depart from POSIX semantics if they want, but I'd prefer if they were a bit more shamefaced about it. People (ie, programs) count on those semantics.

(The truly traditional getcwd() way wouldn't have been a problem, because it predates readdir() having d_ino and so doesn't use it (it stat()s everything to get inode numbers). I reflexively follow this pre-d_ino algorithm when I'm talking about doing getcwd() by hand (cf), but these days you want to use the dirent d_ino and if possible d_type, because they're much more efficient than stat()'ing everything.)

How part of my email handling drifted into convoluted complexity

By: cks
1 October 2025 at 01:50

Once upon a time, my email handling was relatively simple. I wasn't on any big mailing lists, so I had almost everything delivered straight to my inbox (both in the traditional /var/mail mbox sense and then through to MH's own inbox folder directory). I did some mail filtering with procmail, but it was all for things that I basically never looked at, so I had procmail write them to mbox files under $HOME/.mail. I moved email from my Unix /var/mail inbox to MH's inbox with MH's inc command (either running it directly or having exmh run it for me). Rarely, I had a mbox file procmail had written that I wanted to read, and at that point I inc'd it either to my MH +inbox or to some other folder.

Later, prompted by wanting to improve my breaks and vacations, I diverted a bunch of mailing lists away from my inbox. Originally I had procmail write these diverted messages to mbox files, then later I'd inc the files to read the messages. Then I found that outside of vacations, I needed to make this email more readily accessible, so I had procmail put them in MH folder directories under Mail/inbox (one of MH's nice features is that your inbox is a regular folder and can have sub-folders, just like everything else). As I noted at the time, procmail only partially emulates MH when doing this, and one of the things it doesn't do is keep track of new, unread ('unseen') messages.

(MH has a general purpose system for keeping track of 'sequences' of messages in a MH folder, so it tracks unread messages based on what is in the special 'unseen' sequence. Inc and other MH commands update this sequence; procmail doesn't.)

Along with this procmail setup I wrote a basic script, called mlists, to report how many messages each of these 'mailing list' inboxes had in them. After a while I started diverting lower priority status emails and so on through this system (and stopped reading the mailing lists); if I got a type of email in any volume that I didn't want to read right away during work, it probably got shunted to these side inboxes. At some point I made mlists optionally run the MH scan command to show me what was in each inbox folder (well, for the inbox folders where this was potentially useful information). The mlists script was still mostly simple and the whole system still made sense, but it was a bit more complex than before, especially when it also got a feature where it auto-reset the current message number in each folder to the first message.

A couple of years ago, I switched the MH frontend I used from exmh to MH-E in GNU Emacs, which changed how I read my email in practice. One of the changes was that I started using the GNU Emacs Speedbar, which always displays a count of messages in MH folders and especially wants to let you know about folders with unread messages. Since I had the hammer of my mlists script handy, I proceeded to mutate it to be what a comment in the script describes as "a discount maintainer of 'unseen'", so that MH-E's speedbar could draw my attention to inbox folders that had new messages.

This is not the right way to do this. The right way to do this is to have procmail deliver messages through MH's rcvstore, which as a MH command can update the 'unseen' sequence properly. But using rcvstore is annoying, partly because you have to use another program to add the locking it needs, so at every point the path of least resistance was to add a bit more hacks to what I already had. I had procmail, and procmail could deliver to MH folder directories, so I used it (and at the time the limitations were something I considered a feature). I had a script to give me basic information, so it could give me more information, and then it could do one useful thing while it was giving me information, and then the one useful thing grew into updating 'unseen'.

And since I have all of this, it's not even worth the effort of switching to the proper rcvstore approach and throwing a bunch of it away. I'm always going to want the 'tell me stuff' functionality of my mlists script, so part of it has to stay anyway.

Can I see similarities between this and how various of our system tools have evolved, mutated, and become increasingly complex? Of course. I think it's much the same obvious forces involved, because each step seems reasonable in isolation, right up until I've built a discount environment that duplicates much of rcvstore.

Sidebar: an extra bonus bit of complexity

It turns out that part of the time, I want to get some degree of live notification of messages being filed into these inbox folders. I may not look at all or even many of them, but there are some periodic things that I do want to pay attention to. So my discount special hack is basically:

tail -f .mail/procmail-log |
  egrep -B2 --no-group-separator 'Folder: /u/cks/Mail/inbox/'

(This is a script, of course, and I run it in a terminal window.)

This could be improved in various ways but then I'd be sliding down the convoluted complexity slope and I'm not willing to do that. Yet. Give it a few years and I may be back to write an update.

More on the tools I use to read email affecting my email reading

By: cks
30 September 2025 at 03:32

About two years ago I wrote an entry about how my switch from reading email with exmh to reading it in GNU Emacs with MH-E had affected my email reading behavior more than I expected. As time has passed and I've made more extensive customizations to my MH-E environment, this has continued. One of the recent ways I've noticed is that I'm slowly making more and more use of the fact that GNU Emacs is a multi-window editor ('multi-frame' in Emacs terminology) and reading email with MH-E inside it still leaves me with all of the basic Emacs facilities. Specifically, I can create several Emacs windows (frames) and use this to be working in multiple MH folders at the same time.

Back when I used exmh extensively, I mostly had MH pull my email into the default 'inbox' folder, where I dealt with it all at once. Sometimes I'd wind up pulling some new email into a separate folder, but exmh only really giving me a view of a single folder at a time combined with a system administrator's need to be regularly responding to email made that a bit awkward. At first my use of MH-E mostly followed that; I had a single Emacs MH-E window (frame) and within that window I switched between folders. But lately I've been creating more new windows when I want to spend time reading a non-inbox folder, and in turn this has made me much more willing to put new email directly into different (MH) folders rather than funnel it all into my inbox.

(I don't always make a new window to visit another folder, because I don't spend long on many of my non-inbox folders for new email. But for various mailing lists and so on, reading through them may take at least a bit of time so it's more likely I'll decide I want to keep my MH inbox folder still available.)

One thing that makes this work is that MH-E itself has reasonably good support for displaying and working on multiple folders at once. There are probably ways to get MH-E to screw this up and run MH commands with the wrong MH folder as the current folder, so I'm careful that I don't try to have MH-E carry out its pending MH operations in two MH-E folders at the same time. There are areas where MH-E is less than ideal when I'm also using command-line MH tools, because MH-E changes MH's global notion of the current folder any time I have it do things like show a message in some folder. But at least MH-E is fine (in normal circumstances) if I use MH commands to change the current folder; MH-E will just switch it back the next time I have it show another message.

PS: On a purely pragmatic basis, another change in my email handling is that I'm no longer as irritated with HTML emails because GNU Emacs is much better at displaying HTML than exmh was. I've actually left my MH-E setup showing HTML by default, instead of forcing multipart/alternative email to always show the text version (my exmh setup). GNU Emacs and MH-E aren't up to the level of, say, Thunderbird, and sometimes this results in confusing emails, but it's better than it was.

(The situation that seems tricky for MH-E is that people sometimes include inlined images, for example screenshots as part of problem reports, and MH-E doesn't always give any indication that it's even omitting something.)

Syndication feed fetchers, HTTP redirects, and conditional GET

By: cks
29 September 2025 at 03:49

In response to my entry on how ETag values are specific to a URL, a Wandering Thoughts reader asked me in email what a syndication feed reader (fetcher) should do when it encounters a temporary HTTP redirect, in the context of conditional GET. I think this is a good question, especially if we approach it pragmatically.

The specification compliant answer is that every final (non-redirected) URL must have its ETag and Last-Modified values tracked separately. If you make a conditional GET for URL A because you know its ETag or Last-Modified (or both) and you get a temporary HTTP redirection to another URL B that you don't have an ETag or Last-Modified for, you can't make a conditional GET. This means you have to insure that If-None-Match and especially If-Modified-Since aren't copied from the original HTTP request to the newly re-issued redirect target request. And when you make another request for URL A later, you can't send a conditional GET using ETag or Last-Modified values you got from successfully fetching URL B; you either have to use the last values observed for URL A or make an unconditional GET. In other words, saved ETag and Last-Modified values should be per-URL properties, not per-feed properties.

(Unfortunately this may not fit well with feed reader code structures, data storage, or uses of low-level HTTP request libraries that hide things like HTTP redirects from you.)

Pragmatically, you can probably get away with re-doing the conditional GET when you get a temporary HTTP redirect for a feed, with the feed's original saved ETag and Last-Modified information. There are three likely cases for a temporary HTTP redirection of a syndication feed that I can think of:

  • You're receiving a generic HTTP redirection to some sort of error page that isn't a valid syndication feed. Your syndication feed fetcher isn't going to do anything with a successful fetch of it (except maybe add an 'error' marker to the feed), so a conditional GET that fools you with "nothing changed" is harmless.

  • You're being redirected to an alternate source of the normal feed, for example a feed that's normally dynamically generated might serve a (temporary) HTTP redirect to a static copy under high load. If the conditional GET matches the ETag (probably unlikely in practice) or the Last-Modified (more possible), then you almost certainly have the most current version and are fine, and you've saved the web server some load.

  • You're being (temporarily) redirected to some kind of error feed; a valid syndication feed that contains one or more entries that are there to tell the person seeing them about a problem. Here, the worst thing that happens if your conditional GET fools you with "nothing has changed" is that the person reading the feed doesn't see the error entry (or entries).

The third case is a special variant of an unlikely general case where the normal URL and the redirected URL are both versions of the feed but each has entries that the other doesn't. In this general case, a conditional GET that fools you with a '304 Not Modified' will cause you to miss some entries. However, this should cure itself when the temporary HTTP redirect stops happening (or when a new entry is published to the temporary location, which should change its ETag and reset its Last-Modified date to more or less now).

A feed reader that keeps a per-feed 'Last-Modified' value and updates it after following a temporary HTTP redirect is living dangerously. You may not have the latest version of the non-redirected feed but the target of the HTTP redirection may be 'more recent' than it for various reasons (even if it's a valid feed; if it's not a valid feed then blindly saving its ETag and Last-Modified is probably quite dangerous). When the temporary HTTP redirection goes away and the normal feed's URL resumes responding with the feed again, using the target's "Last-Modified" value for a conditional GET of the original URL could cause you to receive "304 Not Modified" until the feed is updated again (and its Last-Modified moves to be after your saved value), whenever that happens. Some feeds update frequently; others may only update days or weeks later.

Given this and the potential difficulties of even noticing HTTP redirects (if they're handled by some underlying library or tool), my view is that if a feed provides both an ETag and a Last-Modified, you should save and use only the ETag unless you're sure you're going to handle HTTP redirects correctly. An ETag could still get you into trouble if used across different URLs, but it's much less likely (see the discussion at the end of my entry about Last-Modified being specific to the URL).

(All of this is my view as someone providing syndication feeds, not someone writing syndication feed fetchers. There may be practical issues I'm unaware of, since the world of feeds is very large and it probably contains a lot of weird feed behavior (to go with the weird feed fetcher behavior).)

The HTTP Last-Modified value is specific to the URL (technically so is the ETag value)

By: cks
28 September 2025 at 01:08

Last time around I wrote about how If-None-Match values (which come from ETag values) must come from the actual URL itself, not (for example) from another URL that you were at one point redirected to. In practice, this is only an issue of moderate concern for ETag/If-None-Match; you can usually make a conditional GET using an ETag from another URL and get away with it. This is very much an issue if you make the mistake of doing the same thing with an If-Modified-Since header based on another URL's Last-Modified header. This is because the Last-Modified header value isn't unique to a particular document, in a way that ETag values can often be.

If you take the Last-Modified timestamp from URL A and perform a conditional GET for URL B with an 'If-Modified-Since' of that timestamp, the web server may well give you exactly what you asked for but not what you wanted by saying 'this hasn't been modified since then' even though the contents of those URLs are entirely different. You told the web server to decide purely on the basis of timestamps without reference to anything that might even vaguely specify the content, and so it did. This can happen even if the server is requiring an exact timestamp match (as it probably should), because there are any number of ways for the 'Last-Modified' timestamp of a whole bunch of URLs to be exactly the same because some important common element of them was last updated at that point.

(This is how DWiki works. The Last-Modified date of a page is the most recent timestamp of all of the elements that went into creating it, so if I change some shared element, everything will promptly take on the Last-Modified of that element.)

This means that if you're going to use Last-Modified in conditional GETs, you must handle HTTP redirects specially. It's actively dangerous (to actually getting updates) to mingle Last-Modified dates from the original URL and the redirection URL; you either have to not use Last-Modified at all, or track the Last-Modified values separately. For things that update regularly, any 'missing the current version' problems will cure themselves eventually, but for infrequently updated things you could go quite a while thinking that you have the current content when you don't.

In theory this is also true of ETag values; the specification allows them to be calculated in ways that are URL-specific (the specification mentions that the ETag might be a 'revision number'). A plausible implementation of serving a collection of pages from a Git repository could use the repository's Git revision as the common ETag for all pages; after all, the URL (the page) plus that git revision uniquely identifies it, and it's very cheap to provide under the right circumstances (eg, you can record the checked out git revision).

In practice, common ways of generating ETags will make them different across different URLs, potentially unless the contents are the same. DWiki generates ETag values using a cryptographic hash, so two different URLs will only have the same ETag if they have the same contents, which I believe is a common approach for pages that are generated dynamically. Apache generates ETag values for static files using various file attributes that will be different for different files, which is probably also a common approach for things that serve static files. Pragmatically you're probably much safer sending an ETag value from one URL in an If-None-Match header to another URL (for example, through repeating it while following a HTTP redirection). It's still technically wrong, though, and it may cause problems someday.

(This feels obvious but it was only today that I realized how it interacts with conditional GETs and HTTP redirects.)

Go's builtin 'new()' function will take an expression in Go 1.26

By: cks
27 September 2025 at 03:20

An interesting little change recently landed in the development version of Go, and so will likely appear in Go 1.26 when it's released. The change is that the builtin new() function will be able to take an expression, not just a type. This change stems from the proposal in issue 45624, which dates back to 2021 (and earlier for earlier proposals). The new specifications language is covered in, for example, this comment on the issue. An example is in the current development documentation for the release notes, but it may not sound very compelling.

A variety of uses came up in the issue discussion, some of which were a surprise to me. One case that's apparently surprisingly common is to start with a pointer and want to make another pointer to a (shallow) copy of its value. With the change to 'new()', this is:

np = new(*p)

Today you can write this as a generic function (apparently often called 'ref()'), or do it with a temporary variable, but in Go 1.26 this will (probably) be a built in feature, and perhaps the Go compiler will be able to optimize it in various ways. This sort of thing is apparently more common than you might expect.

Another obvious use for the new capability is if you're computing a new value and then creating a pointer to it. Right now, this has to be written using a temporary variable:

t := <some expression>
p := &t

With 'new(expr)' this can be written as one line, without a temporary variable (although as before a 'ref()' generic function can do this today).

The usage example from the current documentation is a little bit peculiar, at least as far as providing a motivation for this change. In a slightly modified form, the example is:

type Person struct {
    Name string `json:"name"`
    Age  *int   `json:"age"` // age if known; nil otherwise
}

func newPerson(name string, age int) *Person {
    return &Person{
        Name: name,
	   Age:  new(age),
     }
}

The reason this is a bit peculiar is that today you can write 'Age: &age' and it works the same way. Well, at a semantic level it works the same way. The theoretical but perhaps not practical complication is inlining combined with escape analysis. If newPerson() is inlined into a caller, then the caller's variable for the 'age' parameter may be unused after the (inlined) call to newPerson, and so could get mapped to 'Age: &callervar', which in turn could force escape analysis to put that variable in the heap, which might be less efficient than keeping the variable in the stack (or registers) until right at the end.

A broad language reason is that allowing new() to take an expression removes the special privilege that structs and certain other compound data structures have had, where you could construct pointers to initialized versions of them. Consider:

type ints struct { i int }
[...]
t := 10
ip := &t
isp := &ints{i: 10}

You can create a pointer to the int wrapped in a struct on a single line with no temporary variable, but a pointer to a plain int requires you to materialize a temporary variable. This is a bit annoying.

A pragmatic part of adding this is that people appear to write and use equivalents of new(value) a fair bit. The popularity of an expression is not necessarily the best reason to add a built-in equivalent to the language, but it does suggest that this feature will get used (or will eventually get used, since the existing uses won't exactly get converted instantly for all sorts of reasons).

This strikes me as a perfectly fine change for Go to make. The one thing that's a little bit non-ideal is that 'new()' of constant numbers has less type flexibility than the constant numbers themselves. Consider:

var ui uint
var uip *uint

ui = 10       // okay
uip = new(10) // type mismatch error

The current error that the compiler reports is 'cannot use new(10) (value of type *int) as *uint value in assignment', which is at least relatively straightforward.

(You fix it by casting ('converting') the untyped constant number to whatever you need. The now more relevant than before 'default type' of a constant is covered in the specification section on Constants.)

The broad state of ZFS on Illumos, Linux, and FreeBSD (as I understand it)

By: cks
26 September 2025 at 02:45

Once upon a time, Sun developed ZFS and put it in Solaris, which was good for us. Then Sun open-sourced Solaris as 'OpenSolaris', including ZFS, although not under the GPL (a move that made people sad and Scott McNealy is on record as regretting). ZFS development continued in Solaris and thus in OpenSolaris until Oracle bought Sun and soon afterward closed Solaris source again (in 2010); while Oracle continued ZFS development in Oracle Solaris, we can ignore that. OpenSolaris was transmogrified into Illumos, and various Illumos distributions formed, such as OmniOS (which we used for our second generation of ZFS fileservers).

Well before Oracle closed Solaris, separate groups of people ported ZFS into FreeBSD and onto Linux, where the effort was known as "ZFS on Linux". Since the Linux kernel community felt that ZFS's license wasn't compatible with the kernel's license, ZoL was an entirely out of (kernel) tree effort, while FreeBSD was able to accept ZFS into their kernel tree (I believe all the way back in 2008). Both ZFS on Linux and FreeBSD took changes from OpenSolaris into their versions up until Oracle closed Solaris in 2010. After that, open source ZFS development split into three mostly separate strands.

(In theory OpenZFS was created in 2013. In practice I think OpenZFS at the time was not doing much beyond coordination of the three strands.)

Over time, a lot more people wanted to build machines using ZFS on top of FreeBSD or Linux (including us) than wanted to keep using Illumos distributions. Not only was Illumos a different environment, but Illumos and its distributions didn't see the level of developer activity that FreeBSD and Linux did, which resulted in driver support issues and other problems (cf). For ZFS, the consequence of this was that many more improvements to ZFS itself started happening in ZFS on Linux and in FreeBSD (I believe to a lesser extent) than were happening in Illumos or OpenZFS, the nominal upstream. Over time the split of effort between Linux and FreeBSD became an obvious problem and eventually people from both sides got together. This resulted in ZFS on Linux v2.0.0 becoming 'OpenZFS 2.0.0' in 2020 (see also the Wikipedia history) and also becoming portable to FreeBSD, where it became the FreeBSD kernel ZFS implementation in FreeBSD 13.0 (cf).

The current state of OpenZFS is that it's co-developed for both Linux and FreeBSD. The OpenZFS ZFS repository routinely has FreeBSD specific commits, and as far as I know OpenZFS's test suite is routinely run on a variety of FreeBSD machines as well as a variety of Linux ones. I'm not sure how OpenZFS work propagates into FreeBSD itself, but it does (some spelunking of the FreeBSD source repository suggests that there are periodic imports of the latest changes). On Linux, OpenZFS releases and development versions propagate to Linux distributions in various ways (some of them rather baroque), including people simply building their own packages from the OpenZFS repository.

Illumos continues to use and maintain its own version of ZFS, which it considers separate from OpenZFS. There is an incomplete Illumos project discussion on 'consuming' OpenZFS changes (via, also), but my impression is that very few changes move from OpenZFS to Illumos. My further impression is that there is basically no one on the OpenZFS side who is trying to push changes into Illumos; instead, OpenZFS people consider it up to Illumos to pull changes, and Illumos people aren't doing much of that for various reasons. At this point, if there's an attractive ZFS change in OpenZFS, the odds of it appearing in Illumos on a timely basis appear low (to put it one way).

(Some features have made it into Illumos, such as sequential scrubs and resilvers, which landed in issue 10405. This feature originated in what was then ZoL and was ported into Illumos.)

Even if Illumos increases the pace of importing features from OpenZFS, I don't ever expect it to be on the leading edge and I think that's fine. There have definitely been various OpenZFS features that needed some time before they became fully ready for stable production use (even after they appeared in releases). I think there's an ecological niche for a conservative ZFS that only takes solidly stable features, and that fits Illumos's general focus on stability.

PS: I'm out of touch with the Illumos world these days, so I may have mis-characterized the state of affairs there. If so, I welcome corrections and updates in the comments.

If-None-Match values must come from the actual URL itself

By: cks
24 September 2025 at 16:55

Because I recently looked at the web server logs for Wandering Thoughts, I said something on the Fediverse:

It's impressive how many ways feed readers screw up ETag values. Make up their own? Insert ETags obtained from the target of a HTTP redirect of another request? Stick suffixes on the end? Add their own quoting? I've seen them all.

(And these are just the ones that I can readily detect from the ETag format being wrong for the ETags my techblog generates.)

(Technically these are If-None-Match values, not ETag values; it's just that the I-N-M value is supposed to come from an ETag you returned.)

One of these mistakes deserves special note, and that's the HTTP redirect case. Suppose you request a URL, receive a HTTP 302 temporary redirect, follow the redirect, and get a response at the new URL with an ETag value. As a practical matter, you cannot then present that ETag value in an If-None-Match header when you re-request the original URL, although you could if you re-requested the URL that you were redirected to. The two URLs are not the same and they don't necessarily have the same ETag values or even the same format of ETags.

(This is an especially bad mistake for a feed fetcher to make here, because if you got a HTTP redirect that gives you a different format of ETag, it's because you've been redirected to a static HTML page served directly by Apache (cf) and it's obviously not a valid syndication feed. You shouldn't be saving the ETag value for responses that aren't valid syndication feeds, because you don't want to get them again.)

This means that feed readers can't just store 'an ETag value' for a feed. They need to associate the ETag value with a specific, final URL, which may not be the URL of the feed (because said feed URL may have been redirected). They also need to (only) make conditional requests when they have an ETag for that specific URL, and not copy the If-None-Match header from the initial GET into a redirected GET.

This probably clashes with many low level HTTP client APIs, which I suspect want to hide HTTP redirects from the caller. For feed readers, such high level APIs are a mistake. They actively need to know about HTTP redirects so that, for example, they can consider updating their feed URL if they get permanent HTTP redirects to a new URL. And also, of course, to properly handle conditional GETs.

A hack: outsourcing web browser/client checking to another web server

By: cks
24 September 2025 at 03:18

A while back on the Fediverse, I shared a semi-cursed clever idea:

Today I realized that given the world's simplest OIDC IdP (one user, no password, no prompting, the IdP just 'logs you in' if your browser hits the login URL), you could put @cadey's Anubis in front of anything you can protect with OIDC authentication, including anything at all on an Apache server (via mod_auth_openidc). No need to put Anubis 'in front' of anything (convenient for eg static files or CGIs), and Anubis doesn't even have to be on the same website or machine.

This can be generalized, of course. There are any number of filtering proxies and filtering proxy services out there that will do various things for you, either for free or on commercial terms; one example of a service is geoblocking that's maintained by someone else who's paid to be on top of it and be accurate. Especially with services, you may not want to put them in front of your main website (that gives the service a lot of power), but you would be fine with putting a single-purpose website behind the service or the proxy, if your main website can use the result. With the world's simplest OIDC IdP, you can do that, at least for anything that will do OIDC.

(To be explicit, yes, I'm partly talking about Cloudflare.)

This also generalizes in the other direction, in that you don't necessarily need to use OIDC. You just need some system for passing authenticated information back and forth between your main website and your filtered, checked, proxied verification website. Since you don't need to carry user identity information around this can be pretty simple (although it's going to involve some cryptography, so I recommend just using OIDC or some well-proven option if you can). I've thought about this a bit and I'm pretty certain you can make a quite simple implementation.

(You can also use SAML if you happen to have an extremely simple SAML server and appropriate SAML clients, but really, why. OIDC is today's all-purpose authentication hammer.)

A custom system can pass arbitrary information back and forth between the main website and the verifier, so you can know (for example) if the two saw the same client details. I think you can do this to some extent with OIDC as well if you have a custom IdP, because nothing stops your IdP and your OIDC client from agreeing on some very custom OIDC claims, such as (say) 'clientip'.

(I don't know of any such minimal OIDC server, although I wouldn't be surprised if one exists, probably as a demonstration or test server. And I suppose you can always put a banner on your OIDC IdP's login page that tells people what login and password to use, if you can only find a simple IdP that requires an actual login.)

Unix mail programs have had two approaches to handling your mail

By: cks
23 September 2025 at 02:34

Historically, Unix mail programs (what we call 'mail clients' or 'mail user agents' today) have had two different approaches to handling your email, what I'll call the shared approach and the exclusive approach, with the shared approach being the dominant one. To explain the shared approach, I have to back up to talk about what Unix mail transfer agents (MTAs) traditionally did. When a Unix MTA delivered email to you, at first it delivered email into a single file in a specific location (such as '/usr/spool/mail/<login>') in a specific format, initially mbox; even then, this could be called your 'inbox'. Later, when the maildir mailbox format became popular, some MTAs gained the ability to deliver to maildir format inboxes.

(There have been a number of Unix mail spool formats over the years, which I'm not going to try to get into here.)

A 'shared' style mail program worked directly with your inbox in whatever format it was in and whatever location it was in. This is how the V7 'mail' program worked, for example. Naturally these programs didn't have to work on your inbox; you could generally point them at another mailbox in the same format. I call this style 'shared' because you could use any number of different mail programs (mail clients) on your mailboxes, providing that they all understood the format and also provided that all of them agreed on how to lock your mailbox against modifications, including against your system's MTA delivering new email right at the point where your mail program was, for example, trying to delete some.

(Locking issues are one of the things that maildir was designed to help with.)

An 'exclusive' style mail program (or system) was designed to own your email itself, rather than try to share your system mailbox. Of course it had to access your system mailbox a bit to get at your email, but broadly the only thing an exclusive mail program did with your inbox was pull all your new email out of it, write it into the program's own storage format and system, and then usually empty out your system inbox. I call this style 'exclusive' because you generally couldn't hop back and forth between mail programs (mail clients) and would be mostly stuck with your pick, since your main mail program was probably the only one that could really work with its particular storage format.

(Pragmatically, only locking your system mailbox for a short period of time and only doing simple things with it tended to make things relatively reliable. Shared style mail programs had much more room for mistakes and explosions, since they had to do more complex operations, at least on mbox format mailboxes. Being easy to modify is another advantage of the maildir format, since it outsources a lot of the work to your Unix filesystem.)

This shared versus exclusive design choice turned out to have some effects when mail moved to being on separate servers and accessed via POP and then later IMAP. My impression is that 'exclusive' systems coped fairly well with POP, because the natural operation with POP is to pull all of your new email out of the server and store it locally. By contrast, shared systems coped much better with IMAP than exclusive ones did, because IMAP is inherently a shared mail environment where your mail stays on the IMAP server and you manipulate it there.

(Since IMAP is the dominant way that mail clients/user agents get at email today, my impression is that the 'exclusive' approach is basically dead at this point as a general way of doing mail clients. Almost no one wants to use an IMAP client that immediately moves all of their email into a purely local data storage of some sort; they want their email to stay on the IMAP server and be accessible from and by multiple clients and even devices.)

Most classical Unix mail clients are 'shared' style programs, things like Alpine, Mutt, and the basic Mail program. One major 'exclusive' style program, really a system, is (N)MH (also). MH is somewhat notable because in its time it was popular enough that a number of other mail programs and mail systems supported its basic storage format to some degree (for example, procmail can deliver messages to MH-format directories, although it doesn't update all of the things that MH would do in the process).

Another major source of 'exclusive' style mail handling systems is GNU Emacs. I believe that both rmail and GNUS normally pull your email from your system inbox into their own storage formats, partly so that they can take exclusive ownership and don't have to worry about locking issues with other mail clients. GNU Emacs has a number of mail reading environments (cf, also) and I'm not sure what the others do (apart from MH-E, which is a frontend on (N)MH).

(There have probably been other 'exclusive' style systems. Also, it's a pity that as far as I know, MH never grew any support for keeping its messages in maildir format directories, which are relatively close to MH's native format.)

Maybe I should add new access control rules at the front of rule lists

By: cks
22 September 2025 at 03:14

Not infrequently I wind up maintaining slowly growing lists of filtering rules to either allow good things or weed out bad things. Not infrequently, traffic can potentially match more than one filtering rule, either because it has multiple bad (or good) characteristics or because some of the match rules overlap. My usual habit has been to add new rules to the end of my rule lists (or the relevant section of them), so the oldest rules are at the top and the newest ones are at the bottom.

After writing about how access control rules need some form of usage counters, it's occurred to me that maybe I want to reverse this, at least in typical systems where the first matching rule wins. The basic idea is that the rules I'm most likely to want to drop are the oldest rules, but by having them first I'm hindering my ability to see if they've been made obsolete by newer rules. If an old rule matches some bad traffic, a new rule matches all of the bad traffic, and the new rule is last, any usage counters will show a mix of the old rule and the new rule, making it look like the old rule is still necessary. If the order was reversed, the new rule would completely occlude the old rule and usage counters would show me that I could weed the old rule out.

(My view is that it's much less likely that I'll add a new rule at the bottom that's completely ineffectual because everything it matches is already matched by something earlier. If I'm adding a new rule, it's almost certainly because something isn't being handled by the collection of existing rules.)

Another possible advantage to this is that it will keep new rules at the top of my attention, because when I look at the rule list (or the section of it) I'll probably start at the top. Currently, the top is full of old rules that I usually ignore, but if I put new rules first I'll naturally see them right away.

(I think that most things I deal with are 'first match wins' systems. A 'last match wins' system would naturally work right here, but it has other confusing aspects. I also have the impression that adding new rules at the end is a common thing, but maybe it's just in the cultural water here.)

Our Django model class fields should include private, internal names

By: cks
21 September 2025 at 01:30

Let me tell you about a database design mistake I made in our Django web application for handling requests for Unix accounts. Our current account request app evolved from a series of earlier systems, and one of the things that these earlier systems asked people for was their 'status' with the university; were they visitors, graduate students, undergraduate students, (new) staff, or so on. When I created the current system I copied this and so the database schema includes a 'Status' model class. The only thing I put in this model class was a text field that people picked from in our account request form, and I didn't really think of the text there as what you could call load bearing. It was just a piece of information we asked people for because we'd always asked people for and faithfully duplicating the old CGI was the easy way to implement the web app.

Before too long, it turned out that we wanted to do some special things if people were graduate students (for example, notifying the department's administrative people so they could update their records to include the graduate student's Unix login and email address here). The obvious simple way to implement this was to do a text match on the value of the 'status' field for a particular person; if their 'status' was "Graduate Student", we knew they were a graduate student and we could do various special things. Over time, this knowledge of what the people-visible "Graduate Student" status text was wormed its way into a whole collection of places around our account systems.

For reasons beyond the scope of this entry, we now (recently) want to change the people-visible text to be not exactly "Graduate Student" any more. Now we have a problem, because a bunch of places know that exact text (in fact I'm not sure I remember where all of those places are).

The mistake I made, way back when we first wanted things to know that an account or account request was a 'graduate student', was in not giving our 'Status' model an internal 'label' field that wasn't shown to people in addition to the text shown to people. You can practically guarantee that anything you show to people will want to change sooner or later, so just like you shouldn't make actual people-exposed fields into primary or foreign keys, none of your code should care about their value. The correct solution is an additional field that acts as the internal label of a Status (with values that make sense to us), and then using this internal label any time the code wants to match on or find the 'Graduate Student' status.

(In theory I could use Django's magic 'id' field for this, since we're having Django create automatic primary keys for everything, including the Status model. In practice, the database IDs are completely opaque and I'd rather have something less opaque in code instead of everything knowing that ID '14' is the Graduate Student status ID.)

Fortunately, I've had a good experience with my one Django database migration so far, so this is a fixable problem. Threading the updates through all of the code (and finding all of the places that need updates, including in outside programs) will be a bit of work, but that's what I get for taking the quick hack approach when this first came up.

(I'm sure I'm not the only person to stub my toe this way, and there's probably a well known database design principle involved that would have told me better if I'd known about it and paid attention at the time.)

These days, systemd can be a cause of restrictions on daemons

By: cks
20 September 2025 at 02:59

One of the traditional rites of passage for Linux system administrators is having a daemon not work in the normal system configuration (eg, when you boot the system) but work when you manually run it as root. The classical cause of this on Unix was that $PATH wasn't fully set in the environment the daemon was running in but was in your root shell. On Linux, another traditional cause of this sort of thing has been SELinux and a more modern source (on Ubuntu) has sometimes been AppArmor. All of these create hard to see differences between your root shell (where the daemon works when run by hand) and the normal system environment (where the daemon doesn't work). These days, we can add another cause, an increasingly common one, and that is systemd service unit restrictions, many of which are covered in systemd.exec.

(One pernicious aspect of systemd as a cause of these restrictions is that they can appear in new releases of the same distribution. If a daemon has been running happily in an older release and now has surprise issues in a new Ubuntu LTS, I don't always remember to look at its .service file.)

Some of systemd's protective directives simply cause failures to do things, like access user home directories if ProtectHome= is set to something appropriate. Hopefully your daemon complains loudly here, reporting mysterious 'permission denied' or 'file not found' errors. Some systemd settings can have additional, confusing effects, like PrivateTmp=. A standard thing I do when troubleshooting a chain of programs executing programs executing programs is to shim in diagnostics that dump information to /tmp, but with PrivateTmp= on, my debugging dump files are mysteriously not there in the system-wide /tmp.

(On the other hand, a daemon may not complain about missing files if it's expected that the files aren't always there. A mailer usually can't really tell the difference between 'no one has .forward files' and 'I'm mysteriously not able to see people's home directories to find .forward files in them'.)

Sometimes you don't get explicit errors, just mysterious failures to do some things. For example, you might set IP address access restrictions with the intention of blocking inbound connections but wind up also blocking DNS queries (and this will also depend on whether or not you use systemd-resolved). The good news is that you're mostly not going to find standard systemd .service files for normal daemons shipped by your Linux distribution with IP address restrictions. The bad news is that at some point .service files may start showing up that impose IP address restrictions with the assumption that DNS resolution is being done via systemd-resolved as opposed to direct DNS queries.

(I expect some Linux distributions to resist this, for example Debian, but others may declare that using systemd-resolved is now mandatory in order to simplify things and let them harden service configurations.)

Right now, you can usually test if this is the problem by creating a version of the daemon's .service file with any systemd restrictions stripped out of it and then seeing if using that version makes life happy. In the future it's possible that some daemons will assume and require some systemd restrictions (for instance, assuming that they have a /tmp all of their own), making things harder to test.

Some stuff on how Linux consoles interact with the mouse

By: cks
19 September 2025 at 01:24

On at least x86 PCs, Linux text consoles ('TTY' consoles or 'virtual consoles') support some surprising things. One of them is doing some useful stuff with your mouse, if you run an additional daemon such as gpm or the more modern consolation. This is supported on both framebuffer consoles and old 'VGA' text consoles. The experience is fairly straightforward; you install and activate one of the daemons, and afterward you can wave your mouse around, select and paste text, and so on. How it works and what you get is not as clear, and since I recently went diving into this area for reasons, I'm going to write down what I now know before I forget it (with a focus on how consolation works).

The quick summary is that the console TTY's mouse support is broadly like a terminal emulator. With a mouse daemon active, the TTY will do "copy and paste" selection stuff on its own. A mouse aware text mode program can put the console into a mode where mouse button presses are passed through to the program, just as happens in xterm or other terminal emulators.

The simplest TTY mode is when a non-mouse-aware program or shell is active, which is to say a program that wouldn't try to intercept mouse actions itself if it was run in a regular terminal window and would leave mouse stuff up to the terminal emulator. In this mode, your mouse daemon reads mouse input events and then uses sub-options of the TIOCLINUX ioctl to inject activities into the TTY, for example telling it to 'select' some text and then asking it to paste that selection to some file descriptor (normally the console itself, which delivers it to whatever foreground program is taking terminal input at the time).

(In theory you can use the mouse to scroll text back and forth, but in practice that was removed in 2020, both for the framebuffer console and for the VGA console. If I'm reading the code correctly, a VGA console might still have a little bit of scrollback support depending on how much spare VGA RAM you have for your VGA console size. But you're probably not using a VGA console any more.)

The other mode the console TTY can be in is one where some program has used standard xterm-derived escape sequences to ask for xterm-compatible "mouse tracking", which is the same thing it might ask for in a terminal emulator if it wanted to handle the mouse itself. What this does in the kernel TTY console driver is set a flag that your mouse daemon can query with TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't directly handle or look at mouse events. Instead, consolation (or gpm) reads the flag and, when the flag is set, uses the TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL sub-option to report the mouse position and button presses to the kernel (instead of handling mouse activity itself). The kernel then turns around and sends mouse reporting escape codes to the TTY, as the program asked for.

(As I discovered, we got a CVE this year related to this, where the kernel let too many people trigger sending programs 'mouse' events. See the stable kernel commit message for details.)

A mouse daemon like consolation doesn't have to pay attention to the kernel's TTY 'mouse reporting' flag. As far as I can tell from the current Linux kernel code, if the mouse daemon ignores the flag it can keep on doing all of its regular copy and paste selection and mouse button handling. However, sending mouse reports is only possible when a program has specifically asked for it; the kernel will report an error if you ask it to send a mouse report at the wrong time.

(As far as I can see there's no notification from the kernel to your mouse daemon that someone changed the 'mouse reporting' flag. Instead you have to poll it; it appears consolation does this every time through its event loop before it handles any mouse events.)

PS: Some documentation on console mouse reporting was written as a 2020 kernel documentation patch (alternate version) but it doesn't seem to have made it into the tree. According to various sources, eg, the mouse daemon side of things can only be used by actual mouse daemons, not by programs, although programs do sometimes use other bits of TIOCLINUX's mouse stuff.

PPS: It's useful to install a mouse daemon on your desktop or laptop even if you don't intend to ever use the text TTY. If you ever wind up in the text TTY for some reason, perhaps because your regular display environment has exploded, having mouse cut and paste is a lot nicer than not having it.

Free and open source software is incompatible with (security) guarantees

By: cks
18 September 2025 at 02:53

If you've been following the tech news, one of the recent things that's happened is that there has been another incident where a bunch of popular and widely used packages on a popular package repository for a popular language were compromised, this time with a self-replicating worm. This is very inconvenient to some people, especially to companies in Europe, for some reason, and so some people have been making the usual noises. On the Fediverse, I had a hot take:

Hot take: free and open source is fundamentally incompatible with strong security *guarantees*, because FOSS is incompatible with strong guarantees about anything. It says so right there on the tin: "without warranty of any kind, either expressed or implied". We guarantee nothing by default, you get the code, the project, everything, as-is, where-is, how-is.

Of course companies find this inconvenient, especially with the EU CRA looming, but that's not FOSS's problem. That's a you problem.

To be clear here: this is not about the security and general quality of FOSS (which is often very good), or the responsiveness of FOSS maintainers. This is about guarantees, firm (and perhaps legally binding) assurances of certain things (which people want for software in general). FOSS can provide strong security in practice but it's inimical to FOSS's very nature to provide a strong guarantee of that or anything else. The thing that makes most of FOSS possible is that you can put out software without that guarantee and without legal liability.

An individual project can solemnly say it guarantees its security, and if it does so it's an open legal question whether that writing trumps the writing in the license. But in general a core and absolutely necessary aspect of free and open source is that warranty disclaimer, and that warranty disclaimer cuts across any strong guarantees about anything, including security and lack of bugs.

Are the compromised packages inconvenient to a lot of companies? They certainly are. But neither the companies nor commentators can say that the compromise violated some general strong security guarantee about packages, because there is and never will be such a guarantee with FOSS (see, for example, Thomas Depierre's I am not a supplier, which puts into words a sentiment a lot of FOSS people have).

(But of course the companies and sympathetic commentators are framing it that way because they are interested in the second vision of "supply chain security", where using FOSS code is supposed to magically absolve companies of the responsibility that people want someone to take.)

The obvious corollary of this is that widespread usage of FOSS packages and software, especially with un-audited upgrades of package versions (however that happens), is incompatible with having any sort of strong security or quality guarantee about the result. The result may have strong security and high quality, but if so, those come without guarantees; you've just been lucky. If you want guarantees, you will have to arrange them yourself and it's very unlikely you can achieve strong guarantees while using the typical every-changing pile of FOSS code.

(For example, if dependencies auto-update before you can audit them and their changes, or faster than you can keep up, you have nothing in practice.)

My Fedora machines need a cleanup of their /usr/sbin for Fedora 42

By: cks
17 September 2025 at 03:06

One of the things that Fedora is trying to do in Fedora 42 is unifying /usr/bin and /usr/sbin. In an ideal (Fedora) world, your Fedora machines will have /usr/sbin be a symbolic link to /usr/bin after they're upgraded to Fedora 42. However, if your Fedora machines have been around for a while, or perhaps have some third party packages installed, what you'll actually wind up with is a /usr/sbin that is mostly symbolic links to /usr/bin but still has some actual programs left.

One source of these remaining /usr/sbin programs is old packages from past versions of Fedora that are no longer packaged in Fedora 41 and Fedora 42. Old packages are usually harmless, so it's easy for them to linger around if you're not disciplined; my home and office desktops (which have been around for a while) still have packages from as far back as Fedora 28.

(An added complication of tracking down file ownership is that some RPMs haven't been updated for the /sbin to /usr/sbin merge and so still believe that their files are /sbin/<whatever> instead of /usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find these.)

Obviously, you shouldn't remove old packages without being sure of whether or not they're important to you. I'm also not completely sure that all packages in the Fedora 41 (or 42) repositories are marked as '.fc41' or '.fc42' in their RPM versions, or if there are some RPMs that have been carried over from previous Fedora versions. Possibly this means I should wait until a few more Fedora versions have come to pass so that other people find and fix the exceptions.

(On what is probably my cleanest Fedora 42 test virtual machine, there are a number of packages that 'dnf list --extras' doesn't list that have '.fc41' in their RPM version. Some of them may have been retained un-rebuilt for binary compatibility reasons. There's also the 'shim' UEFI bootloaders, which date from 2024 and don't have Fedora releases in their RPM versions, but those I expect to basically never change once created. But some others are a bit mysterious, such as 'libblkio', and I suspect that they may have simply been missed by the Fedora 42 mass rebuild.)

PS: In theory anyone with access to the full Fedora 42 RPM repository could sweep the entire thing to find packages that still install /usr/sbin files or even /sbin files, which would turn up any relevant not yet rebuilt packages. I don't know if there's any easy way to do this through dnf commands, although I think dnf does have access to a full file list for all packages (which is used for certain dnf queries).

Access control rules need some form of usage counters

By: cks
16 September 2025 at 03:15

Today, for reasons outside the scope of this entry, I decided to spend some time maintaining and pruning the access control rules for Wandering Thoughts, this blog. Due to the ongoing crawler plague (and past abuses), Wandering Thoughts has had to build up quite a collection of access control rules, which are mostly implemented as a bunch of things in an Apache .htaccess file (partly 'Deny from ...' for IP address ranges and partly as rewrite rules based on other characteristics). The experience has left me with a renewed view of something, which is that systems with access control rules need some way of letting you see which rules are still being used by your traffic.

It's in the nature of systems with access control rules to accumulate more and more rules over time. You hit another special situation, you add another rule, perhaps to match and block something or perhaps to exempt something from blocking. These rules often interact in various ways, and over time you'll almost certainly wind up with a tangled thicket of rules (because almost no one goes back to carefully check and revisit all existing rules when they add a new one or modify an existing one). The end result is a mess, and one of the ways to reduce the mess is to weed out rules that are now obsolete. One way a rule can be obsolete is that it's not used any more, and often these are the easiest rules to drop once you can recognize them.

(A rule that's still being matched by traffic may be obsolete for other reasons, and rules that aren't currently being matched may still be needed as a precaution. But it's a good starting point.)

If you have the necessary log data, you can sometimes establish if a rule was actually ever used by manually checking your logs. For example, if you have logs of rejected traffic (or logs of all traffic), you can search it for an IP address range to see if a particular IP address rule ever matched anything. But this requires tedious manual effort and that means that only determined people will go through it, especially regularly. The better way is to either have this information provided directly, such as by counters on firewall rules, or to have something in your logs that makes deriving it easy.

(An Apache example would be to augment any log line that was matched by some .htaccess rule with a name or a line number or the like. Then you could go readily through your logs to determine which lines were matched and how often.)

The next time I design an access control rule system, I'm hopefully going to remember this and put something in its logging to (optionally) explain its decisions.

(Periodically I write something that has an access control rule system of some sort. Unfortunately all of mine to date have been quiet on this, so I'm not at all without sin here.)

The idea of /usr/sbin has failed in practice

By: cks
15 September 2025 at 03:17

One of the changes in Fedora Linux 42 is unifying /usr/bin and /usr/sbin, by moving everything in /usr/sbin to /usr/bin. To some people, this probably smacks of anathema, and to be honest, my first reaction was to bristle at the idea. However, the more I thought about it, the more I had to concede that the idea of /usr/sbin has failed in practice.

We can tell /usr/sbin has failed in practice by asking how many people routinely operate without /usr/sbin in their $PATH. In a lot of environments, the answer is that very few people do, because sooner or later you run into a program that you want to run (as yourself) to obtain useful information or do useful things. Let's take FreeBSD 14.3 as an illustrative example (to make this not a Linux biased entry); looking at /usr/sbin, I recognize iostat, manctl (you might use it on your own manpages), ntpdate (which can be run by ordinary people to query the offsets of remote servers), pstat, swapinfo, and traceroute. There are probably others that I'm missing, especially if you use FreeBSD as a workstation and so care about things like sound volumes and keyboard control.

(And if you write scripts and want them to send email, you'll care about sendmail and/or FreeBSD's 'mailwrapper', both in /usr/sbin. There's also DTrace, but I don't know if you can DTrace your own binaries as a non-root user on FreeBSD.)

For a long time, there has been no strong organizing principle to /usr/sbin that would draw a hard line and create a situation where people could safely leave it out of their $PATH. We could have had a principle of, for example, "programs that don't work unless run by root", but no such principle was ever followed for very long (if at all). Instead programs were more or less shoved in /usr/sbin if developers thought they were relatively unlikely to be used by normal people. But 'relatively unlikely' is not 'never', and shortly after people got told to 'run traceroute' and got 'command not found' when they tried, /usr/sbin (probably) started appearing in $PATH.

(And then when you asked 'how does my script send me email about something', people told you about /usr/sbin/sendmail and another crack appeared in the wall.)

If /usr/sbin is more of a suggestion than a rule and it appears in everyone's $PATH because no one can predict which programs you want to use will be in /usr/sbin instead of /usr/bin, I believe this means /usr/sbin has failed in practice. What remains is an unpredictable and somewhat arbitrary division between two directories, where which directory something appears in operates mostly as a hint (a hint that's invisible to people who don't specifically look where a program is).

(This division isn't entirely pointless and one could try to reform the situation in a way short of Fedora 42's "burn the entire thing down" approach. If nothing else the split keeps the size of both directories somewhat down.)

PS: The /usr/sbin like idea that I think is still successful in practice is /usr/libexec. Possibly a bunch of things in /usr/sbin should be relocated to there (or appropriate subdirectories of it).

My machines versus the Fedora selinux-policy-targeted package

By: cks
14 September 2025 at 02:26

I upgrade Fedora on my office and home workstations through an online upgrade with dnf, and as part of this I read (or at least scan) DNF's output to look for problems. Usually this goes okay, but DNF5 has a general problem with script output and when I did a test upgrade from Fedora 41 to Fedora 42 on a virtual machine, it generated a huge amount of repeated output from a script run by selinux-policy-targeted, repeatedly reporting "Old compiled fcontext format, skipping" for various .bin files in /etc/selinux/targeted/contexts/files. The volume of output made the rest of DNF's output essentially unreadable. I would like to avoid this when I actually upgrade my office and home workstations to Fedora 42 (which I still haven't done, partly because of this issue).

(You can't make this output easier to read because DNF5 is too smart for you. This particular error message reportedly comes from 'semodule -B', per this Fedora discussion.)

The 'targeted' policy is one of several SELinux policies that are supported or at least packaged by Fedora (although I suspect I might see similar issues with the other policies too). My main machines don't use SELinux and I have it completely disabled, so in theory I should be able to remove the selinux-policy-targeted package to stop it from repeatedly complaining during the Fedora 42 upgrade process. In practice, selinux-policy-targeted is a 'protected' package that DNF will normally refuse to remove. Such packages are listed in /etc/dnf/protected.d/ in various .conf files; selinux-policy-targeted installs (well, includes) a .conf file to protect itself from removal once installed.

(Interestingly, sudo protects itself but there's nothing specifically protecting su and the rest of util-linux. I suspect util-linux is so pervasively a dependency that other protected things hold it down, or alternately no one has ever worried about people removing it and shooting themselves in the foot.)

I can obviously remove this .conf file and then DNF will let me remove selinux-policy-targeted, which will force the removal of some other SELinux policy packages (both selinux-policy packages themselves and some '*-selinux' sub-packages of other packages). I tried this on another Fedora 41 test virtual machine and nothing obvious broke, but that doesn't mean that nothing broke at all. It seems very likely that almost no one tests Fedora without the selinux-policy collective installed and I suspect it's not a supported configuration.

I could reduce my risks by removing the packages only just before I do the upgrade to Fedora 42 and put them back later (well, unless I run into a dnf issue as a result, although that issue is from 2024). Also, now that I've investigated this, I could in theory delete the .bin files in /etc/selinux/targeted/contexts/files before the upgrade, hopefully making it so that selinux-policy-targeted has less or nothing to complain about. Since I'm not using SELinux, hopefully the lack of these files won't cause any problems, but of course this is less certain a fix than removing selinux-policy-targeted (for example, perhaps the .bin files would get automatically rebuilt early on in the upgrade process as packages are shuffled around, and bring the problem back with them).

Really, though, I wish DNF5 didn't have its problem with script output. All of this is hackery to deal with that underlying issue.

Some notes on (Tony Finch's) exponential rate limiting in practice

By: cks
13 September 2025 at 03:43

After yesterday's entry where I discovered it, I went and implemented Tony Finch's exponential rate limiting for HTTP request rate limiting in DWiki, the engine underlying this blog, replacing the more brute force and limited version I had initially implemented. I chose exponential rate limiting over GCRA or leaky buckets because I found it much easier to understand how to set the limits (partly because I'm somewhat familiar with the whole thing from Exim). Exponential rate limiting needed me to pick a period of time and a number of (theoretical) requests that can be made in that time interval, which was easy enough; GCRA 'rate' and 'burst' numbers were less clear to me. However, exponential rate limiting has some slightly surprising things that I want to remember.

(Exponential ratelimits don't have a 'burst' rate as such but you can sort of achieve this by your choice of time intervals.)

In my original simple rate limiting, any rate limit record that had a time outside of my interval was irrelevant and could be dropped in order to reduce space usage (my current approach uses basically the same hack as my syndication feed ratelimits, so I definitely don't want to let its space use grow without bound). This is no longer necessarily true in exponential rate limiting, depending on how big of a rate the record (the source) had built up before it took a break. This old rate 'decays' at a rate I will helpfully put in a table for my own use:

Time since last seen Old rate multiplied by
1x interval 0.37
2x interval 0.13
3x interval 0.05
4x interval 0.02

(This is, eg, 'exp(-1)' for we only last saw the source 'interval' time ago.)

Where this becomes especially relevant is if you opt for 'strict' rate limiting instead of 'leaky', where every time the source makes a request you increase its recorded rate even if you reject the request for being rate limited. A high-speed source that insists on hammering you for a while can build up a very large current rate under a strict rate limit policy, and that means its old past behavior can affect it (ie, possibly cause it to be rate limited) well beyond your nominal rate limit interval. Especially with 'strict' rate limiting, you could opt to cap the maximum age a valid record could have and drop everything that you last saw over, say, 3x your interval ago; this would be generous to very high rate old sources, but not too generous (since their old rate would be reduced to 0.05 or less of what it was even if you counted it).

As far as I can see, the behavior with leaky rate limiting and a cost of 1 (for the simple case of all HTTP requests having the same cost) is that if the client keeps pounding away at you, one of its requests will get through on a semi-regular basis. The client will make a successful request, the request will push its rate just over your limit, it will get rate limited some number of times, then enough time will have passed since its last successful request that its new request will be just under the rate limit and succeed. In some environments, this is fine and desired. However, my current goal is to firmly cut off clients that are making requests too fast, so I don't want this; instead, I implemented the 'strict' behavior so you don't get through at all until your request rate and the interval since your last request drops low enough.

Mathematically, a client that makes requests with little or no gap between them (to the precision of your timestamps) can wind up increasing its rate by slightly over its 'cost' per request. If I'm understanding the math correctly, how much over the cost is capped by Tony Finch's 'max(interval, 1.0e-10)' step, with 1.0e-10 being a small but non-zero number that you can move up or down depending on, eg, your language and its floating point precision. Having looked at it, in Python the resulting factor with 1.0e-10 is '1.000000082740371', so you and I probably don't need to worry about this. If the client doesn't make requests quite that fast, its rate will go up each time by slightly less than the 'cost' you've assigned. In Python, a client that makes a request every millisecond has a factor for this of '0.9995001666249781' of the cost; slower request rates make this factor smaller.

This is probably mostly relevant if you're dumping or reporting the calculated rates (for example, when a client hits the rate limit) and get puzzled by the odd numbers that may be getting reported.

I don't know how to implement proper ratelimiting (well, maybe I do now)

By: cks
12 September 2025 at 01:53

In theory I have a formal education as a programmer (although it was a long time ago). In practice my knowledge from it isn't comprehensive, and every so often I run into an area where I know there's relevant knowledge and algorithms but I don't know what they are and I'm not sure how to find them. Today's area is scalable rate-limiting with low storage requirements.

Suppose, not hypothetically, that you want to ratelimit a collection of unpredictable sources and not use all that much storage per source. One extremely simple and obvious approach is to store, for each source, a start time and a count. Every time the source makes a request, you check to see if the start time is within your rate limit interval; if it is, you increase the count (or ratelimit the source), and if it isn't, you reset the start time to now and the count to 1.

(Every so often you can clean out entries with start times before your interval.)

The disadvantage of this simple approach is that it completely forgets about the past history of each source periodically. If your rate limit intervals are 20 minutes, a prolific source gets to start over from scratch every 20 minutes and run up its count until it gets rate limited again. Typically you want rate limiting not to forget about sources so fast.

I know there are algorithms that maintain decaying averages or moving (rolling) averages. The Unix load average is maintained this way, as is Exim ratelimiting. The Unix load average has the advantage that it's updated on a regular basis, which makes the calculation relatively simple. Exim has to deal with erratic updates that are unpredictable intervals from the previous update, and the comment in the source is a bit opaque to me. I could probably duplicate the formula in my code but I'd have to do a bunch of work to convince myself the result was correct.

(And now I've found Tony Finch's exponential rate limiting (via), which I'm going to have to read carefully, along with the previous GCRA: leaky buckets without the buckets.)

Given that rate limiting is such a common thing these days, I suspect that there are a number of algorithms for this with various different choices about how the limits work. Ideally, it would be possible to readily find writeups of them with internet searches, but of course as you know internet search is fairly broken these days.

(For example you can find a lot of people giving high level overviews of rate limiting without discussing how to actually implement it.)

Now that I've found Tony Finch's work I'm probably going to rework my hacky rate limiting code to do things better, because my brute force approach is using the same space as leaky buckets (as covered in Tony Finch's article) with inferior results. This shows the usefulness of knowing algorithms instead of just coding away.

(Improving the algorithm in my code will probably make no practical difference, but sometimes programming is its own pleasure.)

ZFS snapshots aren't as immutable as I thought, due to snapshot metadata

By: cks
11 September 2025 at 03:29

If you know about ZFS snapshots, you know that one of their famous properties is that they're immutable; once a snapshot is made, its state is frozen. Or so you might casually describe it, but that description is misleading. What is frozen in a ZFS snapshot is the state of the filesystem (or zvol) that it captures, and only that. In particular, the metadata associated with the snapshot can and will change over time.

(When I say it this way it sounds obvious, but for a long time my intuition about how ZFS operated was misled by me thinking that all aspects of a snapshot had to be immutable once made and trying to figure out how ZFS worked around that.)

One visible place where ZFS updates the metadata of a snapshot is to maintain information about how much unique space the snapshot is using. Another is that when a ZFS snapshot is deleted, other ZFS snapshots may require updates to adjust the list of snapshots (every snapshot points to the previous one) and the ZFS deadlist of blocks that are waiting to be freed.

Mechanically, I believe that various things in a dsl_dataset_phys_t are mutable, with the exception of things like the creation time and the creation txg, and also the block pointer, which points to the actual filesystem data of the snapshot. Things like the previous snapshot information have to be mutable (you might delete the previous snapshot), and things like the deadlist and the unique bytes are mutated as part of operations like snapshot deletion. The other things I'm not sure of.

(See also my old entry on a broad overview of how ZFS is structured on disk. A snapshot is a 'DSL dataset' and it points to the object set for that snapshot. The root directory of a filesystem DSL dataset, snapshot or otherwise, is at a fixed number in the object set; it's always object 1. A snapshot freezes the object set as of that point in time.)

PS: Another mutable thing about snapshots is their name, since 'zfs rename' can change that. The manual page even gives an example of using (recursive) snapshot renaming to keep a rolling series of daily snapshots.

How I think OpenZFS's 'written' and 'written@<snap>' dataset properties work

By: cks
10 September 2025 at 03:25

Yesterday I wrote some notes about ZFS's 'written' dataset property, where the short summary is that 'written' reports the amount of space written in a snapshot (ie, that wasn't in the previous snapshot), and 'written@<snapshot>' reports the amount of space written since the specified snapshot (up to either another snapshot or the current state of the dataset). In that entry, I left un-researched the question of how ZFS actually gives us those numbers; for example, if there was a mechanism in place similar to the complicated one for 'used' space. I've now looked into this and as far as I can see the answer is that ZFS determines information on the fly.

The guts of the determination are in dsl_dataset_space_written_impl(), which has a big comment that I'm going to quote wholesale:

Return [...] the amount of space referenced by "new" that was not referenced at the time the bookmark corresponds to. "New" may be a snapshot or a head. The bookmark must be before new, [...]

The written space is calculated by considering two components: First, we ignore any freed space, and calculate the written as new's used space minus old's used space. Next, we add in the amount of space that was freed between the two time points, thus reducing new's used space relative to old's. Specifically, this is the space that was born before zbm_creation_txg, and freed before new (ie. on new's deadlist or a previous deadlist).

(A 'bookmark' here is an internal ZFS thing.)

When this talks about 'used' space, this is not the "used" snapshot property; this is the amount of space the snapshot or dataset refers to, including space shared with other snapshots. If I'm understanding the code and the comment right, the reason we add back in freed space is because otherwise you could wind up with a negative number. Suppose you wrote a 2 GB file, made one snapshot, deleted the file, and then made a second snapshot. The difference in space referenced between the two snapshots is slightly less than negative 2 GB, but we can't report that as 'written', so we go through the old stuff that got deleted and add its size back in to make the number positive again.

To determine the amount of space that's been freed between the bookmark and "new", the ZFS code walks backward through all snapshots from "new" to the bookmark, calling another ZFS function to determine how much relevant space got deleted. This uses the ZFS deadlists that ZFS is already keeping track of to know when it can free an object.

This code is used both for 'written@<snap>' and 'written'; the only difference between them is that when you ask for 'written', the ZFS kernel code automatically finds the previous snapshot for you.

Some notes on OpenZFS's 'written' dataset property

By: cks
9 September 2025 at 03:28

ZFS snapshots and filesystems have a 'written' property, and a related 'written@snapshot one. These are documented as:

written
The amount of space referenced by this dataset, that was written since the previous snapshot (i.e. that is not referenced by the previous snapshot).

written@snapshot
The amount of referenced space written to this dataset since the specified snapshot. This is the space that is referenced by this dataset but was not referenced by the specified snapshot. [...]

(Apparently I never noticed the 'written' property before recently, despite it being there from very long ago.)

The 'written' property is related to the 'used' property, and it's both more confusing and less confusing as it relates to snapshots. Famously (but not famously enough), for snapshots the used property ('USED' in the output of 'zfs list') only counts space that is exclusive to that snapshot. Space that's only used by snapshots but that is shared by more than one snapshot is in 'usedbysnapshots'.

To understand 'written' better, let's do an experiment: we'll make a snapshot, write a 2 GByte file, make a second snapshot, write another 2 GByte file, make a third snapshot, and then delete the first 2 GB file. Since I've done this, I can tell you the results.

If there are no other snapshots of the filesystem, the first snapshot's 'written' value is the full size of the filesystem at the time it was made, because everything was written before it was made. The second snapshot's 'written' is 2 GBytes, the data file we wrote between the first and the second snapshot. The third snapshot's 'written' is another 2 GB, for the second file we wrote. However, at the end, after we delete one of the data files, the filesystem's 'written' is small (certainly not 2 GB), and so would be the 'written' of a fourth snapshot if we made one.

The reason the filesystem's 'written' is so small is that ZFS is counting concrete on-disk (new) space. Deleting a 2 GB file frees up a bunch of space but it doesn't require writing very much to the filesystem, so the 'written' value is low.

If we look at the 'used' values for all three snapshots, they're all going to be really low. This is because both 2 GByte data files we wrote are shared between the second and the third snapshot. Since they're both in multiple snapshots, they're in 'usedbysnapshots' but not 'used'.

(ZFS has a somewhat complicated mechanism to maintain all of this information.)

There is one interesting 'written' usage that appears to show you deleted space, but it is a bit tricky. The manual page implies that the normal usage of 'written@<snapshot>' is to ask for it for the filesystem itself; however, in experimentation you can ask for it for a snapshot too. So take the three snapshots above, and the filesystem after deleting the first data file. If you ask for 'written@first' for the filesystem, you will get 2 GB, but if you ask for 'written@first' for the third snapshot, you will get 4 GB. What the filesystem appears to be reporting is how much still-live data has been written between the first snapshot and now, which is only 2 GB because we deleted the other 2 GB. Meanwhile, all four GB are still alive in the third snapshot.

My conclusion from looking into this is that I can use 'written' as an indication of how much new data a snapshot has captured, but I can't use it as an indication of how much changed in a snapshot. As I've seen, deleting data is a potentially big change but a small 'written' value. If I'm understanding 'written' correctly, one useful thing about it is that it shows roughly how much data an incremental 'zfs send' of just that snapshot would send. Under some circumstances it will also give you an idea of how much data your backup system may need to back up; however, this works best if people are creating new files (and deleting old ones), instead of updating or appending to existing files (where ZFS only updates some blocks but a backup system probably needs to re-save the whole thing).

Why Firefox's media autoplay settings are complicated and imperfect

By: cks
8 September 2025 at 03:25

In theory, a website that wanted to play video or audio could throw in a '<video controls ...>' or '<audio controls ...>' element in the HTML of the page and be done with it. This would make handling playing media simple and blocking autoplay reliable; you'd ignore the autoplay element and the person using the browser would directly trigger playing media by interacting with things that the browser directly controlled and so the browser could know for sure that a person had directly clicked on them and the media should be played.

As anyone who's seen websites with audio and video on the web knows, in practice almost no one does it this way, with browser controls on the <video> or <audio> element. Instead, everyone displays controls of their own somehow (eg as HTML elements styled through CSS), attaches JavaScript actions to them, and then uses the HTMLMediaElement browser API to trigger playback and various other things. As a result of this use of JavaScript, browsers in general and Firefox in particular no longer have a clear, unambiguous view of your intentions to play media. At best, all they can know is that you interacted with the web page, this interaction triggered some JavaScript, and the JavaScript requested that media play.

(Browsers can know somewhat of how you interacted with a web page, such as whether you clicked or scrolled or typed a key.)

On good, well behaved websites, this interaction is with visually clear controls (such as a visual 'play' button) and the JavaScript that requests media playing is directly attached to those controls. And even on these websites, JavaScript may later legitimately act asynchronously to request more playing of things, or you may interact with media playback in other ways (such as spacebar to pause and then restart media playing). On not so good websites, well, any piece of JavaScript that manages to run can call HTMLMediaElement.play() to try to start playing the media. There are lots of ways to have JavaScript run automatically and so a web page can start trying to play media the moment its JavaScript starts running, and it can keep trying to trigger playback over and over again if it wants to through timers or suchlike.

Since Firefox only blocking the actual autoplay attribute and allowing JavaScript to trigger media playing any time it wants to would be a pretty obviously bad 'Block Autoplay' experience, it must try harder. Firefox's approach is to (also) block use of HTMLMediaElement.play() until you have done some 'user gesture' on the page. As far as I can tell from Firefox's description of this, the list of 'user gestures' is fairly expansive and covers much of how you interact with a page. Certainly, if a website can cause you to click on something, regardless of what it looks like, this counts as a 'user gesture' in Firefox.

(I'm sure that Firefox's selection of things that count as 'user gestures' are drawn from real people on real hardware doing things to deliberately trigger playback, including resuming playback after it's been paused by, for example, tapping spacebar.)

In Firefox, this makes it quite hard to actually stop a bad website from playing media while preserving your ability to interact with the site. Did you scroll the page with the spacebar? I think that counts as a user gesture. Did you use your mouse scroll wheel? Probably a user gesture. Did you click on anything at all, including to dismiss some banner? Definitely a user gesture. As far as I can tell, the only reliable way you can prevent a web page from starting media playback is to immediately close the page. Basically anything you do to use it is dangerous.

Firefox does have a very strict global 'no autoplay' policy that you can turn on through about:config, which they call click-to-play, where Firefox tries to limit HTMLMediaElement.play() to being called as the direct result of a JavaScript event handler. However, their wiki notes that this can break some (legitimate) websites entirely (well, for media playback), and it's a global setting that gets in the way of some things I want; you can't set it only for some sites. And even with click-to-play, if a website can get you to click on something of its choice, it's game over as far as I know; if you have to click or tap a key to dismiss an on-page popup banner, the page can trigger media playing from that event handler.

All of this is why I'd like a per-website "permanent mute" option for Firefox. As far as I know, there's literally no other way in standard Firefox to reliably prevent a potentially bad website (or advertising network that it uses) from playing media on you.

(I suspect that you can defeat a lot of such websites with click-to-play, though.)

PS: Muting a tab in Firefox is different from stopping media playback (or blocking it from starting). All it does is stop Firefox from outputting audio from that tab (to wherever you're having Firefox send audio). Any media will 'play' or continue to play, including videos displaying moving things and being distracting.

We can't expect people to pick 'good' software

By: cks
7 September 2025 at 02:35

One of the things I've come to believe in (although I'm not consistent about it) is that we can't expect people to pick software that is 'good' in a technical sense. People certainly can and do pick software that is good in that it works nicely, has a user interface that works for them, and so on, which is to say all of the parts of 'good' that they can see and assess, but we can't expect people to go beyond that, to dig deeply into the technical aspects to see how good their choice of software is. For example, how efficiently an IMAP client implements various operations at the protocol level is more or less invisible to most people. Even if you know enough to know about potential technical quality aspects, realistically you have to rely on any documentation the software provides (if it provides anything). Very few people are going to set up an IMAP server test environment and point IMAP clients at it to see how they behave, or try to read the source code of open source clients.

(Plus, you have to know a lot to set up a realistic test environment. A lot of modern software varies its behavior in subtle ways depending on the surrounding environment, such as the server (or client) at the other end, what your system is like, and so on. To extend my example, the same IMAP client may behave differently when talking to two different IMAP server implementations.)

Broadly, the best we can do is get software to describe important technical aspects of itself, to document them even if the software doesn't, and to explain to people why various aspects matter and thus what they should look for if they want to pick good software. I think this approach has seen some success in, for example, messaging apps, where 'end to end encrypted' or similar things has become a technical quality measure that's typically relatively legible to people. Other technical quality measures in other software are much less legible to people in general, including in important software like web browsers.

(One useful way to make technical aspects legible is to create some sort of scorecard for them. Although I don't think it was built for this purpose, there's caniuse for browsers and their technical quality for various CSS and HTML5 features.)

To me, one corollary to this is that there's generally no point in yelling at people (in various ways) or otherwise punishing them because they picked software that isn't (technically) good. It's pretty hard for a non-specialist to know what is actually good or who to trust to tell them what's actually good, so it's not really someone's fault if they wind up with not-good software that does undesirable things. This doesn't mean that we should always accept the undesirable things, but it's probably best to either deal with them or reject them as gracefully as possible.

(This definitely doesn't mean that we should blindly follow Postel's Law, because a lot of harm has been done to various ecosystems by doing so. Sometimes you have to draw a line, even if it affects people who simply had bad luck in what software they picked. But ideally there's a difference between drawing a line and yelling at people about them running into the line.)

Our too many paths to 'quiet' Prometheus alerts

By: cks
6 September 2025 at 02:54

One of the things our Prometheus environment has is a notion of different sorts of alerts, and in particular of less important alerts that should go to a subset of people (ie, me). There are various reasons for this, including that the alert is in testing, or it concerns a subsystem that only I should have to care about, or that it fires too often for other people (for example, a reboot notification for a machine we routinely reboot).

For historical reasons, there are at least four different ways that this can be done in our Prometheus environment:

  • a special label can be attached to the Prometheus alert rule, which is appropriate if the alert rule itself is in testing or otherwise is low priority.

  • a special label can be attached to targets in a scrape configuration, although this has some side effects that can be less than ideal. This affects all alerts that trigger based on metrics from, for example, the Prometheus host agent (for that host).

  • our Prometheus configuration itself can apply alert relabeling to add the special label for everything from a specific host, as indicated by a "host" label that we add. This is useful if we have so many exporters being scraped from a particular host, or if I want to keep metric continuity (ie, the metrics not changing their label set) when a host moves into production.

  • our Alertmanager configuration can specifically route certain alerts about certain machines to the 'less important alerts' destination.

The drawback of these assorted approaches is that now there are at least three places to check and possibly to update when a host moves from being a testing host into being a production host. A further drawback is some of these (the first two) are used a lot more often than others of these (the last two). When you have multiple things, some of which are infrequently used, and fallible humans have to remember to check them all, you can guess what can happen next.

And that is the simple version of why alerts about one of our fileservers wouldn't have gone to everyone here for about the past year.

How I discovered the problem was that I got an alert about one of the fileserver's Prometheus exporters restarting, and decided that I should update the alert configuration to make it so that alerts about this service restarting only went to me. As I was in the process of doing this, I realized that the alert already had only gone to me, despite there being no explicit configuration in the alert rule or the scrape configuration. This set me on an expedition into the depths of everything else, where I turned up an obsolete bit in our general Prometheus configuration.

On the positive side, now I've audited our Prometheus and Alertmanager configurations for any other things that shouldn't be there. On the negative side, I'm now not completely sure that there isn't a fifth place that's downgrading (some) alerts about (some) hosts.

Could NVMe disks become required for adequate performance?

By: cks
5 September 2025 at 03:34

It's not news that full speed NVMe disks are extremely fast, as well as extremely good at random IO and doing a lot of IO at once. In fact they have performance characteristics that upset general assumptions about how you might want to design systems, at least for reading data from disk (for example, you want to generate a lot of simultaneous outstanding requests, either explicitly in your program or implicitly through the operating system). I'm not sure how much write bandwidth normal NVMe drives can really deliver for sustained write IO, but I believe that they can absorb very high write rates for a short period as you flush out a few hundred megabytes or more. This is a fairly big sea change from even SATA SSDs (and I believe SAS SSDs), never mind HDDs.

About a decade ago, I speculated that everyone was going to be forced to migrate to SATA SSDs because developers would build programs that required SATA SSD performance. It's quite common for developers to build programs and systems that run well on their hardware (whether that's laptops, desktops, or servers, cloud or otherwise), and developers often use the latest and best. These days, that's going to have NVMe SSDs, and so it wouldn't be surprising if developers increasingly developed for full NVMe performance. Some of this may be inadvertent, in that the developer doesn't realize what the performance impact of their choices are on systems with less speedy storage. Some of this will likely be deliberate, as developers choose to optimize for NVMe performance or even develop systems that only work well with that level of performance.

This is a potential problem because there are a number of ways to not have that level of NVMe performance. Most obviously, you can simply not have NVMe drives; instead you may be using SATA SSDs (as we mostly are, including in our fileservers), or even HDDs (as we are in our Prometheus metrics server). Less obviously, you may have NVMe drives but be driving them in ways that don't give you the full NVMe bandwidth. For instance, you might have a bunch of NVMe drives behind a 'tri-mode' HBA, or have (some of) your NVMe drives hanging off the chipset with shared PCIe lanes to the CPU, or have to drive some of your NVMe drives with fewer than x4 PCIe because of limits on slots or lanes.

(Dedicated NVMe focused storage servers will be able to support lots of NVMe devices at full speed, but such storage servers are likely to be expensive. People will inevitably build systems with lower end setups, us included, and I believe that basic 1U servers are still mostly SATA/SAS based.)

One possible reason for optimism is that in today's operating systems, it can take careful system design and unusual programming patterns to really push NVMe disks to high performance levels. This may make it less likely that software accidentally winds up being written so it only performs well on NVMe disks; if it happens, it will be deliberate and the project will probably tell you about it. This is somewhat unlike the SSD/HDD situation a decade ago, where the difference in (random) IO operations per second was both massive and easily achieved.

(This entry was sparked in part by reading this article (via), which I'm not taking a position on.)

HTTP headers that tell syndication feed fetchers how soon to come back

By: cks
4 September 2025 at 03:17

Programs that fetch syndication feeds should fetch them only every so often. But how often? There are a variety of ways to communicate this, and for my own purposes I want to gather them in one place.

I'll put the summary up front. For Atom syndication feeds, your HTTP feed responses should contain a Cache-Control: max-age=... HTTP header that gives your desired retry interval (in seconds), such as '3600' for pulling the feed once an hour. If and when people trip your rate limits and get HTTP 429 responses, your 429s should include a Retry-After header with how long you want feed readers to wait (although they won't).

There are two syndication feed formats in general usage, Atom and RSS2. Although generally not great (and to be avoided), RSS2 format feeds can optionally contain a number of elements to explicitly tell feed readers how frequently they should poll the feed. The Atom syndication feed format has no standard element to communicate polling frequency. Instead, the nominally standard way to do this is through a general Cache-Control: max-age=... HTTP header, which gives a (remaining) lifetime in seconds. You can also set an Expires header, which gives an absolute expiry time, but not both.

(This information comes from Daniel Aleksandersen's Best practices for syndication feed caching. One advantage of HTTP headers over feed elements is that they can be returned on HTTP 304 Not Modified responses; one drawback is that you need to be able to set HTTP headers.)

If you have different rate limit policies for conditional GET requests and unconditional ones, you have a choice to make about the time period you advertise on successful unconditional GETs of your feed. Every feed reader has to do an unconditional GET the first time it fetches your feed, and many of them will periodically do unconditional GETs for various reasons. You could choose to be optimistic, assume that the feed reader's next poll will be a conditional GET, and give it the conditional GET retry interval, or you could be pessimistic and give it a longer unconditional GET one. My personal approach is to always advertise the conditional GET retry interval, because I assume that if you're not going to do any conditional GETs you're probably not paying attention to my Cache-Control header either.

As rachelbythebay's ongoing work on improving feed reader behavior has uncovered, a number of feed readers will come back a bit earlier than your advertised retry interval. So my view is that if you have a rate limit, you should advertise a retry interval that is larger than it. On Wandering Thoughts my current conditional GET feed rate limit is 45 minutes, but I advertise a one hour max-age (and I would like people to stick to once an hour).

(Unconditional GETs of my feeds are rate limited down to once every four hours.)

Once people trip your rate limits and start getting HTTP 429 responses, you theoretically can signal how soon they can come back with a Retry-After header. The simplest way to implement this is to have a constant value that you put in this header, even if your actual rate limit implementation would allow a successful request earlier. For example, if you rate limit to one feed fetch every half hour and a feed fetcher polls after 20 minutes, the simple Retry-After value is '1800' (half an hour in seconds), although if they tried again in just over ten minutes they could succeed (depending on how you implement rate limits). This is what I currently do, with a different Retry-After (and a different rate limit) for conditional GET requests and unconditional GETs.

My suspicion is that there are almost no feed fetchers that ignore your Cache-Control max-age setting but that honor your HTTP 429 Retry-After setting (or that react to 429s at all). Certainly I see a lot of feed fetchers here behaving in ways that very strongly suggest they ignore both, such as rather frequent fetch attempts. But at least I tried.

Sidebar: rate limit policies and feed reader behavior

When you have a rate limit, one question is whether failed (rate limited) requests should count against the rate limit, or if only successful ones count. If you nominally allow one feed fetch every 30 minutes and a feed reader fetches at T (successfully), T+20, and T+33, this is the difference between the third fetch failing (since it's less than 30 minutes from the previous attempt) or succeeding (since it's more than 30 minutes from the last successful fetch).

There are various situations where the right answer is that your rate limit counts from the last request even if the last request failed (what Exim calls a strict ratelimit). However, based on observed feed reader behavior, doing this strict rate limiting on feed fetches will result in quite a number of syndication feed readers never successfully fetching your feed, because they will never slow down and drop under your rate limit. You probably don't want this.

Mapping from total requests per day to average request rates

By: cks
3 September 2025 at 03:43

Suppose, not hypothetically, that a single IP address with a single User-Agent has made 557 requests for your blog's syndication feed in about 22 and a half hours (most of which were rate-limited and got HTTP 429 replies). If we generously assume that these requests were distributed evenly over one day (24 hours), what was the average interval between requests (the rate of requests)? The answer is easy enough to work out and it's about two and a half minutes between requests, if they were evenly distributed.

I've been looking at numbers like this lately and I don't feel like working out the math each time, so here is a table of them for my own future use.

Total requests Theoretical interval (rate)
6 Four hours
12 Two hours
24 One hour
32 45 minutes
48 30 minutes
96 15 minutes
144 10 minutes
288 5 minutes
360 4 minutes
480 3 minutes
720 2 minutes
1440 One minute
2880 30 seconds
5760 15 seconds
8640 10 seconds
17280 5 seconds
43200 2 seconds
86400 One second

(This obviously isn't comprehensive; instead I want it to give me a ballpark idea, and I care more about higher request counts than lower ones. But not too high because I mostly don't deal with really high rates. Every four hours and every 45 minutes are relevant to some ratelimiting I do.)

Yesterday there were about 20,240 requests for the main syndication feed for Wandering Thoughts, which is an aggregate rate of more than one request every five seconds. About 10,570 of those requests weren't blocked in various ways or ratelimited, which is still more than one request every ten seconds (if they were evenly spread out, which they probably weren't).

(There were about 48,000 total requests to Wandering Thoughts, and about 18,980 got successful responses, although almost 2,000 of those successful responses were a single rogue crawler that's now blocked. This is of course nothing compared to what a busy website sees. Yesterday my department's web server saw 491,900 requests, although that seems to have been unusually high. Interested parties can make their own tables for that sort of volume level.)

It's a bit interesting to see this table written out this way. For example, if I thought about it I knew there was a factor of ten difference between one request every ten seconds and one request every second, but it's more concrete when I see the numbers there with the extra zero.

In GNU Emacs, I should remember that the basics still work

By: cks
2 September 2025 at 03:42

Over on the Fediverse, I said something that has a story attached:

It sounds obvious to say it, but I need to remember that I can always switch buffers in GNU Emacs by just switching buffers, not by using, eg, the MH-E commands to switch (back) to another folder. The MH-E commands quite sensibly do additional things, but sometimes I don't want them.

GNU Emacs has a spectrum of things that range from assisting your conventional editing (such as LSP clients) to what are essentially nearly full-blown applications that happen to be embedded in GNU Emacs, such as magit and MH-E and the other major modes for reading your email (or Usenet news, or etc). One of my personal dividing lines is to what extent the mode takes over from regular Emacs keybindings and regular Emacs behaviors. On this scale, MH-E is quite high on the 'application' side; in MH-E folder buffers, you mostly do things through custom keybindings.

(Well, sort of. This is actually overselling the case because I use regular Emacs buffer movement and buffer searching commands routinely, and MH-E uses Emacs marks to select ranges of messages, which you establish through normal Emacs commands. But actual MH-E operations, like switching to another folder, are done through custom keybindings that involve MH-E functions.)

My dominant use of GNU Emacs at the moment is as a platform for MH-E. When I'm so embedded in an MH-E mindset, it's easy to wind up with a form of tunnel vision, where I think of the MH-E commands as the only way to do something like 'switch to another (MH) folder'. Sometimes I do need or want to use the MH-E commands, and sometimes they're the easiest way, but part of the power of GNU Emacs as a general purpose environment is that ultimately, MH-E's displays of folders and messages, the email message I'm writing, and so on, are all just Emacs buffers being displayed in Emacs windows. I don't have to switch between these things through MH-E commands if I don't want to; I can just switch buffers with 'C-x b'.

(Provided that the buffer already exists. If the buffer doesn't exist, I need to use the MH-E command to create it.)

Sometimes the reason to use native Emacs buffer switching is that there's no MH-E binding for the functionality, for example to switch from a mail message I'm writing back to my inbox (either to look at some other message or to read new email that just came in). Sometimes it's because, for example, the MH-E command to switch to a folder wants to rescan the MH folder, which forces me to commit or discard any pending deletions and refilings of email.

One of the things that makes this work is that MH-E uses a bunch of different buffers for things. For example, each MH folder gets its own separately named buffer, instead of MH-E simply loading the current folder (whatever it is) into a generic 'show a folder' buffer. Magit does something similar with buffer naming, where its summary buffer isn't called just 'magit' but 'magit: <directory>' (I hadn't noticed that until I started writing this entry, but of course Magit would do it that way as a good Emacs citizen).

Now that I've written this, I've realized that a bit of my MH-E customization uses a fixed buffer name for a temporary buffer, instead of a buffer name based on the current folder. I'm in good company on this, since a number of MH-E status display commands also use fixed-name buffers, but perhaps I should do better. On the other hand, using a fixed buffer name does avoid having a bunch of these buffers linger around just because I used my command.

(This is using with-output-to-temp-buffer, and a lot of use of it in GNU Emacs' standard Lisp is using fixed names, so maybe my usage here is fine. The relevant Emacs Lisp documentation doesn't have style and usage notes that would tell me either way.)

Some thoughts on Ubuntu automatic ('unattended') package upgrades

By: cks
1 September 2025 at 02:46

The default behavior of a stock Ubuntu LTS server install is that it enables 'unattended upgrades', by installing the package unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades, which controls this). Historically, we haven't believed in unattended automatic package upgrades and eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential advantages; in practice it mostly results in package upgrades being applied after some delay that depends on when they come out relative to working days.

I have a few machines that actually are stock Ubuntu servers, for reasons outside the scope of this entry. These machines naturally have automated upgrades turned on and one of them (in a cloud, using the cloud provider's standard Ubuntu LTS image) even appears to automatically reboot itself if kernel updates need that. These machines are all in undemanding roles (although one of them is my work IPv6 gateway), so they aren't necessarily indicative of what we'd see on more complex machines, but none of them have had any visible problems from these unattended upgrades.

(I also can't remember the last time that we ran into a problem with updates when we applied them. Ubuntu updates still sometimes have regressions and other problems, forcing them to be reverted or reissued, but so far we haven't seen problems ourselves; we find out about these problems only through the notices in the Ubuntu security lists.)

If we were starting from scratch today in a greenfield environment, I'm not sure we'd bother building our automation for manual package updates. Since we have the automation and it offers various extra features (even if they're rarely used), we're probably not going to switch over to automated upgrades (including in our local build of Ubuntu 26.04 LTS when that comes out next year).

(The advantage of switching over to standard unattended upgrades is that we'd get rid of a local tool that, like all local tools, is all our responsibility. The less local weird things we have, the better, especially since we have so many as it is.)

I wish Firefox had some way to permanently mute a website

By: cks
31 August 2025 at 02:27

Over on the Fediverse, I had a wish:

My kingdom for a way to tell Firefox to never, ever play audio and/or video for a particular site. In other words, a permanent and persistent mute of that site. AFAIK this is currently impossible.

(For reasons, I cannot set media.autoplay.blocking_policy to 2 generally. I could if Firefox had a 'all subdomains of ...' autoplay permission, but it doesn't, again AFAIK.)

(This is in a Firefox setup that doesn't have uMatrix and that runs JavaScript.)

Sometimes I visit sites in my 'just make things work' Firefox instance that has JavaScript and cookies and so on allowed (and throws everything away when it shuts down), and it turns out that those sites have invented exceedingly clever ways to defeat Firefox's default attempts to let you block autoplaying media (and possibly their approach is clever enough to defeat even the strict 'click to start' setting for media.autoplay.blocking_policy). I'd like to frustrate those sites, especially ones that I keep winding up back on for various reasons, and never hear unexpected noises from Firefox.

(In general I'd probably like to invert my wish, so that Firefox never played audio or video by default and I had to specifically enable it on a site by site basis. But again this would need an 'all subdomains of' option. This version might turn out to be too strict, I'd have to experiment.)

You can mute a tab, but only once it starts playing, and your mute isn't persistent. As far as I know there's no (native) way to get Firefox to start a tab muted, or especially to always start tabs for a site in a muted state, or to disable audio and/or video for a site entirely (the way you can deny permission for camera or microphone access). I'm somewhat surprised that Firefox doesn't have any option for 'this site is obnoxious, put them on permanent mute', because there are such sites out there.

Both uMatrix and apparently NoScript can selectively block media, but I'd have to add either of them to this profile and I broadly want it to be as plain as reasonable. I do have uBlock Origin in this profile (because I have it in everything), but as far as I can tell it doesn't have a specific (and selective) media blocking option, although it's possible you can do clever things with filter rules, especially if you care about one site instead of all sites.

(I also think that Firefox should be able to do this natively, but evidently Firefox disagrees with me.)

PS: If Firefox actually does have an apparently well hidden feature for this, I'd love to know about it.

❌
❌