❌

Reading view

There are new articles available, click to refresh the page.

My flailing around with Firefox's Multi-Account Containers

By: cks

I have two separate Firefox environments. One of them is quite locked down so that it blocks JavaScript by default, doesn't accept cookies, and so on. Naturally this breaks a lot of things, so I have a second "just make it work" environment that runs all the JavaScript, accepts all the cookies, and so on (although of course I use uBlock Origin, I'm not crazy). This second environment is pretty risky in the sense that it's going to be heavily contaminated with tracking cookies and so on, so to mitigate the risk (and make it a better environment to test things in), I have this Firefox set to discard cookies, caches, local storage, history, and so on when it shuts down.

In theory how I use this Firefox is that I start it when I need to use some annoying site I want to just work, use the site briefly, and then close it down, flushing away all of the cookies and so on. In practice I've drifted into having a number of websites more or less constantly active in this "accept everything" Firefox, which means that I often keep it running all day (or longer at home) and all of those cookies stick around. This is less than ideal, and is a big reason why I wish Firefox had a 'open this site in a specific profile' feature. Yesterday, spurred on by Ben Zanin's Fediverse comment, I decided to make my "accept everything" Firefox environment more complicated in the pursuit of doing better (ie, throwing away at least some cookies more often).

First, I set up a combination of Multi-Account Containers for the basic multi-container support and FoxyTab to assign wildcarded domains to specific containers. My reason to use Multi-Account Containers and to confine specific domains to specific containers is that both M-A C itself and my standard Cookie Quick Manager add-on can purge all of the cookies and so on for a specific container. In theory this lets me manually purge undesired cookies, or all cookies except desired ones (for example, my active Fediverse login). Of course I'm not likely to routinely manually delete cookies, so I also installed Cookie AutoDelete with a relatively long timeout and with its container awareness turned on, and exemptions configured for the (container-confined) sites that I'm going to want to retain cookies from even when I've closed their tab.

(It would be great if Cookie AutoDelete supported different cookie timeouts for different containers. I suspect it's technically possible, along with other container-aware cookie deletion, since Cookie AutoDelete applies different retention policies in different containers.)

In FoxyTab, I've set a number of my containers to 'Limit to Designated Sites'; for example, my 'Fediverse' container is set this way. The intention is that when I click on an external link in a post while reading my Fediverse feed, any cookies that external site sets don't wind up in the Fediverse container; instead they go either in the default 'no container' environment or in any specific container I've set up for them. As part of this I've created a 'Cookie Dump' container that I've assigned as the container for various news sites and so on where I actively want a convenient way to discard all their cookies and data (which is available through Multi-Account Containers).

Of course if you look carefully, much of this doesn't really require Multi-Account Containers and FoxyTab (or containers at all). Instead I could get almost all of this just by using Cookie AutoDelete to clean out cookies from closed sites after a suitable delay. Containers do give me a bit more isolation between the different things I'm using my "just make it work" Firefox for, and maybe that's important enough to justify the complexity.

(I still have this Firefox set to discard everything when it exits. This means that I have to re-log-in every so often even for the sites where I have Cookie AutoDelete keep cookies, but that's fine.)

I wish Firefox Profiles supported assigning websites to profiles

By: cks

One of the things that Firefox is working on these days is improving Firefox's profiles feature so that it's easier to use them. Firefox also has an existing feature that is similar to profiles, in containers and the Multi-Account Containers extension. The reason Firefox is tuning up profiles is that containers only separate some things, while profiles separate pretty much everything. A profile has a separate set of about:config settings, add-ons, add-on settings, memorized logins, and so on. I deliberately use profiles to create two separate and rather different Firefox environments. I'd like to have at least two or three more profiles, but one reason I've been lazy is that the more profiles I have, the more complex getting URLs into the right profile is (even with tooling to help).

This leads me to my wish for profiles, which is for profiles to support the kind of 'assign website to profile' and 'open website in profile' features that you currently have with containers, especially with the Multi-Account Containers extension. Actually I would like a somewhat better version than Multi-Account Containers currently offers, because as far as I can see you can't currently say 'all subdomains under this domain should open in container X' and that's a feature I very much want for one of my use cases.

(Multi-Account Containers may be able to do wildcarded subdomains with an additional add-on, but on the other hand apparently it may have been neglected or abandoned by Mozilla.)

Another way to get much of what I want would be for some of my normal add-ons to be (more) container aware. I could get a lot of the benefit of profiles (although not all of them) by using Multi-Account Containers with container aware cookie management in, say, Cookie AutoDelete (which I believe does support that, although I haven't experimented). Using containers also has the advantage that I wouldn't have to maintain N identical copies of my configuration for core extensions and bookmarklets and so on.

(I'm not sure what you can copy from one profile to a new one, and you currently don't seem to get any assistance from Firefox for it, at least in the old profile interface. This is another reason I haven't gone wild on making new Firefox profiles.)

Modern Linux filesystem mounts are rather complex things

By: cks

Once upon a time, Unix filesystem mounts worked by putting one inode on top of another, and this was also how they worked in very early Linux. It wasn't wrong to say that mounts were really about inodes, with the names only being used to find the inodes. This is no longer how things work in Linux (and perhaps other Unixes, but Linux is what I'm most familiar with for this). Today, I believe that filesystem mounts in Linux are best understood as namespace operations.

Each separate (unmounted) filesystem is a a tree of names (a namespace). At a broad level, filesystem mounts in Linux take some name from that filesystem tree and project it on top of something in an existing namespace, generally with some properties attached to the projection. A regular conventional mount takes the root name of the new filesystem and puts the whole tree somewhere, but for a long time Linux's bind mounts took some other name in the filesystem as their starting point (what we could call the root inode of the mount). In modern Linux, there can also be multiple mount namespaces in existence at one time, with different contents and properties. A filesystem mount does not necessarily appear in all of them, and different things can be mounted at the same spot in the tree of names in different mount namespaces.

(Some mount properties are still global to the filesystem as a whole, while other mount properties are specific to a particular mount. See mount(2) for a discussion of general mount properties. I don't know if there's a mechanism to handle filesystem specific mount properties on a per mount basis.)

This can't really be implemented with an inode-based view of mounts. You can somewhat implement traditional Linux bind mounts with an inode based approach, but mount namespaces have to be separate from the underlying inodes. At a minimum a mount point must be a pair of 'this inode in this namespace has something on top of it', instead of just 'this inode has something on top of it'.

(A pure inode based approach has problems going up the directory tree even in old bind mounts, because the parent directory of a particular directory depends on how you got to the directory. If /usr/share is part of /usr and you bind mounted /usr/share to /a/b, the value of '..' depends on if you're looking at '/usr/share/..' or '/a/b/..', even though /usr/share and /a/b are the same inode in the /usr filesystem.)

If I'm reading manual pages correctly, Linux still normally requires the initial mount of any particular filesystem be of its root name (its true root inode). Only after that initial mount is made can you make bind mounts to pull out some subset of its tree of names and then unmount the original full filesystem mount. I believe that a particular filesystem can provide ways to sidestep this with a filesystem specific mount option, such as btrfs's subvol= mount option that's covered in the btrfs(5) manual page (or 'btrfs subvolume set-default').

You can add arbitrary zones to NSD (without any glue records)

By: cks

Suppose, not hypothetically, that you have a very small DNS server for a captive network situation, where the DNS server exists only to give clients answers for a small set of hosts. One of the ways you can implement this is with an authoritative DNS servers, such as NSD, that simply has an extremely minimal set of DNS data. If you're using NSD for this, you might be curious how minimal you can be and how much you need to mimic ordinary DNS structure.

Here, by 'mimic ordinary DNS structure', I mean inserting various levels of NS records so there is a more or less conventional path of NS delegations from the DNS root ('.') down to your name. If you're providing DNS clients with 'dog.example.org', you might conventionally have a NS record for '.', a NS record for 'org.', and a NS record for 'example.org.', mimicking what you'd see in global DNS. Of course all of your NS records are going to point to your little DNS server, but they're present if anything looks.

Perhaps unsurprisingly, NSD doesn't require this and DNS clients normally don't either. If you say:

zone:
  name: example.org
  zonefile: example-stub

and don't have any other DNS data, NSD won't object and it will answer queries for 'dog.example.org' with your minimal stub data. This works for any zone, including completely made up ones:

zone:
  name: beyond.internal
  zonefile: beyond-stub

The actual NSD stub zone files can be quite minimal. An older OpenBSD NSD appears to be happy with zone files that have only a $ORIGIN, a $TTL, a '@ IN SOA' record, and what records you care about in the zone.

Once I thought about it, I realized I should have expected this. An authoritative DNS server normally only holds data for a small subset of zones and it has to be willing to answer queries about the data it holds. Some authoritative DNS servers (such as Bind) can also be used as resolving name servers so they'd sort of like to have information about at least the root nameservers, but NSD is a pure authoritative server so there's no reason for it to care.

As for clients, they don't normally do DNS resolution starting from the root downward. Instead, they expect to operate by sending the entire query to whatever their configured DNS resolver is, which is going to be your little NSD setup. In a number of configurations, clients either can't talk directly to outside DNS or shouldn't try to do DNS resolution that way because it won't work; they need to send everything to their configured DNS resolver so it can do, for example, "split horizon" DNS.

(Yes, the modern vogue for DNS over HTTPS puts a monkey wrench into split horizon DNS setups. That's DoH's problem, not ours.)

Since this works for a .net zone, you can use it to try to disable DNS over HTTPS resolvers in your stub DNS environment by providing a .net zone with 'use-application-dns CNAME .' or the like, to trigger at least Firefox's canary domain detection.

(I'm not going to address whether you should have such a minimal stub DNS environment or instead count on your firewall to block traffic and have a normal DNS environment, possibly with split horizon or response policy zones to introduce your special names.)

Some of the things that ZFS scrubs will detect

By: cks

Recently I saw a discussion of my entry on how ZFS scrubs don't really check the filesystem structure where someone thought that ZFS scrubs only protected you from the disk corrupting data at rest, for example due to sectors starting to fail (here). While ZFS scrubs have their limits, they do manage to check somewhat more than this.

To start with, ZFS scrubs check the end to end hardware path for reading all your data (and implicitly for writing it). There are a variety of ways that things in the hardware path can be unreliable; for example, you might have slowly failing drive cables that are marginal and sometimes give you errors on data reads (or worse, data writes). A ZFS scrub has some chance to detect this; if a ZFS scrub passes, you know that as of that point in time you can reliably read all your data from all your disks and that all the data was reliably written.

If a scrub passes, you also know that the disks haven't done anything obviously bad with your data. This can be important if you're doing operations that you consider somewhat exotic, such as telling SSDs to discard unused sectors. If you have ZFS send TRIM commands to a SSD and then your scrub passes, you know that the SSD didn't incorrectly discard some sectors that were actually used.

Related to this, if you do a ZFS level TRIM and then the scrub passes, you know that ZFS itself didn't send TRIM commands that told the SSD to discard sectors that were actually used. In general, if ZFS has a serious problem where it writes the wrong thing to the wrong place, a scrub will detect it (although the scrub can't fix it). Similarly, a scrub will detect if a disk itself corrupted the destination of a write (or a read), or if things were corrupted somewhere in the lower level software and hardware path of the write.

There are a variety of ZFS level bugs that could theoretically write the wrong thing to the wrong place, or do something that works out to the same effect. ZFS could have a bug in free space handling (so that it incorrectly thinks some in use sectors are free and overwrites them), or it could write too much or too little, or it could correctly allocate and write data but record the location of the data incorrectly in higher level data structures, or it could accidentally not do a write (for example, if it's supposed to write a duplicate copy of some data but forgets to actually issue the IO). ZFS scrubs can detect all of these issues under the right circumstances.

(To a limited extent a ZFS scrub also checks the high level metadata of filesystems and snapshots. since it has to traverse that metadata to find the object set for each dataset and similar things. Since a scrub just verifies checksums, this won't cross check dataset level metadata like information on how much data was written in each snapshot, or the space usage.)

What little I want out of web "passkeys" in my environment

By: cks

WebAuthn is yet another attempt to do an API for web authentication that doesn't involve passwords but that instead allows browsers, hardware tokens, and so on to do things more securely. "Passkeys" (also) is the marketing term for a "WebAuthn credential", and an increasing number of websites really, really want you to use a passkey for authentication instead of any other form of multi-factor authentication (they may or may not still require your password).

Most everyone that wants you to use passkeys also wants you to specifically use highly secure ones. The theoretically most secure are physical hardware security keys, followed by passkeys that are stored and protected in secure enclaves in various ways by the operating system (provided that the necessary special purpose hardware is available). Of course the flipside of 'secure' is 'locked in', whether locked in to your specific hardware key (or keys, generally you'd better have backups) or locked in to a particular vendor's ecosystem because their devices are the only ones that can possibly use your encrypted passkey vault.

(WebAuthn neither requires nor standardizes passkey export and import operations, and obviously security keys are built to not let anyone export the cryptographic material from them, that's the point.)

I'm extremely not interested in the security versus availability tradeoff that passkeys make in favour of security. I care far more about preserving availability of access to my variety of online accounts than about nominal high security. So if I'm going to use passkeys at all, I have some requirements:

Linux people: is there a passkeys implementation that does not use physical hardware tokens (software only), is open source, works with Firefox, and allows credentials to be backed up and copied to other devices by hand, without going through some cloud service?

I don't think I'm asking for much, but this is what I consider the minimum for me actually using passkeys. I want to be 100% sure of never losing them because I have multiple backups and can use them on multiple machines.

Apparently KeePassXC more or less does what I want (when combined with its Firefox extension), and it can even export passkeys in a plain text format (well, JSON). However, I don't know if anything else can ingest those plain text passkeys, and I don't know if KeePassXC can be told to only do passkeys with the browser and not try to take over passwords.

(But at least a plain text JSON backup of your passkeys can be imported into another KeePassXC instance without having to try to move, copy, or synchronize a KeePassXC database.)

Normally I would ignore passkeys entirely, but an increasing number of websites are clearly going to require me to use some form of multi-factor authentication, no matter how stupid this is (cf), and some of them will probably require passkeys or at least make any non-passkey option very painful. And it's possible that reasonably integrated passkeys will be a better experience than TOTP MFA with my janky minimal setup.

(Of course KeePassXC also supports TOTP, and TOTP has an extremely obvious import process that everyone supports, and I believe KeePassXC will export TOTP secrets if you ask nicely.)

While KeePassXC is okay, what I would really like is for Firefox to support 'memorized passkeys' right along with its memorized passwords (and support some kind of export and import along with it). Should people use them? Perhaps not. But it would put that choice firmly in the hands of the people using Firefox, who could decide on how much security they did or didn't want, not in the hands of websites who want to force everyone to face a real risk of losing their account so that the website can conduct security theater.

(Firefox will never support passkeys this way for an assortment of reasons. At most it may someday directly use passkeys through whatever operating system services expose them, and maybe Linux will get a generic service that works the way I want it to. Nor is Firefox ever going to support 'memorized TOTP codes'.)

Two reasons why Unix traditionally requires mount points to exist

By: cks

Recently on the Fediverse, argv minus one asked a good question:

Why does #Linux require #mount points to exist?

And are there any circumstances where a mount can be done without a pre-existing mount point (i.e. a mount point appears out of thin air)?

I think there is one answer for why this is a good idea in general and otherwise complex to do, although you can argue about it, and then a second historical answer based on how mount points were initially implemented.

The general problem is directory listings. We obviously want and need mount points to appear in readdir() results, but in the kernel, directory listings are historically the responsibility of filesystems and are generated and returned in pieces on the fly (which is clearly necessary if you have a giant directory; the kernel doesn't read the entire thing into memory and then start giving your program slices out of it as you ask). If mount points never appear in the underlying directory, then they must be inserted at some point in this process. If mount points can sometimes exist and sometimes not, it's worse; you need to somehow keep track of which ones actually exist and then add the ones that don't at the end of the directory listing. The simplest way to make sure that mount points always exist in directory listings is to require them to have an existence in the underlying filesystem.

(This was my initial answer.)

The historical answer is that in early versions of Unix, filesystems were actually mounted on top of inodes, not directories (or filesystem objects). When you passed a (directory) path to the mount(2) system call, all it was used for was getting the corresponding inode, which was then flagged as '(this) inode is mounted on' and linked (sort of) to the new mounted filesystem on top of it. All of the things that dealt with mount points and mounted filesystem did so by inode and inode number, with no further use of the paths and the root inode of the mounted filesystem being quietly substituted for the mounted-on inode. All of the mechanics of this needed the inode and directory entry for the name to actually exist (and V7 required the name to be a directory).

I don't think modern kernels (Linux or otherwise) still use this approach to handling mounts, but I believe it lingered on for quite a while. And it's a sufficiently obvious and attractive implementation choice that early versions of Linux also used it (see the Linux 0.96c version of iget() in fs/inode.c).

Sidebar: The details of how mounts worked in V7

When you passed a path to the mount(2) system call (called 'smount()' in sys/sys3.c), it used the name to get the inode and then set the IMOUNT flag from sys/h/inode.h on it (and put the mount details in a fixed size array of mounts, which wasn't very big). When iget() in sys/iget.c was fetching inodes for you and you'd asked for an IMOUNT inode, it gave you the root inode of the filesystem instead, which worked in cooperation with name lookup in a directory (the name lookup in the directory would find the underlying inode number, and then iget() would turn it into the mounted filesystem's root inode). This gave Research Unix a simple, low code approach to finding and checking for mount points, at the cost of pinning a few more inodes into memory (not necessarily a small thing when even a big V7 system only had at most 200 inodes in memory at once, but then a big V7 system was limited to 8 mounts, see h/param.h).

We can't really do progressive rollouts of disruptive things

By: cks

In a comment on my entry on how we reboot our machines right after updating their kernels, Jukka asked a good question:

While I do not know how many machines there are in your fleet, I wonder whether you do incremental rolling, using a small snapshot for verification before rolling out to the whole fleet?

We do this to some extent but we can't really do it very much. The core problem is that the state of almost all of our machines is directly visible and exposed to people. This is because we mostly operate an old fashioned Unix login server environment, where people specifically use particular servers (either directly by logging in to them or implicitly because their home directory is on a particular NFS fileserver). About the only genuinely generic machines we have are the nodes in our SLURM cluster, where we can take specific unused nodes out of service temporarily without anyone noticing.

(Some of these login servers in use all of the time; others we might find idle if we're extremely lucky. But it's hard to predict when someone will show up to try to use a currently empty server.)

This means that progressively rolling out a kernel update (and rebooting things) to our important, visible core servers requires multiple people-visible reboots of machines, instead of one big downtime when everything is rebooted. Generally we feel that repeated disruptions are much more annoying and disruptive overall to people; it's better to get the pain of reboot disruptions over all at once. It's also much easier to explain to people, and we don't have to annoy them with repeated notifications that yet another subset of our servers and services will be down for a bit.

(To make an incremental deployment more painful for us, these will normally have to be after-hours downtimes, which means that we'll be repeatedly staying late, perhaps once a week for three or four weeks as we progressively work through a rollout.)

In addition to the nodes of our SLURM cluster, there are a number of servers that can be rebooted in the background to some degree without people noticing much. We will often try the kernel update out on a few of them in advance, and then update others of them earlier in the day (or the day before) both as a final check and to reduce the number of systems we have to cover at the actual out of hours downtime. But a lot of our servers cannot really be tested much in advance, such as our fileservers or our web server (which is under constant load for reasons outside the scope of this entry). We can (and do) update a test fileserver or a test web server, but neither will see a production load and it's under production loads that problems are most likely to surface.

This is a specific example of how the 'cattle' model doesn't fit all situations. To have a transparent rolling update that involves reboots (or anything else that's disruptive on a single machine), you need to be able to transparently move people off of machines and then back on to them. This is hard to get in any environment where people have long term usage of specific machines, where they have login sessions and running compute jobs and so on, and where you have have non-redundant resources on a single machine (such as NFS fileservers without transparent failover from server to server).

We don't update kernels without immediately rebooting the machine

By: cks

I've mentioned this before in passing (cf, also) but today I feel like saying it explicitly: our habit with all of our machines is to never apply a kernel update without immediately rebooting the machine into the new kernel. On our Ubuntu machines this is done by holding the relevant kernel packages; on my Fedora desktops I normally run 'dnf update --exclude "kernel*"' unless I'm willing to reboot on the spot.

The obvious reason for this is that we want to switch to the new kernel under controlled, attended conditions when we'll be able to take immediate action if something is wrong, rather than possibly have the new kernel activate at some random time without us present and paying attention if there's a power failure, a kernel panic, or whatever. This is especially acute on my desktops, where I use ZFS by building my own OpenZFS packages and kernel modules. If something goes wrong and the kernel modules don't load or don't work right, an unattended reboot can leave my desktops completely unusable and off the network until I can get to them. I'd rather avoid that if possible (sometimes it isn't).

(In general I prefer to reboot my Fedora machines with me present because weird things happen from time to time and sometimes I make mistakes, also.)

The less obvious reason is that when you reboot a machine right after applying a kernel update, it's clear in your mind that the machine has switched to a new kernel. If there are system problems in the days immediately afterward the update, you're relatively likely to remember this and at least consider the possibility that the new kernel is involved. If you apply a kernel update, walk away without rebooting, and the machine reboots a week and a half later for some unrelated reason, you may not remember that one of the things the reboot did was switch to a new kernel.

(Kernels aren't the only thing that this can happen with, since not all system updates and changes take effect immediately when made or applied. Perhaps one should reboot after making them, too.)

I'm assuming here that your Linux distribution's package management system is sensible, so there's no risk of losing old kernels (especially the one you're currently running) merely because you installed some new ones but didn't reboot into them. This is how Debian and Ubuntu behave (if you don't 'apt autoremove' kernels), but not quite how Fedora's dnf does it (as far as I know). Fedora dnf keeps the N most recent kernels around and probably doesn't let you remove the currently running kernel even if it's more than N kernels old, but I don't believe it tracks whether or not you've rebooted into those N kernels and stretches the N out if you haven't (or removes more recent installed kernels that you've never rebooted into, instead of older kernels that you did use at one point).

PS: Of course if kernel updates were perfect this wouldn't matter. However this isn't something you can assume for the Linux kernel (especially as patched by your distribution), as we've sometimes seen. Although big issues like that are relatively uncommon.

We (I) need a long range calendar reminder system

By: cks

About four years ago I wrote an entry about how your SMART drive database of attribute meanings needs regular updates. That entry was written on the occasion of updating the database we use locally on our Ubuntu servers, and at the time we were using a mix of Ubuntu 18.04 and Ubuntu 20.04 servers, both of which had older drive databases that probably dated from early 2018 and early 2020 respectively. It is now late 2025 and we use a mix of Ubuntu 24.04 and 22.04 servers, both of which have drive databases that are from after October of 2021.

Experienced system administrators know where this one is going: today I updated our SMART drive database again, to a version of the SMART database that was more recent than the one shipped with 24.04 instead of older than it.

It's a fact of life that people forget things. People especially forget things that are a long way away, even if they make little notes in their worklog message when recording something that they did (as I did four years ago). It's definitely useful to plan ahead in your documentation and write these notes, but without an external thing to push you or something to explicitly remind you, there's no guarantee that you'll remember.

All of which leads me to the view that it would be useful for us to have a long range calendar reminder system, something that could be used to set reminders for more than a year into the future and ideally allow us to write significant email messages to our future selves to cover all of the details (although there are hacks around that, such as putting the details on a web page and having the calendar mail us a link). Right now the best calendar reminder system we have is the venerable calendar, which we can arrange to email one-line notes to our general address that reaches all sysadmins, but calendar doesn't let you include the year in the reminder date.

(For SMART drive database updates, we could get away with mailing ourselves once a year in, say, mid-June. It doesn't hurt to update the drive database more than every Ubuntu LTS release. But there are situations where a reminder several years in the future is what we want.)

PS: Of course it's not particularly difficult to build an ad-hoc script system to do this, with various levels of features. But every local ad-hoc script that we write is another little bit of overhead, and I'd like to avoid that kind of thing if at all possible in favour of a standard solution (that isn't a shared cloud provider calendar).

We need to start doing web blocking for non-technical reasons

By: cks

My sense is that for a long time, technical people (system administrators, programmers, and so on) have seen the web as something that should be open by default and by extension, a place where we should only block things for 'technical' reasons. Common technical reasons are a harmful volume of requests or clear evidence of malign intentions, such as probing for known vulnerabilities. Otherwise, if it wasn't harming your website and wasn't showing any intention to do so, you should let it pass. I've come to think that in the modern web this is a mistake, and we need to be willing to use blocking and other measures for 'non-technical' reasons.

The core problem is that the modern web seems to be fragile and is kept going in large part by a social consensus, not technical things such as capable software and powerful servers. However, if we only react to technical problems, there's very little that preserves and reinforces this social consensus, as we're busy seeing. With little to no consequences for violating the social consensus, bad actors are incentivized to skate right up to and even over the line of causing technical problems. When we react by taking only narrow technical measures, we tacitly reward the bad actors for their actions; they can always find another technical way. They have no incentive to be nice or to even vaguely respect the social consensus, because we don't punish them for it.

So I've come to feel that if something like the current web is to be preserved, we need to take action not merely when technical problems arise but also when the social consensus is violated. We need to start blocking things for what I called editorial reasons. When software or people do things that merely shows bad manners and doesn't yet cause us technical problems, we should still block it, either soft (temporarily, perhaps with HTTP 429 Too Many Requests) or hard (permanently). We need to take action to create the web that we want to see, or we aren't going to get it or keep it.

To put it another way, if we want to see good, well behaved browsers, feed readers, URL fetchers, crawlers, and so on, we have to create disincentives for ones that are merely bad (as opposed to actively damaging). In its own way, this is another example of the refutation of Postel's Law. If we accept random crap to be friendly, we get random crap (and the quality level will probably trend down over time).

To answer one potential criticism, it's true that in some sense, blocking and so on for social reasons is not good and is in some theoretical sense arguably harmful for the overall web ecology. On the other hand, the current unchecked situation itself is also deeply harmful for the overall web ecology and it's only going to get worse if we do nothing, with more and more things effectively driven off the open web. We only get to pick the poison here.

I wish SSDs gave you CPU performance style metrics about their activity

By: cks

Modern CPUs have an impressive collection of performance counters for detailed, low level information on things like cache misses, branch mispredictions, various sorts of stalls, and so on; on Linux you can use 'perf list' to see them all. Modern SSDs (NVMe, SATA, and SAS) are all internally quite complex, and their behavior under load depends on a lot of internal state. It would be nice to have CPU performance counter style metrics to expose some of those details. For a relevant example that's on my mind (cf), it certainly would be interesting to know how often flash writes had to stall while blocks were hastily erased, or the current erase rate.

Having written this, I checked some of our SSDs (the ones I'm most interested in at the moment) and I see that our SATA SSDs do expose some of this information as (vendor specific) SMART attributes, with things like 'block erase count' and 'NAND GB written' to TLC or SLC (as well as the host write volume and so on stuff you'd expect). NVMe does this in a different way that doesn't have the sort of easy flexibility that SMART attributes do, so a random one of ours that I checked doesn't seem to provide this sort of lower level information.

It's understandable that SSD vendors don't necessarily want to expose this sort of information, but it's quite relevant if you're trying to understand unusual drive performance. For example, for your workload do you need to TRIM your drives more often, or do they have enough pre-erased space available when you need it? Since TRIM has an overhead, you may not want to blindly do it on a frequent basis (and its full effects aren't entirely predictable since they depend on how much the drive decides to actually erase in advance).

(Having looked at SMART 'block erase count' information on one of our servers, it's definitely doing something when the server is under heavy fsync() load, but I need to cross-compare the numbers from it to other systems in order to get a better sense of what's exceptional and what's not.)

I'm currently more focused on write related metrics, but there's probably important information that could be exposed for reads and for other operations. I'd also like it if SSDs provided counters for how many of various sorts of operations they saw, because while your operating system can in theory provide this, it often doesn't (or doesn't provide them at the granularity of, say, how many writes with 'Force Unit Access' or how many 'Flush' operations were done).

(In Linux, I think I'd have to extract this low level operation information in an ad-hoc way with eBPF tracing.)

A (filesystem) journal can be a serialization point for durable writes

By: cks

Suppose that you have a filesystem that uses some form of a journal to provide durability (as many do these days) and you have a bunch of people (or processes) writing and updating things all over the filesystem that they want to be durable, so these processes are all fsync()'ing their work on a regular basis (or the equivalent system call or synchronous write operation). In a number of filesystem designs, this creates a serialization point on the filesystem's journal.

This is related to the traditional journal fsync() problem, but that one is a bit different. In the traditional problem you have a bunch of changes from a bunch of processes, some of which one process wants to fsync() and most of which it doesn't; this can be handled by only flushing necessary things. Here we have a bunch of processes making a bunch of relatively independent changes but approximately all of the processes want to fsync() their changes.

The simple way to get durability (and possibly integrity) for fsync() is to put everything that gets fsync()'d into the journal (either directly or indirectly) and then force the journal to be durably committed to disk. If the filesystem's journal is a linear log, as is usually the case, this means that multiple processes mostly can't be separately writing and flushing journal entries at the same time. Each durable commit of the journal is a bottleneck for anyone who shows up 'too late' to get their change included in the current commit; they have to wait for the current commit to be flushed to disk before they can start adding more entries to the journal (but then everyone can be bundled into the next commit).

In some filesystems, processes can readily make durable writes outside of the journal (for example, overwriting something in place); such processes can avoid serializing on a linear journal. Even if they have to put something in the journal, you can perhaps minimize the direct linear journal contents by having them (durably) write things to various blocks independently, then put only compact pointers to those out of line blocks into the linear journal with its serializing, linear commits. The goal is to avoid having someone show up wanting to write megabytes 'to the journal' and forcing everyone to wait for their fsync(); instead people serialize only on writing a small bit of data at the end, and writing the actual data happens in parallel (assuming the disk allows that).

(I may have made this sound simple but the details are likely fiendishly complex.)

If you have a filesystem in this situation, and I believe one of them is ZFS, you may find you care a bunch about the latency of disks flushing writes to media. Of course you need the workload too, but there are certain sorts of workloads that are prone to this (for example, traditional Unix mail spools).

I believe that you can also see this sort of thing with databases, although they may be more heavily optimized for concurrent durable updates.

Sidebar: Disk handling of durable writes can also be a serialization point

Modern disks (such as NVMe SSDs) broadly have two mechanism to force things to durable storage. You can issue specific writes of specific blocks with 'Force Unit Access' (FUA) set, which causes the disk to write those blocks (and not necessarily any others) to media, or you can issue a general 'Flush' command to the disk and it will write anything it currently has in its write cache to media.

If you issue FUA writes, you don't have to wait for anything else other than your blocks to be written to media. If you issue 'Flush', you get to wait for everyone's blocks to be written out. This means that for speed you want to issue FUA writes when you want things on media, but on the other hand you may have already issued non-FUA writes for some of the blocks before you found out that you wanted them on media (for example, if someone writes a lot of data, so much that you start writeback, and then they issue a fsync()). And in general, the block IO programming model inside your operating system may favour issuing a bunch of regular writes and then inserting a 'force everything before this point to media' fencing operation into the IO stream.

NVMe SSDs and the question of how fast they can flush writes to flash

By: cks

Over on the Fediverse, I had a question I've been wondering about:

Disk drive people, sysadmins, etc: would you expect NVMe SSDs to be appreciably faster than SATA SSDs for a relatively low bandwidth fsync() workload (eg 40 Mbytes/sec + lots of fsyncs)?

My naive thinking is that AFAIK the slow bit is writing to the flash chips to make things actually durable when you ask, and it's basically the same underlying flash chips, so I'd expect NVMe to not be much faster than SATA SSDs on this narrow workload.

This is probably at least somewhat wrong. This 2025 SSD hierarchy article doesn't explicitly cover forced writes to flash (the fsync() case), but it does cover writing 50 GBytes of data in 30,000 files, which is probably enough to run any reasonable consumer NVMe SSD out of fast write buffer storage (either RAM or fast flash). The write speeds they get on this test from good NVMe drives are well over the maximum SATA data rates, so there's clearly a sustained write advantage to NVMe SSDs over SATA SSDs.

In replies on the Fediverse, several people pointed out that NVMe SSDs are likely using newer controllers than SATA SSDs and these newer controllers may well be better at handling writes. This isn't surprising when I thought about it, especially in light of NVMe perhaps overtaking SATA for SSDs, although apparently 'enterprise' SATA/SAS SSDs are still out there and probably seeing improvements (unlike consumer SATA SSDs where price is the name of the game).

Also, apparently the real bottleneck in writing to the actual flash is finding erased blocks or, if you're unlucky, having to wait for blocks to be erased. Actual writes to the flash chips may be able to go at something close to the PCIe 3.0 (or better) bandwidth, which would help explain the Tom's Hardware large write figures (cf).

(If this is the case, then explicitly telling SSDs about discarded blocks is especially important for any write workload that will be limited by flash write speeds, including fsync() heavy workloads.)

PS: The reason I'm interested in this is that we have a SATA SSD based system that seems to have periodic performance issues related enough write IO combined with fsync()s (possibly due to write buffering interactions), and I've been wondering how much moving it to be NVMe based might help. Since this machine uses ZFS, perhaps one thing we should consider is manually doing some ZFS 'TRIM' operations.

The strange case of 'mouse action traps' in GNU Emacs with (slower) remote X

By: cks

Some time back over on the Fediverse, I groused about GNU Emacs tooltips. That grouse was a little imprecise; the situation I usually see problems with is specifically running GNU Emacs in SSH-forwarded X from home, which has a somewhat high latency. This high latency caused me to change how I opened URLs from GNU Emacs, and it seems to be the root of the issues I'm seeing.

The direct experience I was having with tooltips was that being in a situation where Emacs might want to show a GUI tooltip would cause Emacs to stop responding to my keystrokes for a while. If the tooltip was posted and visible it would stay visible, but the stall could happen without that. However, it doesn't seem to be tooltips as such that cause this problem, because even with tooltips disabled as far as I can tell (and certainly not appearing), the cursor and my interaction with Emacs can get 'stuck' in places where there's mouse actions available.

(I tried both setting the tooltip delay times to very large numbers and setting tooltip-functions to do nothing.)

This is especially visible to me because my use of MH-E is prone to this in two cases. First, when composing email flyspell mode will attach a 'correct word' button-2 popup menu to misspelled words, which can then stall things if I move the cursor to them (especially if I use a mouse click to do so, perhaps because I want to make the word into an X selection). Second, when displaying email that has links in it, these links can be clicked on (and have hover tooltips to display what the destination URL is); what I frequently experience is that after I click on a link, when I come back to the GNU Emacs (X) window I can't immediately switch to the next message, scroll the text of the current message, or otherwise do things.

This 'trapping' and stall doesn't usually happen when I'm in the office, which is still using remote X but over a much faster and lower latency 1G network connection. Disabling tooltips themselves isn't ideal because it means I no longer get to see where links go, and anyway it's relatively pointless if it doesn't fix the real problem.

When I thought this was an issue specific to tooltips, it made sense to me because I could imagine that GNU Emacs needed to do a bunch of relatively synchronous X operations to show or clear a tooltip, and those operations could take a while over my home link. Certainly displaying regular GNU Emacs (X) menus isn't particularly fast. Without tooltips displaying it's more mysterious, but it's still possible that Emacs is doing a bunch of X operations when it thinks a mouse or tooltip target is 'active', or perhaps there's something else going on.

(I'm generally happy with GNU Emacs but that doesn't mean it's perfect or that I don't have periodic learning experiences.)

PS: In theory there are tools that can monitor and report on the flow of X events (by interposing themselves into it). In practice it's been a long time since I used any of them, and anyway there's probably nothing I can do about it if GNU Emacs is doing a lot of X operations. Plus it's probably partly the GTK toolkit at work, not GNU Emacs itself.

PPS: Having taken a brief look at the MH-E code, I'm pretty sure that it doesn't even begin to work with GNU Emacs' TRAMP (also) system for working with remote files. TRAMP has some support for running commands remotely, but MH-E has its own low-level command execution and assumes that it can run commands rapidly, whenever it feels like, and then read various results out of the filesystem. Probably the most viable approach would be to use sshfs to mount your entire ~/Mail locally, have a local install of (N)MH, and then put shims in for the very few MH commands that have to run remotely (such as inc and the low level post command that actually sends out messages you've written). I don't know if this would work very well, but it would almost certainly be better than trying to run all those MH commands remotely.

Staring at code can change what I see (a story from long ago)

By: cks

I recently read Hillel Wayne's Sapir-Whorf does not apply to Programming Languages (via, which I will characterize as being about how programming can change how you see things even though the Sapir-Whorf hypothesis doesn't apply (Hillel Wayne points to the Tetris Effect). As it happens, long ago I experienced a particular form of this that still sticks in my memory.

Many years ago, I was recruited to be a TA for the university's upper year Operating Systems course, despite being an undergraduate at the time. One of the jobs of TAs was to mark assignments, which we did entirely by hand back in those days; any sort of automated testing was far in the future, and for these assignments I don't think we even ran the programs by hand. Instead, marking was mostly done by having students hand in printouts of their modifications to the course's toy operating system and we three TAs collectively scoured the result to see if they'd made the necessary changes and spot errors.

Since this was an OS course, some assignments required dealing with concurrency, which meant that students had to properly guard and insulate their changes (in, for example, memory handling) from various concurrency problems. Failure to completely do so would cost marks, so the TAs were on the lookout for such problems. Over the course of the course, I got very good at spotting these concurrency problems entirely by eye in the printed out code. I didn't really have to think about it, I'd be reading the code (or scanning it) and the problem would jump out at me. In the process I formed a firm view that concurrency is very hard for people to deal with, because so many students made so many mistakes (whether obvious or subtle).

(Since students were modifying the toy OS to add or change features, there was no set form that their changes had to follow; people implemented the new features in various different ways. This meant that their concurrency bugs had common patterns but not specific common forms.)

I could have thought that I was spotting these problems because I was a better programmer than these other undergraduate students (some of whom were literally my peers, it was just that I'd taken the OS course a year earlier than they had because it was one of my interests). However, one of the most interesting parts of the whole experience was getting pretty definitive proof that I wasn't, and it was my focused experience that made the difference. One of the people taking this course was a fellow undergraduate who I knew and I knew was a better programmer than I was, but when I was marking his version of one assignment I spotted what I viewed at the time as a reasonably obvious concurrency issue. So I wasn't seeing these issues when the undergraduates doing the assignment missed them because I was a better programmer, since here I wasn't: I was seeing the bugs because I was more immersed in this than they were.

(This also strongly influenced my view of how hard and tricky concurrency is. Here was a very smart programmer, one with at least some familiarity with the whole area, and they'd still made a mistake.)

Uses for DNS server delegation

By: cks

A commentator on my entry on systemd-resolved's new DNS server delegation feature asked:

My memory might fail me here, but: wasn't something like this a feature introduced in ISC's BIND 8, and then considered to be a bad mistake and dropped again in BIND 9 ?

I don't know about Bind, but what I do know is that this feature is present in other DNS resolvers (such as Unbound) and that it has a variety of uses. Some of those uses can be substituted with other features and some can't be, at least not as-is.

The quick version of 'DNS server delegation' is that you can send all queries under some DNS zone name off to some DNS server (or servers) of your choice, rather than have DNS resolution follow any standard NS delegation chain that may or may not exist in global DNS. In Unbound, this is done through, for example, Forward Zones.

DNS server delegation has at least three uses that I know of. First, you can use it to insert entire internal TLD zones into the view that clients have. People use various top level names for these zones, such as .internal, .kvm, .sandbox (our choice), and so on. In all cases you have some authoritative servers for these zones and you need to direct queries to these servers instead of having your queries go to the root nameservers and be rejected.

(Obviously you will be sad if IANA ever assigns your internal TLD to something, but honestly if IANA allows, say, '.internal', we'll have good reason to question their sanity. The usual 'standard DNS environment' replacement for this is to move your internal TLD to be under your organizational domain and then implement split horizon DNS.)

Second, you can use it to splice in internal zones that don't exist in external DNS without going to the full overkill of split horizon authoritative data. If all of your machines live in 'corp.example.org' and you don't expose this to the outside world, you can have your public example.org servers with your public data and your corp.example.org authoritative servers, and you splice in what is effectively a fake set of NS records through DNS server delegation. Related to this, if you want you can override public DNS simply by having an internal and an external DNS server, without split horizon DNS; you use DNS server delegation to point to the internal DNS server for certain zones.

(This can be replaced with split horizon DNS, although maintaining split horizon DNS is its own set of headaches.)

Finally, you can use this to short-cut global DNS resolution for reliability in cases where you might lose external connectivity. For example, there are within-university ('on-campus' in our jargon) authoritative DNS servers for .utoronto.ca and .toronto.edu. We can use DNS server delegation to point these zones at these servers to be sure we can resolve university names even if the university's external Internet connection goes down. We can similarly point our own sub-zone at our authoritative servers, so even if our link to the university backbone goes down we can resolve our own names.

(This isn't how we actually implement this; we have a more complex split horizon DNS setup that causes our resolving DNS servers to have a complete copy of the inside view of our zones, acting as caching secondaries.)

The early Unix history of chown() being restricted to root

By: cks

A few years ago I wrote about the divide in chown() about who got to give away files, where BSD and V7 were on one side, restricting it to root, while System III and System V were on the other, allowing the owner to give them away too. At the time I quoted the V7 chown(2) explanation of this:

[...] Only the super-user may execute this call, because if users were able to give files away, they could defeat the (nonexistent) file-space accounting procedures.

Recently, for reasons, chown(2) and its history was on my mind and so I wondered if the early Research Unixes had always had this, or if a restriction was added at some point.

The answer is that the restriction was added in V6, where the V6 chown(2) manual page has the same wording as V7. In Research Unix V5 and earlier, people can chown(2) away their own files; this is documented in the V4 chown(2) manual page and is what the V5 kernel code for chown() does. This behavior runs all the way back to the V1 chown() manual page, with an extra restriction that you can't chown() setuid files.

(Since I looked it up, the restriction on chown()'ing setuid files was lifted in V4. In V4 and later, a setuid file has its setuid bit removed on chown; in V3 you still can't give away such a file, according to the V3 chown(2) manual page.)

At this point you might wonder where the System III and System V unrestricted chown came from. The surprising to me answer seems to be that System III partly descends from PWB/UNIX, and PWB/UNIX 1.0, although it was theoretically based on V6, has pre-V6 chown(2) behavior (kernel source, manual page). I suspect that there's a story both to why V6 made chown() more restricted and also why PWB/UNIX specifically didn't take that change from V6, but I don't know if it's been documented anywhere (a casual Internet search didn't turn up anything).

(The System III chown(2) manual page says more or less the same thing as the PWB/UNIX manual page, just more formally, and the kernel code is very similar.)

Maybe why OverlayFS had its readdir() inode number issue

By: cks

A while back I wrote about readdir()'s inode numbers versus OverlayFS, which discussed an issue where for efficiency reasons, OverlayFS sometimes returned different inode numbers in readdir() than in stat(). This is not POSIX legal unless you do some pretty perverse interpretations (as covered in my entry), but lots of filesystems deviate from POSIX semantics every so often. A more interesting question is why, and I suspect the answer is related to another issue that's come up, the problem of NFS exports of NFS mounts.

What's common in both cases is that NFS servers and OverlayFS both must create an 'identity' for a file (a NFS filehandle and an inode number, respectively). In the case of NFS servers, this identity has some strict requirements; OverlayFS has a somewhat easier life, but in general it still has to create and track some amount of information. Based on reading the OverlayFS article, I believe that OverlayFS considers this expensive enough to only want to do it when it has to.

OverlayFS definitely needs to go to this effort when people call stat(), because various programs will directly use the inode number (the POSIX 'file serial number') to tell files on the same filesystem apart. POSIX technically requires OverlayFS to do this for readdir(), but in practice almost everyone that uses readdir() isn't going to look at the inode number; they look at the file name and perhaps the d_type field to spot directories without needing to stat() everything.

If there was a special 'not a valid inode number' signal value, OverlayFS might use that, but there isn't one (in either POSIX or Linux, which is actually a problem). Since OverlayFS needs to provide some sort of arguably valid inode number, and since it's reading directories from the underlying filesystems, passing through their inode numbers from their d_ino fields is the simple answer.

(This entry was inspired by Kevin Lyda's comment on my earlier entry.)

Sidebar: Why there should be a 'not a valid inode number' signal value

Because both standards and common Unix usage include a d_ino field in the structure readdir() returns, they embed the idea that the stat()-visible inode number can easily be recovered or generated by filesystems purely by reading directories, without needing to perform additional IO. This is true in traditional Unix filesystems, but it's not obvious that you would do that all of the time in all filesystems. The on disk format of directories might only have some sort of object identifier for each name that's not easily mapped to a relatively small 'inode number' (which is required to be some C integer type), and instead the 'inode number' is an attribute you get by reading file metadata based on that object identifier (which you'll do for stat() but would like to avoid for reading directories).

But in practice if you want to design a Unix filesystem that performs decently well and doesn't just make up inode numbers in readdir(), you must store a potentially duplicate copy of your 'inode numbers' in directory entries.

Keeping notes is for myself too, illustrated (once again)

By: cks

Yesterday I wrote about restarting or redoing something after a systemd service restarts. The non-hypothetical situation that caused me to look into this was that after we applied a package update to one system, systemd-networkd on it restarted and wiped out some critical policy based routing rules. Since I vaguely remembered this happening before, I sighed and arranged to have our rules automatically reapplied on both systems with policy based routing rules, following the pattern I worked out.

Wait, two systems? And one of them didn't seem to have problems after the systemd-networkd restart? Yesterday I ignored that and forged ahead, but really it should have set off alarm bells. The reason the other system wasn't affected was I'd already solved the problem the right way back in March of 2024, when we first hit this networkd behavior and I wrote an entry about it.

However, I hadn't left myself (or my co-workers) any notes about that March 2024 fix; I'd put it into place on the first machine (then the only machine we had that did policy based routing) and forgotten about it. My only theory is that I wanted to wait and be sure it actually fixed the problem before documenting it as 'the fix', but if so, I made a mistake by not leaving myself any notes that I had a fix in testing. When I recently built the second machine with policy based routing I copied things from the first machine, but I didn't copy the true networkd fix because I'd forgotten about it.

(It turns out to have been really useful that I wrote that March 2024 entry because it's the only documentation I have, and I'd probably have missed the real fix if not for it. I rediscovered it in the process of writing yesterday's entry.)

I know (and knew) that keeping notes is good, and that my memory is fallible. And I still let this slip through the cracks for whatever reason. Hopefully the valuable lesson I've learned from this will stick a bit so I don't stub my toe again.

(One obvious lesson is that I should make a note to myself any time I'm testing something that I'm not sure will actually work. Since it may not work I may want to formally document it in our normal system for this, but a personal note will keep me from completely losing track of it. You can see the persistence of things 'in testing' as another example of the aphorism that there's nothing as permanent as a temporary fix.)

Restarting or redoing something after a systemd service restarts

By: cks

Suppose, not hypothetically, that your system is running some systemd based service or daemon that resets or erase your carefully cultivated state when it restarts. One example is systemd-networkd, although you can turn that off (or parts of it off, at least), but there are likely others. To clean up after this happens, you'd like to automatically restart or redo something after a systemd unit is restarted. Systemd supports this, but I found it slightly unclear how you want to do this and today I poked at it, so it's time for notes.

(This is somewhat different from triggering one unit when another unit becomes active, which I think is still not possible in general.)

First, you need to put whatever you want to do into a script and a .service unit that will run the script. The traditional way to run a script through a .service unit is:

[Unit]
....

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/your/script/here

[Install]
WantedBy=multi-user.target

(The 'RemainAfterExit' is load-bearing, also.)

To get this unit to run after another unit is started or restarted, what you need is PartOf=, which causes your unit to be stopped and started when the other unit is, along with 'After=' so that your unit starts after the other unit instead of racing it (which could be counterproductive when what you want to do is fix up something from the other unit). So you add:

[Unit]
...
PartOf=systemd-networkd.service
After=systemd-networkd.service

(This is what works for me in light testing. This assumes that the unit you want to re-run after is normally always running, as systemd-networkd is.)

In testing, you don't need to have your unit specifically enabled by itself, although you may want it to be for clarity and other reasons. Even if your unit isn't specifically enabled, systemd will start it after the other unit because of the PartOf=. If the other unit is started all of the time (as is usually the case for systemd-networkd), this effectively makes your unit enabled, although not in an obvious way (which is why I think you should specifically 'systemctl enable' it, to make it obvious). I think you can have your .service unit enabled and active without having the other unit enabled, or even present.

You can declare yourself PartOf a .target unit, and some stock package systemd units do for various services. And a .target unit can be PartOf a .service; on Fedora, 'sshd-keygen.target' is PartOf sshd.service in a surprisingly clever little arrangement to generate only the necessary keys through a templated 'sshd-keygen@.service' unit.

I admit that the whole collection of Wants=, Requires=, Requisite=, BindsTo=, PartOf=, Upholds=, and so on are somewhat confusing to me. In the past, I've used the wrong version and suffered the consequences, and I'm not sure I have them entirely right in this entry.

Note that as far as I know, PartOf= has those Requires= consequences, where if the other unit is stopped, yours will be too. In a simple 'run a script after the other unit starts' situation, stopping your unit does nothing and can be ignored.

(If this seems complicated, well, I think it is, and I think one part of the complication is that we're trying to use systemd as an event-based system when it isn't one.)

Systemd-resolved's new 'DNS Server Delegation' feature (as of systemd 258)

By: cks

A while ago I wrote an entry about things that resolved wasn't for as of systemd 251. One of those things was arbitrary mappings of (DNS) names to DNS servers, for example if you always wanted *.internal.example.org to query a special DNS server. Systemd-resolved didn't have a direct feature for this and attempting to attach your DNS names to DNS server mappings to a network interface could go wrong in various ways. Well, time marches on and as of systemd v258 this is no longer the state of affairs.

Systemd v258 introduces systemd.dns-delegate files, which allow you to map DNS names to DNS servers independently from network interfaces. The release notes describe this as:

A new DNS "delegate zone" concept has been introduced, which are additional lookup scopes (on top of the existing per-interface and the one global scope so far supported in resolved), which carry one or more DNS server addresses and a DNS search/routing domain. It allows routing requests to specific domains to specific servers. Delegate zones can be configured via drop-ins below /etc/systemd/dns-delegate.d/*.dns-delegate.

Since systemd v258 is very new I don't have any machines where I can actually try this out, but based on the systemd.dns-delegate documentation, you can use this both for domains that you merely want diverted to some DNS server and also domains that you also want on your search path. Per resolved.conf's Domains= documentation, the latter is 'Domains=example.org' (example.org will be one of the domains that resolved tries to find single-label hostnames in, a search domain), and the former is 'Domains=~example.org' (where we merely send queries for everything under 'example.org' off to whatever DNS= you set, a route-only domain).

(While resolved.conf's Domains= officially promises to check your search domains in the order you listed them, I believe this is strictly for a single 'Domains=' setting for a single interface. If you have multiple 'Domains=' settings, for example in a global resolved.conf, a network interface, and now in a delegation, I think systemd-resolved makes no promises.)

Right now, these DNS server delegations can only be set through static files, not manipulated through resolvectl. I believe fiddling with them through resolvectl is on the roadmap, but for now I guess we get to restart resolved if we need to change things. In fact resolvectl doesn't expose anything to do with them, although I believe read-only information is available via D-Bus and maybe varlink.

Given the timing of systemd v258's release relative to Fedora releases, I probably won't be able to use this feature until Fedora 44 in the spring (Fedora 42 is current and Fedora 43 is imminent, which won't have systemd v258 given that v258 was released only a couple of weeks ago). My current systemd-resolved setup is okay (if it wasn't I'd be doing something else), but I can probably find uses for these delegations to improve it.

Why I have a GPS bike computer

By: cks

(This is a story about technology. Sort of.)

Many bicyclists with a GPS bike computer probably have it primarily to record their bike rides and then upload them to places like Strava. I'm a bit unusual in that while I do record my rides and make some of them public, and I've come to value this, it's not my primary reason to have a GPS bike computer. Instead, my primary reason is following pre-made routes.

When I started with my recreational bike club, it was well before the era of GPS bike computers. How you followed (or lead) our routes back then was through printed cue sheets, which had all of the turns and so on listed in order, often with additional notes. One of the duties of the leader of the ride was printing out a sufficient number of cue sheets in advance and distributing them to interested parties before the start of the ride. If you were seriously into using cue sheets, you'd use a cue sheet holder (nowadays you can only find these as 'map holders', which is basically the same job); otherwise you might clip the cue sheet to a handlebar brake or gear cable or fold it up and stick it in a back jersey pocket.

Printed cue sheets have a number of nice features, such as giving you a lot of information at a glance. One of them is that a well done cue sheet was and is a lot more than just a list of all of the turns and other things worthy of note; it's an organized, well formatted list of these. The cues would be broken up into sensibly chosen sections, with whitespace between them to make it easier to narrow in on the current one, and you'd lay out the page (or pages) so that the cue or section breaks happened at convenient spots to flip the cue sheet around in cue holders or clips. You'd emphasize important turns, cautions, or other things in various ways. And so on. Some cue sheets even had a map of the route printed on the back.

(You needed to periodically flip the cue sheet around and refold it because many routes had too many turns and other cues to fit in a small amount of printed space, especially if you wanted to use a decently large font size for easy readability.)

Starting in the early 2010s, more and more TBN people started using GPS bike computers or smartphones (cf). People began converting our cue sheet routes to computerized GPS routes, with TBN eventually getting official GPS routes. Over time, more and more members got smartphones and GPS units and there was more and more interest in GPS routes and less and less interest in cue sheets. In 2015 I saw the writing on the wall for cue sheets and the club more or less deprecated them, so in August 2016 I gave in and got a GPS unit (which drove me to finally get a smartphone, because my GPS unit assumed you had one). Cue sheet first routes lingered on for some years afterward, but they're all gone by now; everything is GPS route first.

You can still get cue sheets for club routes (the club's GPS routes typically have turn cues and you can export these into something you can print). But what we don't really have any more is the old school kind of well done, organized cue sheets, and it's basically been a decade since ride leaders would turn up with any printed cue sheets at all. These days it's on you to print your own cue sheet if you need it, and also on you to make a good cue sheet from the basic cue sheet (if you care enough to do so). There are some people who still use cue sheets, but they're a decreasing minority and they probably already had the cue sheet holders and so on (which are now increasingly hard to find). A new rider who wanted to use cue sheets would have an uphill struggle and they might never understand why long time members could be so fond of them.

Cue sheets are still a viable option for route following (and they haven't fundamentally changed). They're just not very well supported any more in TBN because they stopped being popular. If you insist on sticking with them, you still can, but it's not going to be a great experience. I didn't move to a GPS unit because I couldn't possibly use cue sheets any more (I still have my cue sheet holder); I moved because I could see the writing on the wall about which one would be the more convenient, more usable option.

Applications to the (computing) technologies of your choice are left as an exercise for the reader.

PS: As a whole I think GPS bike computers are mostly superior to cue sheets for route following, but that's a different discussion (and it depends on what sort of bicycling you're doing). There are points on both sides.

A Firefox issue and perhaps how handling scaling is hard

By: cks

Over on the Fediverse I shared a fun Firefox issue I've just run into:

Today's fun Firefox bug: if I move my (Nightly) Firefox window left and right across my X display, the text inside the window reflows to change its line wrapping back and forth. I have a HiDPI display with non-integer scaling and some other settings, so I'm assuming that Firefox is now suffering from rounding issues where the exact horizontal pixel position changes its idea of the CSS window width, triggering text reflows as it jumps back and forth by a CSS pixel.

(I've managed to reproduce this in a standard Nightly, although so far only with some of my settings.)

Close inspection says that this isn't quite what's happening, and the underlying problem is happening more often than I thought. What is actually happening is that as I move my Firefox window left and right, a thin vertical black line usually appears and disappears at the right edge of the window (past a scrollbar if there is one). Since I can see it on my HiDPI display, I suspect that this vertical line is at least two screen pixels wide. Under the right circumstances of window width, text size, and specific text content, this vertical black bar takes enough width away from the rest of the window to cause Firefox to re-flow and re-wrap text, creating easily visible changes as the window moves.

A variation of this happens when the vertical black bar isn't drawn but things on the right side of the toolbar and the URL bar area will shift left and right slightly as the window is moved horizontally. If the window is showing a scrollbar, the position of the scroll target in the scrollbar will move left and right, with the right side getting ever so slightly wider or returning back to being symmetrical. It's easiest to see this if I move the window sideways slowly, which is of course not something I do often (usually I move windows rapidly).

(This may be related to how X has a notion of sizing windows in non-pixel units if the window asks for it. Firefox in my configuration definitely asks for this; it asserts that it wants to be resized in units of 2 (display) pixels both horizontally and vertically. However, I can look at the state of a Firefox window in X and see that the window size in pixels doesn't change between the black bar appearing and disappearing.)

All of this is visible partly because under X and my window manager, windows can redisplay themselves even during an active move operation. If the window contents froze while I dragged windows around, I probably wouldn't have noticed this for some time. Text reflowing as I moved a Firefox window sideways created a quite attention-getting shimmer.

It's probably relevant that I need unusual HiDPI settings and I've also set Firefox's layout.css.devPixelsPerPx to 1.7 in about:config. That was part of why I initially assumed this was a scaling and rounding issue, and why I still suspect that area of Firefox a bit.

(I haven't filed this as a Firefox bug yet, partly because I just narrowed down what was happening in the process of writing this entry.)

What (I think) you need to do basic UDP NAT traversal

By: cks

Yesterday I wished for a way to do native "blind" WireGuard relaying, without needing to layer something on top of WireGuard. I wished for this both because it's the simplest approach for getting through NATs and the one you need in general under some circumstances. The classic and excellent work on all of the complexities of NAT traversal is Tailscale's How NAT traversal works, which also winds up covering the situation where you absolutely have to have a relay. But, as I understand things, in a fair number of situations you can sort of do without a relay and have direct UDP NAT traversal, although you need to do some extra work to get it and you need additional pieces.

Following RFC 4787, we can divide NAT into to two categories, endpoint-independent mapping (EIM) and endpoint-dependent mapping (EDM). In EIM, the public IP and port of your outgoing NAT'd traffic depend only on your internal IP and port, not on the destination (IP or port); in EDM they (also) depend on the destination. NAT'ing firewalls normally NAT based on what could be called "flows". For TCP, flows are a real thing; you can specifically tell a single TCP connection and it's difficult to fake one. For UDP, a firewall generally has no idea of what is a valid flow, and the best it can do is accept traffic that comes from the destination IP and port, which in theory is replies from the other end.

This leads to the NAT traffic traversal trick that we can do for UDP specifically. If we have two machines that want to talk to each other on each other's UDP port 51820, the first thing they need is to learn the public IP and port being used by the other machine. This requires some sort of central coordination server as well as the ability to send traffic to somewhere on UDP port 51820 (or whatever port you care about). In the case of WireGuard, you might as well make this a server on a public IP running WireGuard and have an actual WireGuard connection to it, and the discount 'coordination server' can then be basically the WireGuard peer information from 'wg' (the 'endpoint' is the public IP and port you need).

Once the two machines know each other's public IP and port, they start sending UDP port 51820 (or whatever) packets to each other, to the public IP and port they learned through the coordination server. When each of them sends their first outgoing packet, this creates a 'flow' on their respective NAT firewall which will allow the other machine's traffic in. Depending on timing, the first few packets from the other machine may arrive before your firewall has set up its state to allow them in and will get dropped, so each side needs to keep sending until it works or until it's clear that at least one side has an EDM (or some other complication).

(For WireGuard, you'd need something that sets the peer's endpoint to your now-known host and port value and then tries to send it some traffic to trigger the outgoing packets.)

As covered in Tailscale's article, it's possible to make direct NAT traversal work in some additional circumstances with increasing degrees of effort. You may be lucky and have a local EDM firewall that can be asked to stop doing EDM for your UDP port (via a number of protocols for this), and otherwise it may be possible to feel your way around one EDM firewall.

If you can arrange a natural way to send traffic from your UDP port to your coordination server, the basic NAT setup can be done without needing the deep cooperation of the software using the port; all you need is a way to switch what remote IP and port it uses for a particular peer. Your coordination server may need special software to listen to traffic and decode which peer is which, or you may be able to exploit existing features of your software (for example, by making the coordination server a WireGuard peer). Otherwise, I think you need either some cooperation from the software involved or gory hacks.

Wishing for a way to do 'blind' (untrusted) WireGuard relaying

By: cks

Over on the Fediverse, I sort of had a question:

I wonder if there's any way in standard WireGuard to have a zero-trust network relay, so that two WG peers that are isolated from each other (eg both behind NAT) can talk directly. The standard pure-WG approach has a public WG endpoint that everyone talks to and which acts as a router for the internal WG IPs of everyone, but this involves decrypting and re-encrypting the WG traffic.

By 'talk directly' I mean that each of the peers has the WireGuard keys of the other and the traffic between the two of them stays encrypted with those keys all the way through its travels. The traditional approach to the problem of two NAT'd machines that want to talk to each other with WireGuard is to have a WireGuard router that both of them talk to over WireGuard, but this means that the router sees the unencrypted traffic between them. This is less than ideal if you don't want to trust your router machine, for example because you want to make it a low-trust virtual machine rented from some cloud provider.

Since we love indirection in computer science, you can in theory solve this with another layer of traffic encapsulation (with a lot of caveats). The idea is that all of the 'public' endpoint IPs of WireGuard peers are actually on a private network, and you route the private network through your public router. Getting the private network packets to and from the router requires another level of encapsulation and unless you get very clever, all your traffic will go through the router even if two WireGuard peers could talk directly. Since WireGuard automatically keeps track of the current public IPs of peers, it would be ideal to do this with WireGuard, but I'm not sure that WG-in-WG can have the routing maintained the way we want.

This untrusted relay situation is of course one of the things that 'automatic mesh network on top of WireGuard' systems give you, but it would be nice to be able to do this with native features (and perhaps without an explicit control plane server that machines talk to, although that seems unlikely). As far as I know such systems implement this with their own brand of encapsulation, which I believe requires running their WireGuard stack.

(On Linux you might be able to do something clever with redirecting outgoing WireGuard packets to a 'tun' device connected to a user level program, which then wrapped them up, sent them off, received packets back, and injected the received packets into the system.)

Using systems because you know them already

By: cks

Every so often on the Fediverse, people ask for advice on a monitoring system to run on their machine (desktop or server), and some of the time Prometheus, and when it does I wind up making awkward noises. On the one hand, we run Prometheus (and Grafana) and are happy with it, and I run separate Prometheus setups on my work and home desktops. On the other hand, I don't feel I can recommend picking Prometheus for a basic single-machine setup, despite running it that way myself.

Why do I run Prometheus on my own machines if I don't recommend that you do so? I run it because I already know Prometheus (and Grafana), and in fact my desktops (re)use much of our production Prometheus setup (but they scrape different things). This is a specific instance (and example) of a general thing in system administration, which is that not infrequently it's simpler for you to use something you already know even if it's not necessarily an exact fit (or even a great fit) for the problem. For example, if you're quite familiar with operating PostgreSQL databases, it might be simpler to use PostgreSQL for a new system where SQLite could do perfectly well and other people would find SQLite much simpler. Especially if you have canned setups, canned automation, and so on all ready to go for PostgreSQL, and not for SQLite.

(Similarly, our generic web server hammer is Apache, even if we're doing things that don't necessarily need Apache and could be done perfectly well or perhaps better with nginx, Caddy, or whatever.)

This has a flipside, where you use a tool because you know it even if there might be a significantly better option, one that would actually be easier overall even accounting for needing to learn the new option and build up the environment around it. What we could call "familiarity-driven design" is a thing, and it can even be a confining thing, one where you shape your problems to conform to the tools you already know.

(And you may not have chosen your tools with deep care and instead drifted into them.)

I don't think there's any magic way to know which side of the line you're on. Perhaps the best we can do is be a little bit skeptical about our reflexive choices, especially if we seem to be sort of forcing them in a situation that feels like it should have a simpler or better option (such as basic monitoring of a single machine).

(In a way it helps that I know so much about Prometheus because it makes me aware of various warts, even if I'm used to them and I've climbed the learning curves.)

Apache .htaccess files are important because they enable delegation

By: cks

Apache's .htaccess files have a generally bad reputation. For example, lots of people will tell you that they can cause performance problems and you should move everything from .htaccess files into your main Apache configuration, using various pieces of Apache syntax to restrict what configuration directives apply to. The result can even be clearer, since various things can be confusing in .htaccess files (eg rewrites and redirects). Despite all of this, .htaccess files are important and valuable because of one property, which is that they enable delegation of parts of your server configuration to other people.

The Apache .htaccess documentation even spells this out in reverse, in When (not) to use .htaccess files:

In general, you should only use .htaccess files when you don't have access to the main server configuration file. [...]

If you operate the server and would be writing the .htaccess file, you can put the contents of the .htaccess in the main server configuration and make your life easier and Apache faster (and you probably should). But if the web server and its configuration isn't managed as a unitary whole by one group, then .htaccess files allow the people managing the overall Apache configuration to safely delegate things to other people on a per-directory basis, using Unix ownership. This can both enable people to do additional things and reduce the amount of work the central people have to do, letting people things scale better.

(The other thing that .htaccess files allow is dynamic updates without having to restart or reload the whole server. In some contexts this can be useful or important, for example if the updates are automatically generated at unpredictable times.)

I don't think it's an accident that .htaccess files emerged in Apache, because one common environment Apache was initially used in was old fashioned multi-user Unix web servers where, for example, every person with a login on the web server might have their own UserDir directory hierarchy. Hence features like suEXEC, so you could let people run CGIs without those CGIs having to run as the web user (a dangerous thing), and also hence the attraction of .htaccess files. If you have a bunch of (graduate) students with their own web areas, you definitely don't want to let all of them edit your departmental web server's overall configuration.

(Apache doesn't solve all your problems here, at least not in a simple configuration; you're still left with the multiuser PHP problem. Our solution to this problem is somewhat brute force.)

These environments are uncommon today but they're not extinct, at least at universities like mine, and .htaccess files (and Apache's general flexibility) remain valuable to us.

Readdir()'s inode numbers versus OverlayFS

By: cks

Recently I re-read Deep Down the Rabbit Hole: Bash, OverlayFS, and a 30-Year-Old Surprise (via) and this time around, I stumbled over a bit in the writeup that made me raise my eyebrows:

Bash’s fallback getcwd() assumes that the inode [number] from stat() matches one returned by readdir(). OverlayFS breaks that assumption.

I wouldn't call this an 'assumption' so much as 'sane POSIX semantics', although I'm not sure that POSIX absolutely requires this.

As we've seen before, POSIX talks about 'file serial number(s)' instead of inode numbers. The best definition of these is covered in sys/stat.h, where we see that a 'file identity' is uniquely determined by the combination the inode number and the device ID (st_dev), and POSIX says that 'at any given time in a system, distinct files shall have distinct file identities' while hardlinks have the same identity. The POSIX description of readdir() and dirent.h don't caveat the d_ino file serial numbers from readdir(), so they're implicitly covered by the general rules for file serial numbers.

In theory you can claim that the POSIX guarantees don't apply here since readdir() is only supplying d_ino, the file serial number, not the device ID as well. I maintain that this fails due to a POSIX requirement:

[...] The value of the structure's d_ino member shall be set to the file serial number of the file named by the d_name member. [...]

If readdir() gives one file serial number and a fstatat() of the same name gives another, a plain reading of POSIX is that one of them is lying. Files don't have two file serial numbers, they have one. Readdir() can return duplicate d_ino numbers for files that aren't hardlinks to each other (and I think legitimately may do so in some unusual circumstances), but it can't return something different than what fstatat() does for the same name.

The perverse argument here turns on POSIX's 'at any given time'. You can argue that the readdir() is at one time and the stat() is at another time and the system is allowed to entirely change file serial numbers between the two times. This is certainly not the intent of POSIX's language but I'm not sure there's anything in the standard that rules it out, even though it makes file serial numbers fairly useless since there's no POSIX way to get a bunch of them at 'a given time' so they have to be coherent.

So to summarize, OverlayFS has chosen what are effectively non-POSIX semantics for its readdir() inode numbers (under some circumstances, in the interests of performance) and Bash used readdir()'s d_ino in a traditional Unix way that caused it to notice. Unix filesystems can depart from POSIX semantics if they want, but I'd prefer if they were a bit more shamefaced about it. People (ie, programs) count on those semantics.

(The truly traditional getcwd() way wouldn't have been a problem, because it predates readdir() having d_ino and so doesn't use it (it stat()s everything to get inode numbers). I reflexively follow this pre-d_ino algorithm when I'm talking about doing getcwd() by hand (cf), but these days you want to use the dirent d_ino and if possible d_type, because they're much more efficient than stat()'ing everything.)

How part of my email handling drifted into convoluted complexity

By: cks

Once upon a time, my email handling was relatively simple. I wasn't on any big mailing lists, so I had almost everything delivered straight to my inbox (both in the traditional /var/mail mbox sense and then through to MH's own inbox folder directory). I did some mail filtering with procmail, but it was all for things that I basically never looked at, so I had procmail write them to mbox files under $HOME/.mail. I moved email from my Unix /var/mail inbox to MH's inbox with MH's inc command (either running it directly or having exmh run it for me). Rarely, I had a mbox file procmail had written that I wanted to read, and at that point I inc'd it either to my MH +inbox or to some other folder.

Later, prompted by wanting to improve my breaks and vacations, I diverted a bunch of mailing lists away from my inbox. Originally I had procmail write these diverted messages to mbox files, then later I'd inc the files to read the messages. Then I found that outside of vacations, I needed to make this email more readily accessible, so I had procmail put them in MH folder directories under Mail/inbox (one of MH's nice features is that your inbox is a regular folder and can have sub-folders, just like everything else). As I noted at the time, procmail only partially emulates MH when doing this, and one of the things it doesn't do is keep track of new, unread ('unseen') messages.

(MH has a general purpose system for keeping track of 'sequences' of messages in a MH folder, so it tracks unread messages based on what is in the special 'unseen' sequence. Inc and other MH commands update this sequence; procmail doesn't.)

Along with this procmail setup I wrote a basic script, called mlists, to report how many messages each of these 'mailing list' inboxes had in them. After a while I started diverting lower priority status emails and so on through this system (and stopped reading the mailing lists); if I got a type of email in any volume that I didn't want to read right away during work, it probably got shunted to these side inboxes. At some point I made mlists optionally run the MH scan command to show me what was in each inbox folder (well, for the inbox folders where this was potentially useful information). The mlists script was still mostly simple and the whole system still made sense, but it was a bit more complex than before, especially when it also got a feature where it auto-reset the current message number in each folder to the first message.

A couple of years ago, I switched the MH frontend I used from exmh to MH-E in GNU Emacs, which changed how I read my email in practice. One of the changes was that I started using the GNU Emacs Speedbar, which always displays a count of messages in MH folders and especially wants to let you know about folders with unread messages. Since I had the hammer of my mlists script handy, I proceeded to mutate it to be what a comment in the script describes as "a discount maintainer of 'unseen'", so that MH-E's speedbar could draw my attention to inbox folders that had new messages.

This is not the right way to do this. The right way to do this is to have procmail deliver messages through MH's rcvstore, which as a MH command can update the 'unseen' sequence properly. But using rcvstore is annoying, partly because you have to use another program to add the locking it needs, so at every point the path of least resistance was to add a bit more hacks to what I already had. I had procmail, and procmail could deliver to MH folder directories, so I used it (and at the time the limitations were something I considered a feature). I had a script to give me basic information, so it could give me more information, and then it could do one useful thing while it was giving me information, and then the one useful thing grew into updating 'unseen'.

And since I have all of this, it's not even worth the effort of switching to the proper rcvstore approach and throwing a bunch of it away. I'm always going to want the 'tell me stuff' functionality of my mlists script, so part of it has to stay anyway.

Can I see similarities between this and how various of our system tools have evolved, mutated, and become increasingly complex? Of course. I think it's much the same obvious forces involved, because each step seems reasonable in isolation, right up until I've built a discount environment that duplicates much of rcvstore.

Sidebar: an extra bonus bit of complexity

It turns out that part of the time, I want to get some degree of live notification of messages being filed into these inbox folders. I may not look at all or even many of them, but there are some periodic things that I do want to pay attention to. So my discount special hack is basically:

tail -f .mail/procmail-log |
  egrep -B2 --no-group-separator 'Folder: /u/cks/Mail/inbox/'

(This is a script, of course, and I run it in a terminal window.)

This could be improved in various ways but then I'd be sliding down the convoluted complexity slope and I'm not willing to do that. Yet. Give it a few years and I may be back to write an update.

More on the tools I use to read email affecting my email reading

By: cks

About two years ago I wrote an entry about how my switch from reading email with exmh to reading it in GNU Emacs with MH-E had affected my email reading behavior more than I expected. As time has passed and I've made more extensive customizations to my MH-E environment, this has continued. One of the recent ways I've noticed is that I'm slowly making more and more use of the fact that GNU Emacs is a multi-window editor ('multi-frame' in Emacs terminology) and reading email with MH-E inside it still leaves me with all of the basic Emacs facilities. Specifically, I can create several Emacs windows (frames) and use this to be working in multiple MH folders at the same time.

Back when I used exmh extensively, I mostly had MH pull my email into the default 'inbox' folder, where I dealt with it all at once. Sometimes I'd wind up pulling some new email into a separate folder, but exmh only really giving me a view of a single folder at a time combined with a system administrator's need to be regularly responding to email made that a bit awkward. At first my use of MH-E mostly followed that; I had a single Emacs MH-E window (frame) and within that window I switched between folders. But lately I've been creating more new windows when I want to spend time reading a non-inbox folder, and in turn this has made me much more willing to put new email directly into different (MH) folders rather than funnel it all into my inbox.

(I don't always make a new window to visit another folder, because I don't spend long on many of my non-inbox folders for new email. But for various mailing lists and so on, reading through them may take at least a bit of time so it's more likely I'll decide I want to keep my MH inbox folder still available.)

One thing that makes this work is that MH-E itself has reasonably good support for displaying and working on multiple folders at once. There are probably ways to get MH-E to screw this up and run MH commands with the wrong MH folder as the current folder, so I'm careful that I don't try to have MH-E carry out its pending MH operations in two MH-E folders at the same time. There are areas where MH-E is less than ideal when I'm also using command-line MH tools, because MH-E changes MH's global notion of the current folder any time I have it do things like show a message in some folder. But at least MH-E is fine (in normal circumstances) if I use MH commands to change the current folder; MH-E will just switch it back the next time I have it show another message.

PS: On a purely pragmatic basis, another change in my email handling is that I'm no longer as irritated with HTML emails because GNU Emacs is much better at displaying HTML than exmh was. I've actually left my MH-E setup showing HTML by default, instead of forcing multipart/alternative email to always show the text version (my exmh setup). GNU Emacs and MH-E aren't up to the level of, say, Thunderbird, and sometimes this results in confusing emails, but it's better than it was.

(The situation that seems tricky for MH-E is that people sometimes include inlined images, for example screenshots as part of problem reports, and MH-E doesn't always give any indication that it's even omitting something.)

Syndication feed fetchers, HTTP redirects, and conditional GET

By: cks

In response to my entry on how ETag values are specific to a URL, a Wandering Thoughts reader asked me in email what a syndication feed reader (fetcher) should do when it encounters a temporary HTTP redirect, in the context of conditional GET. I think this is a good question, especially if we approach it pragmatically.

The specification compliant answer is that every final (non-redirected) URL must have its ETag and Last-Modified values tracked separately. If you make a conditional GET for URL A because you know its ETag or Last-Modified (or both) and you get a temporary HTTP redirection to another URL B that you don't have an ETag or Last-Modified for, you can't make a conditional GET. This means you have to insure that If-None-Match and especially If-Modified-Since aren't copied from the original HTTP request to the newly re-issued redirect target request. And when you make another request for URL A later, you can't send a conditional GET using ETag or Last-Modified values you got from successfully fetching URL B; you either have to use the last values observed for URL A or make an unconditional GET. In other words, saved ETag and Last-Modified values should be per-URL properties, not per-feed properties.

(Unfortunately this may not fit well with feed reader code structures, data storage, or uses of low-level HTTP request libraries that hide things like HTTP redirects from you.)

Pragmatically, you can probably get away with re-doing the conditional GET when you get a temporary HTTP redirect for a feed, with the feed's original saved ETag and Last-Modified information. There are three likely cases for a temporary HTTP redirection of a syndication feed that I can think of:

  • You're receiving a generic HTTP redirection to some sort of error page that isn't a valid syndication feed. Your syndication feed fetcher isn't going to do anything with a successful fetch of it (except maybe add an 'error' marker to the feed), so a conditional GET that fools you with "nothing changed" is harmless.

  • You're being redirected to an alternate source of the normal feed, for example a feed that's normally dynamically generated might serve a (temporary) HTTP redirect to a static copy under high load. If the conditional GET matches the ETag (probably unlikely in practice) or the Last-Modified (more possible), then you almost certainly have the most current version and are fine, and you've saved the web server some load.

  • You're being (temporarily) redirected to some kind of error feed; a valid syndication feed that contains one or more entries that are there to tell the person seeing them about a problem. Here, the worst thing that happens if your conditional GET fools you with "nothing has changed" is that the person reading the feed doesn't see the error entry (or entries).

The third case is a special variant of an unlikely general case where the normal URL and the redirected URL are both versions of the feed but each has entries that the other doesn't. In this general case, a conditional GET that fools you with a '304 Not Modified' will cause you to miss some entries. However, this should cure itself when the temporary HTTP redirect stops happening (or when a new entry is published to the temporary location, which should change its ETag and reset its Last-Modified date to more or less now).

A feed reader that keeps a per-feed 'Last-Modified' value and updates it after following a temporary HTTP redirect is living dangerously. You may not have the latest version of the non-redirected feed but the target of the HTTP redirection may be 'more recent' than it for various reasons (even if it's a valid feed; if it's not a valid feed then blindly saving its ETag and Last-Modified is probably quite dangerous). When the temporary HTTP redirection goes away and the normal feed's URL resumes responding with the feed again, using the target's "Last-Modified" value for a conditional GET of the original URL could cause you to receive "304 Not Modified" until the feed is updated again (and its Last-Modified moves to be after your saved value), whenever that happens. Some feeds update frequently; others may only update days or weeks later.

Given this and the potential difficulties of even noticing HTTP redirects (if they're handled by some underlying library or tool), my view is that if a feed provides both an ETag and a Last-Modified, you should save and use only the ETag unless you're sure you're going to handle HTTP redirects correctly. An ETag could still get you into trouble if used across different URLs, but it's much less likely (see the discussion at the end of my entry about Last-Modified being specific to the URL).

(All of this is my view as someone providing syndication feeds, not someone writing syndication feed fetchers. There may be practical issues I'm unaware of, since the world of feeds is very large and it probably contains a lot of weird feed behavior (to go with the weird feed fetcher behavior).)

The HTTP Last-Modified value is specific to the URL (technically so is the ETag value)

By: cks

Last time around I wrote about how If-None-Match values (which come from ETag values) must come from the actual URL itself, not (for example) from another URL that you were at one point redirected to. In practice, this is only an issue of moderate concern for ETag/If-None-Match; you can usually make a conditional GET using an ETag from another URL and get away with it. This is very much an issue if you make the mistake of doing the same thing with an If-Modified-Since header based on another URL's Last-Modified header. This is because the Last-Modified header value isn't unique to a particular document, in a way that ETag values can often be.

If you take the Last-Modified timestamp from URL A and perform a conditional GET for URL B with an 'If-Modified-Since' of that timestamp, the web server may well give you exactly what you asked for but not what you wanted by saying 'this hasn't been modified since then' even though the contents of those URLs are entirely different. You told the web server to decide purely on the basis of timestamps without reference to anything that might even vaguely specify the content, and so it did. This can happen even if the server is requiring an exact timestamp match (as it probably should), because there are any number of ways for the 'Last-Modified' timestamp of a whole bunch of URLs to be exactly the same because some important common element of them was last updated at that point.

(This is how DWiki works. The Last-Modified date of a page is the most recent timestamp of all of the elements that went into creating it, so if I change some shared element, everything will promptly take on the Last-Modified of that element.)

This means that if you're going to use Last-Modified in conditional GETs, you must handle HTTP redirects specially. It's actively dangerous (to actually getting updates) to mingle Last-Modified dates from the original URL and the redirection URL; you either have to not use Last-Modified at all, or track the Last-Modified values separately. For things that update regularly, any 'missing the current version' problems will cure themselves eventually, but for infrequently updated things you could go quite a while thinking that you have the current content when you don't.

In theory this is also true of ETag values; the specification allows them to be calculated in ways that are URL-specific (the specification mentions that the ETag might be a 'revision number'). A plausible implementation of serving a collection of pages from a Git repository could use the repository's Git revision as the common ETag for all pages; after all, the URL (the page) plus that git revision uniquely identifies it, and it's very cheap to provide under the right circumstances (eg, you can record the checked out git revision).

In practice, common ways of generating ETags will make them different across different URLs, potentially unless the contents are the same. DWiki generates ETag values using a cryptographic hash, so two different URLs will only have the same ETag if they have the same contents, which I believe is a common approach for pages that are generated dynamically. Apache generates ETag values for static files using various file attributes that will be different for different files, which is probably also a common approach for things that serve static files. Pragmatically you're probably much safer sending an ETag value from one URL in an If-None-Match header to another URL (for example, through repeating it while following a HTTP redirection). It's still technically wrong, though, and it may cause problems someday.

(This feels obvious but it was only today that I realized how it interacts with conditional GETs and HTTP redirects.)

Go's builtin 'new()' function will take an expression in Go 1.26

By: cks

An interesting little change recently landed in the development version of Go, and so will likely appear in Go 1.26 when it's released. The change is that the builtin new() function will be able to take an expression, not just a type. This change stems from the proposal in issue 45624, which dates back to 2021 (and earlier for earlier proposals). The new specifications language is covered in, for example, this comment on the issue. An example is in the current development documentation for the release notes, but it may not sound very compelling.

A variety of uses came up in the issue discussion, some of which were a surprise to me. One case that's apparently surprisingly common is to start with a pointer and want to make another pointer to a (shallow) copy of its value. With the change to 'new()', this is:

np = new(*p)

Today you can write this as a generic function (apparently often called 'ref()'), or do it with a temporary variable, but in Go 1.26 this will (probably) be a built in feature, and perhaps the Go compiler will be able to optimize it in various ways. This sort of thing is apparently more common than you might expect.

Another obvious use for the new capability is if you're computing a new value and then creating a pointer to it. Right now, this has to be written using a temporary variable:

t := <some expression>
p := &t

With 'new(expr)' this can be written as one line, without a temporary variable (although as before a 'ref()' generic function can do this today).

The usage example from the current documentation is a little bit peculiar, at least as far as providing a motivation for this change. In a slightly modified form, the example is:

type Person struct {
    Name string `json:"name"`
    Age  *int   `json:"age"` // age if known; nil otherwise
}

func newPerson(name string, age int) *Person {
    return &Person{
        Name: name,
	   Age:  new(age),
     }
}

The reason this is a bit peculiar is that today you can write 'Age: &age' and it works the same way. Well, at a semantic level it works the same way. The theoretical but perhaps not practical complication is inlining combined with escape analysis. If newPerson() is inlined into a caller, then the caller's variable for the 'age' parameter may be unused after the (inlined) call to newPerson, and so could get mapped to 'Age: &callervar', which in turn could force escape analysis to put that variable in the heap, which might be less efficient than keeping the variable in the stack (or registers) until right at the end.

A broad language reason is that allowing new() to take an expression removes the special privilege that structs and certain other compound data structures have had, where you could construct pointers to initialized versions of them. Consider:

type ints struct { i int }
[...]
t := 10
ip := &t
isp := &ints{i: 10}

You can create a pointer to the int wrapped in a struct on a single line with no temporary variable, but a pointer to a plain int requires you to materialize a temporary variable. This is a bit annoying.

A pragmatic part of adding this is that people appear to write and use equivalents of new(value) a fair bit. The popularity of an expression is not necessarily the best reason to add a built-in equivalent to the language, but it does suggest that this feature will get used (or will eventually get used, since the existing uses won't exactly get converted instantly for all sorts of reasons).

This strikes me as a perfectly fine change for Go to make. The one thing that's a little bit non-ideal is that 'new()' of constant numbers has less type flexibility than the constant numbers themselves. Consider:

var ui uint
var uip *uint

ui = 10       // okay
uip = new(10) // type mismatch error

The current error that the compiler reports is 'cannot use new(10) (value of type *int) as *uint value in assignment', which is at least relatively straightforward.

(You fix it by casting ('converting') the untyped constant number to whatever you need. The now more relevant than before 'default type' of a constant is covered in the specification section on Constants.)

The broad state of ZFS on Illumos, Linux, and FreeBSD (as I understand it)

By: cks

Once upon a time, Sun developed ZFS and put it in Solaris, which was good for us. Then Sun open-sourced Solaris as 'OpenSolaris', including ZFS, although not under the GPL (a move that made people sad and Scott McNealy is on record as regretting). ZFS development continued in Solaris and thus in OpenSolaris until Oracle bought Sun and soon afterward closed Solaris source again (in 2010); while Oracle continued ZFS development in Oracle Solaris, we can ignore that. OpenSolaris was transmogrified into Illumos, and various Illumos distributions formed, such as OmniOS (which we used for our second generation of ZFS fileservers).

Well before Oracle closed Solaris, separate groups of people ported ZFS into FreeBSD and onto Linux, where the effort was known as "ZFS on Linux". Since the Linux kernel community felt that ZFS's license wasn't compatible with the kernel's license, ZoL was an entirely out of (kernel) tree effort, while FreeBSD was able to accept ZFS into their kernel tree (I believe all the way back in 2008). Both ZFS on Linux and FreeBSD took changes from OpenSolaris into their versions up until Oracle closed Solaris in 2010. After that, open source ZFS development split into three mostly separate strands.

(In theory OpenZFS was created in 2013. In practice I think OpenZFS at the time was not doing much beyond coordination of the three strands.)

Over time, a lot more people wanted to build machines using ZFS on top of FreeBSD or Linux (including us) than wanted to keep using Illumos distributions. Not only was Illumos a different environment, but Illumos and its distributions didn't see the level of developer activity that FreeBSD and Linux did, which resulted in driver support issues and other problems (cf). For ZFS, the consequence of this was that many more improvements to ZFS itself started happening in ZFS on Linux and in FreeBSD (I believe to a lesser extent) than were happening in Illumos or OpenZFS, the nominal upstream. Over time the split of effort between Linux and FreeBSD became an obvious problem and eventually people from both sides got together. This resulted in ZFS on Linux v2.0.0 becoming 'OpenZFS 2.0.0' in 2020 (see also the Wikipedia history) and also becoming portable to FreeBSD, where it became the FreeBSD kernel ZFS implementation in FreeBSD 13.0 (cf).

The current state of OpenZFS is that it's co-developed for both Linux and FreeBSD. The OpenZFS ZFS repository routinely has FreeBSD specific commits, and as far as I know OpenZFS's test suite is routinely run on a variety of FreeBSD machines as well as a variety of Linux ones. I'm not sure how OpenZFS work propagates into FreeBSD itself, but it does (some spelunking of the FreeBSD source repository suggests that there are periodic imports of the latest changes). On Linux, OpenZFS releases and development versions propagate to Linux distributions in various ways (some of them rather baroque), including people simply building their own packages from the OpenZFS repository.

Illumos continues to use and maintain its own version of ZFS, which it considers separate from OpenZFS. There is an incomplete Illumos project discussion on 'consuming' OpenZFS changes (via, also), but my impression is that very few changes move from OpenZFS to Illumos. My further impression is that there is basically no one on the OpenZFS side who is trying to push changes into Illumos; instead, OpenZFS people consider it up to Illumos to pull changes, and Illumos people aren't doing much of that for various reasons. At this point, if there's an attractive ZFS change in OpenZFS, the odds of it appearing in Illumos on a timely basis appear low (to put it one way).

(Some features have made it into Illumos, such as sequential scrubs and resilvers, which landed in issue 10405. This feature originated in what was then ZoL and was ported into Illumos.)

Even if Illumos increases the pace of importing features from OpenZFS, I don't ever expect it to be on the leading edge and I think that's fine. There have definitely been various OpenZFS features that needed some time before they became fully ready for stable production use (even after they appeared in releases). I think there's an ecological niche for a conservative ZFS that only takes solidly stable features, and that fits Illumos's general focus on stability.

PS: I'm out of touch with the Illumos world these days, so I may have mis-characterized the state of affairs there. If so, I welcome corrections and updates in the comments.

If-None-Match values must come from the actual URL itself

By: cks

Because I recently looked at the web server logs for Wandering Thoughts, I said something on the Fediverse:

It's impressive how many ways feed readers screw up ETag values. Make up their own? Insert ETags obtained from the target of a HTTP redirect of another request? Stick suffixes on the end? Add their own quoting? I've seen them all.

(And these are just the ones that I can readily detect from the ETag format being wrong for the ETags my techblog generates.)

(Technically these are If-None-Match values, not ETag values; it's just that the I-N-M value is supposed to come from an ETag you returned.)

One of these mistakes deserves special note, and that's the HTTP redirect case. Suppose you request a URL, receive a HTTP 302 temporary redirect, follow the redirect, and get a response at the new URL with an ETag value. As a practical matter, you cannot then present that ETag value in an If-None-Match header when you re-request the original URL, although you could if you re-requested the URL that you were redirected to. The two URLs are not the same and they don't necessarily have the same ETag values or even the same format of ETags.

(This is an especially bad mistake for a feed fetcher to make here, because if you got a HTTP redirect that gives you a different format of ETag, it's because you've been redirected to a static HTML page served directly by Apache (cf) and it's obviously not a valid syndication feed. You shouldn't be saving the ETag value for responses that aren't valid syndication feeds, because you don't want to get them again.)

This means that feed readers can't just store 'an ETag value' for a feed. They need to associate the ETag value with a specific, final URL, which may not be the URL of the feed (because said feed URL may have been redirected). They also need to (only) make conditional requests when they have an ETag for that specific URL, and not copy the If-None-Match header from the initial GET into a redirected GET.

This probably clashes with many low level HTTP client APIs, which I suspect want to hide HTTP redirects from the caller. For feed readers, such high level APIs are a mistake. They actively need to know about HTTP redirects so that, for example, they can consider updating their feed URL if they get permanent HTTP redirects to a new URL. And also, of course, to properly handle conditional GETs.

A hack: outsourcing web browser/client checking to another web server

By: cks

A while back on the Fediverse, I shared a semi-cursed clever idea:

Today I realized that given the world's simplest OIDC IdP (one user, no password, no prompting, the IdP just 'logs you in' if your browser hits the login URL), you could put @cadey's Anubis in front of anything you can protect with OIDC authentication, including anything at all on an Apache server (via mod_auth_openidc). No need to put Anubis 'in front' of anything (convenient for eg static files or CGIs), and Anubis doesn't even have to be on the same website or machine.

This can be generalized, of course. There are any number of filtering proxies and filtering proxy services out there that will do various things for you, either for free or on commercial terms; one example of a service is geoblocking that's maintained by someone else who's paid to be on top of it and be accurate. Especially with services, you may not want to put them in front of your main website (that gives the service a lot of power), but you would be fine with putting a single-purpose website behind the service or the proxy, if your main website can use the result. With the world's simplest OIDC IdP, you can do that, at least for anything that will do OIDC.

(To be explicit, yes, I'm partly talking about Cloudflare.)

This also generalizes in the other direction, in that you don't necessarily need to use OIDC. You just need some system for passing authenticated information back and forth between your main website and your filtered, checked, proxied verification website. Since you don't need to carry user identity information around this can be pretty simple (although it's going to involve some cryptography, so I recommend just using OIDC or some well-proven option if you can). I've thought about this a bit and I'm pretty certain you can make a quite simple implementation.

(You can also use SAML if you happen to have an extremely simple SAML server and appropriate SAML clients, but really, why. OIDC is today's all-purpose authentication hammer.)

A custom system can pass arbitrary information back and forth between the main website and the verifier, so you can know (for example) if the two saw the same client details. I think you can do this to some extent with OIDC as well if you have a custom IdP, because nothing stops your IdP and your OIDC client from agreeing on some very custom OIDC claims, such as (say) 'clientip'.

(I don't know of any such minimal OIDC server, although I wouldn't be surprised if one exists, probably as a demonstration or test server. And I suppose you can always put a banner on your OIDC IdP's login page that tells people what login and password to use, if you can only find a simple IdP that requires an actual login.)

Unix mail programs have had two approaches to handling your mail

By: cks

Historically, Unix mail programs (what we call 'mail clients' or 'mail user agents' today) have had two different approaches to handling your email, what I'll call the shared approach and the exclusive approach, with the shared approach being the dominant one. To explain the shared approach, I have to back up to talk about what Unix mail transfer agents (MTAs) traditionally did. When a Unix MTA delivered email to you, at first it delivered email into a single file in a specific location (such as '/usr/spool/mail/<login>') in a specific format, initially mbox; even then, this could be called your 'inbox'. Later, when the maildir mailbox format became popular, some MTAs gained the ability to deliver to maildir format inboxes.

(There have been a number of Unix mail spool formats over the years, which I'm not going to try to get into here.)

A 'shared' style mail program worked directly with your inbox in whatever format it was in and whatever location it was in. This is how the V7 'mail' program worked, for example. Naturally these programs didn't have to work on your inbox; you could generally point them at another mailbox in the same format. I call this style 'shared' because you could use any number of different mail programs (mail clients) on your mailboxes, providing that they all understood the format and also provided that all of them agreed on how to lock your mailbox against modifications, including against your system's MTA delivering new email right at the point where your mail program was, for example, trying to delete some.

(Locking issues are one of the things that maildir was designed to help with.)

An 'exclusive' style mail program (or system) was designed to own your email itself, rather than try to share your system mailbox. Of course it had to access your system mailbox a bit to get at your email, but broadly the only thing an exclusive mail program did with your inbox was pull all your new email out of it, write it into the program's own storage format and system, and then usually empty out your system inbox. I call this style 'exclusive' because you generally couldn't hop back and forth between mail programs (mail clients) and would be mostly stuck with your pick, since your main mail program was probably the only one that could really work with its particular storage format.

(Pragmatically, only locking your system mailbox for a short period of time and only doing simple things with it tended to make things relatively reliable. Shared style mail programs had much more room for mistakes and explosions, since they had to do more complex operations, at least on mbox format mailboxes. Being easy to modify is another advantage of the maildir format, since it outsources a lot of the work to your Unix filesystem.)

This shared versus exclusive design choice turned out to have some effects when mail moved to being on separate servers and accessed via POP and then later IMAP. My impression is that 'exclusive' systems coped fairly well with POP, because the natural operation with POP is to pull all of your new email out of the server and store it locally. By contrast, shared systems coped much better with IMAP than exclusive ones did, because IMAP is inherently a shared mail environment where your mail stays on the IMAP server and you manipulate it there.

(Since IMAP is the dominant way that mail clients/user agents get at email today, my impression is that the 'exclusive' approach is basically dead at this point as a general way of doing mail clients. Almost no one wants to use an IMAP client that immediately moves all of their email into a purely local data storage of some sort; they want their email to stay on the IMAP server and be accessible from and by multiple clients and even devices.)

Most classical Unix mail clients are 'shared' style programs, things like Alpine, Mutt, and the basic Mail program. One major 'exclusive' style program, really a system, is (N)MH (also). MH is somewhat notable because in its time it was popular enough that a number of other mail programs and mail systems supported its basic storage format to some degree (for example, procmail can deliver messages to MH-format directories, although it doesn't update all of the things that MH would do in the process).

Another major source of 'exclusive' style mail handling systems is GNU Emacs. I believe that both rmail and GNUS normally pull your email from your system inbox into their own storage formats, partly so that they can take exclusive ownership and don't have to worry about locking issues with other mail clients. GNU Emacs has a number of mail reading environments (cf, also) and I'm not sure what the others do (apart from MH-E, which is a frontend on (N)MH).

(There have probably been other 'exclusive' style systems. Also, it's a pity that as far as I know, MH never grew any support for keeping its messages in maildir format directories, which are relatively close to MH's native format.)

Maybe I should add new access control rules at the front of rule lists

By: cks

Not infrequently I wind up maintaining slowly growing lists of filtering rules to either allow good things or weed out bad things. Not infrequently, traffic can potentially match more than one filtering rule, either because it has multiple bad (or good) characteristics or because some of the match rules overlap. My usual habit has been to add new rules to the end of my rule lists (or the relevant section of them), so the oldest rules are at the top and the newest ones are at the bottom.

After writing about how access control rules need some form of usage counters, it's occurred to me that maybe I want to reverse this, at least in typical systems where the first matching rule wins. The basic idea is that the rules I'm most likely to want to drop are the oldest rules, but by having them first I'm hindering my ability to see if they've been made obsolete by newer rules. If an old rule matches some bad traffic, a new rule matches all of the bad traffic, and the new rule is last, any usage counters will show a mix of the old rule and the new rule, making it look like the old rule is still necessary. If the order was reversed, the new rule would completely occlude the old rule and usage counters would show me that I could weed the old rule out.

(My view is that it's much less likely that I'll add a new rule at the bottom that's completely ineffectual because everything it matches is already matched by something earlier. If I'm adding a new rule, it's almost certainly because something isn't being handled by the collection of existing rules.)

Another possible advantage to this is that it will keep new rules at the top of my attention, because when I look at the rule list (or the section of it) I'll probably start at the top. Currently, the top is full of old rules that I usually ignore, but if I put new rules first I'll naturally see them right away.

(I think that most things I deal with are 'first match wins' systems. A 'last match wins' system would naturally work right here, but it has other confusing aspects. I also have the impression that adding new rules at the end is a common thing, but maybe it's just in the cultural water here.)

Our Django model class fields should include private, internal names

By: cks

Let me tell you about a database design mistake I made in our Django web application for handling requests for Unix accounts. Our current account request app evolved from a series of earlier systems, and one of the things that these earlier systems asked people for was their 'status' with the university; were they visitors, graduate students, undergraduate students, (new) staff, or so on. When I created the current system I copied this and so the database schema includes a 'Status' model class. The only thing I put in this model class was a text field that people picked from in our account request form, and I didn't really think of the text there as what you could call load bearing. It was just a piece of information we asked people for because we'd always asked people for and faithfully duplicating the old CGI was the easy way to implement the web app.

Before too long, it turned out that we wanted to do some special things if people were graduate students (for example, notifying the department's administrative people so they could update their records to include the graduate student's Unix login and email address here). The obvious simple way to implement this was to do a text match on the value of the 'status' field for a particular person; if their 'status' was "Graduate Student", we knew they were a graduate student and we could do various special things. Over time, this knowledge of what the people-visible "Graduate Student" status text was wormed its way into a whole collection of places around our account systems.

For reasons beyond the scope of this entry, we now (recently) want to change the people-visible text to be not exactly "Graduate Student" any more. Now we have a problem, because a bunch of places know that exact text (in fact I'm not sure I remember where all of those places are).

The mistake I made, way back when we first wanted things to know that an account or account request was a 'graduate student', was in not giving our 'Status' model an internal 'label' field that wasn't shown to people in addition to the text shown to people. You can practically guarantee that anything you show to people will want to change sooner or later, so just like you shouldn't make actual people-exposed fields into primary or foreign keys, none of your code should care about their value. The correct solution is an additional field that acts as the internal label of a Status (with values that make sense to us), and then using this internal label any time the code wants to match on or find the 'Graduate Student' status.

(In theory I could use Django's magic 'id' field for this, since we're having Django create automatic primary keys for everything, including the Status model. In practice, the database IDs are completely opaque and I'd rather have something less opaque in code instead of everything knowing that ID '14' is the Graduate Student status ID.)

Fortunately, I've had a good experience with my one Django database migration so far, so this is a fixable problem. Threading the updates through all of the code (and finding all of the places that need updates, including in outside programs) will be a bit of work, but that's what I get for taking the quick hack approach when this first came up.

(I'm sure I'm not the only person to stub my toe this way, and there's probably a well known database design principle involved that would have told me better if I'd known about it and paid attention at the time.)

These days, systemd can be a cause of restrictions on daemons

By: cks

One of the traditional rites of passage for Linux system administrators is having a daemon not work in the normal system configuration (eg, when you boot the system) but work when you manually run it as root. The classical cause of this on Unix was that $PATH wasn't fully set in the environment the daemon was running in but was in your root shell. On Linux, another traditional cause of this sort of thing has been SELinux and a more modern source (on Ubuntu) has sometimes been AppArmor. All of these create hard to see differences between your root shell (where the daemon works when run by hand) and the normal system environment (where the daemon doesn't work). These days, we can add another cause, an increasingly common one, and that is systemd service unit restrictions, many of which are covered in systemd.exec.

(One pernicious aspect of systemd as a cause of these restrictions is that they can appear in new releases of the same distribution. If a daemon has been running happily in an older release and now has surprise issues in a new Ubuntu LTS, I don't always remember to look at its .service file.)

Some of systemd's protective directives simply cause failures to do things, like access user home directories if ProtectHome= is set to something appropriate. Hopefully your daemon complains loudly here, reporting mysterious 'permission denied' or 'file not found' errors. Some systemd settings can have additional, confusing effects, like PrivateTmp=. A standard thing I do when troubleshooting a chain of programs executing programs executing programs is to shim in diagnostics that dump information to /tmp, but with PrivateTmp= on, my debugging dump files are mysteriously not there in the system-wide /tmp.

(On the other hand, a daemon may not complain about missing files if it's expected that the files aren't always there. A mailer usually can't really tell the difference between 'no one has .forward files' and 'I'm mysteriously not able to see people's home directories to find .forward files in them'.)

Sometimes you don't get explicit errors, just mysterious failures to do some things. For example, you might set IP address access restrictions with the intention of blocking inbound connections but wind up also blocking DNS queries (and this will also depend on whether or not you use systemd-resolved). The good news is that you're mostly not going to find standard systemd .service files for normal daemons shipped by your Linux distribution with IP address restrictions. The bad news is that at some point .service files may start showing up that impose IP address restrictions with the assumption that DNS resolution is being done via systemd-resolved as opposed to direct DNS queries.

(I expect some Linux distributions to resist this, for example Debian, but others may declare that using systemd-resolved is now mandatory in order to simplify things and let them harden service configurations.)

Right now, you can usually test if this is the problem by creating a version of the daemon's .service file with any systemd restrictions stripped out of it and then seeing if using that version makes life happy. In the future it's possible that some daemons will assume and require some systemd restrictions (for instance, assuming that they have a /tmp all of their own), making things harder to test.

Some stuff on how Linux consoles interact with the mouse

By: cks

On at least x86 PCs, Linux text consoles ('TTY' consoles or 'virtual consoles') support some surprising things. One of them is doing some useful stuff with your mouse, if you run an additional daemon such as gpm or the more modern consolation. This is supported on both framebuffer consoles and old 'VGA' text consoles. The experience is fairly straightforward; you install and activate one of the daemons, and afterward you can wave your mouse around, select and paste text, and so on. How it works and what you get is not as clear, and since I recently went diving into this area for reasons, I'm going to write down what I now know before I forget it (with a focus on how consolation works).

The quick summary is that the console TTY's mouse support is broadly like a terminal emulator. With a mouse daemon active, the TTY will do "copy and paste" selection stuff on its own. A mouse aware text mode program can put the console into a mode where mouse button presses are passed through to the program, just as happens in xterm or other terminal emulators.

The simplest TTY mode is when a non-mouse-aware program or shell is active, which is to say a program that wouldn't try to intercept mouse actions itself if it was run in a regular terminal window and would leave mouse stuff up to the terminal emulator. In this mode, your mouse daemon reads mouse input events and then uses sub-options of the TIOCLINUX ioctl to inject activities into the TTY, for example telling it to 'select' some text and then asking it to paste that selection to some file descriptor (normally the console itself, which delivers it to whatever foreground program is taking terminal input at the time).

(In theory you can use the mouse to scroll text back and forth, but in practice that was removed in 2020, both for the framebuffer console and for the VGA console. If I'm reading the code correctly, a VGA console might still have a little bit of scrollback support depending on how much spare VGA RAM you have for your VGA console size. But you're probably not using a VGA console any more.)

The other mode the console TTY can be in is one where some program has used standard xterm-derived escape sequences to ask for xterm-compatible "mouse tracking", which is the same thing it might ask for in a terminal emulator if it wanted to handle the mouse itself. What this does in the kernel TTY console driver is set a flag that your mouse daemon can query with TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't directly handle or look at mouse events. Instead, consolation (or gpm) reads the flag and, when the flag is set, uses the TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL sub-option to report the mouse position and button presses to the kernel (instead of handling mouse activity itself). The kernel then turns around and sends mouse reporting escape codes to the TTY, as the program asked for.

(As I discovered, we got a CVE this year related to this, where the kernel let too many people trigger sending programs 'mouse' events. See the stable kernel commit message for details.)

A mouse daemon like consolation doesn't have to pay attention to the kernel's TTY 'mouse reporting' flag. As far as I can tell from the current Linux kernel code, if the mouse daemon ignores the flag it can keep on doing all of its regular copy and paste selection and mouse button handling. However, sending mouse reports is only possible when a program has specifically asked for it; the kernel will report an error if you ask it to send a mouse report at the wrong time.

(As far as I can see there's no notification from the kernel to your mouse daemon that someone changed the 'mouse reporting' flag. Instead you have to poll it; it appears consolation does this every time through its event loop before it handles any mouse events.)

PS: Some documentation on console mouse reporting was written as a 2020 kernel documentation patch (alternate version) but it doesn't seem to have made it into the tree. According to various sources, eg, the mouse daemon side of things can only be used by actual mouse daemons, not by programs, although programs do sometimes use other bits of TIOCLINUX's mouse stuff.

PPS: It's useful to install a mouse daemon on your desktop or laptop even if you don't intend to ever use the text TTY. If you ever wind up in the text TTY for some reason, perhaps because your regular display environment has exploded, having mouse cut and paste is a lot nicer than not having it.

Free and open source software is incompatible with (security) guarantees

By: cks

If you've been following the tech news, one of the recent things that's happened is that there has been another incident where a bunch of popular and widely used packages on a popular package repository for a popular language were compromised, this time with a self-replicating worm. This is very inconvenient to some people, especially to companies in Europe, for some reason, and so some people have been making the usual noises. On the Fediverse, I had a hot take:

Hot take: free and open source is fundamentally incompatible with strong security *guarantees*, because FOSS is incompatible with strong guarantees about anything. It says so right there on the tin: "without warranty of any kind, either expressed or implied". We guarantee nothing by default, you get the code, the project, everything, as-is, where-is, how-is.

Of course companies find this inconvenient, especially with the EU CRA looming, but that's not FOSS's problem. That's a you problem.

To be clear here: this is not about the security and general quality of FOSS (which is often very good), or the responsiveness of FOSS maintainers. This is about guarantees, firm (and perhaps legally binding) assurances of certain things (which people want for software in general). FOSS can provide strong security in practice but it's inimical to FOSS's very nature to provide a strong guarantee of that or anything else. The thing that makes most of FOSS possible is that you can put out software without that guarantee and without legal liability.

An individual project can solemnly say it guarantees its security, and if it does so it's an open legal question whether that writing trumps the writing in the license. But in general a core and absolutely necessary aspect of free and open source is that warranty disclaimer, and that warranty disclaimer cuts across any strong guarantees about anything, including security and lack of bugs.

Are the compromised packages inconvenient to a lot of companies? They certainly are. But neither the companies nor commentators can say that the compromise violated some general strong security guarantee about packages, because there is and never will be such a guarantee with FOSS (see, for example, Thomas Depierre's I am not a supplier, which puts into words a sentiment a lot of FOSS people have).

(But of course the companies and sympathetic commentators are framing it that way because they are interested in the second vision of "supply chain security", where using FOSS code is supposed to magically absolve companies of the responsibility that people want someone to take.)

The obvious corollary of this is that widespread usage of FOSS packages and software, especially with un-audited upgrades of package versions (however that happens), is incompatible with having any sort of strong security or quality guarantee about the result. The result may have strong security and high quality, but if so, those come without guarantees; you've just been lucky. If you want guarantees, you will have to arrange them yourself and it's very unlikely you can achieve strong guarantees while using the typical every-changing pile of FOSS code.

(For example, if dependencies auto-update before you can audit them and their changes, or faster than you can keep up, you have nothing in practice.)

My Fedora machines need a cleanup of their /usr/sbin for Fedora 42

By: cks

One of the things that Fedora is trying to do in Fedora 42 is unifying /usr/bin and /usr/sbin. In an ideal (Fedora) world, your Fedora machines will have /usr/sbin be a symbolic link to /usr/bin after they're upgraded to Fedora 42. However, if your Fedora machines have been around for a while, or perhaps have some third party packages installed, what you'll actually wind up with is a /usr/sbin that is mostly symbolic links to /usr/bin but still has some actual programs left.

One source of these remaining /usr/sbin programs is old packages from past versions of Fedora that are no longer packaged in Fedora 41 and Fedora 42. Old packages are usually harmless, so it's easy for them to linger around if you're not disciplined; my home and office desktops (which have been around for a while) still have packages from as far back as Fedora 28.

(An added complication of tracking down file ownership is that some RPMs haven't been updated for the /sbin to /usr/sbin merge and so still believe that their files are /sbin/<whatever> instead of /usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find these.)

Obviously, you shouldn't remove old packages without being sure of whether or not they're important to you. I'm also not completely sure that all packages in the Fedora 41 (or 42) repositories are marked as '.fc41' or '.fc42' in their RPM versions, or if there are some RPMs that have been carried over from previous Fedora versions. Possibly this means I should wait until a few more Fedora versions have come to pass so that other people find and fix the exceptions.

(On what is probably my cleanest Fedora 42 test virtual machine, there are a number of packages that 'dnf list --extras' doesn't list that have '.fc41' in their RPM version. Some of them may have been retained un-rebuilt for binary compatibility reasons. There's also the 'shim' UEFI bootloaders, which date from 2024 and don't have Fedora releases in their RPM versions, but those I expect to basically never change once created. But some others are a bit mysterious, such as 'libblkio', and I suspect that they may have simply been missed by the Fedora 42 mass rebuild.)

PS: In theory anyone with access to the full Fedora 42 RPM repository could sweep the entire thing to find packages that still install /usr/sbin files or even /sbin files, which would turn up any relevant not yet rebuilt packages. I don't know if there's any easy way to do this through dnf commands, although I think dnf does have access to a full file list for all packages (which is used for certain dnf queries).

Access control rules need some form of usage counters

By: cks

Today, for reasons outside the scope of this entry, I decided to spend some time maintaining and pruning the access control rules for Wandering Thoughts, this blog. Due to the ongoing crawler plague (and past abuses), Wandering Thoughts has had to build up quite a collection of access control rules, which are mostly implemented as a bunch of things in an Apache .htaccess file (partly 'Deny from ...' for IP address ranges and partly as rewrite rules based on other characteristics). The experience has left me with a renewed view of something, which is that systems with access control rules need some way of letting you see which rules are still being used by your traffic.

It's in the nature of systems with access control rules to accumulate more and more rules over time. You hit another special situation, you add another rule, perhaps to match and block something or perhaps to exempt something from blocking. These rules often interact in various ways, and over time you'll almost certainly wind up with a tangled thicket of rules (because almost no one goes back to carefully check and revisit all existing rules when they add a new one or modify an existing one). The end result is a mess, and one of the ways to reduce the mess is to weed out rules that are now obsolete. One way a rule can be obsolete is that it's not used any more, and often these are the easiest rules to drop once you can recognize them.

(A rule that's still being matched by traffic may be obsolete for other reasons, and rules that aren't currently being matched may still be needed as a precaution. But it's a good starting point.)

If you have the necessary log data, you can sometimes establish if a rule was actually ever used by manually checking your logs. For example, if you have logs of rejected traffic (or logs of all traffic), you can search it for an IP address range to see if a particular IP address rule ever matched anything. But this requires tedious manual effort and that means that only determined people will go through it, especially regularly. The better way is to either have this information provided directly, such as by counters on firewall rules, or to have something in your logs that makes deriving it easy.

(An Apache example would be to augment any log line that was matched by some .htaccess rule with a name or a line number or the like. Then you could go readily through your logs to determine which lines were matched and how often.)

The next time I design an access control rule system, I'm hopefully going to remember this and put something in its logging to (optionally) explain its decisions.

(Periodically I write something that has an access control rule system of some sort. Unfortunately all of mine to date have been quiet on this, so I'm not at all without sin here.)

The idea of /usr/sbin has failed in practice

By: cks

One of the changes in Fedora Linux 42 is unifying /usr/bin and /usr/sbin, by moving everything in /usr/sbin to /usr/bin. To some people, this probably smacks of anathema, and to be honest, my first reaction was to bristle at the idea. However, the more I thought about it, the more I had to concede that the idea of /usr/sbin has failed in practice.

We can tell /usr/sbin has failed in practice by asking how many people routinely operate without /usr/sbin in their $PATH. In a lot of environments, the answer is that very few people do, because sooner or later you run into a program that you want to run (as yourself) to obtain useful information or do useful things. Let's take FreeBSD 14.3 as an illustrative example (to make this not a Linux biased entry); looking at /usr/sbin, I recognize iostat, manctl (you might use it on your own manpages), ntpdate (which can be run by ordinary people to query the offsets of remote servers), pstat, swapinfo, and traceroute. There are probably others that I'm missing, especially if you use FreeBSD as a workstation and so care about things like sound volumes and keyboard control.

(And if you write scripts and want them to send email, you'll care about sendmail and/or FreeBSD's 'mailwrapper', both in /usr/sbin. There's also DTrace, but I don't know if you can DTrace your own binaries as a non-root user on FreeBSD.)

For a long time, there has been no strong organizing principle to /usr/sbin that would draw a hard line and create a situation where people could safely leave it out of their $PATH. We could have had a principle of, for example, "programs that don't work unless run by root", but no such principle was ever followed for very long (if at all). Instead programs were more or less shoved in /usr/sbin if developers thought they were relatively unlikely to be used by normal people. But 'relatively unlikely' is not 'never', and shortly after people got told to 'run traceroute' and got 'command not found' when they tried, /usr/sbin (probably) started appearing in $PATH.

(And then when you asked 'how does my script send me email about something', people told you about /usr/sbin/sendmail and another crack appeared in the wall.)

If /usr/sbin is more of a suggestion than a rule and it appears in everyone's $PATH because no one can predict which programs you want to use will be in /usr/sbin instead of /usr/bin, I believe this means /usr/sbin has failed in practice. What remains is an unpredictable and somewhat arbitrary division between two directories, where which directory something appears in operates mostly as a hint (a hint that's invisible to people who don't specifically look where a program is).

(This division isn't entirely pointless and one could try to reform the situation in a way short of Fedora 42's "burn the entire thing down" approach. If nothing else the split keeps the size of both directories somewhat down.)

PS: The /usr/sbin like idea that I think is still successful in practice is /usr/libexec. Possibly a bunch of things in /usr/sbin should be relocated to there (or appropriate subdirectories of it).

My machines versus the Fedora selinux-policy-targeted package

By: cks

I upgrade Fedora on my office and home workstations through an online upgrade with dnf, and as part of this I read (or at least scan) DNF's output to look for problems. Usually this goes okay, but DNF5 has a general problem with script output and when I did a test upgrade from Fedora 41 to Fedora 42 on a virtual machine, it generated a huge amount of repeated output from a script run by selinux-policy-targeted, repeatedly reporting "Old compiled fcontext format, skipping" for various .bin files in /etc/selinux/targeted/contexts/files. The volume of output made the rest of DNF's output essentially unreadable. I would like to avoid this when I actually upgrade my office and home workstations to Fedora 42 (which I still haven't done, partly because of this issue).

(You can't make this output easier to read because DNF5 is too smart for you. This particular error message reportedly comes from 'semodule -B', per this Fedora discussion.)

The 'targeted' policy is one of several SELinux policies that are supported or at least packaged by Fedora (although I suspect I might see similar issues with the other policies too). My main machines don't use SELinux and I have it completely disabled, so in theory I should be able to remove the selinux-policy-targeted package to stop it from repeatedly complaining during the Fedora 42 upgrade process. In practice, selinux-policy-targeted is a 'protected' package that DNF will normally refuse to remove. Such packages are listed in /etc/dnf/protected.d/ in various .conf files; selinux-policy-targeted installs (well, includes) a .conf file to protect itself from removal once installed.

(Interestingly, sudo protects itself but there's nothing specifically protecting su and the rest of util-linux. I suspect util-linux is so pervasively a dependency that other protected things hold it down, or alternately no one has ever worried about people removing it and shooting themselves in the foot.)

I can obviously remove this .conf file and then DNF will let me remove selinux-policy-targeted, which will force the removal of some other SELinux policy packages (both selinux-policy packages themselves and some '*-selinux' sub-packages of other packages). I tried this on another Fedora 41 test virtual machine and nothing obvious broke, but that doesn't mean that nothing broke at all. It seems very likely that almost no one tests Fedora without the selinux-policy collective installed and I suspect it's not a supported configuration.

I could reduce my risks by removing the packages only just before I do the upgrade to Fedora 42 and put them back later (well, unless I run into a dnf issue as a result, although that issue is from 2024). Also, now that I've investigated this, I could in theory delete the .bin files in /etc/selinux/targeted/contexts/files before the upgrade, hopefully making it so that selinux-policy-targeted has less or nothing to complain about. Since I'm not using SELinux, hopefully the lack of these files won't cause any problems, but of course this is less certain a fix than removing selinux-policy-targeted (for example, perhaps the .bin files would get automatically rebuilt early on in the upgrade process as packages are shuffled around, and bring the problem back with them).

Really, though, I wish DNF5 didn't have its problem with script output. All of this is hackery to deal with that underlying issue.

Some notes on (Tony Finch's) exponential rate limiting in practice

By: cks

After yesterday's entry where I discovered it, I went and implemented Tony Finch's exponential rate limiting for HTTP request rate limiting in DWiki, the engine underlying this blog, replacing the more brute force and limited version I had initially implemented. I chose exponential rate limiting over GCRA or leaky buckets because I found it much easier to understand how to set the limits (partly because I'm somewhat familiar with the whole thing from Exim). Exponential rate limiting needed me to pick a period of time and a number of (theoretical) requests that can be made in that time interval, which was easy enough; GCRA 'rate' and 'burst' numbers were less clear to me. However, exponential rate limiting has some slightly surprising things that I want to remember.

(Exponential ratelimits don't have a 'burst' rate as such but you can sort of achieve this by your choice of time intervals.)

In my original simple rate limiting, any rate limit record that had a time outside of my interval was irrelevant and could be dropped in order to reduce space usage (my current approach uses basically the same hack as my syndication feed ratelimits, so I definitely don't want to let its space use grow without bound). This is no longer necessarily true in exponential rate limiting, depending on how big of a rate the record (the source) had built up before it took a break. This old rate 'decays' at a rate I will helpfully put in a table for my own use:

Time since last seen Old rate multiplied by
1x interval 0.37
2x interval 0.13
3x interval 0.05
4x interval 0.02

(This is, eg, 'exp(-1)' for we only last saw the source 'interval' time ago.)

Where this becomes especially relevant is if you opt for 'strict' rate limiting instead of 'leaky', where every time the source makes a request you increase its recorded rate even if you reject the request for being rate limited. A high-speed source that insists on hammering you for a while can build up a very large current rate under a strict rate limit policy, and that means its old past behavior can affect it (ie, possibly cause it to be rate limited) well beyond your nominal rate limit interval. Especially with 'strict' rate limiting, you could opt to cap the maximum age a valid record could have and drop everything that you last saw over, say, 3x your interval ago; this would be generous to very high rate old sources, but not too generous (since their old rate would be reduced to 0.05 or less of what it was even if you counted it).

As far as I can see, the behavior with leaky rate limiting and a cost of 1 (for the simple case of all HTTP requests having the same cost) is that if the client keeps pounding away at you, one of its requests will get through on a semi-regular basis. The client will make a successful request, the request will push its rate just over your limit, it will get rate limited some number of times, then enough time will have passed since its last successful request that its new request will be just under the rate limit and succeed. In some environments, this is fine and desired. However, my current goal is to firmly cut off clients that are making requests too fast, so I don't want this; instead, I implemented the 'strict' behavior so you don't get through at all until your request rate and the interval since your last request drops low enough.

Mathematically, a client that makes requests with little or no gap between them (to the precision of your timestamps) can wind up increasing its rate by slightly over its 'cost' per request. If I'm understanding the math correctly, how much over the cost is capped by Tony Finch's 'max(interval, 1.0e-10)' step, with 1.0e-10 being a small but non-zero number that you can move up or down depending on, eg, your language and its floating point precision. Having looked at it, in Python the resulting factor with 1.0e-10 is '1.000000082740371', so you and I probably don't need to worry about this. If the client doesn't make requests quite that fast, its rate will go up each time by slightly less than the 'cost' you've assigned. In Python, a client that makes a request every millisecond has a factor for this of '0.9995001666249781' of the cost; slower request rates make this factor smaller.

This is probably mostly relevant if you're dumping or reporting the calculated rates (for example, when a client hits the rate limit) and get puzzled by the odd numbers that may be getting reported.

I don't know how to implement proper ratelimiting (well, maybe I do now)

By: cks

In theory I have a formal education as a programmer (although it was a long time ago). In practice my knowledge from it isn't comprehensive, and every so often I run into an area where I know there's relevant knowledge and algorithms but I don't know what they are and I'm not sure how to find them. Today's area is scalable rate-limiting with low storage requirements.

Suppose, not hypothetically, that you want to ratelimit a collection of unpredictable sources and not use all that much storage per source. One extremely simple and obvious approach is to store, for each source, a start time and a count. Every time the source makes a request, you check to see if the start time is within your rate limit interval; if it is, you increase the count (or ratelimit the source), and if it isn't, you reset the start time to now and the count to 1.

(Every so often you can clean out entries with start times before your interval.)

The disadvantage of this simple approach is that it completely forgets about the past history of each source periodically. If your rate limit intervals are 20 minutes, a prolific source gets to start over from scratch every 20 minutes and run up its count until it gets rate limited again. Typically you want rate limiting not to forget about sources so fast.

I know there are algorithms that maintain decaying averages or moving (rolling) averages. The Unix load average is maintained this way, as is Exim ratelimiting. The Unix load average has the advantage that it's updated on a regular basis, which makes the calculation relatively simple. Exim has to deal with erratic updates that are unpredictable intervals from the previous update, and the comment in the source is a bit opaque to me. I could probably duplicate the formula in my code but I'd have to do a bunch of work to convince myself the result was correct.

(And now I've found Tony Finch's exponential rate limiting (via), which I'm going to have to read carefully, along with the previous GCRA: leaky buckets without the buckets.)

Given that rate limiting is such a common thing these days, I suspect that there are a number of algorithms for this with various different choices about how the limits work. Ideally, it would be possible to readily find writeups of them with internet searches, but of course as you know internet search is fairly broken these days.

(For example you can find a lot of people giving high level overviews of rate limiting without discussing how to actually implement it.)

Now that I've found Tony Finch's work I'm probably going to rework my hacky rate limiting code to do things better, because my brute force approach is using the same space as leaky buckets (as covered in Tony Finch's article) with inferior results. This shows the usefulness of knowing algorithms instead of just coding away.

(Improving the algorithm in my code will probably make no practical difference, but sometimes programming is its own pleasure.)

ZFS snapshots aren't as immutable as I thought, due to snapshot metadata

By: cks

If you know about ZFS snapshots, you know that one of their famous properties is that they're immutable; once a snapshot is made, its state is frozen. Or so you might casually describe it, but that description is misleading. What is frozen in a ZFS snapshot is the state of the filesystem (or zvol) that it captures, and only that. In particular, the metadata associated with the snapshot can and will change over time.

(When I say it this way it sounds obvious, but for a long time my intuition about how ZFS operated was misled by me thinking that all aspects of a snapshot had to be immutable once made and trying to figure out how ZFS worked around that.)

One visible place where ZFS updates the metadata of a snapshot is to maintain information about how much unique space the snapshot is using. Another is that when a ZFS snapshot is deleted, other ZFS snapshots may require updates to adjust the list of snapshots (every snapshot points to the previous one) and the ZFS deadlist of blocks that are waiting to be freed.

Mechanically, I believe that various things in a dsl_dataset_phys_t are mutable, with the exception of things like the creation time and the creation txg, and also the block pointer, which points to the actual filesystem data of the snapshot. Things like the previous snapshot information have to be mutable (you might delete the previous snapshot), and things like the deadlist and the unique bytes are mutated as part of operations like snapshot deletion. The other things I'm not sure of.

(See also my old entry on a broad overview of how ZFS is structured on disk. A snapshot is a 'DSL dataset' and it points to the object set for that snapshot. The root directory of a filesystem DSL dataset, snapshot or otherwise, is at a fixed number in the object set; it's always object 1. A snapshot freezes the object set as of that point in time.)

PS: Another mutable thing about snapshots is their name, since 'zfs rename' can change that. The manual page even gives an example of using (recursive) snapshot renaming to keep a rolling series of daily snapshots.

How I think OpenZFS's 'written' and 'written@<snap>' dataset properties work

By: cks

Yesterday I wrote some notes about ZFS's 'written' dataset property, where the short summary is that 'written' reports the amount of space written in a snapshot (ie, that wasn't in the previous snapshot), and 'written@<snapshot>' reports the amount of space written since the specified snapshot (up to either another snapshot or the current state of the dataset). In that entry, I left un-researched the question of how ZFS actually gives us those numbers; for example, if there was a mechanism in place similar to the complicated one for 'used' space. I've now looked into this and as far as I can see the answer is that ZFS determines information on the fly.

The guts of the determination are in dsl_dataset_space_written_impl(), which has a big comment that I'm going to quote wholesale:

Return [...] the amount of space referenced by "new" that was not referenced at the time the bookmark corresponds to. "New" may be a snapshot or a head. The bookmark must be before new, [...]

The written space is calculated by considering two components: First, we ignore any freed space, and calculate the written as new's used space minus old's used space. Next, we add in the amount of space that was freed between the two time points, thus reducing new's used space relative to old's. Specifically, this is the space that was born before zbm_creation_txg, and freed before new (ie. on new's deadlist or a previous deadlist).

(A 'bookmark' here is an internal ZFS thing.)

When this talks about 'used' space, this is not the "used" snapshot property; this is the amount of space the snapshot or dataset refers to, including space shared with other snapshots. If I'm understanding the code and the comment right, the reason we add back in freed space is because otherwise you could wind up with a negative number. Suppose you wrote a 2 GB file, made one snapshot, deleted the file, and then made a second snapshot. The difference in space referenced between the two snapshots is slightly less than negative 2 GB, but we can't report that as 'written', so we go through the old stuff that got deleted and add its size back in to make the number positive again.

To determine the amount of space that's been freed between the bookmark and "new", the ZFS code walks backward through all snapshots from "new" to the bookmark, calling another ZFS function to determine how much relevant space got deleted. This uses the ZFS deadlists that ZFS is already keeping track of to know when it can free an object.

This code is used both for 'written@<snap>' and 'written'; the only difference between them is that when you ask for 'written', the ZFS kernel code automatically finds the previous snapshot for you.

Some notes on OpenZFS's 'written' dataset property

By: cks

ZFS snapshots and filesystems have a 'written' property, and a related 'written@snapshot one. These are documented as:

written
The amount of space referenced by this dataset, that was written since the previous snapshot (i.e. that is not referenced by the previous snapshot).

written@snapshot
The amount of referenced space written to this dataset since the specified snapshot. This is the space that is referenced by this dataset but was not referenced by the specified snapshot. [...]

(Apparently I never noticed the 'written' property before recently, despite it being there from very long ago.)

The 'written' property is related to the 'used' property, and it's both more confusing and less confusing as it relates to snapshots. Famously (but not famously enough), for snapshots the used property ('USED' in the output of 'zfs list') only counts space that is exclusive to that snapshot. Space that's only used by snapshots but that is shared by more than one snapshot is in 'usedbysnapshots'.

To understand 'written' better, let's do an experiment: we'll make a snapshot, write a 2 GByte file, make a second snapshot, write another 2 GByte file, make a third snapshot, and then delete the first 2 GB file. Since I've done this, I can tell you the results.

If there are no other snapshots of the filesystem, the first snapshot's 'written' value is the full size of the filesystem at the time it was made, because everything was written before it was made. The second snapshot's 'written' is 2 GBytes, the data file we wrote between the first and the second snapshot. The third snapshot's 'written' is another 2 GB, for the second file we wrote. However, at the end, after we delete one of the data files, the filesystem's 'written' is small (certainly not 2 GB), and so would be the 'written' of a fourth snapshot if we made one.

The reason the filesystem's 'written' is so small is that ZFS is counting concrete on-disk (new) space. Deleting a 2 GB file frees up a bunch of space but it doesn't require writing very much to the filesystem, so the 'written' value is low.

If we look at the 'used' values for all three snapshots, they're all going to be really low. This is because both 2 GByte data files we wrote are shared between the second and the third snapshot. Since they're both in multiple snapshots, they're in 'usedbysnapshots' but not 'used'.

(ZFS has a somewhat complicated mechanism to maintain all of this information.)

There is one interesting 'written' usage that appears to show you deleted space, but it is a bit tricky. The manual page implies that the normal usage of 'written@<snapshot>' is to ask for it for the filesystem itself; however, in experimentation you can ask for it for a snapshot too. So take the three snapshots above, and the filesystem after deleting the first data file. If you ask for 'written@first' for the filesystem, you will get 2 GB, but if you ask for 'written@first' for the third snapshot, you will get 4 GB. What the filesystem appears to be reporting is how much still-live data has been written between the first snapshot and now, which is only 2 GB because we deleted the other 2 GB. Meanwhile, all four GB are still alive in the third snapshot.

My conclusion from looking into this is that I can use 'written' as an indication of how much new data a snapshot has captured, but I can't use it as an indication of how much changed in a snapshot. As I've seen, deleting data is a potentially big change but a small 'written' value. If I'm understanding 'written' correctly, one useful thing about it is that it shows roughly how much data an incremental 'zfs send' of just that snapshot would send. Under some circumstances it will also give you an idea of how much data your backup system may need to back up; however, this works best if people are creating new files (and deleting old ones), instead of updating or appending to existing files (where ZFS only updates some blocks but a backup system probably needs to re-save the whole thing).

Why Firefox's media autoplay settings are complicated and imperfect

By: cks

In theory, a website that wanted to play video or audio could throw in a '<video controls ...>' or '<audio controls ...>' element in the HTML of the page and be done with it. This would make handling playing media simple and blocking autoplay reliable; you'd ignore the autoplay element and the person using the browser would directly trigger playing media by interacting with things that the browser directly controlled and so the browser could know for sure that a person had directly clicked on them and the media should be played.

As anyone who's seen websites with audio and video on the web knows, in practice almost no one does it this way, with browser controls on the <video> or <audio> element. Instead, everyone displays controls of their own somehow (eg as HTML elements styled through CSS), attaches JavaScript actions to them, and then uses the HTMLMediaElement browser API to trigger playback and various other things. As a result of this use of JavaScript, browsers in general and Firefox in particular no longer have a clear, unambiguous view of your intentions to play media. At best, all they can know is that you interacted with the web page, this interaction triggered some JavaScript, and the JavaScript requested that media play.

(Browsers can know somewhat of how you interacted with a web page, such as whether you clicked or scrolled or typed a key.)

On good, well behaved websites, this interaction is with visually clear controls (such as a visual 'play' button) and the JavaScript that requests media playing is directly attached to those controls. And even on these websites, JavaScript may later legitimately act asynchronously to request more playing of things, or you may interact with media playback in other ways (such as spacebar to pause and then restart media playing). On not so good websites, well, any piece of JavaScript that manages to run can call HTMLMediaElement.play() to try to start playing the media. There are lots of ways to have JavaScript run automatically and so a web page can start trying to play media the moment its JavaScript starts running, and it can keep trying to trigger playback over and over again if it wants to through timers or suchlike.

Since Firefox only blocking the actual autoplay attribute and allowing JavaScript to trigger media playing any time it wants to would be a pretty obviously bad 'Block Autoplay' experience, it must try harder. Firefox's approach is to (also) block use of HTMLMediaElement.play() until you have done some 'user gesture' on the page. As far as I can tell from Firefox's description of this, the list of 'user gestures' is fairly expansive and covers much of how you interact with a page. Certainly, if a website can cause you to click on something, regardless of what it looks like, this counts as a 'user gesture' in Firefox.

(I'm sure that Firefox's selection of things that count as 'user gestures' are drawn from real people on real hardware doing things to deliberately trigger playback, including resuming playback after it's been paused by, for example, tapping spacebar.)

In Firefox, this makes it quite hard to actually stop a bad website from playing media while preserving your ability to interact with the site. Did you scroll the page with the spacebar? I think that counts as a user gesture. Did you use your mouse scroll wheel? Probably a user gesture. Did you click on anything at all, including to dismiss some banner? Definitely a user gesture. As far as I can tell, the only reliable way you can prevent a web page from starting media playback is to immediately close the page. Basically anything you do to use it is dangerous.

Firefox does have a very strict global 'no autoplay' policy that you can turn on through about:config, which they call click-to-play, where Firefox tries to limit HTMLMediaElement.play() to being called as the direct result of a JavaScript event handler. However, their wiki notes that this can break some (legitimate) websites entirely (well, for media playback), and it's a global setting that gets in the way of some things I want; you can't set it only for some sites. And even with click-to-play, if a website can get you to click on something of its choice, it's game over as far as I know; if you have to click or tap a key to dismiss an on-page popup banner, the page can trigger media playing from that event handler.

All of this is why I'd like a per-website "permanent mute" option for Firefox. As far as I know, there's literally no other way in standard Firefox to reliably prevent a potentially bad website (or advertising network that it uses) from playing media on you.

(I suspect that you can defeat a lot of such websites with click-to-play, though.)

PS: Muting a tab in Firefox is different from stopping media playback (or blocking it from starting). All it does is stop Firefox from outputting audio from that tab (to wherever you're having Firefox send audio). Any media will 'play' or continue to play, including videos displaying moving things and being distracting.

We can't expect people to pick 'good' software

By: cks

One of the things I've come to believe in (although I'm not consistent about it) is that we can't expect people to pick software that is 'good' in a technical sense. People certainly can and do pick software that is good in that it works nicely, has a user interface that works for them, and so on, which is to say all of the parts of 'good' that they can see and assess, but we can't expect people to go beyond that, to dig deeply into the technical aspects to see how good their choice of software is. For example, how efficiently an IMAP client implements various operations at the protocol level is more or less invisible to most people. Even if you know enough to know about potential technical quality aspects, realistically you have to rely on any documentation the software provides (if it provides anything). Very few people are going to set up an IMAP server test environment and point IMAP clients at it to see how they behave, or try to read the source code of open source clients.

(Plus, you have to know a lot to set up a realistic test environment. A lot of modern software varies its behavior in subtle ways depending on the surrounding environment, such as the server (or client) at the other end, what your system is like, and so on. To extend my example, the same IMAP client may behave differently when talking to two different IMAP server implementations.)

Broadly, the best we can do is get software to describe important technical aspects of itself, to document them even if the software doesn't, and to explain to people why various aspects matter and thus what they should look for if they want to pick good software. I think this approach has seen some success in, for example, messaging apps, where 'end to end encrypted' or similar things has become a technical quality measure that's typically relatively legible to people. Other technical quality measures in other software are much less legible to people in general, including in important software like web browsers.

(One useful way to make technical aspects legible is to create some sort of scorecard for them. Although I don't think it was built for this purpose, there's caniuse for browsers and their technical quality for various CSS and HTML5 features.)

To me, one corollary to this is that there's generally no point in yelling at people (in various ways) or otherwise punishing them because they picked software that isn't (technically) good. It's pretty hard for a non-specialist to know what is actually good or who to trust to tell them what's actually good, so it's not really someone's fault if they wind up with not-good software that does undesirable things. This doesn't mean that we should always accept the undesirable things, but it's probably best to either deal with them or reject them as gracefully as possible.

(This definitely doesn't mean that we should blindly follow Postel's Law, because a lot of harm has been done to various ecosystems by doing so. Sometimes you have to draw a line, even if it affects people who simply had bad luck in what software they picked. But ideally there's a difference between drawing a line and yelling at people about them running into the line.)

Our too many paths to 'quiet' Prometheus alerts

By: cks

One of the things our Prometheus environment has is a notion of different sorts of alerts, and in particular of less important alerts that should go to a subset of people (ie, me). There are various reasons for this, including that the alert is in testing, or it concerns a subsystem that only I should have to care about, or that it fires too often for other people (for example, a reboot notification for a machine we routinely reboot).

For historical reasons, there are at least four different ways that this can be done in our Prometheus environment:

  • a special label can be attached to the Prometheus alert rule, which is appropriate if the alert rule itself is in testing or otherwise is low priority.

  • a special label can be attached to targets in a scrape configuration, although this has some side effects that can be less than ideal. This affects all alerts that trigger based on metrics from, for example, the Prometheus host agent (for that host).

  • our Prometheus configuration itself can apply alert relabeling to add the special label for everything from a specific host, as indicated by a "host" label that we add. This is useful if we have so many exporters being scraped from a particular host, or if I want to keep metric continuity (ie, the metrics not changing their label set) when a host moves into production.

  • our Alertmanager configuration can specifically route certain alerts about certain machines to the 'less important alerts' destination.

The drawback of these assorted approaches is that now there are at least three places to check and possibly to update when a host moves from being a testing host into being a production host. A further drawback is some of these (the first two) are used a lot more often than others of these (the last two). When you have multiple things, some of which are infrequently used, and fallible humans have to remember to check them all, you can guess what can happen next.

And that is the simple version of why alerts about one of our fileservers wouldn't have gone to everyone here for about the past year.

How I discovered the problem was that I got an alert about one of the fileserver's Prometheus exporters restarting, and decided that I should update the alert configuration to make it so that alerts about this service restarting only went to me. As I was in the process of doing this, I realized that the alert already had only gone to me, despite there being no explicit configuration in the alert rule or the scrape configuration. This set me on an expedition into the depths of everything else, where I turned up an obsolete bit in our general Prometheus configuration.

On the positive side, now I've audited our Prometheus and Alertmanager configurations for any other things that shouldn't be there. On the negative side, I'm now not completely sure that there isn't a fifth place that's downgrading (some) alerts about (some) hosts.

Could NVMe disks become required for adequate performance?

By: cks

It's not news that full speed NVMe disks are extremely fast, as well as extremely good at random IO and doing a lot of IO at once. In fact they have performance characteristics that upset general assumptions about how you might want to design systems, at least for reading data from disk (for example, you want to generate a lot of simultaneous outstanding requests, either explicitly in your program or implicitly through the operating system). I'm not sure how much write bandwidth normal NVMe drives can really deliver for sustained write IO, but I believe that they can absorb very high write rates for a short period as you flush out a few hundred megabytes or more. This is a fairly big sea change from even SATA SSDs (and I believe SAS SSDs), never mind HDDs.

About a decade ago, I speculated that everyone was going to be forced to migrate to SATA SSDs because developers would build programs that required SATA SSD performance. It's quite common for developers to build programs and systems that run well on their hardware (whether that's laptops, desktops, or servers, cloud or otherwise), and developers often use the latest and best. These days, that's going to have NVMe SSDs, and so it wouldn't be surprising if developers increasingly developed for full NVMe performance. Some of this may be inadvertent, in that the developer doesn't realize what the performance impact of their choices are on systems with less speedy storage. Some of this will likely be deliberate, as developers choose to optimize for NVMe performance or even develop systems that only work well with that level of performance.

This is a potential problem because there are a number of ways to not have that level of NVMe performance. Most obviously, you can simply not have NVMe drives; instead you may be using SATA SSDs (as we mostly are, including in our fileservers), or even HDDs (as we are in our Prometheus metrics server). Less obviously, you may have NVMe drives but be driving them in ways that don't give you the full NVMe bandwidth. For instance, you might have a bunch of NVMe drives behind a 'tri-mode' HBA, or have (some of) your NVMe drives hanging off the chipset with shared PCIe lanes to the CPU, or have to drive some of your NVMe drives with fewer than x4 PCIe because of limits on slots or lanes.

(Dedicated NVMe focused storage servers will be able to support lots of NVMe devices at full speed, but such storage servers are likely to be expensive. People will inevitably build systems with lower end setups, us included, and I believe that basic 1U servers are still mostly SATA/SAS based.)

One possible reason for optimism is that in today's operating systems, it can take careful system design and unusual programming patterns to really push NVMe disks to high performance levels. This may make it less likely that software accidentally winds up being written so it only performs well on NVMe disks; if it happens, it will be deliberate and the project will probably tell you about it. This is somewhat unlike the SSD/HDD situation a decade ago, where the difference in (random) IO operations per second was both massive and easily achieved.

(This entry was sparked in part by reading this article (via), which I'm not taking a position on.)

HTTP headers that tell syndication feed fetchers how soon to come back

By: cks

Programs that fetch syndication feeds should fetch them only every so often. But how often? There are a variety of ways to communicate this, and for my own purposes I want to gather them in one place.

I'll put the summary up front. For Atom syndication feeds, your HTTP feed responses should contain a Cache-Control: max-age=... HTTP header that gives your desired retry interval (in seconds), such as '3600' for pulling the feed once an hour. If and when people trip your rate limits and get HTTP 429 responses, your 429s should include a Retry-After header with how long you want feed readers to wait (although they won't).

There are two syndication feed formats in general usage, Atom and RSS2. Although generally not great (and to be avoided), RSS2 format feeds can optionally contain a number of elements to explicitly tell feed readers how frequently they should poll the feed. The Atom syndication feed format has no standard element to communicate polling frequency. Instead, the nominally standard way to do this is through a general Cache-Control: max-age=... HTTP header, which gives a (remaining) lifetime in seconds. You can also set an Expires header, which gives an absolute expiry time, but not both.

(This information comes from Daniel Aleksandersen's Best practices for syndication feed caching. One advantage of HTTP headers over feed elements is that they can be returned on HTTP 304 Not Modified responses; one drawback is that you need to be able to set HTTP headers.)

If you have different rate limit policies for conditional GET requests and unconditional ones, you have a choice to make about the time period you advertise on successful unconditional GETs of your feed. Every feed reader has to do an unconditional GET the first time it fetches your feed, and many of them will periodically do unconditional GETs for various reasons. You could choose to be optimistic, assume that the feed reader's next poll will be a conditional GET, and give it the conditional GET retry interval, or you could be pessimistic and give it a longer unconditional GET one. My personal approach is to always advertise the conditional GET retry interval, because I assume that if you're not going to do any conditional GETs you're probably not paying attention to my Cache-Control header either.

As rachelbythebay's ongoing work on improving feed reader behavior has uncovered, a number of feed readers will come back a bit earlier than your advertised retry interval. So my view is that if you have a rate limit, you should advertise a retry interval that is larger than it. On Wandering Thoughts my current conditional GET feed rate limit is 45 minutes, but I advertise a one hour max-age (and I would like people to stick to once an hour).

(Unconditional GETs of my feeds are rate limited down to once every four hours.)

Once people trip your rate limits and start getting HTTP 429 responses, you theoretically can signal how soon they can come back with a Retry-After header. The simplest way to implement this is to have a constant value that you put in this header, even if your actual rate limit implementation would allow a successful request earlier. For example, if you rate limit to one feed fetch every half hour and a feed fetcher polls after 20 minutes, the simple Retry-After value is '1800' (half an hour in seconds), although if they tried again in just over ten minutes they could succeed (depending on how you implement rate limits). This is what I currently do, with a different Retry-After (and a different rate limit) for conditional GET requests and unconditional GETs.

My suspicion is that there are almost no feed fetchers that ignore your Cache-Control max-age setting but that honor your HTTP 429 Retry-After setting (or that react to 429s at all). Certainly I see a lot of feed fetchers here behaving in ways that very strongly suggest they ignore both, such as rather frequent fetch attempts. But at least I tried.

Sidebar: rate limit policies and feed reader behavior

When you have a rate limit, one question is whether failed (rate limited) requests should count against the rate limit, or if only successful ones count. If you nominally allow one feed fetch every 30 minutes and a feed reader fetches at T (successfully), T+20, and T+33, this is the difference between the third fetch failing (since it's less than 30 minutes from the previous attempt) or succeeding (since it's more than 30 minutes from the last successful fetch).

There are various situations where the right answer is that your rate limit counts from the last request even if the last request failed (what Exim calls a strict ratelimit). However, based on observed feed reader behavior, doing this strict rate limiting on feed fetches will result in quite a number of syndication feed readers never successfully fetching your feed, because they will never slow down and drop under your rate limit. You probably don't want this.

Mapping from total requests per day to average request rates

By: cks

Suppose, not hypothetically, that a single IP address with a single User-Agent has made 557 requests for your blog's syndication feed in about 22 and a half hours (most of which were rate-limited and got HTTP 429 replies). If we generously assume that these requests were distributed evenly over one day (24 hours), what was the average interval between requests (the rate of requests)? The answer is easy enough to work out and it's about two and a half minutes between requests, if they were evenly distributed.

I've been looking at numbers like this lately and I don't feel like working out the math each time, so here is a table of them for my own future use.

Total requests Theoretical interval (rate)
6 Four hours
12 Two hours
24 One hour
32 45 minutes
48 30 minutes
96 15 minutes
144 10 minutes
288 5 minutes
360 4 minutes
480 3 minutes
720 2 minutes
1440 One minute
2880 30 seconds
5760 15 seconds
8640 10 seconds
17280 5 seconds
43200 2 seconds
86400 One second

(This obviously isn't comprehensive; instead I want it to give me a ballpark idea, and I care more about higher request counts than lower ones. But not too high because I mostly don't deal with really high rates. Every four hours and every 45 minutes are relevant to some ratelimiting I do.)

Yesterday there were about 20,240 requests for the main syndication feed for Wandering Thoughts, which is an aggregate rate of more than one request every five seconds. About 10,570 of those requests weren't blocked in various ways or ratelimited, which is still more than one request every ten seconds (if they were evenly spread out, which they probably weren't).

(There were about 48,000 total requests to Wandering Thoughts, and about 18,980 got successful responses, although almost 2,000 of those successful responses were a single rogue crawler that's now blocked. This is of course nothing compared to what a busy website sees. Yesterday my department's web server saw 491,900 requests, although that seems to have been unusually high. Interested parties can make their own tables for that sort of volume level.)

It's a bit interesting to see this table written out this way. For example, if I thought about it I knew there was a factor of ten difference between one request every ten seconds and one request every second, but it's more concrete when I see the numbers there with the extra zero.

In GNU Emacs, I should remember that the basics still work

By: cks

Over on the Fediverse, I said something that has a story attached:

It sounds obvious to say it, but I need to remember that I can always switch buffers in GNU Emacs by just switching buffers, not by using, eg, the MH-E commands to switch (back) to another folder. The MH-E commands quite sensibly do additional things, but sometimes I don't want them.

GNU Emacs has a spectrum of things that range from assisting your conventional editing (such as LSP clients) to what are essentially nearly full-blown applications that happen to be embedded in GNU Emacs, such as magit and MH-E and the other major modes for reading your email (or Usenet news, or etc). One of my personal dividing lines is to what extent the mode takes over from regular Emacs keybindings and regular Emacs behaviors. On this scale, MH-E is quite high on the 'application' side; in MH-E folder buffers, you mostly do things through custom keybindings.

(Well, sort of. This is actually overselling the case because I use regular Emacs buffer movement and buffer searching commands routinely, and MH-E uses Emacs marks to select ranges of messages, which you establish through normal Emacs commands. But actual MH-E operations, like switching to another folder, are done through custom keybindings that involve MH-E functions.)

My dominant use of GNU Emacs at the moment is as a platform for MH-E. When I'm so embedded in an MH-E mindset, it's easy to wind up with a form of tunnel vision, where I think of the MH-E commands as the only way to do something like 'switch to another (MH) folder'. Sometimes I do need or want to use the MH-E commands, and sometimes they're the easiest way, but part of the power of GNU Emacs as a general purpose environment is that ultimately, MH-E's displays of folders and messages, the email message I'm writing, and so on, are all just Emacs buffers being displayed in Emacs windows. I don't have to switch between these things through MH-E commands if I don't want to; I can just switch buffers with 'C-x b'.

(Provided that the buffer already exists. If the buffer doesn't exist, I need to use the MH-E command to create it.)

Sometimes the reason to use native Emacs buffer switching is that there's no MH-E binding for the functionality, for example to switch from a mail message I'm writing back to my inbox (either to look at some other message or to read new email that just came in). Sometimes it's because, for example, the MH-E command to switch to a folder wants to rescan the MH folder, which forces me to commit or discard any pending deletions and refilings of email.

One of the things that makes this work is that MH-E uses a bunch of different buffers for things. For example, each MH folder gets its own separately named buffer, instead of MH-E simply loading the current folder (whatever it is) into a generic 'show a folder' buffer. Magit does something similar with buffer naming, where its summary buffer isn't called just 'magit' but 'magit: <directory>' (I hadn't noticed that until I started writing this entry, but of course Magit would do it that way as a good Emacs citizen).

Now that I've written this, I've realized that a bit of my MH-E customization uses a fixed buffer name for a temporary buffer, instead of a buffer name based on the current folder. I'm in good company on this, since a number of MH-E status display commands also use fixed-name buffers, but perhaps I should do better. On the other hand, using a fixed buffer name does avoid having a bunch of these buffers linger around just because I used my command.

(This is using with-output-to-temp-buffer, and a lot of use of it in GNU Emacs' standard Lisp is using fixed names, so maybe my usage here is fine. The relevant Emacs Lisp documentation doesn't have style and usage notes that would tell me either way.)

Some thoughts on Ubuntu automatic ('unattended') package upgrades

By: cks

The default behavior of a stock Ubuntu LTS server install is that it enables 'unattended upgrades', by installing the package unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades, which controls this). Historically, we haven't believed in unattended automatic package upgrades and eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential advantages; in practice it mostly results in package upgrades being applied after some delay that depends on when they come out relative to working days.

I have a few machines that actually are stock Ubuntu servers, for reasons outside the scope of this entry. These machines naturally have automated upgrades turned on and one of them (in a cloud, using the cloud provider's standard Ubuntu LTS image) even appears to automatically reboot itself if kernel updates need that. These machines are all in undemanding roles (although one of them is my work IPv6 gateway), so they aren't necessarily indicative of what we'd see on more complex machines, but none of them have had any visible problems from these unattended upgrades.

(I also can't remember the last time that we ran into a problem with updates when we applied them. Ubuntu updates still sometimes have regressions and other problems, forcing them to be reverted or reissued, but so far we haven't seen problems ourselves; we find out about these problems only through the notices in the Ubuntu security lists.)

If we were starting from scratch today in a greenfield environment, I'm not sure we'd bother building our automation for manual package updates. Since we have the automation and it offers various extra features (even if they're rarely used), we're probably not going to switch over to automated upgrades (including in our local build of Ubuntu 26.04 LTS when that comes out next year).

(The advantage of switching over to standard unattended upgrades is that we'd get rid of a local tool that, like all local tools, is all our responsibility. The less local weird things we have, the better, especially since we have so many as it is.)

I wish Firefox had some way to permanently mute a website

By: cks

Over on the Fediverse, I had a wish:

My kingdom for a way to tell Firefox to never, ever play audio and/or video for a particular site. In other words, a permanent and persistent mute of that site. AFAIK this is currently impossible.

(For reasons, I cannot set media.autoplay.blocking_policy to 2 generally. I could if Firefox had a 'all subdomains of ...' autoplay permission, but it doesn't, again AFAIK.)

(This is in a Firefox setup that doesn't have uMatrix and that runs JavaScript.)

Sometimes I visit sites in my 'just make things work' Firefox instance that has JavaScript and cookies and so on allowed (and throws everything away when it shuts down), and it turns out that those sites have invented exceedingly clever ways to defeat Firefox's default attempts to let you block autoplaying media (and possibly their approach is clever enough to defeat even the strict 'click to start' setting for media.autoplay.blocking_policy). I'd like to frustrate those sites, especially ones that I keep winding up back on for various reasons, and never hear unexpected noises from Firefox.

(In general I'd probably like to invert my wish, so that Firefox never played audio or video by default and I had to specifically enable it on a site by site basis. But again this would need an 'all subdomains of' option. This version might turn out to be too strict, I'd have to experiment.)

You can mute a tab, but only once it starts playing, and your mute isn't persistent. As far as I know there's no (native) way to get Firefox to start a tab muted, or especially to always start tabs for a site in a muted state, or to disable audio and/or video for a site entirely (the way you can deny permission for camera or microphone access). I'm somewhat surprised that Firefox doesn't have any option for 'this site is obnoxious, put them on permanent mute', because there are such sites out there.

Both uMatrix and apparently NoScript can selectively block media, but I'd have to add either of them to this profile and I broadly want it to be as plain as reasonable. I do have uBlock Origin in this profile (because I have it in everything), but as far as I can tell it doesn't have a specific (and selective) media blocking option, although it's possible you can do clever things with filter rules, especially if you care about one site instead of all sites.

(I also think that Firefox should be able to do this natively, but evidently Firefox disagrees with me.)

PS: If Firefox actually does have an apparently well hidden feature for this, I'd love to know about it.

Argparse will let you have multiple long (and short) options for one thing

By: cks

Argparse is the standard Python module for handling (Unix style) command line options, in the expected way (which not all languages follow). Or at least more or less the expected way; people are periodically surprised that by default argparse allows you to abbreviate long options (although you can safely turn that off if you assume Python 3.8 or later and you remember this corner case).

What I think of as the typical language API for specifying short and long options allows you to specify (at most) one of each; this is the API of, for example, the Go package I use for option handling. When I've written Python programs using argparse, I've followed this usage without thinking very much about it. However, argparse doesn't actually require you to restrict yourself this way. The addargument()_ accepts a list of option strings, and although the documentation's example shows a single short option and a single long option, you can give it more than one of each and it will work.

So yes, you can perfectly reasonably create an argparse option that can be invoked as either '--ns' or '--no-something', so that on the one hand you have a clear canonical version and on the other hand you have something short for convenience. If I'm going to do this (and sometimes I am), the thing I want to remember is that argparse's help output will report these options in the order I gave them to addargument()_ so I probably want to list the long one first, as the canonical and clearest form. In other words:

parser.add_argument("--no-something", "--ns", ....)

so that the -h output I get says:

--no-something, --ns     Don't do something

(If you have multiple '--no-...' options, abbreviated options aren't as compact as this '--ns' style. Of course it's a little bit unusual to have several long options that mean the same thing, but my view is that long options are sort of a zoo anyway and you might as well be convenient.)

Having multiple short (single letter) options for the same thing is also possible but much less in the Unix style, so I'm not sure I'd ever use it. One plausible use is mapping old short options to your real ones for compatibility (or just options that people are accustomed to using for some particular purpose from other programs, and keep using with yours).

(This is probably not news to anyone who's really used argparse. I'm partly writing this down so that I'll remember it in the future.)

You can only customize GNU Emacs so far due to primitives

By: cks

GNU Emacs is famous as an editor written largely in itself, well, in Emacs Lisp, with a C core for some central high performance things and things that have to be done in C (called 'primitives' in Emacs jargon). It's perhaps popular to imagine that the overall structure of this is that the C parts of GNU Emacs expose a minimal and direct API that's mostly composed of primitive operations, so that as much of Emacs as possible can be implemented in Emacs Lisp. Unfortunately, this isn't really the case, or at least not necessarily as you'd like it, and one consequence of this is to limit the amount of customization you can feasibly do to GNU Emacs.

An illustration of this is in how GNU Emacs de-iconifies frames in X. In a minimal C API version of GNU Emacs, there might be various low level X primitives, including 'x-deiconify-frame', and the Emacs Lisp code for frame management would call these low level X primitives when running under X, and other primitives when running under Windows, and so on. In the actual GNU Emacs, deiconification of frames happens at multiple points and the exposed primitives are things like raise-frame and make-frame-visible. As their names suggest, these primitives aren't there to give Emacs Lisp code access to low level X operations, they're there to do certain higher level logical things.

This is a perfectly fair and logical decision by the GNU Emacs developers. To put it one way, GNU Emacs is opinionated. It and its developers have a certain model of how it works and how things should behave, what it means for the program to be 'GNU Emacs' as opposed to a hypothetical editor construction kit, and what the C code does is a reflection of that. To the Emacs developers, 'make a frame visible' is a sensible thing to do and is best done in C, so they did it that way.

(Buffers are another area where Emacs is quite opinionated on how it wants to work. This sometimes gets awkward, as anyone who's wrestled with temporarily displaying some information from Emacs Lisp may have experienced.)

The drawback of this is that sometimes you can only easily customize GNU Emacs in ways that line up with how the developers expected, since you can't change the inside of C level primitives. If your concept of an operation you want to hook, modify, block, or otherwise fiddles with matches with how GNU Emacs sees things, all is probably good. But if your concept of 'an operation' doesn't match up with how GNU Emacs sees it, you may find that what you want to touch is down inside the C layer and isn't exposed as a separate primitive.

(Even if it is exposed as a primitive in its own right, you can have problems, because when you advise a primitive, this doesn't affect calls to the primitive from other C functions. If there was a separate 'x-deiconify-frame' primitive, I could hook it for calls from Lisp, but not a call from 'make-frame-visible' if that was still a primitive. So to really have effective hooking of a primitive, you need it to be only called from Lisp code (at least for cases you care about).)

PS: This can lead to awkward situations even when everything you want to modify is in Emacs Lisp code, because the specific bit you want to change may be in the middle of a large function. Of course with Emacs Lisp you can always redefine the function, copying its code and modifying it to taste, but there are still drawbacks. You can make this somewhat more reliable in the face of changes (via a comment on this entry, but it's still not great.

The Bash Readline bindings and settings that I want

By: cks

Normally I use Bash (and Readline in general) in my own environment, where I have a standard .inputrc set up to configure things to my liking (although it turns out that one particular setting doesn't work now (and may never have), and I didn't notice). However, sometimes I wind up using Bash in foreign environments, for example if I'm su'd to root at the moment, and when that happens the differences can be things that I get annoyed by. I spent a bit of today running into this again and being irritated enough that this time I figured out how to fix it on the fly.

The general Bash command to do readline things is 'bind', and I believe it accepts all of the same syntax as readline init files do, both for keybindings and for turning off (mis-)features like bracketed paste (which we dislike enough that turning it off for root is a standard feature of our install framework). This makes it convenient if I forget the exact syntax, because I can just look at my standard .inputrc and copy lines from it.

What I want to do is the following:

  • Switch Readline to the Unix word erase behavior I want:

    set bind-tty-special-chars off
    Control-w: backward-kill-word

    Both of these are necessary because without the first, Bash will automatically bind Ctrl-w (my normal word-erase character) to 'unix-word-rubout' and not let you override that with your own binding.

    (This is the difference that I run into all the time, because I'm very used to be able to use Ctrl-W to delete only the most recent component of a path. I think this partly comes from habit and partly because you tab-complete multi-component paths a component at a time, so if I mis-completed the latest component I want to Ctrl-W just it. M-Del is a standard Readline binding for this, but it's less convenient to type and not something I remember.)

  • Make readline completion treat symbolic links to directories as if they were directories:

    set mark-symlinked-directories on

    When completing paths and so on, I mostly don't bother thinking about the difference between an actual directory (such as /usr/bin) and a symbolic link to a directory (such as /bin on modern Linuxes). If I type '/bi<TAB>' I want this to complete to '/bin/', not '/bin', because it's basically guaranteed that I will go on to tab-complete something in '/bin/'. If I actually want the symbolic link, I'll delete the trailing '/' (which does happen every so often, but much less frequently than I want to tab-complete through the symbolic link).

  • Make readline forget any random edits I did to past history lines when I hit Return to finally do something:

    set revert-all-at-newline on

    The behavior I want from readline is that past history is effectively immutable. If I edit some bit of it and then abandon the edit by moving to another command in the history (or just start a command from scratch), the edited command should revert to being what I actually typed back when I executed it no later than when I hit Return on the current command and start a new one. It infuriates me when I cursor-up (on a fresh command) and don't see exactly the past commands that I typed.

    (My notes say I got this from Things You Didn't Know About GNU Readline.)

This is more or less in the order I'm likely to fix them. The different (and to me wrong) behavior of C-w is a relatively constant irritation, while the other two are less frequent.

(If this irritates me enough on a particular system, I can probably do something in root's .bashrc, if only to add an alias to use 'bind -f ...' on a prepared file. I can't set these in /root/.inputrc, because my co-workers don't particularly agree with my tastes on these and would probably be put out if standard readline behavior they're used to suddenly changed on them.)

(In other Readline things I want to remember, there's Readline's support for fishing out last or first or Nth arguments from earlier commands.)

Why Wandering Thoughts has fewer comment syndication feeds than yesterday

By: cks

Over on the Fediverse I said:

My techblog used to offer Atom syndication feeds for the comments on individual entries. I just turned that off because it turns out to be a bad idea on the modern web when you have many years of entries. There are (were) any number of 'people' (feed things) that added the comment feeds for various entries years ago and then never took them out, despite those entries being years old and in some cases never having gotten comments in the first place.

DWiki, the engine behind Wandering Thoughts, is nothing if not general. Syndication feeds, for example, are a type of 'view' over a directory hierarchy, and are available for both pages and comments. A regular (page) syndication feed view can only be done over (on) a directory, because if it was applied to an individual page the feed would only ever contain that page. However, when I wrote DWiki it was obvious that a comment syndication feed for a particular page made sense; it would give you all of the comments 'under' that page (ie, on it). And so for almost all of the time that Wandering Thoughts has been in operation, you could have looked down to the bottom of an entry's page (on the web) and seen in small type 'Atom Syndication: Recent Comments' (with the 'recent comments' being a HTML link giving you the URL of that page's comment feed).

(The comment syndication feed for a directory is all comments on all pages underneath the directory.)

That's gone now, because I decided that it didn't make sense in what Wandering Thoughts has become and because I was slowly accumulating feed readers that were pulling the comment syndication feeds for more and more entries. This is exactly the behavior I should have expected from feed readers from the start; once someone puts a feed in, that feed is normally forever even if it's extremely inactive or has never had an entry. The feed reader will dutifully poll every feed for years to come (well, certainly every feed that responds with HTTP success and a valid syndication feed, which all of my comment feeds did).

(There weren't very many pages having their comment syndication feeds hit, but there were enough that I kept noticing them, especially when I added things like hacky rate limiting for feed fetching. I actually put in some extra hacks to deal with how requests for these feeds interacted with my rate limiting.)

There are undoubtedly places on the Internet where discussion (in the form of comments) continues on for years on certain pages, and so a comment feed for an individual page could make sense; you really might keep up (in your feed reader) with a slow moving conversation that lasts years. Other places on the Internet put definite cut-offs on further discussion (comments) on individual pages, which provides a natural deadline to turn off the page's comment syndication feed. But neither of those profiles describes Wandering Thoughts, where my entries remain open for comments more or less forever (and sometimes people do comment on quite old entries), but comments and discussions don't tend to go on for very long.

Of course, the other thing that this change prevents is that it stops (LLM) web crawlers from trying to crawl all of those URLs for comment syndication feeds. You can't crawl URLs that aren't advertised any more and no longer exist (well, sort of, they technically exist but the code for handling them arranges to return 404s if the new 'no comment feeds for actual pages' configuration option is turned on).

Giving up on Android devices using IPv6 on our general-access networks

By: cks

We have a couple of general purpose, general access networks that anyone can use to connect their devices to; one is a wired network (locally, it's called our 'RED' network after the colour of the network cables used for it), and the other is a departmental wireless network that's distinct from the centrally run university-wide network. However, both of these networks have a requirement that we need to be able to more or less identify who is responsible for a machine on them. Currently, this is done through (IPv4) DHCP and registering the Ethernet address of your device. This is a problem for any IPv6 deployment, because the Android developers refuse to support DHCPv6.

We're starting to look more seriously at IPv6, including sort of planning out how our IPv6 subnets will probably work, so I came back to thinking about this issue recently. My conclusion and decision was to give up on letting Android devices use IPv6 on our networks. We can't use SLAAC (StateLess Address AutoConfiguration) because that doesn't require any sort of registration, and while Android devices apparently can use IPv6 Prefix Delegation, that would consume /64s at a prodigious rate using reasonable assumptions. We'd also have to build a system to do it. So there's no straightforward answer, and while I can think of potential hacks, I've decided that none of them are particular good options compared to the simple choice to not support IPv6 for Android by way of only supporting DHCPv6.

(Our requirement for registering a fixed Ethernet address also means that any device that randomizes its wireless Ethernet address on every connection has to turn that off. Hopefully all such devices actually have such an option.)

I'm only a bit sad about this, because you can only hope that a rock rolls uphill for so long before you give up. IPv6 is still not a critical thing in my corner of the world (as shown by how no one is complaining to us about the lack of it), so some phones continuing to not have IPv6 is not likely to be a big deal to people here.

(Android devices that can be connected to wired networking will be able to get IPv6 on some research group networks. Some research groups ask for their network to be open and not require pre-registration of devices (which is okay if it only exists in access-controlled space), and for IPv6 I expect we'll do this by turning on SLAAC on the research group's network and calling it a day.)

Connecting M.2 drives to various things (and not doing so)

By: cks

As a result of discovering that (M.2) NVMe SSDs seem to have become the dominant form of SSDs, I started looking into what you could connect M.2 NVMe SSDs to. Especially I started looking to see if you could turn M.2 NVMe SSDs into SATA SSDs, so you could connect high capacity M.2 NVMe SSDs to, for example, your existing stock of ZFS fileservers (which use SATA SSDs). The short version is that as far as I can tell, there's nothing that does this, and once I started thinking about it I wasn't as surprised as I might be.

What you can readily find is passive adapters from M.2 NVMe or M.2 SATA to various other forms of either NVMe or SATA, depending. For example, there are M.2 NVMe to U.2 cases, and M.2 SATA to SATA cases; these are passive because they're just wiring things through, with no protocol conversion. There are also some non-passive products that go the other way; they're a M.2 'NVMe' 2280 card that has four SATA ports on it (and presumably a PCIe SATA controller). However, the only active M.2 NVMe product (one with protocol conversion) that I can find is M.2 NVMe to USB, generally in the form of external enclosures.

(NVMe drives are PCIe devices, so an 'M.2 NVMe' connector is actually providing some PCIe lanes to the M.2 card. Normally these lanes are connected to an NVMe controller, but I don't believe there's any intrinsic reason that you can't connect them to other PCIe things. So you can have 'PCIe SATA controller on an M.2 PCB' and various other things.)

When I thought about it, I realized the problem with my hypothetical 'obvious' M.2 NVMe to SATA board (and case): since it involves protocol conversion (between NVMe and SATA), someone would have to make the controller chipset for it. You can't make a M.2 NVMe to SATA adapter until someone goes to the expense of designing and fabricating (and probably programming) the underlying chipset, and presumably no one has yet found it commercially worthwhile to do so. Since (M.2) NVMe to USB adapters exist, protocol conversion is certainly possible, and since such adapters are surprisingly inexpensive, presumably there's enough demand to drive down the price of the underlying controller chipsets.

(These chipsets are, for example, the Realtek RTL9210B-CG or the ASMedia ASM3242.)

Designing a chipset is not merely expensive, it's very expensive, which to me explains why there aren't any high-priced options for connecting a NVMe drive up via SATA, the way there are high-priced options for some uncommon things (like connecting multiple NVMe drives to a single PCIe slot without PCIe bifurcation, which can presumably be done with the right existing PCIe bridge chipset).

(Since I checked, there also doesn't currently seem to be any high capacity M.2 SATA SSDs (which in theory could just be a controller chipset swap from the M.2 NVMe version). If they existing, you could use a passive M.2 SATA to 2.5" SATA adapter to get them into the form factor you want.)

It seems like NVMe SSDs have overtaken SATA SSDs for high capacities

By: cks

For a long time, NVMe SSDs were the high end option; as the high end option they cost more than SATA SSDs of the same capacity, and SATA SSDs were generally available in higher capacity than NVMe SSDs (at least at prices you wanted to pay). This is why my home desktop wound up with a storage setup with a mirrored pair of 2 TB NVMe SSDs (which felt pretty indulgent) and a mirrored pair of 4 TB SATA SSDs (which felt normal-ish). Today, for reasons outside the boundary of this entry, I wound up casually looking to see how available large SSDs were. What I expected to find was that large-capacity SATA SSDs would now be reasonably available and not too highly priced, while NVMe SSDs would top out at perhaps 4TB and high prices.

This is not what I found, at least at some large online retailers. Instead, SATA SSDs seem to have almost completely stagnated at 4 TB, with capacities larger than that only available from a few specialty vendors at eye-watering prices. By contrast, 8 TB NVMe SSDs seem readily available at somewhat reasonable prices from mainstream drive vendors like WD (they aren't inexpensive but they're not unreasonable given the prices of 4 TB NVMe, which is roughly the price I remember 4 TB SATA SSDs being at). This makes me personally sad, because my current home desktop has more SATA ports than M.2 slots or even PCIe x1 slots.

(You can get PCIe x1 cards that mount a single NVMe SSD, and I think I'd get somewhat better than SATA speeds out of them. I have one to try out in my office desktop, but I haven't gotten around to it yet.)

At one level this makes sense. Modern motherboards have a lot more M.2 slots than they used to, and I speculated several years ago that M.2 NVMe drives would eventually be cheaper to make than 2.5" SSDs. So in theory I'm not surprised that probable consumer (lack of) demand has basically extinguished SATA SSDs above 4 TB. In practice, I am surprised and it feels disconcerting for NVMe SSDs to now look like the 'mainstream' choice.

(This is also potentially inconvenient for work, where we have a bunch of ZFS fileservers that currently use 4 TB 2.5" SATA SSDs (an update from their original 2 TB SATA SSDs). If there are no reasonably priced SATA SSDs above 4 TB, then our options for future storage expansion become more limited. In the long run we may have to move to U.2 to get hotswappable 4+ TB SSDs. On the other hand, apparently there are inexpensive M.2 to U.2 adapters, and we've done worse sins with our fileservers.)

Websites and web developers mostly don't care about client-side problems

By: cks

In response to my entry on the fragility of the web in the face of the crawler plague, Jukka said in a comment:

While I understand the server-side frustrations, I think the corresponding client-side frustrations have largely been lacking from the debates around the Web.

For instance, CloudFlare now imposes heavy-handed checks that take a few seconds to complete. [...]

This is absolutely true but it's not new, and it goes well beyond anti-crawler and anti-robot defenses. As covered by people like Alex Russell, it's routine for websites to ignore most real world client side concerns (also, and including on desktops). Just recently (as of August 2025), Github put out a major update that many people are finding immensely slow even on developer desktops. If we can't get web developers to care about common or majority experiences for their UI, which in some sense has relatively little on the line, the odds of web site operators caring when their servers are actually experiencing problems (or at least annoyances) is basically nil.

Much like browsers have most of the power in various relationships with, for example, TLS certificate authorities, websites have most of the power in their relationship to clients (ie, us). If people don't like what a website is doing, their only option is generally a boycott. Based on the available evidence so far, any boycotts over things like CAPTCHA challenges have been ineffective so far. Github can afford to give people a UI with terrible performance because the switching costs are sufficiently high that they know most people won't.

(Another view is that the server side mostly doesn't notice or know that they're losing people; the lost people are usually invisible, with websites only having much visibility into the people who stick around. I suspect that relatively few websites do serious measurement of how many people bounce off or stop using them.)

Thus, in my view, it's not so much that client-side frustrations have been 'lacking' from debates around the web, which makes it sound like client side people haven't been speaking up, as that they've been actively ignored because, roughly speaking, no one on the server side cares about client-side frustrations. Maybe they vaguely sympathize, but they care a lot more about other things. And it's the web server side who decides how things operate.

(The fragility exposed by LLM crawler behavior demonstrates that clients matter in one sense, but it's not a sense that encourages website operators to cooperate or listen. Rather the reverse.)

I'm in no position to throw stones here, since I'm actively making editorial decisions that I know will probably hurt some real clients. Wandering Thoughts has never been hammered by crawler load the way some sites have been; I merely decided that I was irritated enough by the crawlers that I was willing to throw a certain amount of baby out with the bathwater.

Getting the Cinnamon desktop environment to support "AppIndicator"

By: cks

The other day I wrote about what "AppIndicator" is (a protocol) and some things about how the Cinnamon desktop appeared to support it, except they weren't working for me. Now I actually understand what's going on, more or less, and how to solve my problem of a program complaining that it needed AppIndicator.

Cinnamon directly implements the AppIndicator notification protocol in xapp-sn-watcher, part of Cinnamon's xapp(s) package. Xapp-sn-watcher is started as part of your (Cinnamon) session. However, it has a little feature, namely that it will exit if no one is asking it to do anything:

XApp-Message: 22:03:57.352: (SnWatcher) watcher_startup: ../xapp-sn-watcher/xapp-sn-watcher.c:592: No active monitors, exiting in 30s

In a normally functioning Cinnamon environment, something will soon show up to be an active monitor and stop xapp-sn-watcher from exiting:

Cjs-Message: 22:03:57.957: JS LOG: [LookingGlass/info] Loaded applet xapp-status@cinnamon.org in 88 ms
[...]
XApp-Message: 22:03:58.129: (SnWatcher) name_owner_changed_signal: ../xapp-sn-watcher/xapp-sn-watcher.c:162: NameOwnerChanged signal received (n: org.x.StatusIconMonitor.cinnamon_0, old: , new: :1.60
XApp-Message: 22:03:58.129: (SnWatcher) handle_status_applet_name_owner_appeared: ../xapp-sn-watcher/xapp-sn-watcher.c:64: A monitor appeared on the bus, cancelling shutdown

This something is a standard Cinnamon desktop applet. In System Settings β†’ Applets, it's way down at the bottom and is called "XApp Status Applet". If you've accidentally wound up with it not turned on, xapp-sn-watcher will (probably) not have a monitor active after 30 seconds, and then it will exit (and in the process of exiting, it will log alarming messages about failed GLib assertions). Not having this xapp-status applet turned on was my problem, and turning it on fixed things.

(I don't know how it got turned off. It's possible I wen through the standard applets at some point and turned some of them off in an excess of ignorant enthusiasm.)

As I found out from leigh scott in my Fedora bug report, the way to get this debugging output from xapp-sn-watcher is to run 'gsettings set org.x.apps.statusicon sn-watcher-debug true'. This will cause xapp-sn-watcher to log various helpful and verbose things to your ~/.xsession-errors (although apparently not the fact that it's actually exiting; you have to deduce that from the timestamps stopping 30 seconds later and that being the timestamps on the GLib assertion failures).

(I don't know why there's both a program and an applet involved in this and I've decided not to speculate.)

The current (2025) crawler plague and the fragility of the web

By: cks

These days, more and more people are putting more and more obstacles in the way of the plague of crawlers (many of them apparently doing it for LLM 'AI' purposes), me included. Some of these obstacles involve attempting to fingerprint unusual aspects of crawler requests, such as using old browser User-Agents or refusing to accept compressed things in an attempt to avoid gzip bombs; other obstacles may involve forcing visitors to run JavaScript, using CAPTCHAs, or relying on companies like Cloudflare to block bots with various techniques.

On the one hand, I sort of agree that these 'bot' (crawler) defenses are harmful to the overall ecology of the web. On the other hand, people are going to do whatever works for them for now, and none of the current alternatives are particularly good. There's a future where much of the web simply isn't publicly available any more, at least not to anonymous people.

One thing I've wound up feeling from all this is that the current web is surprisingly fragile. A significant amount of the web seems to have been held up by implicit understandings and bargains, not by technology. When LLM crawlers showed up and decided to ignore the social things that had kept those parts of the web going, things started coming down all over the place.

(This isn't new fragility; the fragility was always there.)

Unfortunately, I don't see a technical way out from this (and I'm not sure I see any realistic way in general). There's no magic wand that we can wave to make all of the existing websites, web apps, and so on not get impaired by LLM crawlers when the crawlers persist in visiting everything despite being told not to, and on top of that we're not going to make bandwidth free. Instead I think we're looking at a future where the web ossifies for and against some things, and more and more people see catgirls.

(I feel only slightly sad about my small part in ossifying some bits of the web stack. Another part of me feels that a lot of web client software has gotten away with being at best rather careless for far too long, and now the consequences are coming home to roost.)

What an "AppIndicator" is in Linux desktops and some notes on it

By: cks

Suppose, not hypothetically, that you start up some program on your Fedora 42 Cinnamon desktop and it helpfully tells you "<X> requires AppIndicator to run. Please install the AppIndicator plugin for your desktop". You are likely confused, so here are some notes.

'AppIndicator' itself is the name of an application notification protocol, apparently originally from KDE, and some desktop environments may need a (third party) extension to support it, such as the Ubuntu one for GNOME Shell. Unfortunately for me, Cinnamon is not one of those desktops. It theoretically has native support for this, implemented in /usr/libexec/xapps/xapp-sn-watcher, part of Cinnamon's xapps package.

The actual 'AppIndicator' protocol is done over D-Bus, because that's the modern way. Since this started as a KDE thing, the D-Bus name is 'org.kde.StatusNotifierWatcher'. What provides certain D-Bus names is found in /usr/share/dbus-1/services, but not all names are mentioned there and 'org.kde.StatusNotifierWatcher' is one of the missing ones. In this case /etc/xdg/autostart/xapp-sn-watcher.desktop mentions the D-Bus name in its 'Comment=', but that's probably not something you can count on to find what your desktop is (theoretically) using to provide a given D-Bus name. I found xapp-sn-watcher somewhat through luck.

There are probably a number of ways to see what D-Bus names are currently registered and active. The one that I used when looking at this is 'dbus-send --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames'. As far as I know, there's no easy way to go from an error message about 'AppIndicator' to knowing that you want 'org.kde.StatusNotifierWatcher'; in my case I read the source of the thing complaining which was helpfully in Python.

(I used the error message to find the relevant section of code, which showed me what it wasn't finding.)

I have no idea how to actually fix the problem, or if there is a program that implements org.kde.StatusNotifierWatcher as a generic, more or less desktop independent program the way that stalonetray does for system tray stuff (or one generation of system tray stuff, I think there have been several iterations of it, cf).

(Yes, I filed a Fedora bug, but I believe Cinnamon isn't particularly supported by Fedora so I don't expect much. I also built the latest upstream xapps tree and it also appears to fail in the same way. Possibly this means something in the rest of the system isn't working right.)

Some notes on DMARC policy inheritance and a gotcha

By: cks

When you use DMARC, you get to specify a policy that people should apply to email that claims to be from your domain but doesn't pass DMARC checks (people are under no obligation to pay attention to this and they may opt to be stricter). These policies are set in DNS TXT records, and in casual use we can say that the policies of subdomains in your domain can be 'inherited'. This recently confused me and now I have some answers.

Your top level domain can specify a separate policy for itself (eg 'user@example.org') and subdomains (eg 'user@foo.example.org'); these are the 'p=' and 'sp=' bits in a DMARC DNS TXT record. Your domain's subdomain policy is used only for subdomains that don't set a policy themselves; an explicitly set subdomain policy overrides the domain policy, for better or worse. If your organization wants to force some minimum DMARC policy, you can't do it with a simple DNS record; you have to somehow forbid subdomains from publishing their own conflicting DMARC policies in your DNS.

The flipside of this is that it's not as bad as it could be to set a strict subdomain policy in your domain DMARC record, because subdomains that care can override it (and may already be doing so implicitly if they've published DMARC records themselves).

However, strictly speaking DMARC policies aren't inherited as we usually think about it. Instead, as I once knew but forgot since then, people using DMARC will check for an applicable policy in only two places: on the direct domain or host name that they care about, and on your organization's top level domain. What this means in concrete terms is that if example.org and foo.example.org both have DMARC records and someone sends email as 'user@bar.foo.example.org', the foo.example.org DMARC record won't be checked. Instead, people will look for DMARC only at 'bar.foo.example.org' (where any regular 'p=' policy will be used) and at 'example.org' (where the subdomain policy, 'sp=', will be used).

(As a corollary, a 'sp=' policy setting in the foo.example.org DMARC record will never be used.)

One place this gets especially interesting is if people send email using the domain 'nonexistent.foo.example.org' in the From: header (either from inside or outside your organization). Since this host name isn't in DNS, it has no DMARC policy of its own, and so people will go straight to the 'example.org' subdomain policy without even looking at the policy of 'foo.example.org'.

(Since traditional DNS wildcard records can only wildcard the leftmost label and DMARC records are looked up on a special '_dmarc.' DNS sub-name, it's not simple to give arbitrary names under your subdomain a DMARC policy.)

How not to check or poll URLs, as illustrated by Fediverse software

By: cks

Over on the Fediverse, I said some things:

[on April 27th:]
A bit of me would like to know why the Akkoma Fediverse software is insistently polling the same URL with HEAD then GET requests at five minute intervals for days on end. But I will probably be frustrated if I turn over that rock and applying HTTP blocks to individual offenders is easier.

(I haven't yet blocked Akkoma in general, but that may change.)

[the other day:]
My patience with the Akkoma Fediverse server software ran out so now all attempts by an Akkoma instance to pull things from my techblog will fail (with a HTTP redirect to a static page that explains that Akkoma mis-behaves by repeatedly fetching URLs with HEAD+GET every few minutes). Better luck in some future version, maybe, although I doubt the authors of Akkoma care about this.

(The HEAD and GET requests are literally back to back, with no delay between them that I've ever observed.)

Akkoma is derived from Pleroma and I've unsurprisingly seen Pleroma also do the HEAD then GET thing, but so far I haven't seen any Pleroma server showing up with the kind of speed and frequency that (some) Akkoma servers do.

These repeated HEADs and GETs are for Wandering Thoughts entries that haven't changed. DWiki is carefully written to supply valid HTTP Last-Modified and ETag, and these values are supplied in replies to both HEAD and GET requests. Despite all of this, Akkoma is not doing conditional GETs and is not using the information from the HEAD to avoid doing a GET if neither header has changed its value from the last time. Since Akkoma is apparently completely ignoring the result of its HEAD request, it might as well not make the HEAD request in the first place.

If you're going to repeatedly poll a URL, especially every five or ten minutes, and you want me to accept your software, you must do conditional GETs. I won't like you and may still arrange to give you HTTP 429s for polling so fast, but I most likely won't block you outright. Polling every five or ten minutes without conditional GET is completely unacceptable, at least to me (other people probably don't notice or care).

My best guess as to why Akkoma is polling the URL at all is that it's for "link previews". If you link to something in a Fediverse post, various Fediverse software will do the common social media thing of trying to embed some information about the target of the URL into the post as it presents it to local people; for plain links with no special handling, this will often show the page title. As far as the (rapid) polling goes, I can only guess that Akkoma has decided that it is extremely extra special and it must update its link preview information very rapidly should the linked URL do something like change the page title. However, other Fediverse server implementations manage to do link previews without repeatedly polling me (much less the HEAD then immediately a GET thing).

(On the global scale of things this amount of traffic is small beans, but it's my DWiki and I get to be irritated with bad behavior if I want to, even if it's small scale bad behavior.)

Getting Linux nflog and tcpdump packet filters to sort of work together

By: cks

So, suppose that you have a brand new nflog version of OpenBSD's pflog, so you can use tcpdump to watch dropped packets (or in general, logged packets). And further suppose that you specifically want to see DNS requests to your port 53. So of course you do:

# tcpdump -n -i nflog:30 'port 53'
tcpdump: NFLOG link-layer type filtering not implemented

Perhaps we can get clever by reading from the interface in one tcpdump and sending it to another to be interpreted, forcing the pcap filter to be handled entirely in user space instead of the kernel:

# tcpdump --immediate-mode -w - -U -i nflog:30 | tcpdump -r - 'port 53'
tcpdump: listening on nflog:30, link-type NFLOG (Linux netfilter log messages), snapshot length 262144 bytes
reading from file -, link-type NFLOG (Linux netfilter log messages), snapshot length 262144
tcpdump: NFLOG link-layer type filtering not implemented

Alas we can't.

As far as I can determine, what's going on here is that the netfilter log system, 'NFLOG', uses a 'packet' format that isn't the same as any of the regular formats (Ethernet, PPP, etc) and adds some additional (meta)data about the packet to every packet you capture. I believe the various attributes this metadata can contain are listed in the kernel's nfnetlink_log.h.

(I believe it's not technically correct to say that this additional stuff is 'before' the packet; instead I believe the packet is contained in a NFULA_PAYLOAD attribute.)

Unfortunately for us, tcpdump (or more exactly libpcap) doesn't know how to create packet capture filters for this format, not even ones that are interpreted entirely in user space (as happens when tcpdump reads from a file).

I believe that you have two options. First, you can use tshark with a display filter, not a capture filter:

# tshark -i nflog:30 -Y 'udp.port == 53 or tcp.port == 53'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'nflog:30'
[...]

(Tshark capture filters are subject to the same libpcap inability to work on NFLOG formatted packets as tcpdump has.)

Alternately and probably more conveniently, you can tell tcpdump to use the 'IPV4' datalink type instead of the default, as mentioned in (opaque) passing in the tcpdump manual page:

# tcpdump -i nflog:30 -L
Data link types for nflog:30 (use option -y to set):
  NFLOG (Linux netfilter log messages)
  IPV4 (Raw IPv4)
# tcpdump -i nflog:30 -y ipv4 -n 'port 53'
tcpdump: data link type IPV4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on nflog:30, link-type IPV4 (Raw IPv4), snapshot length 262144 bytes
[...]

Of course this is only applicable if you're only doing IPv4. If you have some IPv6 traffic that you want to care about, I think you have to use tshark display filters (which means learning how to write Wireshark display filters, something I've avoided so far).

I think there is some potentially useful information in the extra NFLOG data, but to get it or to filter on it I think you'll need to use tshark (or Wireshark) and consult the NFLOG display filter reference, although that doesn't seem to give you access to all of the NFLOG stuff that 'tshark -i nflog:30 -V' will print about packets.

(Or maybe the trick is that you need to match 'nflog.tlv_type == <whatever> and nflog.tlv_value == <whatever>'. I believe that some NFLOG attributes are available conveniently, such as 'nflog.prefix', which corresponds to NFULA_PREFIX. See packet-nflog.c.)

PS: There's some information on the NFLOG format in the NFLOG linktype documentation and tcpdump's supported data link types in the link-layer header types documentation.

An interesting thing about people showing up to probe new DNS resolvers

By: cks

Over on the Fediverse, I said something:

It appears to have taken only a few hours (or at most a few hours) from putting a new resolving DNS server into production to seeing outside parties specifically probing it to see if it's an open resolver.

I assume people are snooping activity on authoritative DNS servers and going from there, instead of spraying targeted queries at random IPs, but maybe they are mass scanning.

There turns out to be some interesting aspects to these probes. This new DNS server has two network interfaces, both firewalled off from outside queries, but only one is used as the source IP on queries to authoritative DNS servers. In addition, we have other machines on both networks, with firewalls, so I can get a sense of the ambient DNS probes.

Out of all of these various IPs, the IP that the new DNS server used for querying authoritative DNS servers, and only that IP, very soon saw queries that were specifically tuned for it:

124.126.74.2.54035 > 128.100.X.Y.53: 16797 NS? . (19)
124.126.74.2.7747 > 128.100.X.Y.7: UDP, length 512
124.126.74.2.54035 > 128.100.X.Y.53: 17690 PTR? Y.X.100.128.in-addr.arpa. (47)

This was a consistent pattern from multiple IPs; they all tried to query for the root zone, tried to check the UDP echo port, and then tried a PTR query for the machine's IP itself. Nothing else saw this pattern; not the machine's other IP on a different network, not another IP on the same network, and so on. This pattern and the lack of this pattern to other IPs is what's led me to assume that people are somehow identifying probe targets based on what source IPs they seem making upstream queries.

(There are a variety of ways that you could do this without having special access to DNS servers. APNIC has long used web ad networks and special captive domains and DNS servers for them to do various sorts of measurements, and you could do similar things to discover who was querying your captive DNS servers.)

How you want to have the Unbound DNS server listen on all interfaces

By: cks

Suppose, not hypothetically, that you have an Unbound server with multiple network interfaces, at least two (which I will call A and B), and you'd like Unbound to listen on all of the interfaces. Perhaps these are physical interfaces and there are client machines on both, or perhaps they're virtual interfaces and you have virtual machines on them. Let's further assume that these are routed networks, so that in theory people on A can talk to IP addresses on B and vice versa.

The obvious and straightforward way to have Unbound listen on all of your interfaces is with a server stanza like this:

server:
  interface: 0.0.0.0
  interface: ::0
  # ... probably some access-control statements

This approach works 99% of the time, which is probably why it appears all over the documentation. The other 1% of the time is when a DNS client on network A makes a DNS request to Unbound's IP address on network B; when this happens, the network A client will not get any replies. Well, it won't get any replies that it accepts. If you use tcpdump to examine network traffic, you will discover that Unbound is sending replies to the client on network A using its network A IP address as the source address (which is the default behavior if you send packets to a network you're directly attached to; you normally want to use your IP on that network as the source IP). This will fail with almost all DNS client libraries because DNS clients reject replies from unexpected sources, which is to say any IP other than the IP they sent their query to.

(One way this might happen is if the client moves from network B to network A without updating its DNS configuration. Or you might be testing to see if Unbound's network B IP address answers DNS requests.)

The other way to listen on all interfaces in modern Unbound is to use 'interface-automatic: yes' (in server options), like this:

server:
  interface-automatic: yes

The important bit of what interface-automatic does for you is mentioned in passing in its documentation, and I've emphasized it here:

Listen on all addresses on all (current and future) interfaces, detect the source interface on UDP queries and copy them to replies.

As far as I know, you can't get this 'detect the source interface' behavior for UDP queries in any other way if you use 'interface: 0.0.0.0' to listen on everything. You get it if you listen on specific interfaces, perhaps with 'ip-transparent: yes' for safety:

server:
  interface: 127.0.0.1
  interface: ::1
  interface: <network A>.<my-A-IP>
  interface: <network B>.<my-B-IP>
  # insure we always start
  ip-transparent: yes

Since 'interface-automatic' is marked as an experimental option I'd love to be wrong, but I can't spot an option in skimming the documentation and searching on some likely terms.

(I'm a bit surprised that Unbound doesn't always copy the IP address it received UDP packets on and use that for replies, because I don't think things work if you have the wrong IP there. But this is probably an unusual situation and so it gets papered over, although now I'm curious how this interacts with default routes.)

Another reason to use expendable email addresses for everything

By: cks

I'm a long time advocate of using expendable email addresses any time you have to give someone an email address (and then making sure you can turn them off or more broadly apply filters to them). However, some of the time I've trusted the people who were asking for an email address, didn't have an expendable address already prepared for them, and gave them my regular email address. Today I discovered (or realized) another reason to not do this and to use expendable addresses for absolutely everything, and it's not the usual reason of "the people you gave your email address to might get compromised and have their address collection extracted and sold to spammers". The new problem is mailing service providers, such as Mailchimp.

It's guaranteed that some amount of spammers make use of big mailing service providers, so you will periodically get spam email to any exposed email address, most likely including your real, primary one from such MSPs. At the same time, these days it's quite likely that anyone you give your email address to will at some point wind up using an MSP, if only to send out a cheerful notification of, say, "we moved from street address A to street address B, please remember for planning your next appointment" (because if you want to send out such a mass mailing, you basically have to outsource it to an MSP to get it done, even if you normally use, eg, GMail for your regular organizational activities).

If you've given innocent trustworthy organizations your main email address, it's potentially dangerous or impossible to block a particular MSP from sending email to it. In searching your email archive, you may find that such an organization is already using the MSP to send you stuff that you want, or for big MSPs you might decide that the odds are too bad. But if you've given separate expendable email addresses to all such organizations, you know that they're not going to be sending anything to your main email address, including through some MSP that you've just got spam from, and it's much safer to block that MSP's access to your main email address.

This issue hadn't occurred to me back when I apparently gave one organization my main email address, but it became relevant recently. So now I'm writing it down, if only for my future self as a reminder of why I don't want to do that.

Implementing a basic equivalent of OpenBSD's pflog in Linux nftables

By: cks

OpenBSD's and FreeBSD's PF system has a very convenient 'pflog' feature, where you put in a 'log' bit in a PF rule and this dumps a copy of any matching packets into a pflog pseudo-interface, where you can both see them with 'tcpdump -i pflog0' and have them automatically logged to disk by pflogd in pcap format. Typically we use this to log blocked packets, which gives us both immediate and after the fact visibility of what's getting blocked (and by what rule, also). It's possible to mostly duplicate this in Linux nftables, although with more work and there's less documentation on it.

The first thing you need is nftables rules with one or two log statements of the form 'log group <some number>'. If you want to be able to both log packets for later inspection and watch them live, you need two 'log group' statements with different numbers; otherwise you only need one. You can use different (group) numbers on different nftables rules if you want to be able to, say, look only at accepted but logged traffic or only dropped traffic. In the end this might wind up looking something like:

tcp port ssh counter log group 30 log group 31 drop;

As the nft manual page will tell you, this uses the kernel 'nfnetlink_log' to forward the 'logs' (packets) to a netlink socket, where exactly one process (at most) can subscribe to a particular group to receive those logs (ie, those packets). If we want to both log the packets and be able to tcpdump them, we need two groups so we can have ulogd getting one and tcpdump getting the other.

To see packets from any particular log group, we use the special 'nflog:<N>' pseudo-interface that's hopefully supported by your Linux version of tcpdump. This is used as 'tcpdump -i nflog:30 ...' and works more or less like you'd want it to. However, as far as I know there's no way to see meta-information about the nftables filtering, such as what rule was involved or what the decision was; you just get the packet.

To log the packets to disk for later use, the default program is ulogd, which in Ubuntu is called 'ulogd2'. Ulogd(2) isn't as automatic as OpenBSD's and FreeBSD's pf logging; instead you have to configure it in /etc/ulogd.conf, and on Ubuntu make sure you have the 'ulogd2-pcap' package installed (along with ulogd2 itself). Based merely on getting it to work, what you want in /etc/ulogd.conf is the following three bits:

# A 'stack' of source, handling, and destination
stack=log31:NFLOG,base1:BASE,pcap31:PCAP

# The source: NFLOG group 31, for IPv4 traffic
[log31]
group=31
# addressfamily=10 for IPv6

# the file path is correct for Ubuntu
[pcap31]
file="/var/log/ulog/ulogd.pcap"
sync=0

(On Ubuntu 24.04, any .pcap files in /var/log/ulog will be automatically rotated by logrotate, although I think by default it's only weekly, so you might want to make it daily.)

The ulogd documentation suggests that you will need to capture IPv4 and IPv6 traffic separately, but I've only used this on IPv4 traffic so I don't know. This may imply that you need separate nftables rules to log (and drop) IPv6 traffic so that you can give it a separate group number for ulogd (I'm not sure if it needs a separate one for tcpdump or if tcpdump can sort it out).

Ulogd can also log to many different things than PCAP format, including JSON and databases. It's possible that there are ways to enrich the ulogd pcap logs, or maybe just the JSON logs, with additional useful information such as the network interface involved and other things. I find the ulogd documentation somewhat opaque on this (and also it's incomplete), and I haven't experimented.

(According to this, the JSON logs can be enriched or maybe default to that.)

Given the assorted limitations and other issues with ulogd, I'm tempted to not bother with it and only have our nftables setups support live tcpdump of dropped traffic with a single 'log group <N>'. This would save us from the assorted annoyances of ulogd2.

PS: One reason to log to pcap format files is that then you can use all of the tcpdump filters that you're already familiar with in order to narrow in on (blocked) traffic of interest, rather than having to put together a JSON search or something.

The 'nft' command may not show complete information for iptables rules

By: cks

These days, nftables is the Linux network firewall system that you want to use, and especially it's the system that Ubuntu will use by default even if you use the 'iptables' command. The nft command is the official interface to nftables, and it has a 'nft list ruleset' sub-command that will list your NFT rules. Since iptables rules are implemented with nftables, you might innocently expect that 'nft list ruleset' will show you the proper NFT syntax to achieve your current iptables rules.

Well, about that:

# iptables -vL INPUT
[...] target prot opt in  out  source   destination         
[...] ACCEPT tcp  --  any any  anywhere anywhere    match-set nfsports dst match-set nfsclients src
# nft list ruleset
[...]
      ip protocol tcp xt match "set" xt match "set" counter packets 0 bytes 0 accept
[...]

As they say, "yeah no". As the documentation tells you (eventually), somewhat reformatted:

xt TYPE NAME

TYPE := match | target | watcher

This represents an xt statement from xtables compat interface. It is a fallback if translation is not available or not complete. Seeing this means the ruleset (or parts of it) were created by iptables-nft and one should use that to manage it.

Nftables has a native set type (and also maps), but, quite reasonably, the old iptables 'ipset' stuff isn't translated to nftables sets by the iptables compatibility layer. Instead the compatibility layer uses this 'xt match' magic that the nft command can only imperfectly tell you about. To nft's credit, it prints a warning comment (which I've left out) that the rules are being managed by iptables-nft and you shouldn't touch them. Here, all of the 'xt match "set"' bits in the nft output are basically saying "opaque stuff happens here".

This still makes me a little bit sad because it makes it that bit harder to bootstrap my nftables knowledge from what iptables rules convert into. If I wanted to switch to nftables rules and nftables sets (for example for my now-simpler desktop firewall rules), I'd have to do that from relative scratch instead of getting to clean up what the various translation tools would produce or report.

(As a side effect it makes it less likely that I'll convert various iptables things to being natively nft/nftables based, because I can't do a fully mechanical conversion. If they still work with iptables-nft, I'm better off leaving them as is. Probably this also means that iptables-nft support is likely to have a long, long life.)

Servers will apparently run for a while even when quite hot

By: cks

This past Saturday (yesterday as I write this), a university machine room had an AC failure of some kind:

It's always fun times to see a machine room temperature of 54C and slowly climbing. It's not our machine room but we have switches there, and I have a suspicion that some of them will be ex-switches by the time this is over.

This machine room and its AC has what you could call a history; in 2011 it flooded partly due to an AC failure, then in 2016 it had another AC issue, and another in 2024 (and those are just the ones I remember and can find entries for).

Most of this machine room is a bunch of servers from another department, and my assumption is that they are what created all of the heat when the AC failed. Both we and the other department have switches in the room, but networking equipment is usually relatively low-heat compared to active servers. So I found it interesting that the temperature graph rises in a smooth arc to its maximum temperature (and then drops abruptly, presumably as the AC starts to get fixed). To me this suggests that many of the servers in the room kept running, despite the ambient temperature hitting 54C (and their internal temperatures undoubtedly being much higher). If some servers powered off from the heat, it wasn't enough to stabilized the heat level of the room; it was still increasing right up to when it started dropping rapidly.

(Servers may well have started thermally throttling various things, and it's possible that some of them crashed without powering off and thus potentially without reducing the heat load. I have second hand information that some UPS units reported battery overheating.)

It's one thing to be fairly confident that server thermal limits are set unrealistically high. It's another thing to see servers (probably) keep operating at 54C, rather than fall over with various sorts of failures. For example, I wouldn't have been surprised if power supplies overheated and shut down (or died entirely).

(I think desktop PSUs are often rated as '0C to 50C', but I suspect that neither end of that rating is actually serious, and this was over 50C anyway.)

I rather suspect that running at 50+C for a while has increased the odds of future failures and shortened the lifetime of everything in this machine room (our switches included). But it still amazes me a bit that things didn't fall over and fail, even above 50C.

(When I started writing this entry I thought I could make some fairly confident predictions about the servers keeping running purely from the temperature graph. But the more I think about it, the less I'm sure of that. There are a lot of things that could be going on, including server failures that leave them hung or locked up but still with PSUs running and pumping out heat.)

My policy of semi-transience and why I have to do it

By: cks

Some time back I read Simon Tatham's Policy of transience (via) and recognized both points of similarity and points of drastic departure between Tatham and I. Both Tatham and I use transient shell history, transient terminal and application windows (sort of for me), and don't save our (X) session state, and in general I am a 'disposable' usage pattern person. However, I depart from Tatham in that I have a permanently running browser and I normally keep my login sessions running until I reboot my desktops. But broadly I'm a 'transient' or 'disposable' person, where I mostly don't keep inactive terminal windows or programs around in case I might want them again, or even immediately re-purpose them from one use to another.

(I do have some permanently running terminal windows, much like I have permanently present other windows on my desktop, but that's because they're 'in use', running some program. And I have one inactive terminal window but that's because exiting that shell ends my entire X session.)

The big way that I depart from Tatham is already visible in my old desktop tour, in the form of a collection of iconified browser windows (in carefully arranged spots so I can in theory keep track of them). These aren't web pages I use regularly, because I have a different collection of schemes for those. Instead they're a collection of URLs that I'm keeping around to read later or in general to do something with. This is anathema to Tatham, who keeps track of URLs to read in other ways, but I've found that it's absolutely necessary for me.

Over and over again I've discovered that if something isn't visible to me, shoved in front of my nose, it's extremely likely to drop completely out of my mind. If I file email into a 'to be dealt with' or 'to be read later' or whatever folder, or if I write down URLs to visit later and explanations of them, or any number of other things, I almost might as well throw those things away. Having a web page in an iconified Firefox window in no way guarantees that I'll ever read it, but writing its URL down in a list guarantees that I won't. So I keep an optimistic collection of iconified Firefox windows around (and every so often I look at some of them and give up on them).

It would be nice if I didn't need to do this and could de-clutter various bits of my electronic life. But by now I've made enough attempts over a long enough period of time to be confident that my mind doesn't work that way and is unlikely to ever change its ways. I need active, ongoing reminders for things to stick, and one of the best forms is to have those reminders right on my desktop.

(And because the reminders need to be active and ongoing, they also need to be non-intrusive. Mailing myself every morning with 'here are the latest N URLs you've saved to read later' wouldn't work, for example.)

PS: I also have various permanently running utility programs and their windows, so my desktop is definitely not minimalistic. A lot of this is from being a system administrator and working with a bunch of systems, where I want various sorts of convenient fast access and passive monitoring of them.

The problem of Python's version dependent paths for packages

By: cks

A somewhat famous thing about Python is that more or less all of the official ways to install packages put them into somewhere on the filesystem that contains the Python series version (which is things like '3.13' but not '3.13.5'). This is true for site packages, for 'pip install --user' (to the extent that it still works), and for virtual environments, however you manage them. And this is a problem because it means that any time you change to a new release, such as going from 3.12 to 3.13, all of your installed packages disappear (unless you keep around the old Python version and keep your virtual environments and so on using it).

In general, a lot of people would like to update to new Python releases. Linux distributions want to ship the latest Python (and usually do), various direct users of Python would like the new features, and so on. But these versions dependent paths and their consequences make version upgrades more painful and so to some extent cause them to be done less often.

In the beginning, Python had at least two reasons to use these version dependent paths. Python doesn't promise that either its bytecode (and thus the .pyc files it generates from .py files) or its C ABI (which is depended on by any compiled packages, in .so form on Linux) are stable from version to version. Python's standard installation and bytecode processing used to put both bytecode files and compiled files along side the .py files rather than separating them out. Since pure Python packages can depend on compiled packages, putting the two together has a certain sort of logic; if a compiled package no longer loads because it's for a different Python release, your pure Python packages may no longer work.

(Python bytecode files aren't so tightly connected so some time ago Python moved them into a '__pycache__' subdirectory and gave them a Python version suffix, eg '<whatever>.cpython-312.pyc'. Since they're in a subdirectory, they'll get automatically removed if you remove the package itself.)

An additional issue is that even pure Python packages may not be completely compatible with a new version of Python (and often definitely not with a sufficiently old version). So updating to a new Python version may call for a package update as well, not just using the same version you currently have.

Although I don't like the current situation, I don't know what Python could do to make it significantly better. Putting .py files (ie, pure Python packages) into a version independent directory structure would work some of the time (perhaps a lot of the time if you only went forward in Python versions, never backward) but blow up at other times, sometimes in obvious ways (when a compiled package couldn't be imported) and sometimes in subtle ones (if a package wasn't compatible with the new version of Python).

(It would probably also not be backward compatible to existing tools.)

Abuse systems should handle email reports that use MIME message/rfc822 parts

By: cks

Today I had reason to report spam to Mailchimp (some of you are laughing already, I know). As I usually do, I forwarded the spam message we'd received to them as a message/rfc822 MIME part, with a prequel plain text part saying that it was spam. Forwarding email as a MIME message/rfc822 part is unambiguously the correct way to do so. It's in the MIME RFCs, if done properly (by the client) it automatically includes all headers, and because it's a proper MIME part, tools can recognize the forwarded email message, scan over just it, and so on.

So of course Mailchimp sent me back an autoreply to the effect that they couldn't find any spam mail message in my report. They're not the only people who've replied this way, although sometimes the reply says "we couldn't handle this .eml attachment". So I had to re-forward the spam message in what I called literal plaintext format. This time around either some human or some piece of software found the information and maybe correctly interpreted it.

I think it's perfectly fine and maybe even praiseworthy when email abuse handling systems (and people) are willing to accept these literal plaintext format forwarded spam messages. The more formats you accept abuse reports in, the better. But every abuse handling system should accept MIME message/rfc822 format messages too, as a minimum thing. Not just because it's a standard, but also because it's what a certain amount of mail clients will produce by default if you ask them to forward a message. If you refuse to accept these messages, you're reducing the amount of abuse reports you'll accept, for arbitrary (but of course ostensibly convenient for you) reasons.

I know, I'm tilting at windmills. Mailchimp and all of the other big places doing this don't care one bit what I want and may or may not even do anything when I send them reports.

(I suspect that many places see reducing the number of 'valid' abuse reports they receive as a good thing, so the more hoops they can get away with and the more reports they can reject, the better. In theory this is self-defeating in the long run, but in practice that hasn't worked with the big offenders so far.)

Responsibility for university physical infrastructure can be complicated

By: cks

One of the perfectly sensible reactions to my entry on realizing that we needed two sorts of temperature alerts is to suggest that we directly monitor the air conditioners in our machine rooms, so that we don't have to try to assess how healthy they are from second hand, indirect sources like the temperature of the rooms. There are some practical problems, but a broader problem is that by and large they're not 'our' air conditioners. By this I mean that while the air conditioners and the entire building belongs to the university, neither 'belong' to my department and we can't really do stuff to them.

There are probably many companies who have some split between who's responsible for maintaining a building (and infrastructure things inside it) and who is currently occupying (parts of) the building, but my sense is that universities (or at least mine) take this to a more extreme level than usual. There's an entire (administrative) department that looks after buildings and other physical infrastructure, and they 'own' much of the insides of buildings, including the air conditioning units in our machine rooms (including the really old one). Because those air conditioners belong to the building and the people responsible for it, we can't go ahead and connect monitoring up to the AC units or tap into any native monitoring they might have.

(Since these aren't our AC units, we haven't even asked. Most of the AC units are old enough that they probably don't have any digital monitoring, and for the new units the manufacturer probably considers that an extra cost option. Nor can we particularly monitor their power consumption; these are industrial units, with dedicated high-power circuits that we're not even going to get near. Only university electricians are supposed to touch that sort of stuff.)

I believe that some parts of the university have a multi-level division of responsibility for things. One organization may 'own' the building, another 'owns' the network wiring in the walls and is responsible for fixing it if something goes wrong, and a third 'owns' the space (ie, gets to use it) and has responsibility for everything inside the rooms. Certainly there's a lot of wiring within buildings that is owned by specific departments or organizations; they paid to put it in (although possibly through shared conduits), and now they're the people who control what it can be used for.

(We have run a certain amount of our own fiber between building floors, for example. I believe that things can get complicated when it comes to renovating space for something, but this is fortunately not one of the areas we have to deal with; other people in the department look after that level of stuff.)

I've been inside the university for long enough that all of this feels completely normal to me, and it even feels like it makes sense. Within a university, who is using space is something that changes over time, not just within an academic department but also between departments. New buildings are built, old buildings are renovated, and people move around, so separating maintaining the buildings from who occupies them right now feels natural.

(In general, space is a constant struggle at universities.)

My approach to testing new versions of Exim for our mail servers

By: cks

When I wrote about how Exim's ${run ...} string expansion operator changed how it did quoting, I (sort of) mentioned that I found this when I tested a new version of Exim. Some people would do testing like this in a thorough, automated manner, but I don't go that far. Instead I have a written down test plan, with some resources set up for it in advance. Well, it's more accurate to say that I have test plans, because I have a separate test plan for each of our important mail servers because they have different features and so need different things tested.

In the beginning I simply tested all of the important features of a particular mail server by hand and from memory when I rebuilt it on a new version of Ubuntu. Eventually I got tired of having to reinvent my test process from scratch (or from vague notes) every time around (for each mail server), so I started writing it down. In the process of writing my test process down the natural set of things happened; I made it more thorough and systematic, and I set up various resources (like saved copies of the EICAR test file) to make testing more cut and paste. Having an organized, written down test plan, even as basic as ours is, has made it easier to test new builds of our Exim servers and made that testing more comprehensive.

I test most of our mail servers primarily by using swaks to send various bits of test email to them and then watching what happens (both in the swaks SMTP session and in the Exim logs). So a lot of the test plan is 'run this swaks command and ...', with various combinations of sending and receiving addresses, starting with the very most basic test of 'can it deliver from a valid dummy address to a valid dummy address'. To do some sorts of testing, such as DNS blocklist tests, I take advantage of the fact that all of the IP-based DNS blocklists we use include 127.0.0.2, so that part of the test plan is 'use swaks on the mail machine itself to connect from 127.0.0.2'.

(Some of our mail servers can apply different filtering rules to different local addresses, so I have various pre-configured test addresses set up to make it easy to test that per-address filtering is working.)

The actual test plans are mostly a long list of 'run more or less this swaks command, pointing it at your test server, to test this thing, and you should see the following result'. This is pretty close to cut and paste, which makes it relatively easy and fast for me to run through.

One qualification is that these test plans aren't attempting to be an exhaustive check of everything we do in our Exim configurations. Instead, they're mostly about making sure that the basics work, like delivering straightforward email, and that Exim can interact properly with the outside world, such as talking to ClamAV and rspamd or running external programs (which also tests that the programs themselves work on the new Ubuntu version). Testing every corner of our configurations would be exhausting and my feeling is that it would generally be pointless. Exim is stable software and mostly doesn't change or break things from version to version.

(Part of this is pragmatic experience with Exim and knowledge of what our configuration does conditionally and what it checks all of the time. If Exim does a check all of the time and basic mail delivery works, we know we haven't run into, say, an issue with tainted data.)

The unusual way I end my X desktop sessions

By: cks

I use an eccentric X 'desktop' that is not really a desktop as such in the usual sense but instead a window manager and various programs that I run (as a sysadmin, there's a lot of terminal windows). One of the ways that my desktop is unusual is in how I exit from my X session. First, I don't use xdm or any other graphical login manager; instead I run my session through xinit. When you use an xinit based session, you give xinit a program or a script to run, and when the program exits, xinit terminates the X server and your session.

(If you gave xinit a shell script, whatever foreground program the script ended with was your keystone program.)

Traditionally, this keystone program for your X session was your window manager. At one level this makes a lot of sense; your window manager is basically the core of your X session anyway, so you might as well make quitting from it end the session. However, for a very long time I've used a do-nothing iconified xterm running a shell as my keystone program.

(If you look at FvwmIconMan's strip of terminal windows in my (2011) desktop tour, this is the iconified 'console-ex' window.)

The minor advantage to having an otherwise unused xterm as my session keystone program is that I can start my window manager basically at the start of my (rather complex) session startup, so that I can immediately have it manage all of the other things I start (technically I run a number of commands to set up X settings before I start fvwm, but it's the first program I start that will actually show anything on the screen). The big advantage is that using something else as my keystone program means that I can kill and restart my window manager if something goes badly wrong, and more generally that I don't have to worry about restarting it. This doesn't happen very often, but when it does happen I'm very glad that I can recover my session instead of having to abruptly terminate everything. And should I have to terminate fvwm, this 'console' xterm is a convenient idle xterm in which to restart it (or in general, any other program of my session that needs restarting).

(The 'console' xterm is deliberately placed up at the top of the screen, in an area that I don't normally put non-fvwm windows in, so that if fvwm exits and everything de-iconifies, it's highly likely that this xterm will be visible so I can type into it. If I put it in an ordinary place, it might wind up covered up by a browser window or another xterm or whatever.)

I don't particularly have to use an (iconified) xterm with a shell in it; I could easily have written a little Tk program that displayed a button saying 'click me to exit'. However, the problem with such a program (and the advantage of my 'console' xterm) is that it would be all too easy to accidentally click the button (and force-end my session). With the iconified xterm, I need to do a bunch of steps to exit; I have to deiconify that xterm, focus the window, and Ctrl-D the shell to make it exit (causing the xterm to exit). This is enough out of the way that I don't think I've ever done it by accident.

PS: I believe modern desktop environments like GNOME, KDE, and Cinnamon have moved away from making their window manager be the keystone program and now use a dedicated session manager program that things talk to. One reason for this may be that modern desktop shells seem to be rather more prone to crashing for various reasons, which would be very inconvenient if that ended your session. This isn't all bad, at least if there's a standard D-Bus protocol for ending a session so that you can write an 'exit the session' thing that will work across environments.

Understanding reading all available things from a Go channel (with a timeout)

By: cks

Recently I saw this example Go code (via), and I had to stare at it a while in order to understand what it was doing and how it worked (and why it had to be that way). The goal of waitReadAll() is to either receive (read) all currently available items from a channel (possibly a buffered one) or to time out if nothing shows up in time. This requires two nested selects, with the inner one in a for loop.

The outer select has this form:

select {
  case v, ok := <- c:
    if !ok {
      return ...
    }
    [... inner code ...]

  case <- time.After(dur) // wants go 1.23+
    return ...
}

This is doing three things. First (and last in the code), it's timing out of the duration expires before anything is received on the channel. Second, it's returning right away if the channel is closed and empty; in this case the channel receive from c will succeed, but ok will be false. And finally, in the code I haven't put in, it has received the first real value from the channel and now it has to read the rest of them.

The job of the inner code is to receive any (additional) currently ready items from the channel but to give up if the channel is closed or when there are no more items. It has the following form (trimmed of the actual code to properly accumulate things and so on, see the playground for the full version):

.. setup elided ..
for {
  select {
    case v, ok := <- c:
      if ok {
        // accumulate values
      } else {
        // channel closed and empty
        return ...
      }
    case default:
      // out of items
      return ...
  }
}

There's no timeout in this inner code because the 'case default' means that we never wait for the channel to be ready; either the channel is ready with another item (or it's been closed), or we give up.

One of the reasons this Go code initially confused me is that I started out misreading it as receiving as much as it could from a channel until it reached a timeout. Code that did that would do a lot of the same things (obviously it needs a timeout and a select that has that as one of the cases), and you could structure it somewhat similarly to this code (although I think it's more clearly written without a nested loop).

(This is one of those entries that I write partly to better understand something myself. I had to read this code carefully to really grasp it and I found it easy to mis-read on first impression.)

Starting scripts with '#!/usr/bin/env <whatever>' is rarely useful

By: cks

In my entry on getting decent error reports in Bash for 'set -e', I said that even if you were on a system where /bin/sh was Bash and so my entry worked if you started your script with '#!/bin/sh', you should use '#!/bin/bash' instead for various reasons. A commentator took issue with this direct invocation of Bash and suggested '#!/usr/bin/env bash' instead. It's my view that using env this way, especially for Bash, is rarely useful and thus is almost always unnecessary and pointless (and sometimes dangerous).

The only reason to start your script with '#!/usr/bin/env <whatever>' is if you expect your script to run on a system where Bash or whatever else isn't where you expect (or when it has to run on systems that have '<whatever>' in different places, which is probably most common for third party packages). Broadly speaking this only happens if your script is portable and will run on many different sorts of systems. If your script is specific to your systems (and your systems are uniform), this is pointless; you know where Bash is and your systems aren't going to change it, not if they're sane. The same is true if you're targeting a specific Linux distribution, such as 'this is intrinsically an Ubuntu script'.

(In my case, the script I was doing this to is intrinsically specific to Ubuntu and our environment. It will never run on anything else.)

It's also worth noting that '#!/usr/bin/env <whatever>' only works if (the right version of) <whatever> can be found on your $PATH, and in fact the $PATH of every context where you will run the script (including, for example, from cron). If the system's default $PATH doesn't include the necessary directories, this will likely fail some of the time. This makes using 'env' especially dangerous in an environment where people may install their own version of interpreters like Python, because your script's use of 'env' may find their Python on their $PATH instead of the version that you expect.

(These days, one of the dangers with Python specifically is that people will have a $PATH that (currently) points to a virtual environment with some random selection of Python packages installed and not installed, instead of the system set of packages.)

As a practical matter, pretty much every mainstream Linux distribution has a /bin/bash (assuming that you install Bash, and I'm sorry, Nix and so on aren't mainstream). If you're targeting Linux in general, assuming /bin/bash exists is entirely reasonable. If a Linux distribution relocates Bash, in my view the resulting problems are on them. A lot of the time, similar things apply for other interpreters, such as Python, Perl, Ruby, and so on. '#!/usr/bin/python3' on Linux is much more likely to get you a predictable Python environment than '#!/usr/bin/env python3', and if it fails it will be a clean and obvious failure that's easy to diagnose.

Another issue is that even if your script is fixed to use 'env' to run Bash, it may or may not work in such an alternate environment because other things you expect to find in $PATH may not be there. Unless you're actually testing on alternate environments (such as Nix or FreeBSD), using 'env' may suggest more portability than you're actually able to deliver.

My personal view is that for most people, '#!/usr/bin/env' is a reflexive carry-over that they inherited from a past era of multi-architecture Unix environments, when much less was shipped with the system and so was in predictable locations. In that past Unix era, using '#!/usr/bin/env python' was a reasonably sensible thing; you could hope that the person who wanted to run your script had Python, but you couldn't predict where. For most people, those days are over, especially for scripts and programs that are purely for your internal use and that you won't be distributing to the world (much less inviting people to run your 'written on X' script on a Y, such as a FreeBSD script being run on Linux).

The XLibre project is explicitly political and you may not like the politics

By: cks

A commentator on my 2024 entry on the uncertain possible futures of Unix graphical desktops brought up the XLibre project. XLibre is ostensibly a fork of the X server that will be developed by a new collection of people, which on the surface sounds unobjectionable and maybe a good thing for people (like me) who want X to keep being viable; as a result it has gotten a certain amount of publicity from credulous sources who don't look behind the curtain. Unfortunately for everyone, XLibre is an explicitly political project, and I don't mean that in the sense of disagreements about technical directions (the sense that you could say that 'forking is a political action', because it's the manifestation of a social disagreement). Instead I mean it in the regular sense of 'political', which is that the people involved in XLibre (especially its leader) have certain social values and policies that they espouse, and the XLibre project is explicitly manifesting some of them.

(Plus, a project cannot be divorced from the people involved in it.)

I am not going to summarize here; instead, you should read the Register article and its links, and also the relevant sections of Ariadne Conill's announcement of Wayback and their links. However, even if you "don't care" about politics, you should see this correction to earlier XLibre changes where the person making the earlier changes didn't understand what '2^16' did in C (I would say that the people who reviewed the changes also missed it, but there didn't seem to be anyone doing so, which ought to raise your eyebrows when it comes to the X server).

Using, shipping it as part of a distribution, or advocating for XLibre is not a neutral choice. To do so is to align yourself, knowingly or unknowingly, with the politics of XLibre and with the politics of its leadership and the people its leadership will attract to the project. This is always true to some degree with any project, but it's especially true when the project is explicitly manifesting some of its leadership's values, out in the open. You can't detach XLibre from its leader .

My personal view is that I don't want to have anything to do with XLibre and I will think less of any Unix or Linux distribution that includes it, especially ones that intend to make it their primary X server. At a minimum, I feel those distributions haven't done their due diligence.

In general, my personal guess is that a new (forked) standalone X server is also the wrong approach to maintaining a working X server environment over the long term. Wayback combined with XWayland seems like a much more stable base because each of them has more support in various ways (eg, there are a lot of people who are going to want old X programs to keep working for years or decades to come and so lots of demand for most of XWayland's features).

(This elaborates on my comment on XLibre in this entry. I also think that a viable X based environment is far more likely to stop working due to important programs becoming Wayland-only than because you can no longer get a working X server.)

Some practical challenges of access management in 'IAM' systems

By: cks

Suppose that you have a shiny new IAM system, and you take the 'access management' part of it seriously. Global access management is (or should be) simple; if you disable or suspect someone in your IAM system, they should wind up disabled everywhere. Well, they will wind up unable to authenticate. If they have existing credentials that are used without checking with your IAM system (including things like 'an existing SSH login'), you'll need some system to propagate the information that someone has been disabled in your IAM to consumers and arrange that existing sessions, credentials, and so on get shut down and revoked.

(This system will involve both IAM software features and features in the software that uses the IAM to determine identity.)

However, this only covers global access management. You probably have some things that only certain people should have access to, or that treat certain people differently. This is where our experiences with a non-IAM environment suggest to me that things start getting complex. For pure access, the simplest thing probably is if every separate client system or application has a separate ID and directly talks to the IAM, and the IAM can tell it 'this person cannot authenticate (to you)' or 'this person is disabled (for you)'. This starts to go wrong if you ever put two or more services or applications behind the same IAM client ID, for example if you set up a web server for one application (with an ID) and then host another application on the same web server because of convenience (your web server is already there and already set up to talk to the IAM and so on).

This gets worse if there is a layer of indirection involved, so that systems and application don't talk directly to your IAM but instead talk to, say, a LDAP server or a Radius server or whatever that's fed from your IAM (or is the party that talks to your IAM). I suspect that this is one reason why IAM software has a tendency to directly support a lot of protocols for identity and authentication.

(One thing that's sort of an extra layer of indirection is what people are trying to do, since they may have access permission for some things but not others.)

Another approach is for your IAM to only manage what 'groups' people are in and provide that information to clients, leaving it up to clients to make access decisions based on group membership. On the one hand, this is somewhat more straightforward; on the other hand, your IAM system is no longer directly managing access. It has to count on clients doing the right thing with the group information it hands them. At a minimum this gives you much less central visibility into what your access management rules are.

People not infrequently want complicated access control conditions for individual applications (including things like privilege levels). In any sort of access management system, you need to be able to express these conditions in rules. There's no uniform approach or language for expressing access control conditions, so your IAM will use one, your Unix systems will use one (or more) that you probably get to craft by hand using PAM tricks, your web applications will use one or more depending on what they're written in, and so on and so forth. One of the reasons that these languages differ is that the capabilities and concepts of each system will differ; a mesh VPN has different access control concerns than a web application. Of course these differences make it challenging to handle all of their access management in one single spot in an IAM system, leaving you with the choice of either not being able to do everything you want to but having it all in the IAM or having partially distributed access management.

A change in how Exim's ${run ...} string expansion operator does quoting

By: cks

The Exim mail server has, among other features, a string expansion language with quite a number of expansion operators. One of those expansion operators is '${run}', which 'expands' by running a command and substituting in its output. As is commonly the case, ${run} is given the command to run and all of its command line arguments as a single string, without any explicit splitting into separate arguments:

${run {/some/command -a -b foo -c ...} [...]}

Any time a program does this, a very important question to ask is how this string is split up into separate arguments in order to be exec()'d. In Exim's case, the traditional answer is that it was rather complicated and not well documented, in a way that required you to explicitly quote many arguments that came from variables. In my entry on this I called Exim's then current behavior dangerous and wrong but also said it was probably too late to change it. Fortunately, the Exim developers did not heed my pessimism.

In Exim 4.96, this behavior of ${run} changed. To quote from the changelog:

The ${run} expansion item now expands its command string elements after splitting. Previously it was before; the new ordering makes handling zero-length arguments simpler. The old ordering can be obtained by appending a new option "preexpand", after a comma, to the "run".

(The new way is more or less the right way to do it, although it can create problems with [[some sorts of command string expansions.)

This is an important change because this change is not backward compatible if you used deliberate quoting in your ${run} command string. For example, if you ever expanded a potentially dangerous Exim variable in a ${run} command (for example, one that might have a space in it), you previously had to wrap it in ${quote}:

${run {/some/command \
         --subject ${quote:$header_subject:} ...

(As seen in my entry on our attachment type logging with Exim.)

In Exim 4.96 and later, this same ${run} string expansion will add spurious quote marks around the email message's Subject: header as your program sees it. This is because ${quote:...} will add them, since you asked it to generate a quoted version of its argument, and then ${run} won't strip them out as part of splitting the command string apart into arguments because the command string has already been split before the ${quote:} was done. What this shows is that you probably don't need explicit quoting in ${run} command strings any more, unless you're doing tricky expansions with string expressions (in which case you'll have to switch back to the old way of doing it).

To be clear, I'm all for this change. It makes straightforward and innocent use of ${run} much safer and more reliable (and it plays better with Exim's new rules about 'tainted' strings from the outside world, such as the subject header). Having to remote my use of ${quote:...} is a minor price to pay, and learning this sort of stuff in advance is why I build test servers and have test plans.

(This elaborates on a Fediverse post of mine.)

My system administrator's view of IAM so far (from the outside)

By: cks

Over on the Fediverse I said something about IAM:

My IAM choices appear to be "bespoke giant monolith" or "DIY from a multitude of OSS pieces", and the natural way of life appears to be that you start with the latter because you don't think you need IAM and then you discover maybe you have to blow up the world to move to the first.

At work we are the latter: /etc/passwd to LDAP to a SAML/OIDC server depending on what generation of software and what needs. With no unified IM or AM, partly because no rules system for expressing it.

Identity and Access Management (IAM) isn't the same thing as (single sign on) authentication, although I believe it's connected to authorization if you take the 'Access' part seriously, and also a bunch of IAM systems will also do some or all of authentication too so everything is in one place. However, all of these things can be separated, and in complex environments they are (for example, the university's overall IAM environment, also).

(If you have an IAM system you're presumably going to want to feed information from it to your authentication system, so that it knows who is (still) valid to authenticate and perhaps how.)

I believe that one thing that makes IAM systems complicated is interfacing with what could be called 'legacy systems', which in this context includes garden variety Unix systems. If you take your IAM system seriously, everything that knows about 'logins' or 'users' needs to somehow be drawing data from the IAM system, and the IAM system has to know how to provide each with the information it needs. Or alternately your legacy systems need to somehow merge local identity information (Unix home directories, UIDs, GIDs, etc) with the IAM information. Since people would like their IAM system to do it all, I think this is one driver of IAM system complexity and those bespoke giant monoliths that want to own everything in your environment.

(The reason to want your IAM system to do it all is that if it doesn't, you're building a bunch of local tools and then your IAM information is fragmented. What UID is this person on your Unix systems? Only your Unix systems know, not your central IAM database. For bonus points, the person might have different UIDs on different Unix systems, depending.)

If you start out with a green field new system, you can probably build in this central IAM from the start (assuming that you can find and operate IAM software that does what you want and doesn't make you back away in terror). But my impression is that central IAM systems are quite hard to set up, so the natural alternative is that you start without an IAM system and then are possibly faced with trying to pull all of your /etc/passwd, Apache authentication data, LDAP data, and so on into a new IAM system that is somehow going to take over the world. I have no idea how you'd pull off this transition, although presumably people have.

(In our case, we started our Unix systems well before IAM systems existed. There are accounts here that have existed since the 1980s, partly because professors and retired professors tend to stick around for a long time.)

The difficulty of moving our environment to anything like an IAM system leaves me looking at the whole thing from the outside. If we had to add an 'IAM system', it would likely be because something else we wanted to do needed to be fed data from some IAM system using some IAM protocol. The IAM system would probably not become the center of identity and access management, but just another thing that we pushed information into and updated information in.

Another thing V7 Unix gave us is environment variables

By: cks

Simon Tatham recently wondered "Why is PATH called PATH?". This made me wonder the closely related question of when environment variables appeared in Unix, and the answer is that the environment and environment variables appeared in V7 Unix as another of the things that made it so important to Unix history (also).

Up through V6, the exec system call and family of system calls took two arguments, the path and the argument list; we can see this in both the V6 exec(2) manual page and the implementation of the system call in the kernel. As bonus trivia, it appears that the V6 exec() limited you to 510 characters of arguments (and probably V1 through V5 had a similarly low limit, but I haven't looked at their kernel code).

In V7, the exec(2) manual page now documents a possible third argument, and the kernel implementation is much more complex, plus there's an environ(5) manual page about it. Based on h/param.h, V7 also had a much higher size limit on the combined sized of arguments and environment variables, which isn't all that surprising given the addition of the environment. Commands like login.c were updated to put some things into the new environment; login sets a default $PATH and a $HOME, for example, and environ(5) documents various other uses (which I haven't checked in the source code).

This implies that the V7 shell is where $PATH first appeared in Unix, where the manual page describes it as 'the search path for commands'. This might make you wonder how the V6 shell handled locating commands, and where it looked for them. The details are helpfully documented in the V6 shell manual page, and I'll just quote what it has to say:

If the first argument is the name of an executable file, it is invoked; otherwise the string `/bin/' is prepended to the argument. (In this way most standard commands, which reside in `/bin', are found.) If no such command is found, the string `/usr' is further prepended (to give `/usr/bin/command') and another attempt is made to execute the resulting file. (Certain lesser-used commands live in `/usr/bin'.)

('Invoked' here is carrying some extra freight, since this may not involve a direct kernel exec of the file. An executable file that the kernel didn't like would be directly run by the shell.)

I suspect that '$PATH' was given such as short name (instead of a longer, more explicit one) simply as a matter of Unix style at the time. Pretty much everything in V7 was terse and short in this style for various reasons, and verbose environment variable names would have reduced that limited exec argument space.

Python argparse and the minor problem of a variable valid argument count

By: cks

Argparse is the standard Python module for handling arguments to command line programs, and because for small programs, Python makes using things outside the standard library quite annoying, it's the one I use in my Python based utility programs. Recently I found myself dealing with a little problem where argparse doesn't have a good answer, partly because you can't nest argument groups.

Suppose, not hypothetically, that you have a program that can properly take zero, two, or three command line arguments (which are separate from options), and the command line arguments are of different types (the first is a string and the second two are numbers). Argparse makes it easy to handle having either two or three arguments, no more and no less; the first two arguments have no nargs set, and then the third sets 'nargs="?"'. However, as far as I can see argparse has no direct support for handling the zero-argument case, or rather for forbidding the one-argument one.

(If the first two arguments were of the same type we could easily gather them together into a two-element list with 'nargs=2', but they aren't, so we'd have to tell argparse that both are strings and then try the 'string to int' conversion of the second argument ourselves, losing argparse's handling of it.)

If you set all three arguments to 'nargs="?"' and give them usable default values, you can accept zero, two, or three arguments, and things will work if you supply only one argument (because the second argument will have a usable default). This is the solution I've adopted for my particular program because I'm not stubborn enough to try to roll my own validation on top of argparse, not for a little personal tool.

If argparse supported nested groups for arguments, you could potentially make a mutually exclusive argument group that contained two sub-groups, one with nothing in it and one that handled the two and three argument case. This would require argparse not only to support nested groups but to support empty nested groups (and not ignore them), which is at least a little bit tricky.

Alternately, argparse could support a global specification of what numbers of arguments are valid. Or it could support a 'validation' callback that is called with information about what argparse detected and which could signal errors to argparse that argparse handled in its standard way, giving you uniform argument validation and error text and so on.

Unix had good reasons to evolve since V7 (and had to)

By: cks

There's a certain sort of person who feels that the platonic ideal of Unix is somewhere around Research Unix V7 and it's almost all been downhill since then (perhaps with the exception of further Research Unixes and then Plan 9, although very few people got their hands on any of them). For all that I like Unix and started using it long ago when it was simpler (although not as far back as V7), I reject this view and think it's completely mistaken.

V7 Unix was simple but it was also limited, both in its implementation (which often took shortcuts (also, also, also) and in its overall features (such as short filenames). Obviously V7 didn't have networking, but even things that most people think of as perfectly reasonable and good Unix features like '#!' support for shell scripts in the kernel and processes being in multiple groups at once. That V7 was a simple and limited system meant that its choices were to grow to meet people's quite reasonable needs or to fall out of use.

(Some of these needs were for features and some of them were for performance. The original V7 filesystem was quite simple but also suffered from performance issues, ones that often got worse over time.)

I'll agree that the path that the growth of Unix has taken since V7 is not necessarily ideal; we can all point to various things about modern Unixes that we don't like. Any particular flaws came about partly because people don't necessarily make ideal decisions and partly because we haven't necessarily had perfect understandings of the problems when people had to do something, and then once they'd done something they were constrained by backward compatibility.

(In some ways Plan 9 represents 'Unix without the constraint of backward compatibility', and while I think there are a variety of reasons that it failed to catch on in the world, that lack of compatibility is one of them. Even if you had access to Plan 9, you had to be fairly dedicated to do your work in a Plan 9 environment (and that was before the web made it worse).)

PS: It's my view that the people who are pushing various Unixes forward aren't incompetent, stupid, or foolish. They're rational and talented people who are doing their best in the circumstances that they find themselves. If you want to throw stones, don't throw them at the people, throw them at the overall environment that constrains and shapes how everything in this world is pushed to evolve. Unix is far from the only thing shaped in potentially undesirable ways by these forces; consider, for example, C++.

(It's also clear that a lot of people involved in the historical evolution of BSD and other Unixes were really quite smart, even if you don't like, for example, the BSD sockets API.)

Mostly stopping GNU Emacs from de-iconifying itself when it feels like it

By: cks

Over on the Fediverse I had a long standing GNU Emacs gripe:

I would rather like to make it so that GNU Emacs never un-iconifies itself when it completes (Lisp-level) actions. If I have Emacs iconified I want it to stay that way, not suddenly appear under my mouse cursor like an extremely large modal popup. (Modal popups suck, they are a relic of single-tasking windowing environments.)

For those of you who use GNU Emacs and have never been unlucky enough to experience this, if you start some long operation in GNU Emacs and then decide to iconify it to get it out of your face, a lot of the time GNU Emacs will abruptly pop itself back open when it finishes, generally with completely unpredictable timing so that it disrupts whatever else you switched to in the mean time.

(This only happens in some X environments. In others, the desktop or window manager ignores what Emacs is trying to do and leaves it minimized in your taskbar.)

To cut straight to the answer, you can avoid a lot of this with the following snippet of Emacs Lisp:

(add-to-list 'display-buffer-alist '(t nil (inhibit-switch-frame . t)))

I believe that this has some side effects but that these side effects will generally be that Emacs doesn't yank around your mouse focus or suddenly raise windows to be on top of everything.

GNU Emacs doesn't have a specific function that it calls to de-iconify a frame, what Emacs calls a top level window. Instead, the deiconification happens in C code inside C-level functions like raise-frame and make-frame-visible, which also do other things and which are called from many places. For instance, one of make-frame-visible's jobs is actually displaying the frame's X level window if it doesn't already exist on the screen.

(There's an iconify-or-deiconify-frame function but if you look that's a Lisp function that calls make-frame-visible. It's only used a little bit in the Emacs Lisp code base.)

A determined person could probably hook these C-level functions through advice-add to make them do nothing if they were called on an existing, mapped frame that was just iconified. That would be the elegant way to do what I want. The inelegant way is to discover, via use of the Emacs Lisp debugger, that everything I seem to care about is going through 'display-buffer' (eventually calling window--maybe-raise-frame), and that display-buffer's behavior can be customized to not 'switch frames', which will wind up causing things to not call window--maybe-raise-frame and not de-iconify GNU Emacs windows on me.

To understand display-buffer-alist I relied on Demystifying Emacs’s Window Manager. My addition to display-buffer-alist has three elements:

  • the t tells display-buffer to always use this alist entry.
  • the nil tells display-buffer that I don't have any special action functions I want to use here and it should just use its regular ones. I think an empty list might be more proper here, but nil works.
  • the '(inhibit-switch-frame . t)' sets the important customization, which will be merged with any other things set by other (matching) alist entries.

The net effect is that 'display-buffer' will see 'inhibit-switch-frame' set for every buffer it's asked to switch to, and so will not de-iconify, raise, or otherwise monkey around with frame things in the process of displaying buffers. It's possible that this will have undesirable side effects in some circumstances, but as far as I can tell things like 'speedbar' and 'C-x 5 <whatever>' still work for me afterward, so new frames are getting created when I want them to be.

(I could change the initial 't' to something more complex, for example to only apply this to MH-E buffers, which is where I mostly encounter the problem. See Demystifying Emacs’s Window Manager for a discussion of how to do this based on the major mode of the buffer.)

To see if you're affected by this, you can run the following Emacs Lisp in the scratch buffer and then immediately minimize or iconify the window.

(progn
  (sleep-for 5)
  (display-buffer "*scratch*"))

If you're affected, the Emacs window will pop back open in a few seconds (five or less, depending on how fast you minimized the window). If the Emacs window stays minimized or iconified, your desktop environment is probably overriding whatever Emacs is trying to do.

For me this generally happens any time some piece of Emacs Lisp code is taking a long time to get a buffer ready for display and then calls 'display-buffer' at the end to show the buffer. One trigger for this is if the buffer to be displayed contains a bunch of unusual Unicode characters (possibly ones that my font doesn't have anything for). The first time the characters are used, Emacs will apparently stall working out how to render them and then de-iconify itself if I've iconified it out of impatience.

(It's quite possible that there's a better way to do this, and if so I'd love to know about it.)

Sending drawing commands to your display server versus sending images

By: cks

One of the differences between X and Wayland is that in the classical version of X you send drawing commands to the server while in Wayland you send images; this can be called server side rendering versus client side rendering. Client side rendering doesn't preclude a 'network transparent' display protocol, but it does mean that you're shipping around images instead of drawing commands. Is this less efficient? In thinking about it recently, I realized that the answer is that it depends on a number of things.

Let's start out by assuming that the display server and the display clients are equally powerful and capable as far as rendering the graphics goes, so the only question is where the rendering happens (and what makes it better to do it in one place instead of another). The factors that I can think of are:

  • How many different active client (machines) there are; if there are enough, the active client machines have more aggregate rendering capacity than the server does. But probably you don't usually have all that many different clients all doing rendering at once (that would be a very busy display).

  • The number of drawing commands as compared to the size of the rendered result. In an extreme case in favor of client side rendering, a client executes a whole bunch of drawing commands in order to render a relatively small image (or window, or etc). In an extreme case the other way, a client can send only a few drawing commands to render a large image area.
  • The amount of input data the drawing commands need compared to the output size of the rendered result. An extreme case in favour of client side rendering is if the client is compositing together a (large) stack of things to produce a single rendered result.
  • How efficiently you can encode (and decode) the rendered result or the drawing commands (and their inputs). There's a tradeoff of space used to encoding and decoding time, where you may not be able to afford aggressive encoding because it gets in the way of fast updates.

    What these add up to is the aggregate size of the drawing commands and all of the inputs that they need relative to the rendered result, possibly cleverly encoded on both sides.

  • How much changes from frame to frame and how easily you can encode that in some compact form. Encoding changes in images is a well studied thing (we call it 'video'), but a drawing command model might be able to send only a few commands to change a little bit of what it sent previously for an even bigger saving.

    (This is affected by how a server side rendering server holds the information from clients. Does it execute their draw commands then only retain the final result, as X does, or does it hold their draw commands and re-execute them whenever it needs to re-render things? Let's assume it holds the rendered result, so you can draw over it with new drawing commands rather than having to send a new full set of 'draw this from now onward' commands.)

    A pragmatic advantage of client side rendering is that encoding image to image changes can be implemented generically after any style of rendering; all you need is to retain a copy of the previous frame (or perhaps more frames than that, depending). In a server rendering model, the client needs specific support for determining a set of drawing operations to 'patch' the previous result, and this doesn't necessarily cooperate with an immediate mode approach where the client regenerates the entire set of draw commands from scratch any time it needs to re-render a frame.

I was going to say that the network speed is important too but while it matters, what I think it does is magnifies or shrinks the effect of the relative size of drawing commands compared to the final result. The faster and lower latency your network is, the less it matters if you ship more data in aggregate. On a slow network, it's much more important.

There's probably other things I'm missing, but even with just these I've wound up feeling that the tradeoffs are not as simple and obvious as I believed before I started thinking about it.

(This was sparked by an offhand Fediverse remark and joke.)

Getting decent error reports in Bash when you're using 'set -e'

By: cks

Suppose that you have a shell script that's not necessarily complex but is at least long. For reliability, you use 'set -e' so that the script will immediately stop on any unexpected errors from commands, and sometimes this happens. Since this isn't supposed to happen, it would be nice to print some useful information about what went wrong, such as where it happened, what the failing command's exit status was, and what the command was. The good news is that if you're willing to make your script specifically a Bash script, you can do this quite easily.

The Bash trick you need is:

trap 'echo "Exit status $? at line $LINENO from: $BASH_COMMAND"' ERR

This uses three Bash features: the special '$LINENO' and '$BASH_COMMAND' environment variables (which have the command executed just before the trap and the line number), and the special 'ERR' Bash 'trap' condition that causes your 'trap' statement to be invoked right when 'set -e' is causing your script to fail and exit.

Using 'ERR' instead of 'EXIT' (or '0' if you're a traditionalist like me) is necessary in order to get the correct line number in Bash. If you switch this to 'trap ... EXIT', the line number that Bash will report is the line that the 'trap' was defined on, not the line that the failing command is on (although the command being executed remains the same). This makes a certain amount of sense from the right angle; the shell is currently on that line as it's exiting.

As far as I know, no other version of the Bourne shell can do all of this. The OpenBSD version of /bin/sh has a '$LINENO' variable and 'trap ... 0' preserves its value (instead of resetting it to the line of the 'trap'), but it has no access to the current command. The FreeBSD version of /bin/sh resets '$LINENO' to the line of your 'trap ... 0', so the best you can do is report the exit status. Dash, the Ubuntu 24.04 default /bin/sh, doesn't have '$LINENO', effectively putting you in the same situation as FreeBSD.

(On Fedora, /bin/sh is Bash, and the Fedora version of Bash supports all of 'trap .. ERR', $LINENO, and $BASH_COMMAND even when invoked as '#!/bin/sh' by your script. You probably shouldn't count on this; if you want Bash, use '#!/bin/bash'.)

NFS v4 delegations on a Linux NFS server can act as mandatory locks

By: cks

Over on the Fediverse, I shared an unhappy learning experience:

Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS client that's now not responding, good luck writing to the file from any other machine. Your writes will hang.

NFS v4 delegations are an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular file over to a client that is using that file. There are various sorts of delegations, but even a basic read delegation will force the NFS server to recall the delegation if anything else wants to write to the file or to remove it. Recalling a delegation requires notifying the NFS v4 client that it has lost the delegation and then having the client accept and respond to that. NFS v4 clients have to respond to the loss of a delegation because they may be holding local state that needs to be flushed back to the NFS server before the delegation can be released.

(After all the NFS v4 server promised the client 'this file is yours to fiddle around with, I will consult you before touching it'.)

Under some circumstances, when the NFS v4 server is unable to contact the NFS v4 client, it will simply sit there waiting and as part of that will not allow you to do things that require the delegation to be released. I don't know if there's a delegation recall timeout, although I suspect that there is, and I don't know how to find out what the timeout is, but whatever the value is, it's substantial (it may be the 90 second 'default lease time' from nfsd4_init_leases_net(), or perhaps the 'grace', also probably 90 seconds, or perhaps the two added together).

(90 seconds is not what I consider a tolerable amount of time for my editor to completely freeze when I tell it to write out a new version of the file. When NFS is involved, I will typically assume that something has gone badly wrong well before then.)

As mentioned, the NFS v4 RFC also explicitly notes that NFS v4 clients may have to flush file state in order to release their delegation, and this itself may take some time. So even without an unavailable client machine, recalling a delegation may stall for some possibly arbitrary amount of time (depending on how the NFS v4 server behaves; the RFC encourages NFS v4 servers to not be hasty if the client seems to be making a good faith effort to clear its state). Both the slow client recall and the hung client recall can happen even in the absence of any actual file locks; in my case, the now-unavailable client merely having read from the file was enough to block things.

This blocking recall is effectively a mandatory lock, and it affects both remote operations over NFS and local operations on the fileserver itself. Short of waiting out whatever timeout applies, you have two realistic choices to deal with this (the non-realistic choice is to reboot the fileserver). First, you can bring the NFS client back to life, or at least something that's at its IP address and responds to the server with NFS v4 errors. Second, I believe you can force everything from the client to expire through /proc/fs/nfsd/clients/<ID>, by writing 'expire' to the client's 'ctl' file. You can find the right client ID by grep'ing for something in all of the clients/*/info files.

Discovering this makes me somewhat more inclined than before to consider entirely disabling 'leases', the underlying kernel feature that is used to implement these NFS v4 delegations (I discovered how to do this when investigating NFS v4 client locks on the server). This will also affect local processes on the fileserver, but that now feels like a feature since hung NFS v4 delegation recalls will stall or stop even local operations.

❌