Reading view

There are new articles available, click to refresh the page.

Updating Ubuntu packages that you have local changes for with dgit

By: cks

31 March 2026 at 22:07

Suppose, not entirely hypothetically, that you've made local changes to an Ubuntu package using dgit and now Ubuntu has come out with an update to that package that you want to switch to, with your local changes still on top. Back when I wrote about moving local changes to a new Ubuntu release with dgit, I wrote an appendix with a theory of how to do this, based on a conversation. Now that I've actually done this, I've discovered that there is a minor variation and I'm going to write it down explicitly (with additional notes because I forgot some things between then and now).

I'll assume we're starting from an existing dgit based repository with a full setup of local changes, including an updated debian/changelog. Our first step, for safety, is to make a branch to capture the current state of our repository. I suggest you name this branch after the current upstream package version that you're on top of, for example if the current upstream version you're adding local changes to can be summarized as 'ubuntu2.6':

git branch cslab-2.6

Making a branch allows you to use 'git diff cslab-2.6..' later to see exactly what changed between your versions. A useful thing to do here is to exclude the 'debian/' directory from diffs, which can be done with 'git diff cslab-2.6.. -- . :!debian', although your shell may require you to quote the '!' (cf).

Then we need to use dgit to fetch the upstream updates:

dgit fetch -d ubuntu

We need to use '-d ubuntu', at least in current versions of dgit, or 'dgit fetch' gets confused and fails. At this point we have the updated upstream in the remote tracking branch 'dgit/dgit/jammy,-security,-updates' but our local tree is still not updated.

(All of dgit's remote tracking branches start with 'dgit/dgit/', while all of its local branches start with just 'dgit/'. This is less than optimal for my clarity.)

Normally you would now rebase to shift your local changes on top of the new upstream, but we don't want to immediately do that. The problem is that our top commit is our own dgit-based change to debian/changelog, and we don't want to rebase that commit; instead we'll make a new version of it after we rebase our real local changes. So our first step is to discard our top commit:

git reset --hard HEAD~

(In my original theory I didn't realize we had to drop this commit before the rebase, not after, because otherwise things get confused. At a minimum, you wind up with debian/changelog out of order, and I don't know if dropping your HEAD commit after the rebase works right. It's possible you might get debian/changelog rebase conflicts as well, so I feel dropping your debian/changelog change before the rebase is cleaner.)

Now we can rebase, for which the simpler two-argument form does work (but not plain rebasing, or at least I didn't bother testing plain rebasing):

git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates

(If you are wondering how this command possibly works, as I was part way through writing this entry, note that the first branch is 'dgit/dgit/...', ie our remote tracking branch, and then second branch is 'dgit/...', our local branch with our changes on it.)

At this point we should have all of our local changes stacked on top of the upstream changes, but no debian/changelog entry for them that will bump the package version. We create that with:

gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

Then we can build with 'dpkg-buildpackage -uc -b', and afterward do 'git clean -xdf; git reset --hard' to reset your tree back to its pristine state.

(My view is that while you can prepare a source package for your work if you want to, the 'source' artifact you really want to save is your dgit VCS repository. This will be (much) less bulky when you clean it up to get rid of all of the stuff (to be polite) that dpkg-buildpackage leaves behind.)

Canonical's Netplan is hard to deal with in automation

Chris's Wiki :: blog

By: cks

28 March 2026 at 03:10

Suppose, not entirely hypothetically, that you've traditionally used /etc/resolv.conf on your Ubuntu servers but you're considering switching to systemd-resolved, partly for fast failover if your normal primary DNS server is unavailable and partly because it feels increasingly dangerous not to, since resolved is the normal configuration and what software is likely to expect. One of the ways that resolv.conf is nice is that you can set the configuration by simply copying a single file that isn't used for anything else. On Ubuntu, this is unfortunately not the case for systemd-resolved.

Canonical expects you to operate all of your Ubuntu server networking through Canonical Netplan. In reality, Netplan will render things down to a systemd-networkd configuration, which has some important effects and creates some limitations. Part of that rendered networkd configuration is your DNS resolution settings, and the natural effect of this is that they have to be associated with some interface, because that's the resolved model of the world. This means that Netplan specifically attaches DNS server information to a specific network interfaces in your Netplan configuration. This means that you must find the specific device name and then modify settings within it, and those settings are intermingled (in the same file) with settings you can't touch.

(Sometimes Netplan goes the other way, separating interface specific configuration out to a completely separate section.)

Netplan does not give you a way to do this; if anything, Netplan goes out of its way to not do so. For example, Netplan can dump its full or partial configuration, but it does so in YAML form with no option for JSON (which you could readily search through in a script with jq). However, if you want to modify the Netplan YAML without editing it by hand, 'netplan set' sometimes requires JSON as input. Lack of any good way to search or query Netplan's YAML matters because for things like DNS settings, you need to know the right interface name. Without support for this in Netplan, you wind up doing hacks to try to get the right interface name.

Netplan also doesn't provide you any good way to remove settings. The current Ubuntu 26.04 beta installer writes a Netplan configuration that locks your interfaces to specific MAC addresses:

  enp1s0:
    match:
      macaddress: "52:54:00:a5:d5:fb"
    [...]
    set-name: "enp1s0"

This is rather undesirable if you may someday swap network cards or transplant server disks from one chassis to another, so we would like to automatically take it out. Netplan provides no support for this; 'netplan set' can't be given a blank replacement, for example (and 'netplan set "network.ethernets.enp1s0.match={}"' doesn't do anything). If Netplan would give you all of the enp1s0 block in JSON format, maybe you could edit the JSON and replace the whole thing, but that's not available so far.

(For extra complication you also need to delete the set-name, which is only valid with a 'match:'.)

Another effect of not being able to delete things in scripts is that you can't write scripts that move things out to a different Netplan .conf file that has only your settings for what you care about. If you could reliably get the right interface name and you could delete DNS settings from the file the installer wrote, you could fairly readily create a '/etc/netplan/60-resolv.conf' file that was something close to a drop-in /etc/resolv.conf. But as it is, you can't readily do that.

There are all sorts of modifications you might want to make through a script, such as automatically configuring a known set of VLANs to attach them to whatever the appropriate host interface is. Scripts are good for automation and they're also good for avoiding errors, especially if you're doing repetitive things with slight differences (such as setting up a dozen VLANs on your DHCP server). Netplan fights you almost all the way about doing anything like this.

My best guess is that all of Canonical's uses of Netplan either use internal tooling that reuses Netplan's (C) API or simply re-write Netplan files from scratch (based on, for example, cloud provider configuration information).

(To save other people the time, the netplan Python package on PyPI seems to be a third party package and was last updated in 2019. Which is a pity, because it theoretically has a quite useful command line tool.)

One bleakly amusing thing I've found out through using 'netplan set' on Ubuntu 26.04 is that the Ubuntu server installer and Netplan itself have slightly different views on how Netplan files should be written. The original installer version of the above didn't have the quotes around the strings; 'netplan set' added them.

(All of this would be better if there was a widely agreed on, generally shipped YAML equivalent of 'jq', or better yet something that could also modify YAML in place as well as query it in forms that were useful for automation. But the 'jq for YAML' ecosystem appears to be fragmented at best.)

(4 comments.)

Considering mmap() verus plain reads for my recent code

Chris's Wiki :: blog

By: cks

26 March 2026 at 23:05

The other day I wrote about a brute force approach to mapping IPv4 /24 subnets to Autonomous System Numbers (ASNs), where I built a big, somewhat sparse file of four-byte records, with the record for each /24 at a fixed byte position determined by its first three octets (so 0.0.0.0/24's ASN, if any, is at byte 0, 0.0.1.0/24 is at byte 4, and so on). My initial approach was to open, lseek(), and read() to access the data; in a comment, Aristotle Pagaltzis wondered if mmap() would perform better. The short answer is that for my specific case I think it would be worse, but the issue is interesting to talk about.

(In general, my view is that you should use mmap() primarily if it makes the code cleaner and simpler. Using mmap() for performance is a potentially fraught endeavour that you need to benchmark.)

In my case I have two strikes against mmap() likely being a performance advantage: I'm working in Python (and specifically Python 2) so I can't really directly use the mmap()'d memory, and I'm normally only making a single lookup in the typical case (because my program is running as a CGI). In the non-mmap() case I expect to do an open(), an lseek(), and a read() (which will trigger the kernel possibly reading from disk and then definitely copying data to me). In the mmap() case I would do open(), mmap(), and then access some page, triggering possible kernel IO and then causing the kernel to manipulate process memory mappings to map the page into my address space. In general, it seems unlikely that mmap() plus the page access handling will be cheaper than lseek() plus read().

(In both the mmap() and read() cases I expect two transitions into and out of the kernel. As far as I know, lseek() is a cheap system call (and certainly it seems unlikely to be more expensive than mmap(), which has to do a bunch of internal kernel work), and the extra work the read() does to copy data from the kernel to user space is probably no more work than the kernel manipulating page tables, and could be less.)

If I was doing more lookups in a single process, I could possibly win with the mmap() approach but it's not certain. A lot depends on how often I would be looking up something on an already mapped page and how expensive mapping in a new page is compared to some number of lseek() plus read() system calls (or pread() system calls if I had access to that, which cuts the number of system calls in half). In some scenarios, such as a burst of traffic from the same network or a closely related set of networks, I could see a high hit rate on already mapped pages. In others, the IPv4 addresses are basically random and widely distributed, so many lookups would require mapping new pages.

(Using mmap() makes it unnecessary to keep my own in-process cache, but I don't think it really changes what the kernel will cache for me. Both read()'ing from pages and accessing them through mmap() keeps them recently used.)

Things would also be better in a language where I could easily make zero-copy use of data right out of the mmap()'d pages themselves. Python is not such a language, and I believe that basically any access to the mmap()'d data is going to create new objects and copy some bytes around. I expect that this results in as many intermediate objects and so on as if I used Python's read() stuff.

(Of course if I really cared there's no substitute for actually benchmarking some code. I don't care that much, and the code is simpler with the regular IO approach because I have to use the regular IO approach when writing the data file.)

(3 comments.)

Early notes on switching some libvirt-based virtual machines to UEFI

Chris's Wiki :: blog

By: cks

26 March 2026 at 03:09

I keep around a small collection of virtual machines so I don't have to drag out one of our spare physical servers to test things on. These virtual machines have traditionally used traditional MBR-based booting ('BIOS' in libvirt instead of 'UEFI'), partly because for a long time libvirt didn't support snapshots of UEFI based virtual machines and snapshots are very important for my use of these scratch virtual machines. However, I recently discovered that libvirt now can do snapshots of UEFI based virtual machines, and also all of our physical server installs are UEFI based, so in the past couple of days I've experimented with moving some of my Ubuntu scratch VMs from BIOS to UEFI.

As far as I know, virt-manager and virsh don't directly allow you to switch a virtual machine between BIOS and UEFI after it's been created, partly because the result is probably not going to boot (unless you deliberately set up the OS inside the VM with both an EFI boot and a BIOS MBR boot environment). Within virt-manager, you can only select BIOS or UEFI at setup time, so you have to destroy your virtual machine and recreate it. This works, but it's a bit annoying.

(On the other hand, if you've had some virtual machines sitting around for years and years, you might want to refresh all of their settings anyway.)

It's possible to change between BIOS and UEFI by directly editing the libvirt XML to transform the <os> node. You may want to remove any old snapshots first because I don't know what happens if you revert from a 'changed to UEFI' machine to a snapshot where your virtual machine was a BIOS one. In my view, the easiest way to get the necessary XML is to create (or recreate) another virtual machine with UEFI, and then dump and copy its XML with some minor alterations.

For me, on Fedora with the latest libvirt and company, the <os> XML of a BIOS booting machine is:

 <os>
   <type arch='x86_64' machine='pc-q35-6.1'>hvm</type>
 </os>

Here the 'machine=' is the machine type I picked, which I believe is the better of the two options virt-manager gives me.

My UEFI based machines look like this:

 <os firmware='efi'>
   <type arch='x86_64' machine='pc-q35-9.2'>hvm</type>
   <firmware>
     <feature enabled='yes' name='enrolled-keys'/>
     <feature enabled='yes' name='secure-boot'/>
   </firmware>
   <loader readonly='yes' secure='yes' type='pflash' format='qcow2'>/usr/share/edk2/ovmf/OVMF_CODE_4M.secboot.qcow2</loader>
   <nvram template='/usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2' templateFormat='qcow2' format='qcow2'>/var/lib/libvirt/qemu/nvram/[machine name]_VARS.qcow2</nvram>
 </os>

Here the '[machine-name]' bit is the libvirt name of my virtual machine, such as 'vmguest1'. This nvram file doesn't have to exist in advance; libvirt will create it the first time you start up the virtual machine. I believe it's used to provide snapshots of the UEFI variables and so on to go with snapshots of your physical disks and snapshots of the virtual machine configuration.

(This feature may have landed in libvirt 10.10.0, if I'm reading release notes correctly. Certainly reading the release notes suggests that I don't want to use anything before then with UEFI snapshots.)

Manually changing the XML on one of my scratch machines has worked fine to switch it from BIOS MBR to UEFI booting as far as I can tell, but I carefully cleared all of its disk state and removed all of its snapshots before I tried this. I suspect that I could switch it back to BIOS if I wanted to. Over time, I'll probably change over all of my as yet unchanged scratch virtual machines to UEFI through direct XML editing, because it's the less annoying approach for me. Now that I've looked this up, I'll probably do it through 'virsh edit ...' rather than virt-manager, because that way I get my real editor.

(This is the kind of entry I write for my future use because I don't want to have to re-derive this stuff.)

PS: Much of this comes from this question and answers.

Fedora's virt-manager started using external snapshots for me as of Fedora 41

Chris's Wiki :: blog

By: cks

24 March 2026 at 02:51

Today I made an unpleasant discovery about virt-manager on my (still) Fedora 42 machines that I shared on the Fediverse:

This is my face that Fedora virt-manager appears to have been defaulting to external snapshots for some time and SURPRISE, external snapshots can't be reverted by virsh. This is my face, especially as it seems to have completely screwed up even deleting snapshots on some virtual machines.

(I only discovered this today because today is the first time I tried to touch such a snapshot, either to revert to it or to clean it up. It's possible that there is some hidden default for what sort of snapshot to make and it's only been flipped for me.)

Neither virt-manager nor virsh will clearly tell you about this. In virt-manager you need to click on each snapshot and if it says 'external disk only', congratulations, you're in trouble. In virsh, 'virsh snapshot-list --external <vm>' will list external snaphots, and then 'virsh snapshot-list --tree <vm>' will tell you if they depend on any internal snapshots.

My largest problems came from virtual machines where I had earlier internal snapshots and then I took more snapshots, which became external snapshots from Fedora 41 onward. You definitely can't revert to an external snapshot in this situation, at least not with virsh or virt-manager, and the error messages I got were generic ones about not being able to revert external snapshots. I haven't tested reverting external snapshots for a VM with no internal ones.

(Not being able to revert to external snapshots is a long standing libvirt issue, but it's possible they now work if you only have external snapshots. Otherwise, Fedora 41 and Fedora 42 defaulting to external snapshots is extremely hard to understand (to be polite).)

Update: you can revert an external snapshot in the latest libvirt if all of your snapshots are external. You can't revert them if libvirt helpfully gave you external snapshots on top of internal ones by switching the default type of snapshots (probably in Fedora 41).

If you have an external snapshot that you need to revert to, all I can do is point to a libvirt wiki page on the topic (although it may be outdated by now) along with libvirt's documentation on its snapshot XML. I suspect that there is going to be suffering involved. I haven't tried to do this; when it came up today I could afford to throw away the external snapshot.

If you have internal snapshots and you're willing to throw away the external snapshot and what's built on it, you can use virsh or virt-manager to revert to an internal snapshot and then delete the external snapshot. This leaves the external snapshot's additional disk file or files dangling around for you to delete by hand.

If you have only an external snapshot, it appears that libvirt will let you delete the snapshot through 'virsh snapshot-delete <vm> <external-snapshot>', which preserves the current state of the machine's disks. This only helps if you don't want the snapshot any more, but this is one of my common cases (where I take precautionary snapshots before significant operations and then get rid of them later when I'm satisfied, or at least committed).

The worst situation appears to be if you have an external snapshot made after (and thus on top of) an earlier internal snapshot and you to keep the live state of things while getting rid of the snapshots. As far as I can tell, it's impossible to do this through libvirt, although some of the documentation suggests that you should be able to. The process outlined in libvirt's Merging disk image chains didn't work for me (see also Disk image chains).

(If it worked, this operation would implicitly invalidate the snapshots and I don't know how you get rid of them inside libvirt, since you can't delete them normally. I suspect that to get rid of them, you need to shut down all of the libvirt daemons and then delete the XML files that (on Fedora) you'll find in /var/lib/libvirt/qemu/snapshot/<domain>.)

One reason to delete external snapshots you don't need is if you ever want to be able to easily revert snapshots in the future. I wouldn't trust making internal snapshots on top of external ones, if libvirt even lets you, so if you want to be able to easily revert, it currently appears that you need to have and use only internal snapshots. Certainly you can't mix new external snapshots with old internal snapshots, as I've seen.

(The 5.1.0 virt-manager release will warn you to not mix snapshot modes and defaults to whatever snapshot mode you're already using. I don't know what it defaults to if you don't have any snapshots, I haven't tried that yet.)

Sidebar: Cleaning this up on the most tangled virtual machine

I've tried the latest preview releases of the libvirt stuff, but it doesn't make a difference in the most tangled situation I have:

$ virsh snapshot-delete hl-fedora-36 fedora41-preupgrade
error: Failed to delete snapshot fedora41-preupgrade
error: Operation not supported: deleting external snapshot that has internal snapshot as parent not supported

This VM has an internal snapshot as the parent because I didn't clean up the first snapshot (taken before a Fedora 41 upgrade) before making the second one (taken before a Fedora 42 upgrade).

In theory one can use 'virsh blockcommit' to reduce everything down to a single file, per the knowledge base section on this. In practice it doesn't work in this situation:

$ virsh blockcommit hl-fedora-36 vda --verbose --pivot --active
error: invalid argument: could not find base image in chain for 'vda'

(I tried with --base too and that didn't help.)

I was going to attribute this to the internal snapshot but then I tried 'virsh blockcommit' on another virtual machine with only an external snapshot and it failed too. So I have no idea how this is supposed to work.

Since I could take a ZFS snapshot of the entire disk storage, I chose violence, which is to say direct usage of qemu-img. First, I determined that I couldn't trivially delete the internal snapshot before I did anything else:

$ qemu-img snapshot -d fedora40-preupgrade fedora35.fedora41-preupgrade
qemu-img: Could not delete snapshot 'fedora40-preupgrade': snapshot not found

The internal snapshot is in the underlying file 'fedora35.qcow2'. Maybe I could have deleted it safely even with an external thing sitting on top of it, but I decided not to do that yet and proceed to the main show:

$ qemu-img commit -d fedora35.fedora41-preupgrade
Image committed.
$ rm fedora35.fedora41-preupgrade

Using 'qemu-img info fedora35.qcow2' showed that the internal snapshot was still there, so I removed it with 'qemu-img snapshot -d' (this time on fedora35.qcow2).

All of this left libvirt's XML drastically out of step with the underlying disk situation. So I removed the XML for the snapshots (after saving a copy), made sure all libvirt services weren't running, and manually edited the VM's XML, where it turned out that all I needed to change was the name of the disk file. This appears to have worked fine.

I suspect that I could have skipped manually removing the internal snapshot and its XML and libvirt would then have been happy to see it and remove it.

(I'm writing all of the commands and results down partly for my future reference.)

Wayland has good reasons to put the window manager in the display server

Chris's Wiki :: blog

By: cks

18 March 2026 at 02:26

I recently ran across Isaac Freund's Separating the Wayland Compositor and Window Manager (via), which is excellent news as far as I'm concerned. But in passing, it says:

Traditionally, Wayland compositors have taken on the role of the window manager as well, but this is not in fact a necessary step to solve the architectural problems with X11. Although, I do not know for sure why the original Wayland authors chose to combine the window manager and Wayland compositor, I assume it was simply the path of least resistance. [...]

Unfortunately, I believe that there are excellent reasons to put the window manager into the display server the way Wayland has, and the Wayland people (who were also X people) were quite familiar with them and how X has had problems over the years because of its split.

One large and more or less core problem is that event handling is deeply entwined with window management. As an example, consider this sequence of (input) events:

your mouse starts out over one window. You type some characters.

you move your mouse over to a second window. You type some more characters.

you click a mouse button without moving the mouse.

you type more characters.

Your window manager is extremely involved in the decisions about where all of those input events go and whether the second window receives a mouse button click event in the third step. If the window manager is separate from whatever is handling input events, either some things trigger synchronous delays in further event handling or sufficiently fast typeahead and actions are in a race with the window manager to see if it handles changes in where future events should go fast enough or if some of your typing and other actions are misdirected to the wrong place because the window manager is lagging.

Embedding the window manager in the display server is the simple and obvious approach to insuring that the window manager can see and react to all events without lag, and can freely intercept and modify all events as it wishes without clients having to care. The window manager can even do this using extremely local knowledge if it wants. Do you want your window manager to have key bindings that only apply to browser windows, where the same keys are passed through to other programs? An embedded window manager can easily do that (let's assume it can reliably identify browser windows).

(An outdated example of how complicated you can make mouse button bindings, never mind keyboard bindings, is my mouse button bindings in fvwm.)

X has a collection of mechanisms that try to allow window managers to manage 'focus' (which window receives keyboard input), intercept (some) keys at a window manager level, and do other things that modify or intercept events. The whole system is complex, imperfect, and limited, and a variety of these mechanisms have weird side effects on the X events that regular programs receive; you can often see this with a program such as xev. Historically, not all X programs have coped gracefully with all of the interceptions that window managers like fvwm can do.

(X also has two input event systems, just to make life more complicated.)

X's mechanisms also impose limits on what they'll allow a window manager to do. One famous example is that in X, mouse scroll wheel events always go to the X window under the mouse cursor. Even if your window manager uses 'click (a window) to make it take input', mouse scroll wheel input is special and cannot be directed to a window this way. In Wayland, a full server has no such limitations; its window manager portion can direct all events, including mouse scroll wheels, to wherever it feels like.

(This elaborates on a Fediverse post of mine.)

Cleaning old GPG RPM keys that your Fedora install is keeping around

Chris's Wiki :: blog

By: cks

17 March 2026 at 01:56

Approximately all RPM packages are signed by GPG keys (or maybe they're supposed to be called PGP keys), which your system stores in the RPM database as pseudo-packages (because why not). If your Fedora install has been around long enough, as mine have, you will have accumulated a drift of old keys and sometimes you either want to clean them up or something unfortunate will happen to one of those keys (I'll get to one case for it).

One basic command to see your collection of GPG keys in the RPM database is (taken from this gist):

rpm -q gpg-pubkey --qf '%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n'

On some systems this will give you a nice short list of keys. On others, your list may be very long.

Since Fedora 42 (cf), DNF has functionality (I believe more or less built in) that should offer to remove old GPG keys that have actually expired. This is in the 'expired PGP keys plugin' which comes from the 'libdnf5-plugin-expired-pgp-keys' if you don't have it installed (with a brief manpage that's called 'libdnf5-expired-pgp-keys'). I believe there was a similar DNF4 plugin. However, there are two situations where this seems to not work correctly.

The first situation is now-obsolete GPG keys that haven't expired yet, for various reasons; these may be for past versions of Fedora, for example. These days, the metadata for every DNF repository you use should list a URL for its GPG keys (see the various .repo files in /etc/yum.repos.d/ and look for the 'gpgkey=' lines). So one way to clean up obsolete keys is to fetch all of the current keys for all of your current repositories (or at least the enabled ones), and then remove anything you have that isn't among the list. This process is automated for you by the 'clean-rpm-gpg-pubkey' command and package, which is mentioned in some Fedora upgrade instructions. This will generally clean out most of your obsolete keys, although rare people will have keys that are so old that it chokes on them.

The second situation is apparently a repository operator who is sufficiently clever to have re-issued an expired key using the same key ID and fingerprint but a new expiry date in the future; this fools RPM and related tools and everything chokes. This is unfortunate, since it will often stall all DNF updates unless you disable the repo. One repository operator who has done this is Google, for their Fedora Chrome repository. To fix this you'll have to manually remove the relevant GPG key or keys. Once you've used clean-rpm-gpg-pubkey to reduce your list of GPG keys to a reasonable level, you can use the RPM command I showed above to list all your remaining keys, spot the likely key or keys (based on who owns it, for example), and then use 'rpm -e --allmatches gpg-pubkey-d38b4796-570c8cd3' (or some other appropriate gpg-pubkey name) to manually scrub out the GPG key. Doing a DNF operation such as installing or upgrading a package from the repository should then re-import the current key.

(This also means that it's theoretically harmless to overshoot and remove the wrong key, because it will be fetched back the next time you need it.)

(When I wrote my Fediverse post about discovering clean-rpm-gpg-pubkey, I apparently thought I would remember it without further prompting. This was wrong, and in fact I didn't even remember to use it when I upgraded my home desktop. This time it will hopefully stick, and if not, I have it written down here where it will probably be easier to find.)

UEFI-only booting with GRUB has gone okay on our (Ubuntu 24.04) servers

Chris's Wiki :: blog

By: cks

12 March 2026 at 01:24

We've been operating Ubuntu servers for a long time and for most of that time we've booted them through traditional MBR BIOS boots. Initially it was entirely through MBR and then later it was still mostly through MBR (somewhat depending on who installed a particular server; my co-workers are more tolerant of UEFI than I am). But when we built the 24.04 version of our customized install media, my co-worker wound up making it UEFI only, and so for the past two years all of our 24.04 machines have been UEFI (with us switching BIOSes on old servers into UEFI mode as we updated them). The headline news is that it's gone okay, more or less as you'd expect and hope by now.

All of our servers have mirrored system disks, and the one UEFI thing we haven't really had to deal with so far is fixing Ubuntu's UEFI boot disk redundancy stuff after one disk fails. I think we know how to do it in theory but we haven't had to go through it in practice. It will probably work out okay but it does make me a bit nervous, along with the related issue that the Ubuntu installer makes it hard to be consistent about which disk your '/boot/efi' filesystem comes from.

(In the installer, /boot/efi winds up on the first disk that you set as the boot device, but the disks aren't always presented in order so you can do this on 'the first disk' in the installer and discover that the first disk it listed was /dev/sdb.)

The Ubuntu 24.04 default bootloader is GRUB, so that's what we've wound up with even though as a UEFI-only environment we could in theory use simpler ones, such as systemd-boot. I'm not particularly enthused about GRUB but in practice it does what we want, which is to reliably boot our servers, and it has the huge benefit that it's actively supported by Ubuntu (okay, Canonical) so they're going to make sure it works right, including with their UEFI disk redundancy stuff. If Ubuntu switches default UEFI bootloaders in their server installs, I expect we'll follow along.

(I don't know if Canonical has any plans to switch away from GRUB to something else. I suspect that they'll stick with GRUB for as long as they support MBR booting, which I suspect will be a while, especially as people look more and more likely to hold on to old hardware for much longer than normally expected.)

PS: One reason I'm writing this down is that I've been unenthused about UEFI for a long time, so I'm not sure I would have predicted our lack of troubles in advance. So I'm going to admit it, UEFI has been actually okay. And in its favour, UEFI has regularized some things that used to be pretty odd in the MBR BIOS era.

(I'm still not happy about the UEFI non-story around redundant system disks, but I've accepted that hacks like the Ubuntu approach are the best we're going to get. I don't know what distributions such as Fedora are doing here; my Fedora machines are MBR based and staying that way until the hardware gets replaced, which on current trends won't be any time soon.)

(2 comments.)

Installing Void Linux on ZFS with Hibernation Support

IT+Notes

By: Stefano Marinelli

22 December 2025 at 08:43

A practical guide to installing Void Linux on an encrypted ZFS root with LUKS-encrypted swap and working hibernation support.

Restricting IP address access to specific ports in eBPF: a sketch

Chris's Wiki :: blog

By: cks

8 March 2026 at 03:04

The other day I covered how I think systemd's IPAddressAllow and IPAddressDeny restrictions work, which unfortunately only allows you to limit this to specific (local) ports only if you set up the sockets for those ports in a separate systemd.socket unit. Naturally this raises the question of whether there is a good, scalable way to restrict access to specific ports in eBPF that systemd (or other interested parties) could use. I think the answer is yes, so here is a sketch of how I think you'd this.

Why we care about a 'scalable' way to do this is because systemd generates and installs its eBPF programs on the fly. Since tcpdump can do this sort of cross-port matching, we could write an eBPF program that did it directly. But such a program could get complex if we were matching a bunch of things, and that complexity might make it hard to generate on the fly (or at least make it complex enough that systemd and other programs didn't want to). So we'd like a way that still allows you to generate a simple eBPF program.

Systemd uses cgroup socket SKB eBPF programs, which attach to a cgroup and filter all network packets on ingress or egress. As far as I can understand from staring at code, these are implemented by extracting the IPv4 or IPv4 address of the other side from the SKB and then querying what eBPF calls a LPM (Longest Prefix Match) map. The normal way to use an LPM map is to use the CIDR prefix length and the start of the CIDR network as the key (for individual IPv4 addresses, the prefix length is 32), and then match against them, so this is what systemd's cgroup program does. This is a nicely scalable way to handle the problem; the eBPF program itself is basically constant, and you have a couple of eBPF maps (for the allow and deny sides) that systemd populates with the relevant information from IPAddressAllow and IPAddressDeny.

However, there's nothing in eBPF that requires the keys to be just CIDR prefixes plus IP addresses. A LPM map key has to start with a 32-bit prefix, but the size of the rest of the key can vary. This means that we can make our keys be 16 bits longer and stick the port number in front of the IP address (and increase the CIDR prefix size appropriately). So to match packets to port 22 from 128.100.0.0/16, your key would be (u32) 32 for the prefix length then something like 0x00 0x16 0x80 0x64 0x00 0x00 (if I'm doing the math and understanding the structure right). When you query this LPM map, you put the appropriate port number in front of the IP address.

This does mean that each separate port with a separate set of IP address restrictions needs its own set of map entries. If you wanted a set of ports to all have a common set of restrictions, you could use a normally structured LPM map and a second plain hash map where the keys are port numbers. Then you check the port and the IP address separately, rather than trying to combine them in one lookup. And there are more complex schemes if you need them.

Which scheme you'd use depends on how you expect port based access restrictions to be used. Do you expect several different ports, each with its own set of IP access restrictions (or only one port)? Then my first scheme is only a minor change from systemd's current setup, and it's easy to extend it to general IP address controls as well (just use a port number of zero to mean 'this applies to all ports'). If you expect sets of ports to all use a common set of IP access controls, or several sets of ports with different restrictions for each set, then you might want a scheme with more maps.

(In theory you could write this eBPF program and set up these maps yourself, then use systemd resource control features to attach them to your .service unit. In practice, at that point you probably should write host firewall rules instead, it's likely to be simpler. But see this blog post and the related VCS repository, although that uses a more hard-coded approach.)

Your terminal program has to be where xterm's ziconbeep feature is handled

Chris's Wiki :: blog

By: cks

7 March 2026 at 03:26

I recently wrote about things that make me so attached to xterm. One of those things is xterm's ziconbeep feature, which causes xterm to visibly and perhaps audibly react when it's iconified or minimized and gets output. A commentator suggested that this feature should ideally be done in the window manager, where it could be more general. Unfortunately we can't do the equivalent of ziconbeep in the window manager, or at least we can't do all of it.

A window manager can sound an audible alert when a specific type of window changes its title in a certain way. This would give us the 'beep' part of ziconbeep in a general way, although we're treading toward a programmable window manager. But then, Gnome Shell now does a lot of stuff in JavaScript and its extensions are written in JS and the whole thing doesn't usually blow up. So we've got prior art for writing an extension that reacts to window title changes and does stuff.

What the window manager can't really do is reliably detect when the window has new output, in order to trigger any beeping and change the visible window title. As far as I know, neither X nor Wayland give you particularly good visibility into whether the program is rendering things, and in some ways of building GUIs, you're always drawing things. In theory, a program might opt to detect that it's been minimized and isn't visible and so not render any updates at all (although it will be tracking what to draw for when it's not minimized), but in practice I think this is unfashionable because it gets in the way of various sorts of live previews of minimized windows (where you want the window's drawing surface to reflect its current state).

Another limitation of this as a general window manager feature is that the window manager doesn't know what changes in the appearance of a window are semantically meaningful and which ones are happening because, for example, you just changed some font preference and the program is picking up on that. Only the program itself knows what's semantically meaningful enough to signal for people's attention. A terminal program can have a simple definition but other programs don't necessarily; your mail client might decide that only certain sorts of new email should trigger a discreet 'pay attention to me' marker.

(Even in a terminal program you might want more control over this than xterm gives you. For example, you might want the terminal program to not trigger 'zicon' stuff for text output but instead to do it when the running program finishes and you return to the shell prompt. This is best done by being able to signal the terminal program through escape sequences.)

How I think systemd IP address restrictions on socket units works

Chris's Wiki :: blog

By: cks

6 March 2026 at 04:43

Among the systemd resource controls are IPAddressAllow= and IPAddressDeny=, which allow you to limit what IP addresses your systemd thing can interact with. This is implemented with eBPF. A limitation of these as applied to systemd .service units is that they restrict all traffic, both inbound connections and things your service initiates (like, say, DNS lookups), while you may want only a simple inbound connection filter. However, you can also set these on systemd.socket units. If you do, your IP address restrictions apply only to the socket (or sockets), not to the service unit that it starts. To quote the documentation:

Note that for socket-activated services, the IP access list configured on the socket unit applies to all sockets associated with it directly, but not to any sockets created by the ultimately activated services for it.

So if you have a systemd socket activated service, you can control who can access the socket without restricting who the service itself can talk to.

In general, systemd IP access controls are done through eBPF programs set up on cgroups. If you set up IP access controls on a socket, such as ssh.socket in Ubuntu 24.04, you do get such eBPF programs attached to the ssh.socket cgroup (and there is a ssh.socket cgroup, perhaps because of the eBPF programs):

# pwd
/sys/fs/cgroup/system.slice
# bpftool cgroup list ssh.socket
ID  AttachType      AttachFlags  Name
12  cgroup_inet_ingress   multi  sd_fw_ingress
11  cgroup_inet_egress    multi  sd_fw_egress

However, if you look there are no processes or threads in the ssh.socket cgroup, which is not really surprising but also means there is nothing there for these eBPF programs to apply to. And if you dump the eBPF program itself (with 'ebpftool dump xlated id 12'), it doesn't really look like it checks for the port number.

What I think must be going on is that the eBPF filtering program is connected to the SSH socket itself. Since I can't find any relevant looking uses in the systemd code of the `SO_ATTACH_*' BPF related options from socket(7) (which would be used with setsockopt(2) to directly attach programs to a socket), I assume that what happens is that if you create or perhaps start using a socket within a cgroup, that socket gets tied to the cgroup and its eBPF programs, and this attachment stays when the socket is passed to another program in a different cgroup.

(I don't know if there's any way to see what eBPF programs are attached to a socket or a file descriptor for a socket.)

If this is what's going on, it unfortunately means that there's no way to extend this feature of socket units to get per-port IP access control in .service units. Systemd isn't writing special eBPF filter programs for socket units that only apply to those exact ports, which you could in theory reuse for a service unit; instead, it's arranging to connect (only) specific sockets to its general, broad IP access control eBPF programs. Programs that make their own listening sockets won't be doing anything to get eBPF programs attached to them (and only them), so we're out of luck.

(One could experiment with relocating programs between cgroups, with the initial cgroup in which the program creates its listening sockets restricted and the other not, but I will leave that up to interested parties.)

The things that make me so attached to xterm as my terminal program

Chris's Wiki :: blog

By: cks

2 March 2026 at 04:27

I've said before in various contexts (eg) that I'm very attached to the venerable xterm as my terminal (emulator) program, and I'm not looking forward to the day that I may have to migrate away from it due to Wayland (although I probably can keep running it under XWayland, now that I think about it). But I've never tried to write down a list of the things that make me so attached to it over other alternatives like urxvt, much less more standard ones like gnome-terminal. Today I'm going to try to do that, although my list is probably going to be incomplete.

Xterm's ziconbeep feature, which I use heavily. Urxvt can have an equivalent but I don't know if other terminal programs do.
I routinely use xterm's very convenient way of making large selections, which is supported in urxvt but not in gnome-terminal (and it can't be since gnome-terminal uses mouse button 3 for its own purposes).
The ability to turn off all terminal colours, because they often don't work in my preferred terminal colours. Other terminal programs have somewhat different and sometimes less annoying colours, but it's still far to easy for programs to display things in unreadable colours.
Yes, I can set my shell environment and many programs to not use colours, but I can't set all of them; some modern programs simply always use colours on terminals. Xterm can be set to completely ignore them.
I'm very used to xterm's specific behavior when it comes to what is a 'word' for double-click selection. You can read the full details in the xterm manual page's section on character classes. I'm not sure if it's possible to fully emulate this behavior in other terminal programs; I once made an incomplete attempt in urxvt, while gnome-terminal is quite different and has little or no options for customizing that behavior (in the Gnome way). Generally the modern double click selection behavior is too broad for me.
(For instance, I'm extremely attached to double-click selecting only individual directories in full paths, rather than the entire thing. I can always swipe to select an entire path, but if I can't pick out individual path elements with a double click my only choice is character by character selection, which is a giant pain.)
Based on a quick experiment, I think I can make KDE's konsole behave more or less the way I want by clearing out its entire set of "Word characters" in profiles. I think this isn't quite how xterm behaves but it's probably close enough for my reflexes.
Xterm doesn't treat text specially because of its contents, for example by underlining URLs or worse, hijacking clicks on them to do things. I already have well evolved systems for dealing with things like URLs and I don't want my terminal emulator to provide any 'help'. I believe that KDE's konsole can turn this off, but gnome-terminal doesn't seem to have any option for it.
Many of xterm's behaviors can be controlled from command line switches. Some other terminal emulators (like gnome-terminal) force you to bundle these behaviors together as 'profiles' and only let you select a profile. Similarly, a lot of xterm's behavior can be temporarily changed on the fly through its context menus, without having to change the profile's settings (and then change them back).
Every xterm window is a completely separate program that starts from scratch, and xterm is happy to run on remote servers without complications; this isn't something I can say for all other competitors. Starting from scratch also means things like not deciding to place yourself where your last window was, which is konsole's behavior (and infuriates me).

Of these, the hardest two to duplicate are probably xterm's double click selection behavior of what is a word and xterm's large selection behavior. The latter is hard because it requires the terminal program to not use mouse button 3 for a popup menu.

I use some other xterm features, like key binding, including duplicating windows, but I could live without them, especially if the alternate terminal program directly supports modern cut and paste in addition to xterm's traditional style. And I'm accustomed to a few of xterm's special control characters, especially Ctrl-space, but I think this may be pretty universally supported by now (Ctrl-space is in gnome-terminal).

There are probably things that other terminal programs like konsole, gnome-terminal and so on do that I don't want them to (and that xterm doesn't). But since I don't use anything other than xterm (and a bit of gnome-terminal and once in a while a bit of urxvt), I don't know what those undesired features are. Experimenting with konsole for this entry taught me some things I definitely don't want, such as it automatically placing itself where it was before (including placing a new konsole window on top of one of the existing ones, if you have multiple ones).

(This elaborates on a comment I made elsewhere.)

(5 comments.)

On the Bourne shell's distinction between shell variables and exported ones

Chris's Wiki :: blog

By: cks

28 February 2026 at 03:44

One of the famous things that people run into with the Bourne shell is that it draws a distinction between plain shell variables and special exported shell variables, which are put into the environment of processes started by the shell. This distinction is a source of frustration when you set a variable, run a program, and the program doesn't have the variable available to it:

$ GODEBUG=...
$ go-program
[doesn't see your $GODEBUG setting]

It's also a source of mysterious failures, because more or less all of the environment variables that are present automatically become exported shell variables. So whether or not 'GODEBUG=..; echo running program; go-program' works can depend on whether $GODEBUG was already set when your shell started. The environment variables of regular shell sessions are usually fairly predictable, but the environment variables present when shell scripts get run can be much more varied. This makes it easy to write a shell script that only works right for you, because in your environment it runs with certain environment variables set and so they automatically become exported shell variables.

I've told you all of that because despite these pains, I believe that the Bourne shell made the right choice here, in addition to a pragmatically necessary choice at the time it was created, in V7 (Research) Unix. So let's start with the pragmatics.

The Bourne shell was created along side environment variables themselves, and on the comparatively small machines that V7 ran on, you didn't have much room for the combination of program arguments and the new environment. If either grew too big, you got 'argument list too long' when you tried to run programs. This made it important to minimize and control the size of the environment that the shell gave to new processes. If you want to do that without limiting the use of shell variables so much, a split between plain shell variables and exported ones makes sense and requires only a minor bit of syntax (in the form of 'export').

Both machines and exec() size limits are much larger now, so you might think that getting rid of the distinction is a good thing. The Bell Labs Research Unix people thought so, so they did do this in Tom Duff's rc shell for V10 Unix and Plan 9. Having used both the Bourne shell and a version of rc for many years, I both agree and disagree with them.

For interactive use, having no distinction between shell variables and exported shell variables is generally great. If I set $GODEBUG, $PYTHONPATH, or any number of any other environment variables that I want to affect programs I run, I don't have to remember to do a special 'export' dance; it just works. This is a sufficiently nice (and obvious) thing that it's an option for the POSIX 'sh', in the form of 'set -a' (and this set option is present in more or less all modern Bourne shells, including Bash).

('Set -a' wasn't in the V7 sh, but I haven't looked to see where it came from. I suspect that it may have come from ksh, since POSIX took a lot of the specification for their 'sh' from ksh.)

For shell scripting, however, not having a distinction is messy and sometimes painful. If I write an rc script, every shell variable that I use to keep track of something will leak into the environment of programs that I run. The shell variables for intermediate results, the shell variables for command line options, the shell variables used for for loops, you name it, it all winds up in the environment unless I go well out of my way to painfully scrub them all out. For shell scripts, it's quite useful to have the Bourne shell's strong distinction between ordinary shell variables, which are local to your script, and exported shell variables, which you deliberately act to make available to programs.

(This comes up for shell scripts and not for interactive use because you commonly use a lot more shell variables in shell scripts than you do in interactive sessions.)

For a new Unix shell today that's made primarily or almost entirely for interactive use, automatically exporting shell variables into the environment is probably the right choice. If you wanted to be slightly more selective, you could make it so that shell variables with upper case names are automatically exported and everything else can be manually exported. But for a shell that's aimed at scripting, you want to be able to control and limit variable scope, only exporting things that you explicitly want to.

(One comment.)

Systemd resource controls on user.slice and system.slice work fine

Chris's Wiki :: blog

By: cks

25 February 2026 at 03:54

We have a number of systems where we traditionally set strict overcommit handling, and for some time this has caused us some heartburn. Some years ago I speculated that we might want to use resource controls on user.slice or systemd.slice if they worked, and then recently in a comment here I speculated that this was the way to (relatively) safely limit memory use if it worked.

Well, it does (as far as I can tell, without deep testing). If you want to limit how much of the system's memory people who log in can use so that system services don't explode, you can set MemoryMin= on system.slice to guarantee some amount of memory to it and all things under it. Alternately, you can set MemoryMax= on user.slice, collectively limiting all user sessions to that amount of memory. In either case my view is that you might want to set MemorySwapMax= on user.slice so that user sessions don't spend all of their time swapping. Which one you set things on depends on which is easier and you trust more; my inclination is MemoryMax, although that means you need to dynamically size it depending on this machine's total memory.

(If you want to limit user memory use you'll need to make sure that things like user cron jobs are forced into user sessions, rather than running under cron.service in system.slice.)

Of course this is what you should expect, given systemd's documentation and the kernel documentation. On the other hand, the Linux kernel cgroup and memory system is sufficiently opaque and ever changing that I feel the need to verify that things actually do work (in our environment) as I expect them to. Sometimes there are surprises, or settings that nominally work but don't really affect things the way I expect.

This does raise the question of how much memory you want to reserve for the system. It would be nice if you could use systemd-cgtop to see how much memory your system.slice is currently using, but unfortunately the number it will show is potentially misleadingly high. This is because the memory attributed to any cgroup includes (much) more than program RAM usage. For example, on our it seems typical for system.slice to be using under a gigabyte of 'user' RAM but also several gigabytes of filesystem cache and other kernel memory. You probably want to allow for some of that in what memory you reserve for system.slice, but maybe not all of the current usage.

(You can get the current version of the 'memdu' program I use as memdu.py.)

Gnome, GSettings, gconf, and which one you want

Chris's Wiki :: blog

By: cks

24 February 2026 at 03:22

On the Fediverse a while back, I said:

Ah yes, GNOME, it is of course my mistake that I used gconf-editor instead of dconf-editor. But at least now Gnome-Terminal no longer intercepts F11, so I can possibly use g-t to enter F11 into serial consoles to get the attention of a BIOS. If everything works in UEFI land.

Gnome has had at least two settings systems, GSettings/dconf (also) and the older GConf. If you're using a modern Gnome program, especially a standard Gnome program like gnome-terminal, it will use GSettings and you will want to use dconf-editor to modify its settings outside of whatever Preferences dialogs it gives you (or doesn't give you). You can also use the gsettings or dconf programs from the command line.

(This can include Gnome-derived desktop environments like Cinnamon, which has updated to using GSettings.)

If the program you're using hasn't been updated to the latest things that Gnome is doing, for example Thunderbird (at least as of 2024), then it will still be using GConf. You need to edit its settings using gconf-editor or gconftool-2, or possibly you'll need to look at the GConf version of general Gnome settings. I don't know if there's anything in Gnome that synchronizes general Gnome GSettings settings into GConf settings for programs that haven't yet been updated.

(This is relevant for programs, like Thunderbird, that use general Gnome settings for things like 'how to open a particular sort of thing'. Although I think modern Gnome may not have very many settings for this because it always goes to the GTK GIO system, based on the Arch Wiki's page on Default Applications.)

Because I've made this mistake between gconf-editor and dconf-editor more than once, I've now created a personal gconf-editor cover script that prints an explanation of the situation when I run it without a special --really argument. Hopefully this will keep me sorted out the next time I run gconf-editor instead of dconf-editor.

PS: Probably I want to use gsettings instead of dconf-editor and dconf as much as possible, since gsettings works through the GSettings layer and so apparently has more safety checks than dconf-editor and dconf do.

PPS: Don't ask me what the equivalents are for KDE. KDE settings are currently opaque to me.

(One comment.)

Testing Linux memory limits is a bit of a pain

Chris's Wiki :: blog

By: cks

13 February 2026 at 04:23

For reasons outside of the scope of this entry, I want to test how various systemd memory resource limits work and interact with each other (which means that I'm really digging into cgroup v2 memory controls). When I started trying to do this, it turned out that I had no good test program (or programs), although I had some ones that gave me partial answers.

There are two complexities in memory usage testing programs in a cgroups environment. First, you may be able to allocate more memory than you can actually use, depending on your system's settings for strict overcommit. So it's not enough to see how much memory you can allocate using the mechanism of your choice (I tend to use mmap() rather than go through language allocators). After you've either determined how much memory you can allocate or allocated your target amount, you have to at least force the kernel to materialize your memory by writing something to every page of it. Since the kernel can probably swap out some amount of your memory, you may need to keep repeatedly reading all of it.

The second issue is that if you're not in strict overcommit (and sometimes even if you are), the kernel can let you allocate more memory than you can actually use and then you try to use it, hit you with the OOM killer. For my testing, I care about the actual usable amount of memory, not how much memory I can allocate, so I need to deal with this somehow (and this is where my current test programs are inadequate). Since the OOM killer can't be caught by a process (that's sort of the point), the simple approach is probably to have my test program progressively report on how much memory its touched so far, so I can see how far it got before it was OOM-killed. A more complex approach would be to do the testing in a child process with progress reports back to the parent so it could try to narrow in on how much it could use rather than me guessing that I wanted progress reports every, say, 16 MBytes or 32 MBytes of memory touching.

(Hopefully the OOM killer would only kill the child and not the parent, but with the OOM killer you can never be sure.)

I'm probably not the first person to have this sort of need, so I suspect that other people have written test programs and maybe even put them up somewhere. I don't expect to be able to find them in today's ambient Internet search noise, plus this is very close to the much more popular issue of testing your RAM memory.

(Will I put up my little test program when I hack it up? Probably not, it's too much work to do it properly, with actual documentation and so on. And these days I'm not very enthused about putting more repositories on Github, so I'd need to find some alternate place.)

(2 comments.)

Undo in Vi and its successors, and my views on the mess

Chris's Wiki :: blog

By: cks

12 February 2026 at 04:19

The original Bill Joy vi famously only had a single level of undo (which is part of what makes it a product of its time). The 'u' command either undid your latest change or it redid the change, undo'ing your undo. When POSIX and the Single Unix Specification wrote vi into the standard, they required this behavior; the vi specification requires 'u' to work the same as it does in ex, where it is specified as:

Reverse the changes made by the last command that modified the contents of the edit buffer, including undo.

This is one particular piece of POSIX compliance that I think everyone should ignore.

Vim and its derivatives ignore the POSIX requirement and implement multi-level undo and redo in the usual and relatively obvious way. The vim 'u' command only undoes changes but it can undo lots of them, and to redo changes you use Ctrl-r ('r' and 'R' were already taken). Because 'u' (and Ctrl-r) are regular commands they can be used with counts, so you can undo the last 10 changes (or redo the last 10 undos). Vim can be set to vi compatible behavior if you want. I believe that vim's multi-level undo and redo is the default even when it's invoked as 'vi' in an unconfigured environment, but I can't fully test that.

Nvi has opted to remain POSIX compliant and operate in the traditional vi way, while still supporting multi-level undo. To get multi-level undo in nvi, you extend the first 'u' with '.' commands, so 'u..' undoes the most recent three changes. The 'u' command can be extended with '.' in either of its modes (undo'ing or redo'ing), so 'u..u..' is a no-op. The '.' operation doesn't appear to take a count in nvi, so there is no way to do multiple undos (or redos) in one action; you have to step through them by hand. I'm not sure how nvi reacts if you want do things like move your cursor position during an undo or redo sequence (my limited testing suggests that it can perturb the sequence, so that '.' now doesn't continue undoing or redoing the way vim will continue if you use 'u' or Ctrl-r again).

The vi emulation package evil for GNU Emacs inherits GNU Emacs' multi-level undo and nominally binds undo and redo to 'u' and Ctrl-r respectively. However, I don't understand its actual stock undo behavior. It appears to do multi-level undo if you enter a sequence of 'u' commands and accepts a count for that, but it feels not vi or vim compatible if you intersperse 'u' commands with things like cursor movement, and I don't understand redo at all (evil has some customization settings for undo behavior, especially evil-undo-system). I haven't investigated Evil extensively and this undo and redo stuff makes me less likely to try using it in the future.

The BusyBox implementation of vi is minimal but it can be built with support for 'u' and multi-level undo, which is done by repeatedly invoking 'u'. It doesn't appear to have any redo support, which makes a certain amount of sense in an environment when your biggest concern may be reverting things so they're no worse than they started out. The Ubuntu and Fedora versions of busybox appear to be built this way, but your distance may vary on other Linuxes.

My personal view is that the vim undo and redo behavior is the best and most human friendly option. Undo and redo are predictable and you can predictably intersperse undo and redo operations with other operations that don't modify the buffer, such as moving the cursor, searching, and yanking portions of text. The nvi behavior essentially creates a special additional undo mode, where you have to remember that you're in a sequence of undo or redo operations and you can't necessarily do other vi operations in the middle (such as cursor movement, searches, or yanks). This matters a lot to me because I routinely use multi-level undo when I'm writing text to rewind my buffer to a previous state and yank out some wording that I've decided I like better than its replacement.

(For additional vi versions, on the Fediverse, I was also pointed to nextvi, which appears to use vim's approach to undo and redo; I believe neatvi also does this but I can't spot any obvious documentation on it. There are vi-inspired editors such as vile and vis, but they're not things people would normally use as a direct replacement for vi. I believe that vile follows the nvi approach of 'u.' while vis follows the vim model of 'uu' and Ctrl-r.)

(2 comments.)

Systemd and blocking connections to localhost, including via 'any'

Chris's Wiki :: blog

By: cks

9 February 2026 at 04:21

I recently discovered a surprising path to accessing localhost URLs and services, where instead of connecting to 127.0.0.1 or the IPv6 equivalent, you connected to 0.0.0.0 (or the IPv6 equivalent). In that entry I mentioned that I didn't know if systemd's IPAddressDeny would block this. I've now tested this, and the answer is that systemd's restrictions do block this. If you set 'IPAddressDeny=localhost', the service or whatever is blocked from the 0.0.0.0 variation as well (for both outbound and inbound connections). This is exactly the way it should be, so you might wonder why I was uncertain and felt I needed to test it.

There are a variety of ways at different levels that you might implement access controls on a process (or a group of processes) in Linux, for IP addresses or anything else. For example, you might create an eBPF program that filtered the system calls and system call arguments allowed and attach it to a process and all of its children using seccomp(2). Alternately, for filtering IP connections specifically, you might use a cgroup socket address eBPF program (also), which are among the the cgroup program types that are available. Or perhaps you'd prefer to use a cgroup socket buffer program.

How a program such as systemd implements filtering has implications for what sort of things it has to consider and know about when doing the filtering. For example, if we reasonably conclude that the kernel will have mapped 0.0.0.0 to 127.0.0.1 by the time it invokes cgroup socket address eBPF programs, such a program doesn't need to have any special handling to block access to localhost by people using '0.0.0.0' as the target address to connect to. On the other hand, if you're filtering at the system call level, the kernel has almost certainly not done such mapping at the time it invokes you, so your connect() filter had better know that '0.0.0.0' is equivalent to 127.0.0.1 and it should block both.

This diversity is why I felt I couldn't be completely sure about systemd's behavior without actually testing it. To be honest, I didn't know what the specific options were until I researched them for this entry. I knew systemd used eBPF for IPAddressDeny (because it mentions that in the manual page in passing), but I vaguely knew there are a lot of ways and places to use eBPF and I didn't know if systemd's way needed to know about 0.0.0.0 or if systemd did know.

Sidebar: What systemd uses

As I found out through use of 'bpftool cgroup list /sys/fs/cgroup/<relevant thing>' on a systemd service that I knew uses systemd IP address filtering, systemd uses cgroup socket buffer programs, and is presumably looking for good and bad IP addresses and netblocks in those programs. This unfortunately means that it would be hard for systemd to have different filtering for inbound connections as opposed to outgoing connections, because at the socket buffer level it's all packets.

(You'd have to go up a level to more complicated filters on socket address operations.)

The original vi is a product of its time (and its time has passed)

Chris's Wiki :: blog

By: cks

8 February 2026 at 03:50

Recently I saw another discussion of how some people are very attached to the original, classical vi and its behaviors (cf). I'm quite sympathetic to this view, since I too am very attached to the idiosyncratic behavior of various programs I've gotten used to (such as xterm's very specific behavior in various areas), but at the same time I had a hot take over on the Fediverse:

Hot take: basic vim (without plugins) is mostly what vi should have been in the first place, and much of the differences between vi and vim are improvements. Multi-level undo and redo in an obvious way? Windows for easier multi-file, cross-file operations? Yes please, sign me up.

Basic vi is a product of its time, namely the early 1980s, and the rather limited Unix machines of the time (yes a VAX 11/780 was limited).

(The touches of vim superintelligence, not so much, and I turn them off.)

For me, vim is a combination of genuine improvements in vi's core editing behavior (cf), frustrating (to me) bits of trying too hard to be smart (which I mostly disable when I run across them), and an extension mechanism I ignore but people use to make vim into a superintelligent editor with things like LSP integrations.

Some of the improvements and additions to vi's core editing may be things that Bill Joy either didn't think of or didn't think were important enough. However, I feel strongly that some or even many of omitted features and differences are a product of the limited environments vi had to operate in. The poster child for this is vi's support of only a single level of undo, which drastically constrains the potential memory requirements (and implementation complexity) of undo, especially since a single editing operation in vi can make sweeping changes across a large file (consider a whole-file ':...s/../../' substitution, for example).

(The lack of split windows might be one part memory limitations and one part that splitting an 80 by 24 serial terminal screen is much less useful than splitting, say, an 80 by 50 terminal window.)

Vim isn't the only improved version of vi that has added features like multi-level undo and split windows so you can see multiple files at once (or several parts of the same file); there's also at least nvi. I'm used to vim so I'm biased, but I happen to think that a lot of vim's choices for things like multi-level undo are good ones, ones that will be relatively obvious and natural to new people and avoid various sorts of errors and accidents. But other people like nvi and I'm not going to say they're wrong.

I do feel strongly that giving stock vi to anyone who doesn't specifically ask for it is doing them a disservice, and this includes installing stock vi as 'vi' on new Unix installs. At this point, what new people are introduced to and what is the default on systems should be something better and less limited than stock vi. Time has moved on and Unix systems should move on with it.

(I have similar feelings about the default shell for new accounts for people, as opposed to system accounts. Giving people bare Bourne shell is not doing them any favours and is not likely to make a good first impression. I don't care what you give them but it should at least support cursor editing, file completion, and history, and those should be on by default.)

PS: I have complicated feelings about Unixes that install stock vi as 'vi' and something else under its full name, because on the one hand that sounds okay but on the other hand there is so much stuff out there that says to use 'vi' because that's the one name that's universal. And if you then make 'vi' the name of the default (visual) editor, well, it certainly feels like you're steering new people into it and doing them a disservice.

(I don't expect to change the mind of any Unix that is still shipping stock vi as 'vi'. They've made their cultural decisions a long time ago and they're likely happy with the results.)

(13 comments.)

Early Linux package manager history and patching upstream source releases

Chris's Wiki :: blog

By: cks

1 February 2026 at 03:19

One of the important roles of Linux system package managers like dpkg and RPM is providing a single interface to building programs from source even though the programs may use a wide assortment of build processes. One of the source building features that both dpkg and RPM included (I believe from the start) is patching the upstream source code, as well as providing additional files along with it. My impression is that today this is considered much less important in package managers, and some may make it at least somewhat awkward to patch the source release on the fly. Recently I realized that there may be a reason for this potential oddity in dpkg and RPM.

Both dpkg and RPM are very old (by Linux standards). As covered in Andrew Nesbitt's Package Manager Timeline, both date from the mid-1990s (dpkg in January 1994, RPM in September 1995). Linux itself was quite new at the time and the Unix world was still dominated by commercial Unixes (partly because the march of x86 PCs was only just starting). As a result, Linux was a minority target for a lot of general Unix free software (although obviously not for Linux specific software). I suspect that this was compounded by limitations in early Linux libc, where apparently it had some issues with standards (see eg this, also, also, also).

As a minority target, I suspect that Linux regularly had problems compiling upstream software, and for various reasons not all upstreams were interested in fixing (or changing) that (especially if it involved accepting patches to cope with a non standards compliant environment; one reply was to tell Linux to get standards compliant). This probably left early Linux distributions regularly patching software in order to make it build on (their) Linux, leading to first class support for patching upstream source code in early package managers.

(I don't know for sure because at that time I wasn't using Linux or x86 PCs, and I might have been vaguely in the incorrect 'Linux isn't Unix' camp. My first Linux came somewhat later.)

These days things have changed drastically. Linux is much more standards compliant and of course it's a major platform. Free software that works on non-Linux Unixes but doesn't build cleanly on Linux is a rarity, so it's much easier to imagine (or have) a package manager that is focused on building upstream source code unaltered and where patching is uncommon and not as easy (or trivial) as dpkg and RPM make it.

(You still need to be able to patch upstream releases to handle security patches and so on, since projects don't necessarily publish new releases for them. I believe some projects simply issue patches and tell you to apply them to their current release. And you may have to backport a patch yourself if you're sticking on an older release of the project that they no longer do patches for.)

Making a FreeBSD system have a serial console on its second serial port

Chris's Wiki :: blog

By: cks

31 January 2026 at 04:57

Over on the Fediverse I said:

Today's other work achievement: getting a UEFI booted FreeBSD 15 machine to use a serial console on its second serial port, not its first one. Why? Because the BMC's Serial over Lan stuff appears to be hardwired to the second serial port, and life is too short to wire up physical serial cables to test servers.

The basics of serial console support for your FreeBSD machine are covered in the loader.conf manual page, under the 'console' setting (in the 'Default Settings' section). But between UEFI and FreeBSD's various consoles, things get complicated, and for me the manual pages didn't do a great job of putting the pieces together clearly. So I'll start with my descriptions of all of the loader.conf variables that are relevant:

console="efi,comconsole": Sets both the bootloader console and the kernel console to both the EFI console and the serial port, by default COM1 (ttyu0, Linux ttyS0). This is somewhat harmful if your UEFI BIOS is already echoing console output to the serial port (or at least to the serial port you want); you'll get doubled serial output from the FreeBSD bootloader, but not doubled output from the kernel.
boot_multicons="YES": As covered in loader_simp(8), this establishes multiple low level consoles for kernel messages. It's not necessary if your UEFI BIOS is already echoing console output to the serial port (and the bootloader and kernel can recognize this), but it's harmless to set it just in case.
comconsole_speed="115200": Sets the serial console speed (and in theory 115200 is the default). It's not necessary if the UEFI BIOS has set things up but it's harmless. See loader_simp(8) again.
comconsole_port="0x2f8": Sets the serial port used to COM2. It's not necessary if the UEFI BIOS has set things up, but again it's harmless. You can use 0x3f8 to specify COM1, although it's the default. See loader_simp(8).
hw.uart.console="io:0x2f8,br:115200": This tells the kernel where the serial console is and what baud rate it's at, here COM2 and 115200 baud. The loader will automatically set it for you if you set the comconsole_* variables, either because you also need a 'console=' setting or because you're being redundant. See loader.efi(8) (and then loader_simp(8) and uart(4)).
(That the loader does this even without a 'comconsole' in your nonexistent 'console=' line may some day be considered a bug and fixed.)

If they agree with each other, you can safely set both hw.uart.console and the comconsole_* variables.

On a system where the UEFI BIOS isn't echoing the UEFI console output to a serial port, the basic version of FreeBSD using both the video console (settings for which are in vt(4)) and the serial console (on the default of COM1), with the primary being the video console, is a loader.conf setting of:

console="efi,comconsole"
boot_multicons="YES"

This will change both the bootloader console and the kernel console after boot. If your UEFI BIOS is already echoing 'console' output to the serial port, bootloader output will be doubled and you'll get to see fun bootloader output like:

LLooaaddiinngg  ccoonnffiigguurreedd  mmoodduulleess......

If you see this (or already know that your UEFI BIOS is doing this), the minimal alternate loader.conf settings (for COM1) are:

# for COM1 / ttyu0
hw.uart.console="io:0x3f8,br:115200"

(The details are covered in loader.efi(8)'s discussion of console considerations.)

If you don't need a 'console=' setting because of your UEFI BIOS, you must set either hw.uart.console or the comconsole_* settings. Technically, setting hw.uart.console is the correct approach; that setting only comconsole_* still works may be a bug.

If you don't explicitly set a serial port to use, FreeBSD will use COM1 (ttyu0, Linux ttyS0) for the bootloader and kernel. This is only possible if you're using 'console=', because otherwise you have to directly or indirectly set 'hw.uart.console', which directly tells the kernel which serial port to use (and the bootloader will use whatever UEFI tells it to). To change the serial port to COM2, you need to set the appropriate one of 'comconsole_port' and 'hw.uart.console' from 0x3f8 (COM1) to the right PC port value of 0x2f8.

So our more or less final COM2 /boot/loader.conf for a case where you can turn off or ignore the BIOS echoing to the serial console is:

console="efi,comconsole"
boot_multicons="YES"
comconsole_speed="115200"
# For the COM2 case
comconsole_port="0x2f8"

If your UEFI BIOS is already echoing 'console' output to the serial port, the minimal version of the above (again for COM2) is:

# For the COM2 case
hw.uart.console="io:0x2f8,br:115200"

(As with Linux, the FreeBSD kernel will only use one serial port as the serial console; you can't send kernel messages to two serial ports. FreeBSD at least makes this explicit in its settings.)

As covered in conscontrol and elsewhere, FreeBSD has a high level console, represented by /dev/console, and a low level console, used directly by the kernel for things like kernel messages. The high level console can only go to one device, normally the first one; this is either the first one in your 'console=' line or whatever UEFI considers the primary console. The low level console can go to multiple devices. Unlike Linux, this can be changed on the fly once the system is up through conscontrol (and also have its state checked).

Conveniently, you don't need to do anything to start a serial login on your chosen console serial port. All four possible (PC) serial ports, /dev/ttyu0 through /dev/ttyu3, come pre-set in /etc/ttys with 'onifconsole' (and 'secure'), so that if the kernel is using one of them, there's a getty started on it. I haven't tested what happens if you use conscontrol to change the console on the fly.

Booting FreeBSD on a UEFI based system is covered through the manual page series of uefi(8), boot(8), loader.efi(8), and loader(8). It's not clear to me if loader.efi is the EFI specific version of loader(8), or if the one loads and starts the other in a multi-stage boot process. I suspect it's the former.

Sidebar: What we may wind up with in loader.conf

Here's what I think is a generic commented block for serial console support:

# Uncomment if the UEFI BIOS does not echo to serial port
#console="efi,comconsole"
boot_multicons="YES"
comconsole_speed="115200"
# Uncomment for COM2
#comconsole_port="0x2f8"
# change 0x3f8 (COM1) to 0x2f8 for COM2
hw.uart.console="io:0x3f8,br:115200"

All of this works for me on FreeBSD 15, but your distance may vary.

(2 comments.)

Why Linux wound up with system package managers

Chris's Wiki :: blog

By: cks

29 January 2026 at 04:37

Yesterday I discussed the two sorts of program package managers, system package managers that manage the whole system and application package managers that mostly or entirely manage third party programs. Commercial Unix got application package managers in the very early 1990s, but Linux's first program managers were system package managers, in dpkg and RPM (or at least those seem to be the first Linux package managers).

The abstract way to describe why is to say that Linux distributions had to assemble a whole thing from separate pieces; the kernel came from one place, libc from another, coreutils from a third, and so on. The concrete version is to think about what problems you'd have without a package manager. Suppose that you assembled a directory tree of all of the source code of the kernel, libc, coreutils, GCC, and so on. Now you need to build all of these things (or rebuild, let's ignore bootstrapping for the moment).

Building everything is complicated partly because everything goes about it differently. The kernel has its own configuration and build system, a variety of things use autoconf but not necessarily with the same set of options to control things like features, GCC has a multi-stage build process, Perl has its own configuration and bootstrapping process, X is frankly weird and vaguely terrifying, and so on. Then not everyone uses 'make install' to actually install their software, so you have another set of variations for all of this.

(The less said about the build processes for either TeX or GNU Emacs in the early to mid 1990s, the better.)

If you do this at any scale, you need to keep track of all of this information (cf) and you want a uniform interface for 'turn this piece into a compiled and ready to unpack blob'. That is, you want a source package (which encapsulates all of the 'how to do it' knowledge) and a command that takes a source package and does a build with it. Once you're building things that you can turn into blobs, it's simpler to always ship a new version of the blob whenever you change anything.

(You want the 'install' part of 'build and install' to result in a blob rather than directly installing things on your running system because until it finishes, you're not entirely sure the build and install has fully worked. Also, this gives you an easy way to split overall system up into multiple pieces, some of which people don't have to install. And in the very early days, to split them across multiple floppy disks, as SLS did.)

Now you almost have a system package manager with source packages and binary packages. You're building all of the pieces of your Linux distribution in a standard way from something that looks a lot like source packages, and you pretty much want to create binary blobs from them rather than dump everything into a filesystem. People will obviously want a command that takes a binary blob and 'installs' it by unpacking it on their system (and possibly extra stuff), rather than having to run 'tar whatever' all the time themselves, and they'll also want to automatically keep track of which of your packages they've installed rather than having to keep their own records. Now you have all of the essential parts of a system package manager.

(Both dpkg and RPM also keep track of which package installed what files, which is important for upgrading and removing packages, along with things having versions.)

Scraping the FreeBSD 'mpd5' daemon to obtain L2TP VPN usage data

Chris's Wiki :: blog

By: cks

26 January 2026 at 04:00

We have a collection of VPN servers, some OpenVPN based and some L2TP based. They used to be based on OpenBSD, but we're moving from OpenBSD to FreeBSD and the VPN servers recently moved too. We also have a system for collecting Prometheus metrics on VPN usage, which worked by parsing the output of things. For OpenVPN, our scripts just kept working when we switched to FreeBSD because the two OSes use basically the same OpenVPN setup. This was not the case for our L2TP VPN server.

OpenBSD does L2TP using npppd, which supports a handy command line control program, npppctl, that can readily extract and report status information. On FreeBSD, we wound up using mpd5. Unfortunately, mpd5 has no equivalent of npppctl. Instead, as covered (sort of) in its user manual you get your choice of a TCP based console that's clearly intended for interactive use and a web interface that is also sort of intended for interactive use (and isn't all that well documented).

Fortunately, one convenient thing about the web interface is that it uses HTTP Basic authentication, which means that you can easily talk to it through tools like curl. To do status scraping through the web interface, first you need to turn it on and then you need an unprivileged mpd5 user you'll use for this:

set web self 127.0.0.1 5006
set web open

set user metrics <some-password> user

At this point you can use curl to get responses from the mpd5 web server (from the local host, ie your VPN server itself):

curl -s -u metrics:... --basic 'http://localhost:5006/<something>'

There are two useful things you can ask the web server interface for. First, you can ask it for a complete dump of its status in JSON format, by asking for 'http://localhost:5006/json' (although the documentation claims that the information returned is what 'show summary' in the console would give you, it is more than that). If you understand mpd5 and like parsing and processing JSON, this is probably a good option. We did not opt to do this.

The other option is that you can ask the web interface to run console (interface) commands for you, and then give you the output in either a 'pleasant' HTML page or in a basic plain text version. This is done by requesting either '/cmd?<command>' or '/bincmd?<command>' respectively. For statistics scraping, the most useful version is the 'bincmd' one, and the command we used is 'show session':

curl -s -u metrics:... --basic 'http://localhost:5006/bincmd?show%20session'

This gets you output that looks like:

ng1  172.29.X.Y  B2-2 9375347-B2-2  L2-2  2  9375347-L2-2  someuser  A.B.C.D
RESULT: 0

(I assume 'RESULT: 0' would be something else if there was some sort of problem.)

Of these, the useful fields for us are the first, which gives the local network device, the second, which gives the internal VPN IP of this connection, and the last two, which give us the VPN user and their remote IP. The others are internal MPD things that we (hopefully) don't have to care about. The internal VPN IP isn't necessary for (our) metrics but may be useful for log correlation.

To get traffic volume information, you need to extract the usage information from each local network device that a L2TP session is using (ie, 'ng1' and its friends). As far as I know, the only tool for this in (base) FreeBSD is netstat. Although you can invoke it interface by interface, probably the better thing to do (and what we did) is to use 'netstat -ibn -f link' to dump everything at once and then pick through the output to get the lines that give you packet and byte counts for each L2TP interface, such as ng1 here.

(I'm not sure if dropped packets is relevant for these interfaces; if you think it might be, you want 'netstat -ibnd -f link'.)

FreeBSD has a general system, 'libxo', for producing output from many commands in a variety of handy formats. As covered in xo_options, this can be used to get this netstat output in JSON if you find that more convenient. I opted to get the plain text format and use field numbers for the information I wanted for our VPN traffic metrics.

(Partly this was because I could ultimately reuse a lot of my metrics generation tools from the OpenBSD npppctl parsing. Both environments generated two sets of line and field based information, so a significant amount of the work was merely shuffling around which field was used for what.)

PS: Because of how mpd5 behaves, my view is that you don't want to let anyone but system staff log on to the server where you're using it. It is an old C code base and I would not trust it if people can hammer on its TCP console or its web server. I certainly wouldn't expose the web server to a non-localhost network, even apart from the bit where it definitely doesn't support HTTPS.

The long painful history of (re)using login to log people in

Chris's Wiki :: blog

By: cks

21 January 2026 at 03:36

The news of the time interval is that Linux's usual telnetd has had a giant security vulnerability for a decade. As people on the Fediverse observed, we've been here before; Solaris apparently had a similar bug 20 or so years ago (which was CVE-2007-0882, cf, via), and AIX in the mid 1990s (CVE-1999-0113, source, also)), and also apparently SGI Irix, and no doubt many others (eg). It's not necessarily telnetd at fault, either, as I believe it's sometimes been rlogind.

All of these bugs have a simple underlying cause; in a way that root cause is people using Unix correctly and according to its virtue of modularity, where each program does one thing and you string programs together to achieve your goal. Telnetd and rlogind have the already complicated job of talking a protocol to the network, setting up ptys, and so on, so obviously they should leave the also complex job of logging the user in to login, which already exists to do that. In theory this should work fine.

The problem with this is that from more or less the beginning, login has had several versions of its job. From no later than V3 in 1972, login could also be used to switch from one user to another, not just log in initially. In 4.2 BSD, login was modified and reused to become part of rlogind's authentication mechanism (really; .rhosts is checked in the 4.2BSD login.c, not in rlogind). Later, various versions of login were modified to support 'automatic' logins, without challenging for a password (see eg FreeBSD login(1), OpenBSD login(1), and Linux login(1); use of -f for this appears to date back to around 4.3 Tahoe). Sometimes this was explicitly for the use of things that were running as root and had already authenticated the login.

In theory this is all perfectly Unixy. In practice, login figured out which of these variations of its basic job it was being used for based on a combination of command line arguments and what UID it was running as, which made it absolutely critical that programs running as root that reused login never allowed login to be invoked with arguments that would shift it to a different mode than they expected. Telnetd and rlogind have traditionally run as root, creating this exposure.

People are fallible, programmers included, and attackers are very ingenious. Over the years any number of people have found any number of ways to trick network daemons running as root into running login with 'bad' arguments.

The one daemon I don't think has ever been tricked this way is OpenSSH, because from very early on sshd refused to delegate logging people in to login. Instead, sshd has its own code to log people in to the system. This has had its complexities but has also shielded sshd from all of these (login) context problems.

In my view, this is one of the unfortunate times when the ideals of Unix run up against the uncomfortable realities of the world. Network daemons delegating logging people in to login is the correct Unix answer, but in practice it has repeatedly gone wrong and the best answer is OpenSSH's.

(2 comments.)

Systemd-networkd and giving your virtual devices alternate names

Chris's Wiki :: blog

By: cks

17 January 2026 at 03:28

Recently I wrote about how Linux network interface names have a length limit, of 15 characters. You can work around this limit by giving network interfaces an 'altname' property, as exposed in (for example) 'ip link'. While you can't work around this at all in Canonical's Netplan, it looks like you can have this for your VLANs in systemd-networkd, since there's AlternativeName= in the systemd.link manual page.

Except, if you look at an actual VLAN configuration as materialized by Netplan (or written out by hand), you'll discover a problem. Your VLANs don't normally have .link files, only .netdev and .network files (and even your normal Ethernet links may not have .link files). The AlternativeName= setting is only valid in .link files, because networkd is like that.

(The AlternativeName= is a '[Link]' section setting and .network files also have a '[Link]' section, but they allow completely different sets of '[Link]' settings. The .netdev file, which is where you define virtual interfaces, doesn't have a '[Link]' section at all, although settings like AlternativeName= apply to them just as much as to regular devices. Alternately, .netdev files could support setting altnames for virtual devices in the '[NetDev]' section along side the mandatory 'Name=' setting.)

You can work around this indirectly, because you can create a .link file for a virtual network device and have it work:

[Match]
Type=vlan
OriginalName=vlan22-mlab

[Link]
AlternativeNamesPolicy=
AlternativeName=vlan22-matterlab

Networkd does the right thing here even though 'vlan22-mlab' doesn't exist when it starts up; when vlan22-mlab comes into existence, it matches the .link file and has the altname stapled on.

Given how awkward this is (and that not everything accepts or sees altnames), I think it's probably not worth bothering with unless you have a very compelling reason to give an altname to a virtual interface. In my case, this is clearly too much work simply to give a VLAN interface its 'proper' name.

Since I tested, I can also say that this works on a Netplan-based Ubuntu server where the underlying VLAN is specified in Netplan. You have to hand write the .link file and stick it in /etc/systemd/network, but after that it cooperates reasonably well with a Netplan VLAN setup.

Linux network interface names have a length limit, and Netplan

Chris's Wiki :: blog

By: cks

15 January 2026 at 02:19

Over on the Fediverse, I shared a discovery:

This is my (sad) face that Linux interfaces have a maximum name length. What do you mean I can't call this VLAN interface 'vlan22-matterlab'?

Also, this is my annoyed face that Canonical Netplan doesn't check or report this problem/restriction. Instead your VLAN interface just doesn't get created, and you have to go look at system logs to find systemd-networkd telling you about it.

(This is my face about Netplan in general, of course. The sooner it gets yeeted the better.)

Based on both some Internet searches and looking at kernel headers, I believe the limit is 15 characters for the primary name of an interface. In headers, you will find this called IFNAMSIZ (the kernel) or IF_NAMESIZE (glibc), and it's defined to be 16 but that includes the trailing zero byte for C strings.

(I can be confident that the limit is 15, not 16, because 'vlan22-matterlab' is exactly 16 characters long without a trailing zero byte. Take one character off and it works.)

At the level of ip commands, the error message you get is on the unhelpful side:

# ip link add dev vlan22-matterlab type wireguard
Error: Attribute failed policy validation.

(I picked the type for illustration purposes.)

Systemd-networkd gives you a much better error message:

/run/systemd/network/10-netplan-vlan22-matterlab.netdev:2: Interface name is not valid or too long, ignoring assignment: vlan22-matterlab

(Then you get some additional errors because there's no name.)

As mentioned in my Fediverse post, Netplan tells you nothing. One direct consequence of this is that in any context where you're writing down your own network interface names, such as VLANs or WireGuard interfaces, simply having 'netplan try' or 'netplan apply' succeed without errors does not mean that your configuration actually works. You'll need to look at error logs and perhaps inventory all your network devices.

(This isn't the first time I've seen Netplan behave this way, and it remains just as dangerous.)

As covered in the ip link manual page, network interfaces can have either or both of aliases and 'altname' properties. These alternate names can be (much) longer than 16 characters, and the 'ip link property' altname property can be used in various contexts to make things convenient (I'm not sure what good aliases are, though). However this is somewhat irrelevant for people using Netplan, because the current Netplan YAML doesn't allow you to set interface altnames.

You can set altnames in networkd .link files, as covered in the systemd.link manual page. The direct thing you want is AlternativeName=, but apparently you may also want to set a blank alternative names policy, AlternativeNamesPolicy=. Of course this probably only helps if you're using systemd-networkd directly, instead of through Netplan.

PS: Netplan itself has the notion of Ethernet interfaces having symbolic names, such as 'vlanif0', but this is purely internal to Netplan; it's not manifested as an actual interface altname in the 'rendered' systemd-networkd control files that Netplan writes out.

(Technically this applies to all physical device types.)

(One comment.)

An annoyance in how Netplan requires you to specify VLANs

Chris's Wiki :: blog

By: cks

12 January 2026 at 04:27

Netplan is Canonical's more or less mandatory method of specifying networking on Ubuntu. Netplan has a collection of limitations and irritations, and recently I ran into a new one, which is how VLANs can and can't be specified. To explain this, I can start with the YAML configuration language. To quote the top level version, it looks like:

network:
  version: NUMBER
  renderer: STRING
  [...]
  ethernets: MAPPING
  [...]
  vlans: MAPPING
  [...]

To translate this, you specify VLANs separately from your Ethernet or other networking devices. On the one hand, this is nicely flexible. On the other hand it creates a problem, because here is what you have to write for VLAN properties:

network:
  vlans:
    vlan123:
      id: 123
      link: enp5s0
      addresses: <something>

Every VLAN is on top of some networking device, and because VLANs are specified as a separate category of top level devices, you have to name the underlying device in every VLAN (which gets very annoying and old very fast if you have ten or twenty VLANs to specify). Did you decide to switch from a 1G network port to a 10G network port for the link with all of your VLANs on it? Congratulations, you get to go through every 'vlans:' entry and change its 'link:' value. We hope you don't overlook one.

(Or perhaps you had to move the system disks from one model of 1U server to another model of 1U server because the hardware failed. Or you would just like to write generic install instructions with a generic block of YAML that people can insert directly.)

The best way for Netplan to deal with this would be to allow you to also specify VLANs as part of other devices, especially Ethernet devices. Then you could write:

network:
  ethernet:
    enp5s0: 
      vlans:
        vlan123:
          id: 123
          addresses: <something>

Every VLAN specified in enp5s0's configuration would implicitly use enp5s0 as its underlying link device, and you could rename all of them trivially. This also matches how I think most people think of and deal with VLANs, which is that (obviously) they're tied to some underlying device, and you want to think of them as 'children' of the other device.

(You can have an approach to VLANs where they're more free-floating and the interface that delivers any specific VLAN to your server can change, for load balancing or whatever. But you could still do this, since Netplan will need to keep supporting the separate 'vlans:' section.)

If you want to work around this today, you have to go for the far less convenient approach of artificial network names.

network:
  ethernet:
    vlanif0:
      match:
        name: enp5s0

  vlans:
    vlan123:
      id: 123
      link: vlanif0
      addresses: <something>

This way you only need to change one thing if your VLAN network interface changes, but at the cost of doing a non-standard way of setting up the base interface. (Yes, Netplan accepts it, but it's not how the Ubuntu installer will create your netplan files and who knows what other Canonical tools will have a problem with it as a result.)

We have one future Ubuntu server where we're going to need to set up a lot of VLANs on one underlying physical interface. I'm not sure which option we're going to pick, but the 'vlanif0' option is certainly tempting. If nothing else, it probably means we can put all of the VLANs into a separate, generic Netplan file.

Early experience with using Linux tc to fight bufferbloat latency

Chris's Wiki :: blog

By: cks

11 January 2026 at 03:52

Over on the Fediverse I mentioned something recently:

Current status: doing extremely "I don't know what I'm really doing, I'm copying from a website¹" things with Linux tc to see if I can improve my home Internet latency under load without doing too much damage to bandwidth or breaking my firewall rules. So far, it seems to work and things² claim to like the result.

¹ <documentation link>
² https://bufferbloat.libreqos.com/ via @davecb

What started this was running into a Fediverse post about the bufferbloat test, trying it, and discovering that (as expected) my home DSL link performed badly, with significant increased latency during downloads, uploads, or both. My memory is that reported figures went up to the area of 400 milliseconds.

Conveniently for me, my Linux home desktop is also my DSL router; it speaks PPPoE directly through my DSL modem. This means that doing traffic shaping on my Linux desktop should cover everything, without any need to wrestle with a limited router OS environment. And there was some more or less cut and paste directions on the site.

So my outbound configuration was simple and obviously not harmful:

tc qdisc add root dev ppp0 cake bandwidth 7.6Mbit

The bandwidth is a guess, although one informed by checking both my raw DSL line rate and what testing sites told me.

The inbound configuration was copied from the documentation and it's where I don't understand what I'm doing:

ip link add name ifb4ppp0 type ifb
tc qdisc add dev ppp0 handle ffff: ingress
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort
ip link set ifb4ppp0 up
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0

(This order follows the documentation.)

Here is what I understand about this. As covered in the tc manual page, traffic shaping and scheduling happens only on 'egress', which is to say for outbound traffic. To handle inbound traffic, we need a level of indirection to a special ifb (Intermediate Functional Block) (also) device, that is apparently used only for our (inbound) tc qdisc.

So we have two pieces. The first is the actual traffic shaping on the IFB link, ifb4ppp0, and setting the link 'up' so that it will actually handle traffic instead of throw it away. The second is that we have to push inbound traffic on ppp0 through ifb4ppp0 to get its traffic shaping. To do this we add a special 'ingress' qdisc to ppp0, which applies to inbound traffic, and then we use a tc filter that matches all (ingress) traffic and redirects it to ifb4ppp0 as 'egress' traffic. Since it's now egress traffic, the tc shaping on ifb4ppp0 will now apply to it and do things.

When I set this up I wasn't certain if it was going to break my non-trivial firewall rules on the ppp0 interface. However, everything seems to fine, and the only thing the tc redirect is affecting is traffic shaping. My firewall blocks and NAT rules are still working.

Applying these tc rules definitely improved my latency scores on the test site; my link went from an F rating to an A rating (and a C rating for downloads and uploads happening at once). Does this improve my latency in practice for things like interactive SSH connections while downloads and uploads are happening? It's hard for me to tell, partly because I don't do such downloads and uploads very often, especially while I'm doing interactive stuff over SSH.

(Of course partly this is because I've sort of conditioned myself out of trying to do interactive SSH while other things are happening on my DSL link.)

The most I can say is that this probably improves things, and that since my DSL connection has drifted into having relatively bad latency to start with (by my standards), it probably helps to minimize how much worse it gets under load.

I do seem to get slightly less bandwidth for transfers than I did before; experimentation says that how much less can be fiddled with by adjusting the tc 'bandwidth' settings, although that also changes latency (more bandwidth creates worse latency). Given that I rarely do large downloads or uploads, I'm willing to trade off slightly lower bandwidth for (much) less of a latency hit. One reason that my bandwidth numbers are approximate anyway is that I'm not sure how much PPPoE DSL framing compensation I need.

(The Arch wiki has a page on advanced traffic control that has some discussion of tc.)

Sidebar: A rewritten command order for ingress traffic

If my understanding is correct, we can rewrite the commands to set up inbound traffic shaping to be more clearly ordered:

# Create and enable ifb link
ip link add name ifb4ppp0 type ifb
ip link set ifb4ppp0 up

# Set CAKE with bandwidth limits for
# our actual shaping, on ifb link.
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort

# Wire ifb link (with tc shaping) to inbound
# ppp0 traffic.
tc qdisc add dev ppp0 handle ffff: ingress
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0

The 'ifb4ppp0' name is arbitrary but conventional, set up as 'ifb4<whatever>'.

(2 comments.)

Distribution source packages and whether or not to embed in the source code

Chris's Wiki :: blog

By: cks

10 January 2026 at 03:46

When I described my current ideal Linux source package format, I said that it should be embedded in the source code of the software being packaged. In a comment, bitprophet had a perfectly reasonable and good preference the other way:

Re: other points: all else equal I think I vaguely prefer the Arch "repo contains just the extras/instructions + a reference to the upstream source" approach as it's cleaner overall, and makes it easier to do "more often than it ought to be" cursed things like "apply some form of newer packaging instructions against an older upstream version" (or vice versa).

The Arch approach is isomorphic to the source RPM format, which has various extras and instructions plus a pre-downloaded set of upstream sources. It's not really isomorphic to the Debian source format because you don't normally work with the split up version; the split up version is just a package distribution thing (as dgit shows).

(I believe the Arch approach is also how the FreeBSD and OpenBSD ports trees work. Also, the source package format you work in is not necessarily how you bundle up and distribute source packages, again as shown by Debian.)

Let's call these two packaging options the inline approach (Debian) and the out of line approach (Arch, RPM). My view is that which one you want depends on what you want to do with software and packages. The out of line approach makes it easier to build unmodified packages, and as bitprophet comments it's easy to do weird build things. If you start from a standard template for the type of build and install the software uses, you can practically write the packaging instructions yourself. And the files you need to keep are quite compact (and if you want, it's relatively easy to put a bunch of them into a single VCS repository, each in its own subdirectory).

However, the out of line approach makes modifying upstream software much more difficult than a good version of the inline approach (such as, for example, dgit). To modify upstream software in the out of line approach you have to go through some process similar to what you'd do in the inline approach, and then turn your modifications into patches that your packaging instructions apply on top of the pristine upstream. Moving changes from version to version may be painful in various ways, and in addition to those nice compact out of line 'extras/instructions' package repos, you may want to keep around your full VCS work tree that you built the patches from.

(Out of line versus inline is a separate issue from whether or not the upstream source code should include packaging instructions in any form; I think that generally the upstream should not.)

As a system administrator, I'm biased toward easy modification of upstream packages and thus upstream source because that's most of why I need to build my own packages. However, these days I'm not sure if that's what a Linux distribution should be focusing on. This is especially true for 'rolling' distributions that mostly deal with security issues and bugs not by patching their own version of the software but by moving to a new upstream version that has the security fix or bug fix. If most of what a distribution packages is unmodified from the upstream version, optimizing for that in your (working) source package format is perfectly sensible.

A small suggestion in modern Linux: take screenshots (before upgrades)

Chris's Wiki :: blog

By: cks

4 January 2026 at 03:50

Mike Hoye recently wrote Powering Up, which is in part about helping people install (desktop) Linux, and the Fediverse thread version of it reminded me of something that I don't do enough of:

A related thing I've taken to doing before potential lurching changes (like Linux distribution upgrades) is to take screenshots and window images. Because comparing a now and then image is a heck of a lot easier than restoring backups, and I can look at it repeatedly as I fix things on the new setup.

Linux distributions and the software they package have a long history of deciding to change things for your own good. They will tinker with font choices, font sizes, default DPI determinations, the size of UI elements, and so on, not quite at the drop of a hat but definitely when you do something like upgrade your distribution and bring in a bunch of significant package version changes (and new programs to replace old programs).

Some people are perfectly okay with these changes. Other people, like me, are quite attached to the specifics of how their current desktop environment looks and will notice and be unhappy about even relatively small changes (eg, also). However, because we're fallible humans, people like me can't always recognize exactly what changed and remember exactly what the old version looked like (these two are related); instead, sometimes all we have is the sense that something changed but we're not quite sure exactly what or exactly how.

Screenshots and window images are the fix for that unspecific feeling. Has something changed? You can call up an old screenshot to check, and to example what (and then maybe work out how to reverse it, or decide to live with the change). Screenshots aren't perfect; for example, they won't necessarily tell you what the old fonts were called or what sizes were being used. But they're a lot better than trying to rely on memory or other options.

It would probably also do me good to get into the habit of taking screenshots periodically, even outside of distribution upgrades. Looking back over time every so often is potentially useful to see more subtle, more long term changes, and perhaps ask myself either why I'm not doing something any more or why I'm still doing it.

(Currently I'm somewhat lackadasical about taking screenshots even before distribution upgrades. I have a distribution upgrade process but I haven't made screenshots part of it, and I don't have an explicit checklist for the process. Which I definitely should create. Possibly I should also try to capture font information in text form, to the extent that I can find it.)

(2 comments.)

My ideal Linux source package format (at the moment)

Chris's Wiki :: blog

By: cks

31 December 2025 at 04:25

I've written recently on why source packages are complicated and why packages should be declarative (in contrast to Arch style shell scripts), but I haven't said anything about what I'd like in a source package format, which will mostly be from the perspective of a system administrator who sometimes needs to modify upstream packages or package things myself.

A source package format is a compromise. After my recent experiences with dgit, I now feel that the best option is that a source package is a VCS repository directory tree (Git by default) with special control files in a subdirectory. Normally this will be the upstream VCS repository with packaging control files and any local changes merged in as VCS commits. You perform normal builds in this checked out repository, which has the advantage of convenience and the disadvantage that you have to clean up the result, possibly with liberal use of 'git clean' and 'git reset'. Hermetic builds are done by some tool that copies the checked out files to a build area, or clones the repository, or some other option. If a binary package is built in an environment where this information is available, its metadata should include the exact current VCS commit it was built from, and I would make binary packages not build if there were uncommitted changes.

(Making the native source package a VCS tree with all of the source code makes it easy to work on but mingles package control files with the program source. In today's environment with good distributed VCSes I think this is the right tradeoff.)

The control files should be as declarative as possible, and they should directly express major package metadata such as version numbers (unlike the Debian package format, where the version number is derived from debian/changelog). There should be a changelog but it should be relatively free-form, like RPM changelogs. Changelogs are especially useful for local modifications because they go along with the installed binary package, which means that you can get an answer to 'what did we change in this locally modified package' without having to find your source. The main metadata file that controls everything should be kept simple; I would go as far as to say it should have a format that doesn't allow for multi-line strings, and anything that requires multi-line strings should go in additional separate files (including the package description). You could make it TOML but I don't think you should make it YAML.

Both the build time actions, such as configuring and compiling the source, and the binary package install time actions should by default be declarative; you should be able to say 'this is an autoconf based program and it should have the following additional options', and the build system will take care of everything else. Similarly you should be able to directly express that the binary package needs certain standard things done when it's installed, like adding system users and enabling services. However, this will never be enough so you should also be able to express additional shell script level things that are done to prepare, build, install, upgrade, and so on the package. Unlike RPM and Debian source packages but somewhat like Arch packages, these should be separate files in the control directory, eg 'pkgmeta/build.sh'. Making these separate files makes it much easier to do things like run shellcheck on them or edit them in syntax-aware editor environments.

(It should be possible to combine standard declarative prepare and build actions with additional shell or other language scripting. We want people to be able to do as much as possible with standard, declarative things. Also, although I used '.sh', you should be able to write these actions in other languages too, such as Python or Perl.)

I feel that like RPMs, you should have to at least default to explicitly declaring what files and directories are included in the binary package. Like RPMs, these installed files should be analyzed to determine the binary package dependencies rather than force you to try to declare them in the (source) package metadata (although you'll always have to declare build dependencies in the source package metadata). Like build and install scripts, these file lists should be in separate files, not in the main package metadata file. The RPM collection of magic ways to declare file locations is complex but useful so that, for example, you don't have to keep editing your file lists when the Python version changes. I also feel that you should have to specifically mark files in the file lists with unusual permissions, such as setuid or setgid bits.

The natural way to start packing something new in this system would be to clone its repository and then start adding the package control files. The packaging system could make this easier by having additional tools that you ran in the root of your just-cloned repository and looked around to find indications of things like the name, the version (based on repository tags), the build system in use, and so on, and then wrote out preliminary versions of the control files. More tools could be used incrementally for things like generating the file lists; you'd run the build and 'install' process, then have a tool inventory the installed files for you (and in the process it could recognize places where it should change absolute paths into specially encoded ones for things like 'the current Python package location').

This sketch leaves a lot of questions open, such as what 'source packages' should look like when published by distributions. One answer is to publish the VCS repository but that's potentially quite heavyweight, so you might want a more minimal form. However, once you create a 'source only' minimal form without the VCS history, you're going to want a way to disentangle your local changes from the upstream source.

(6 comments.)

Linux distribution packaging should be as declarative as possible

Chris's Wiki :: blog

By: cks

30 December 2025 at 01:49

A commentator on my entry on why Debian and RPM (source) packages are complicated suggested looking at Arch Linux packaging, where most of the information is in a single file as more or less a shell script (example). Unfortunately, I'm not a fan of this sort of shell script or shell script like format, ultimately because it's only declarative by convention (although I suspect Arch enforces some of those conventions). One reason that declarative formats are important is that you can analyze and understand what they do without having to execute code. Another reason is that such formats naturally standardize things, which makes it much more likely that any divergence from the standard approach is something that matters, instead of a style difference.

Being able to analyze and manipulate declarative (source) packaging is useful for large scale changes within a distribution. The RPM source package format uses standard, more or less declarative macros to build most software, which I understand has made it relatively easy to build a lot of software with special C and C++ hardening options. You can inject similar things into a shell script based environment, but then you wind up with ad-hoc looking modifications in some circumstances, as we see in the Dovecot example.

Some things about declarative source packages versus Arch style minimalism are issues of what could be called 'hygiene'. RPM packages push you to list and categorize what files will be included in the built binary package, rather than simply assuming that everything installed into a scratch hierarchy should be packaged. This can be frustrating (and there are shortcuts), but it does give you a chance to avoid accidentally shipping unintended files. You could do this with shell script style minimal packaging if you wanted to, of course. Both RPM and Debian packages have standard and relatively declarative ways to modify a pristine upstream package, and while you can do that in Arch packages, it's not declarative, which hampers various sorts of things.

Basically my feeling is that at scale, you're likely to wind up with something that's essentially as formulaic as a declarative source package format without having its assured benefits. There will be standard templates that everyone is supposed to follow and they mostly will, and you'll be able to mostly analyze the result, and that 'mostly' qualification will be quietly annoying.

(On the positive side, the Arch package format does let you run shellcheck on your shell stanzas, which isn't straightforward to do in the RPM source format.)

(4 comments.)

Why Debian and RPM (source) packages are complicated

Chris's Wiki :: blog

By: cks

28 December 2025 at 02:44

A commentator on my early notes on dgit mentioned that they found packaging in Debian overly complicated (and I think perhaps RPMs as well) and would rather build and ship a container. On the one hand, this is in a way fair; my impression is that the process of specifying and building a container is rather easier than for source packages. On the other hand, Debian and RPM source packages are complicated for good reasons.

Any reasonably capable source package format needs to contain a number of things. A source package needs to supply the original upstream source code, some amount of distribution changes, instructions for building and 'installing' the source, a list of (some) dependencies (for either or both build time and install time), a list of files and directories it packages, and possibly additional instructions for things to do when the binary package is installed (such as creating users, enabling services, and so on). Then generally you need some system for 'hermetic' builds, ones that don't depend on things in your local (Linux) login environment. You'll also want some amount of metadata to go with the package, like a name, a version number, and a description. Good source package formats also support building multiple binary packages from a single source package, because sometimes you want to split up the built binary files to reduce the amount of stuff some people have to install. A built binary package contains a subset of this; it has (at least) the metadata, the dependencies, a file list, all of the files in the file list, and those install and upgrade time instructions.

Built containers are a self contained blob plus some metadata. You don't need file lists or dependencies or install and removal actions because all of those are about interaction with the rest of the system and by design containers don't interact with the rest of the system. To build a container you still need some of the same information that a source package has, but you need less and it's deliberately more self-contained and freeform. Since the built container is a self contained artifact you don't need a file list, I believe it's uncommon to modify upstream source code as part of the container build process (instead you patch it in advance in your local repository), and your addition of users, activation of services, and so on is mostly free form and at container build time; once built the container is supposed to be ready to go. And my impression is that in practice people mostly don't try to do things like multiple UIDs in a single container.

(You may still want or need to understand what things you install where in the container image, but that's your problem to keep track of; the container format itself only needs a little bit of information from you.)

Containers have also learned from source packages in that they can be layered, which is to say that you can build your container by starting from some other container, either literally or by sticking another level of build instructions on the end. Layered source packages don't make any sense when you're thinking like a distribution, but they make a lot of sense for people who need to modify the distribution's source packages (this is what dgit makes much easier, partly because Git is effectively a layering system; that's one way to look at a sequence of Git commits).

(My impression of container building is that it's a lot more ad-hoc than package building. Both Debian and RPM have tried to standardize and automate a lot of the standard source code building steps, like running autoconf, but the cost of this is that each of them has a bespoke set of 'convenient' automation to learn if you want to build a package from scratch. With containers, you can probably mostly copy the upstream's shell-based build instructions (or these days, their Dockerfile).)

Dgit based building of (potentially modified) Debian packages can be surprisingly close to the container building experience. Like containers, you first prepare your modifications in a repository and then you run some relatively simple commands to build the artifacts you'll actually use. Provided that your modifications don't change the dependencies, files to be packaged, and so on, you don't have to care about how Debian defines and manipulates those, plus you don't even need to know exactly how to build the software (the Debian stuff takes care of that for you, which is to say that the Debian package builders have already worked it out).

In general I don't think you can get much closer to the container build experience other than the dgit build experience or the general RPM experience (if you're starting from scratch). Packaging takes work because packages aren't isolated, self contained objects; they're objects that need to be integrated into a whole system in a reversible way (ie, you can uninstall them, or upgrade them even though the upgraded version has a somewhat different set of files). You need more information, more understanding, and a more complicated build process.

(Well, I suppose there are flatpaks (and snaps). But these mostly don't integrate with the rest of your system; they're explicitly designed to be self-contained, standalone artifacts that run in a somewhat less isolated environment than containers.)

(4 comments.)

Moving local package changes to a new Ubuntu release with dgit

Chris's Wiki :: blog

By: cks

23 December 2025 at 23:55

Suppose, not entirely hypothetically, that you've made local changes to an Ubuntu package on one Ubuntu release, such as 22.04 ('jammy'), and now you want to move to another Ubuntu release such as 24.04 ('noble'). If you're working with straight 'apt-get source' Ubuntu source packages, this is done by tediously copying all of your patches over (hopefully the package uses quilt) to duplicate and recreate your 22.04 work.

If you're using dgit, this is much easier. Partly this is because dgit is based on Git, but partly this is because dgit has an extremely convenient feature where it can have several different releases in the same Git repository. So here's what we want to do, assuming you have a dgit repository for your package already.

(For safety you may want to do this in a copy of your repository. I make rsync'd copies of Git repositories all the time for stuff like this.)

Our first step is to fetch the new 24.04 ('noble') version of the package into our dgit repository as a new dgit branch, and then check out the branch:

dgit fetch -d ubuntu noble,-security,-updates

dgit checkout noble,-security,-updates

We could do this in one operation but I'd rather do it in two, in case there are problems with the fetch.

The Git operation we want to do now is to cherry-pick (also) our changes to the 22.04 version of the package onto the 24.04 version of the package. If this goes well the changes will apply cleanly and we're done. However, there is a complication. If we've followed the usual process for making dgit-based local changes, the last commit on our 22.04 version is an update to debian/changelog. We don't want that change, because we need to do our own 'gbp dch' on the 24.04 version after we've moved our own changes over to make our own 24.04 change to debian/changelog (among other things, the 22.04 changelog change has the wrong version number for the 24.04 package).

In general, cherry-picking all our local changes is 'git cherry-pick old-upstream..old-local'. To get all but the last change, we want 'old-local~' instead. Dgit has long and somewhat obscure branch names; its upstream for our 22.04 changes is 'dgit/dgit/jammy,-security,-updates' (ie, the full 'suite' name we had to use with 'dgit clone' and 'dgit fetch'), while our local branch is 'dgit/jammy,-security,-updates'. So our full command, with a 'git log' beforehand to be sure we're getting what we want, is:

git log dgit/dgit/jammy,-security,-updates..dgit/jammy,-security,-updates~

git cherry-pick dgit/dgit/jammy,-security,-updates..dgit/jammy,-security,-updates~

(We've seen this dgit/dgit/... stuff before when doing 'gbp dch'.)

Then we need to make our debian/changelog update. Here, as an important safety tip, don't blindly copy the command you used while building the 22.04 package, using 'jammy,...' in the --since argument, because that will try to create a very confused changelog of everything between the 22.04 version of the package and the 24.04 version. Instead, you obviously need to update it to your new 'noble' 24.04 upstream, making it:

gbp dch --since dgit/dgit/noble,-security,-updates --local .cslab. --ignore-branch --commit

('git reset --hard HEAD~' may be useful if you make a mistake here. As they say, ask me how I know.)

If the cherry-pick doesn't apply cleanly, you'll have to resolve that yourself. If the cherry-pick applies cleanly but the result doesn't build or perhaps doesn't work because the code has changed too much, you'll be using various ways to modify and update your changes. But at least this is a bunch easier than trying to sort out and update a quilt-based patch series.

Appendix: Dealing with Ubuntu package updates

Based on this conversation, if Ubuntu releases a new version of the package, what I think I need to do is to use 'dgit fetch' and then explicitly rebase:

dgit fetch -d ubuntu

You have to use '-d ubuntu' here or 'dgit fetch' gets confused and fails. There may be ways to fix this with git config settings, but setting them all is exhausting and if you miss one it explodes, so I'm going to have to use '-d ubuntu' all the time (unless dgit fixes this someday).

Dgit repositories don't have an explicit Git upstream set, so I don't think we can use plain rebase. Instead I think we need the more complicated form:

git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates

(Until I do it for real, these arguments are speculative. I believe they should work if I understand 'git rebase' correctly, but I'm not completely sure. I might need the full three argument form and to make the 'upstream' a commit hash.)

Then, as above, we need to drop our debian/changelog change and redo it:

git reset --hard HEAD~

gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

(There may be a clever way to tell 'git rebase' to skip the last change, or you can do an interactive rebase (with '-i') instead of a non-interactive one and delete it yourself.)

Early notes about using dgit on Ubuntu (LTS)

Chris's Wiki :: blog

By: cks

23 December 2025 at 04:25

I recently read Ian Jackson's Debian’s git transition (via) and had a reaction:

I would really like to be able to patch and rebuild Ubuntu packages from a git repository with our local changes (re)based on top of upstream git. It would be much better than quilt'ing and debuild'ing .dsc packages (I have non-complimentary opinions on the Debian source package format). This news gives me hope that it'll be possible someday, but especially for Ubuntu I have no idea how soon or how well documented it will be.

(It could even be better than RPMs.)

The subsequent discussion got me to try out dgit, especially since it had an attractive dgit-user(7) manual page that gave very simple directions on how to make a local change to an upstream package. It turns out that things aren't entirely smooth on Ubuntu, but they're workable.

The starting point is 'dgit clone', but on Ubuntu you currently get to use special arguments that aren't necessary on Debian:

dgit clone -d ubuntu dovecot jammy,-security,-updates

(You don't have to do this on a machine running 'jammy' (Ubuntu 22.04); it may be more convenient to do it from another one, perhaps with a more up to date dgit.)

The latest Ubuntu package for something may be in either their <release>-security or their <release>-updates 'suite', so you need both. I think this is equivalent to what 'apt-get source' gets you, but you might want to double check. Once you've gotten the source in a Git repository, you can modify it and commit those modifications as usual, for example through Magit. If you have an existing locally patched version of the package that you did with quilt, you can import all of the quilt patches, either one by one or all at once and then using Magit's selective commits to sort things out.

Having made your modifications, whether tentative or otherwise, you can now automatically modify debian/changelog:

gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

(You might want to use -S for snapshots when testing modifications and builds, I don't know. Our practice is to use --local to add a local suffix on the upstream package number, so we can keep our packages straight.)

The special bit is the 'dgit/dgit/<whatever you used in dgit clone>', which tells gbp-dch (part of the gbp suite of stuff) where to start the changelog from. Using --commit is optional; what I did was to first run 'gbp dch' without it, then use 'git diff' to inspect the resulting debian/changelog changes, and then 'git restore debian/changelog' and re-run it with a better set of options until eventually I added the '--commit'.

You can then install build-deps (if necessary) and build the binary packages with the dgit-user(7) recommended 'dpkg-buildpackage -uc -b'. Normally I'd say that you absolutely want to build source packages too, but since you have a Git repository with the state frozen that you can rebuild from, I don't think it's necessary here.

(After the build finishes you can admire 'git status' output that will tell you just how many files in your source tree the Debian or Ubuntu package building process modified. One of the nice things about using Git and building from a Git repository is that you can trivially fix them all, rather than the usual set of painful workarounds.)

The dgit-user(7) manual page suggests but doesn't confirm that if you're bold, you can build from a tree with uncommitted changes. Personally, even if I was in the process of developing changes I'd commit them and then make liberal use of rebasing, git-absorb, and so on to keep updating my (committed) changes.

It's not clear to me how to integrate upstream updates (for example, a new Ubuntu update to the Dovecot package) with your local changes. It's possible that 'dgit pull' will automatically rebase your changes, or give you the opportunity to do that. If not, you can always do another 'dgit clone' and then manually import your Git changes as patches.

(A disclaimer: at this point I've only cloned, modified, and built one package, although it's a real one we use. Still, I'm sold; the ability to reset the tree after a build is valuable all by itself, never mind having a better way than quilt to handle making changes.)

(One comment.)

The FreeBSD 15 version of PF has basically caught up to OpenBSD

Chris's Wiki :: blog

By: cks

16 December 2025 at 04:06

When we initially became interested in FreeBSD a year ago, I said that FreeBSD's version of PF was close enough to an older version of OpenBSD PF (in syntax and semantics) that we could deal with it. Indeed, as we've moved firewalls from OpenBSD to FreeBSD we found that most of our rules moved over without trouble and things certainly performed well (better than they had on OpenBSD). Things have gotten even better with the recent release of FreeBSD 15, as covered in Updates to the pf packet filter in FreeBSD and pfSense software. To quote the important bit:

Over the years this difference between OpenBSD and FreeBSD was a common point of discussion, often in overly generalised (and as a result, deeply inaccurate) terms. Thanks to recent efforts by Kristof Provost and Kajetan Staszkiewicz focused on aligning FreeBSD’s pf with the one in OpenBSD, that discussion can be put to rest.

A change that's important for us in FreeBSD 15.0 is that OpenBSD style integrated NAT rules are now supported in the FreeBSD PF. Last year as we were exploring FreeBSD, I wrote about OpenBSD versus FreeBSD syntax for NAT, where a single OpenBSD rule that both passed traffic and NAT'd it had to be split into two FreeBSD rules in the basic version. With FreeBSD 15, we can write NAT rules using the OpenBSD version of syntax.

(I'm talking about syntax here because I don't care about how it's implemented behind the scenes. PF already performs some degree of ruleset transformations, so if the syntax works and the semantics don't change, we're happy even if a peek under the hood would show two rules. But I believe that the FreeBSD 15 changes mean that FreeBSD now has the OpenBSD implementation of this too.)

So far we've converted two firewall rulesets to the old PF NAT syntax, one a simple case that's now in production and a second, more complex one that's not yet in production. We were holding off on our most complex PF NAT firewall, which is complex partly because it uses some stuff that's close to policy based routing. The release of FreeBSD 15 will make it easier to migrate this firewall (in the new year, we don't make big firewall changes shortly before our winter break).

In general, I'm quite happy that FreeBSD and OpenBSD have reached close to parity in their PF as of FreeBSD 15, because that makes it easier to chose between them based on what other aspects of them you like.

(I say 'close to' based on Kristof Provost's comment about the situation on this entry. The situation will get even better (ie, closer) in future FreeBSD versions.)

The systemd journal, message priorities, and (syslog) facilities

Chris's Wiki :: blog

By: cks

15 December 2025 at 03:27

If you use systemd units or systemd-run to conveniently capture output from scripts and programs into the systemd journal, one of the things that it looks like you don't get is message priorities and (syslog) facilities. Fortunately, systemd's journal support is a bit more sophisticated than that.

When you print out regular output and systemd captures it into the journal, systemd assigns it a default priority that's set with SyslogLevel=; this is normally 'info', which is a good default choice. Similarly, you can pick the syslog facility associated with your unit or your systemd-run invocation with SyslogFacility=. Systemd defaults to 'daemon', which may not entirely be what you want. On the other hand, the choice of syslog facility matters less if you're primarily working with journalctl, where what you usually care about is the systemd unit name.

(You can use journalctl to select messages by priority or syslog facility with the -p and --facility options. You can also select by syslog identifier with the -t option. This is probably going to be handy for searching the journal for messages from some of our programs that use syslog to report things.)

If you know that you're logging to systemd (or you don't care that your regular output looks a bit weird in spots), you can also print messages with special priority markers, as covered in sd-daemon(3). Now that I know about this, I may put it to use in some of our scripts and programs. Sadly, unlike the normal Linux logger and its --prio-prefix option, you can't change the syslog facility this way, but if you're doing pure journald logging you probably don't care about that.

(It's possible that sd-daemon(3) actually supports the logger behavior of changing the syslog facility too, but if so it's not documented and you shouldn't count on it. Instead you should assume that you have to control the syslog facility through setting SyslogFacility=, which unfortunately means you can't log just authentication things to 'auth' and everything else to 'daemon' or some other appropriate facility.)

PS: Unfortunately, as far as I know journalctl has no way to augment its normal syslog-like output with some additional fields, such as the priority or the syslog facility. Instead you have to go all the way to a verbose dump of information in one of the supported formats for field selection.

The annoyances of the traditional Unix 'logger' program

Chris's Wiki :: blog

By: cks

13 December 2025 at 04:10

The venerable 'logger' command has been around so long it's part of the Single Unix Specification (really, logger — log messages). Although syslog(3) is in 4.2 BSD (along with syslog(8), the daemon), it doesn't seem to have been until 4.3 BSD that we got logger(1), with more or less the same arguments as the POSIX version. Unfortunately, if you want to do more than throw messages into your syslog and actually create well-formed, useful syslog messages, 'logger' has some annoyances and flaws.

The flaw is front and center in the manual page and the POSIX specification, if you read the description of the -i option carefully:

-i: Log the process ID of the logger process with each message.

(Emphasis mine.)

In shell scripts where you want to report the script's activities to syslog, it's not unusual to want to report more than one thing. In well-formed syslog messages, these would all have the same PID, so that you can tell that they all came from the same invocation of your script. Logger doesn't support this; if you run logger several times over the course of your script and use '-i', every log message will have a different PID. In some environments (such as FreeBSD and Linux with systemd), logger usually puts in its own PID whether you like it or not.

(The traditional fake for this was to not use '-i' and then embed your script's PID into your syslog identifier (FreeBSD even recommends this in their logger(1) manual page). This worked okay when syslog identifiers were nothing more than what got stuck on the front of the message in your log files, but these days it's not necessarily ideal even if your 'logger' environment doesn't add a PID itself. If you're sending syslog to a log aggregation system, the identifier can be meaningful and important and you want it to be a constant for a given message source so you can search on it.)

Since it's a front end to syslog, logger inherits the traditional syslog issues that you have to select a meaningful syslog facility, priority, and identifier (traditionally, the basename of your script). On the positive side, you can easily vary these from message to message; on the not so great side, you have to supply them for every logger invocation and it's on you to make sure all of your uses of logger use the same ones. Logger doesn't insist that you provide these and it doesn't have any mechanism (such as a set of environment variables) for you to provide defaults. This was a bigger issue in the days before shell functions, since these days you can write a 'logit' function for your shell script that invokes logger correctly (for your environment). This function is also a good place to automatically embed your script's PID in the logged message (perhaps as 'pid=... <supplied message>').

Out of the three of these, the syslog identifier is the easiest to do a good job of (since you should be picking a meaningful name for your script anyway) but the traditional syslog environment makes the identifier relatively meaningless.

It's possible to send all of the output of your script to syslog, or with a bunch of work you can send just standard error to syslog (and perhaps repeat it again). But doing either of these requires wrapping the body of your script up and feeding all of it to logger:

(
... script stuff ...
) 2>&1 | logger -i -t "$(basename "$0")" -pX.Y

(Everything will have the same facility and priority, but if it's really important to log things at a different priority you can put in direct 'logger' invocations in the body of the script.)

I suspect that people who used logger a lot probably wrote a wrapper script (you could call it 'stderr-to-syslog') and ran all of the real scripts under it.

All of this adds up to a collection of small annoyances. It's not impossible to use logger in scripts to push things into syslog, but generally it has to be relatively important to capture the information. There's nothing off the shelf that makes it easy. And if you want to have portable logging for your scripts, this basic logger use is all you get.

(Linux with systemd has an entire separate system for this and the standard Linux logger has additional options even for syslog logging. But OpenBSD logger(1) is quite minimal and FreeBSD logger(1) is in between, with its own additional features that don't overlap with the Linux version.)

(2 comments.)

What goes into a well-formed Unix syslog entry

Chris's Wiki :: blog

By: cks

12 December 2025 at 04:54

In a recent entry, I said in passing that the venerable logger utility had some amount of annoyances associated with it. In order to explain those annoyances, I need to first talk about what goes into a well-formed, useful Unix syslog entry in a traditional Unix syslog environment.

(This is 'well-formed' in a social sense, not in a technical sense of simply conforming to the syslog message format. There are a lot of ways to produce technically 'correct' syslog messages that are neither well formed nor useful.)

A well-formed syslog entry is made up from a number of pieces:

A timestamp, the one thing that you don't have to worry about because your syslog environment should automatically generate it for you.
(Your syslog environment will also assign a hostname, which you also don't worry about.)
An appropriate syslog facility, chosen from the assorted options that you generally find listed in your local syslog(3) (the available facilities vary from Unix to Unix). Your program may need to log to multiple different facilities depending on what the messages are about; for example, a network daemon that does authentication should probably send authentication related messages to 'auth' or 'authpriv' and general things to 'daemon'.
(I know I've said to throw every syslog facility together in one place, but having a correct facility still matters.)
An appropriate syslog level (aka priority), where you need to at least distinguish between informational reports ('info'), things only of interest during debugging problems ('debug', and probably normally not logged), and active errors that need attention ('error'). Using more levels is useful if they make sense in your program.
(This doesn't work out in practice but I'm describing how things should be.)
A meaningful and unique identifier ('tag' in logger) that identifies your program as the source of the syslog entry and groups all of its syslog entries together. This is normally expected to be the name of your program or perhaps your system. All syslog entries from your program should have this identifier.
Your process ID (PID), to uniquely identify this instance of your program. Your syslog entries should include a PID even if only one instance of your program is ever running at a time, because that lets system administrators match your syslog messages up with other PID-based information and also tell if and when your program was restarted.
(Under normal circumstances, all messages logged by a single instance of your program should use the same PID, because that's how people match up messages to get all of the ones this particular instance generated.)
A meaningful message that is more or less readable plain text. Plain text is not a great format for logs, but syslog message text that people can read without too much effort is the Unix tradition, even if it means not including a certain amount of available metadata (structured log formats are not 'plain text').

The text and importance of your message text should match the syslog level of the syslog entry; if your text says 'ERROR' but you logged at level 'info', this isn't really a well-formed syslog entry. This goes double if you're using a semi-structured message text format, so that you actually logged 'level=error ...' at level 'info' (or the other way around).

All of this is in service to letting people find your program's syslog entries, pick out the important ones, understand them, and categorize both your syslog entries and syslog entries from other programs. If a busy sysadmin wants to see an overview of all authentication activity, they should be able to look at where they're sending 'auth' logs. If they want to look for problems, they can look for 'error' or higher priority logs. And the syslog facility your program uses should be sensible in general, although there aren't many options these days (and you should probably allow the local system administrators to pick what facility you normally use, so they can assign you a unique local one to collect just your logs somewhere).

A good library or tool for making syslog entries should make it as easy as possible to create well-formed, useful syslog entries. I will note in passing that the traditional syslog(3) API is not ideal for this, because it assumes that your program will log all entries in a single facility, which is not necessarily true for programs that do authentication and something else.

(3 comments.)

Some notes on using systemd-run or systemd-cat for logging program output

Chris's Wiki :: blog

By: cks

9 December 2025 at 03:44

In response to yesterday's entry on using systemd (service) units for easy capturing of log output, a commentator drew my attention to systemd-run and systemd-cat. I spent a bit of time poking at both of them and so I've wound up with some things to remember and some opinions.

(The short summary is that you probably want to use systemd-run with a specific unit name that you pick.)

Systemd-cat is very roughly the systemd equivalent of logger. As you'd expect, things that it puts in the systemd journal flow through to anywhere that regular journal entries would, including things that directly get fed from the journal and syslog (including remote syslog destinations). The most convenient way to use systemd-cat is to just have it run a command, at which point it will capture all of the output from the command and put it in the journal. However, there is a little issue with using just 'systemd-cat /some/command', which is that the journal log identifiers that systemd-cat generates in this case will be the direct name of whatever program produced the output. If /some/command is a script that runs a variety of programs that produce output (perhaps it echos some status information itself then runs a program, which produces output on its own), you'll get a mixture of identifier names in the resulting log:

your-script[...]: >>> Frobulating the thing
some-prog[...]: Frobulation results: 23 processed, 0 errors

Journal logs written by systemd-cat also inherit whatever unit it was in (a session unit, cron.service, etc), and the combination can make it hard to clearly see all of the logs from running your script. To do better you need to give systemd-cat an explicit identifier, 'systemd-cat -t <something> /some/command', which point everything is logged with that name, but still in whatever systemd unit systemd-cat ran in.

Generally you want your script to report all its logs under a single unit name, so you can find them and sort them out from all of the other things your system is logging. To do this you need to use systemd-run with an explicit unit name:

systemd-run -u myscript --quiet --wait -G /some/script

I believe you can then hook this into any systemd service unit infrastructure you want, such as sending email if the unit fails (if you do, you probably want to add '--service-type=oneshot'). Using systemd-run this way gets you the best of both systemd-cat worlds; all of the output from /some/script will be directly labeled with what program produced it, but you can find it all using the unit name.

Systemd-run will refuse to activate a unit with a name that duplicates an existing unit, including existing systemd-run units. In many cases this is a feature for script use, since you basically get 'run only one copy' locking for free (although the error message is noisy, so you may want to do your own quiet locking). If you want to always run your program even if another instance is running, you'll have to generate non-constant unit names (or let systemd-run do it for you).

Systemd-cat has some features that systemd-run doesn't offer, such as setting the priority of messages (and setting a different priority for standard error output). If these features are important to you, I'd suggest nesting systemd-cat (with no '-t' argument) inside systemd-run, so you get both the searchable unit name and the systemd-cat features. If you're already in an environment with a useful unit name and you just need to divert log messages from wherever else the environment wants to send them into the system journal, bare systemd-cat will do the job.

(Arguably this is the case for things run from cron, if you're content to look for all of them under cron.service (or crond.service, depending on your Linux distribution). Running things under systemd-cat puts their output in the journal instead of having them send you email, which may be good enough and saves you having to invent and then remember a bunch of unit names.)

(One comment.)

Turning to systemd units for easy capturing of log output

Chris's Wiki :: blog

By: cks

8 December 2025 at 03:17

Suppose, not hypothetically, that you have a third party tool that you need to run periodically. This tool prints things to standard output (or standard error) that are potentially useful to capture somehow. You want this captured output to be associated with the program (or your general system for running the program) and timestamped, and it would be handy if the log output wound up in all of the usual places in your systems for output. Unix has traditionally had some solutions for this, such as logger for sending things to syslog, but they all have a certain amount of annoyances associated with them.

(If you directly run your script or program from cron, you will automatically capture the output in a nice dated form, but you'll also get email all the time. Let's assume we want a quieter experience than email from cron, because you don't need to regularly see the output, you just want it to be available if you go looking.)

On modern Linux systems, the easy and lazy thing to do is to run your script or program from a systemd service unit, because systemd will automatically do this for you and send the result into the systemd journal (and anything that pulls data from that) and, if configured, into whatever overall systems you have for handling syslog logs. You want a unit like this:

[Unit]
Description=Local: Do whatever
ConditionFileIsExecutable=/root/do-whatever

[Service]
Type=oneshot
ExecStart=/root/do-whatever

Unlike the usual setup for running scripts as systemd services, we don't set 'RemainAfterExit=True' because we want to be able to repeatedly trigger our script with, for example, 'systemctl start local-whatever.service'. You can even arrange to get email if this unit (ie, your script) fails.

You can run this directly from cron through suitable /etc/cron.d files that use 'systemctl start', or set up a systemd timer unit (possibly with a randomized start time). The advantage of a systemd timer unit is that you definitely won't ever get email about this unless you specifically configure it. If you're setting up a relatively unimportant and throwaway thing, it being reliably silent is probably a feature.

(Setting up a systemd timer unit also keeps everything within the systemd ecosystem rather than worrying about various aspects of running 'systemctl start' from scripts or crontabs or etc.)

On the one hand, it feels awkward to go all the way to a systemd service unit simply to get easy to handle logs; it feels like there should be a better solution somewhere. On the other hand, it works and it only needs one extra file over what you'd already need (the .service).

(4 comments.)

Why I (still) love Linux

IT+Notes

By: Stefano Marinelli

24 November 2025 at 07:52

I usually publish articles about how much I love the BSDs or illumos distributions, but today I want to talk about Linux (or, better, GNU/Linux) and why, despite everything, it still holds a place in my heart.

Static Web Hosting on the Intel N150: FreeBSD, SmartOS, NetBSD, OpenBSD and Linux Compared

IT+Notes

By: Stefano Marinelli

19 November 2025 at 08:16

Compare static web hosting performance on an Intel N150 using the same nginx.conf across FreeBSD jails, SmartOS zones, NetBSD, OpenBSD and Linux, focusing on HTTP vs HTTPS and TLS CPU usage.

In Linux, filesystems can and do have things with inode number zero

Chris's Wiki :: blog

By: cks

5 December 2025 at 04:19

A while back I wrote about how in POSIX you could theoretically use inode (number) zero. Not all Unixes consider inode zero to be valid; prominently, OpenBSD's getdents(2) doesn't return valid entries with an inode number of 0, and by extension, OpenBSD's filesystems won't have anything that uses inode zero. However, Linux is a different beast.

Recently, I saw a Go commit message with the interesting description of:

os: allow direntries to have zero inodes on Linux

Some Linux filesystems have been known to return valid entries with zero inodes. This new behavior also puts Go in agreement with recent glibc.

This fixes issue #76428, and the issue has a simple reproduction to create something with inode numbers of zero. According to the bug report:

[...] On a Linux system with libfuse 3.17.1 or later, you can do this easily with GVFS:
# Create many dir entries
(cd big && printf '%04x ' {0..1023} | xargs mkdir -p)
gio mount sftp://localhost/$PWD/big

The resulting filesystem mount is in /run/user/$UID/gvfs (see the issue for the exact long path) and can be experimentally verified to have entries with inode numbers of zero (well, as reported by reading the directory). On systems using glibc 2.37 and later, you can look at this directory with 'ls' and see the zero inode numbers.

(Interested parties can try their favorite non-C or non-glibc bindings to see if those environments correctly handle this case.)

That this requires glibc 2.37 is due to this glibc bug, first opened in 2010 (but rejected at the time for reasons you can read in the glibc bug) and then resurfaced in 2016 and eventually fixed in 2022 (and then again in 2024 for the thread safe version of readdir). The 2016 glibc issue has a bit of a discussion about the kernel side. As covered in the Go issue, libfuse returning a zero inode number may be a bug itself, but there are (many) versions of libfuse out in the wild that actually do this today.

Of course, libfuse (and gvfs) may not be the only Linux filesystems and filesystem environments that can create this effect. I believe there are alternate language bindings and APIs for the kernel FUSE (also, also) support, so they might have the same bug as libfuse does.

(Both Go and Rust have at least one native binding to the kernel FUSE driver. I haven't looked at either to see what they do about inode numbers.)

PS: My understanding of the Linux (kernel) situation is that if you have something inside the kernel that needs an inode number and you ask the kernel to give you one (through get_next_ino(), an internal function for this), the kernel will carefully avoid giving you inode number 0. A lot of things get inode numbers this way, so this makes life easier for everyone. However, a filesystem can decide on inode numbers itself, and when it does it can use inode number 0 (either explicitly or by zeroing out the d_ino field in the getdents(2) dirent structs that it returns, which I believe is what's happening in the libfuse situation).

Some things on X11's obscure DirectColor visual type

Chris's Wiki :: blog

By: cks

4 December 2025 at 03:21

The X Window System has a long standing concept called 'visuals'; to simplify, an X visual determines how to determine the colors of your pixels. As I wrote about a number of years ago, these days X11 mostly uses 'TrueColor' visuals, which directly supply 8-bit values for red, green, and blue ('24-bit color'). However X11 has a number of visual types, such as the straightforward PseudoColor indirect colormap (where every pixel value is an index into an RGB colormap; typically you'd get 8-bit pixels and 24-bit colormaps, so you could have 256 colors out of a full 24-bit gamut). One of the (now) obscure visual types is DirectColor. To quote:

For DirectColor, a pixel value is decomposed into separate RGB subfields, and each subfield separately indexes the colormap for the corresponding value. The RGB values can be changed dynamically.

(This is specific to X11; X10 had a different display color model.)

In a PseudoColor visual, each pixel's value is taken as a whole and used as an index into a colormap that gives the RGB values for that entry. In DirectColor, the pixel value is split apart into three values, one each for red, green, and blue, and each value indexes a separate colormap for that color component. Compared to a PseudoColor visual of the same pixel depth (size, eg each pixel is an 8-bit byte), you get less possible variety within a single color component and (I believe) no more colors in total.

When this came up in my old entry about TrueColor and PseudoColor visuals, in a comment Aristotle Pagaltzis speculated:

[...] maybe it can be implemented as three LUTs in front of a DAC’s inputs or something where the performance impact is minimal? (I’m not a hardware person.) [...]

I was recently reminded of this old entry and when I reread that comment, an obvious realization struck me about why DirectColor might make hardware sense. Back in the days of analog video, essentially every serious sort of video connection between your computer and your display carried the red, green, and blue components separately; you can see this in the VGA connector pinouts, and on old Unix workstations these might literally be separate wires connected to separate BNC connectors on your CRT display.

If you're sending the red, green, and blue signals separately you might also be generating them separately, with one DAC per color channel. If you have separate DACs, it might be easier to feed them from separate LUTs and separate pixel data, especially back in the days when much of a Unix workstation's graphics system was implemented in relatively basic, non-custom chips and components. You can split off the bits from the raw pixel value with basic hardware and then route each color channel to its own LUT, DAC, and associated circuits (although presumably you need to drive them with a common clock).

The other way to look at DirectColor is that it's a more flexible version of TrueColor. A TrueColor visual is effectively a 24-bit DirectColor visual where the color mappings for red, green, and blue are fixed rather than variable (this is in fact how it's described in the X documentation). Making these mappings variable costs you only a tiny bit of extra memory (you need 256 bytes for each color) and might require only a bit of extra hardware in the color generation process, and it enables the program using the display to change colors on the fly with small writes to the colormap rather than large writes to the framebuffer (which, back in the days, were not necessarily very fast). For instance, if you're looking at a full screen image and you want to brighten it, you could simply shift the color values in the colormaps to raise the low values, rather than recompute and redraw all the pixels.

(Apparently DirectColor was often used with 24-bit pixels, split into one byte for each color, which is the same pixel layout as a 24-bit TrueColor visual; see eg this section of the Starlink Project's Graphics Cookbook. Also, this seems to be how the A/UX X server worked. If you were going to do 8-bit pixels I suspected people preferred PseudoColor to DirectColor.)

These days this is mostly irrelevant and the basic simplicity of the TrueColor visual has won out. Well, what won out is PC graphics systems that followed the same basic approach of fixed 24-bit RGB color, and then X went along with it on PC hardware, which became more or less the only hardware.

(There probably was hardware with DirectColor support. While X on PC Unixes will probably still claim to support DirectColor visuals, as reported in things like xdpyinfo, I suspect that it involves software emulation. Although these days you could probably implement DirectColor with GPU shaders at basically no cost.)

(One comment.)

Making Polkit authenticate people like su does (with group wheel)

Chris's Wiki :: blog

By: cks

25 November 2025 at 03:46

Polkit is how a lot of things on modern Linux systems decide whether or not to let people do privileged operations, including systemd's run0, which effectively functions as another su or sudo. Polkit normally has a significantly different authentication model than su or sudo, where an arbitrary login can authenticate for privileged operations by giving the password of any 'administrator' account (accounts in group wheel or group admin, depending on your Linux distribution).

Suppose, not hypothetically, that you want a su like model in Polkit, one where people in group 'wheel' can authenticate by providing the root password, while people not in group 'wheel' cannot authenticate for privileged operations at all. In my earlier entry on learning about Polkit and adjusting it I put forward an untested Polkit stanza to do this. Now I've tested it and I can provide an actual working version.

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // must exist but have a locked password
        return ["unix-user:nobody"];
    }
});

(This goes in /etc/polkit-1/rules.d/50-default.rules, and the filename is important because it has to replace the standard version in /usr/share/polkit-1/rules.d.)

This doesn't quite work the way 'su' does, where it will just refuse to work for people not in group wheel. Instead, if you're not in group wheel you'll be prompted for the password of 'nobody' (or whatever other login you're using), which you can never successfully supply because the password is locked.

As I've experimentally determined, it doesn't work to return an empty list ('[]'), or a Unix group that doesn't exist ('unix-group:nosuchgroup'), or a Unix group that exists but has no members. In all cases my Fedora 42 system falls back to asking for the root password, which I assume is a built-in default for privileged authentication. Instead you apparently have to return something that Polkit thinks it can plausibly use to authenticate the person, even if that authentication can't succeed. Hopefully Polkit will never get smart enough to work that out and stop accepting accounts with locked passwords.

(If you want to be friendly and you expect people on your servers to run into this a lot, you should probably create a login with a more useful name and GECOS field, perhaps 'not-allowed' and 'You cannot authenticate for this operation', that has a locked password. People may or may not realize what's going on, but at least they have a chance.)

PS: This is with the Fedora 42 version of Polkit, which is version 126. This appears to be the most recent version from the upstream project.

Sidebar: Disabling Polkit entirely

Initially I assumed that Polkit had explicit rules somewhere that authorized the 'root' user. However, as far as I can tell this isn't true; there's no normal rules that specifically authorize root or any other UID 0 login name, and despite that root can perform actions that are restricted to groups that root isn't in. I believe this means that you can explicitly disable all discretionary Polkit authorization with an '00-disable.rules' file that contains:

polkit.addRule(function(action, subject) {
    return polkit.Result.NO;
});

Based on experimentation, this disables absolutely everything, even actions that are considered generally harmless (like libvirt's 'virsh list', which I think normally anyone can do).

A slightly more friendly version can be had by creating a situation where there are no allowed administrative users. I think this would be done with a 50-default.rules file that contained:

polkit.addAdminRule(function(action, subject) {
    // must exist but have a locked password
    return ["unix-user:nobody"];
});

You'd also want to make sure that nobody is in any special groups that rules in /usr/share/polkit-1/rules.d use to allow automatic access. You can look for these by grep'ing for 'isInGroup'.

The (early) good and bad parts of Polkit for a system administrator

Chris's Wiki :: blog

By: cks

24 November 2025 at 03:46

At a high level, Polkit is how a lot of things on modern Linux systems decide whether or not to let you do privileged operations. After looking into it a bit, I've wound up feeling that Polkit has both good and bad aspects from the perspective of a system administrator (especially a system administrator with multi-user Linux systems, where most of the people using them aren't supposed to have any special privileges). While I've used (desktop) Linuxes with Polkit for a while and relied on it for a certain amount of what I was doing, I've done so blindly, effectively as a normal person. This is the first I've looked at the details of Polkit, which is why I'm calling this my early reactions.

On the good side, Polkit is a single source of authorization decisions, much like PAM. On a modern Linux system, there are a steadily increasing number of programs that do privileged things, even on servers (such as systemd's run0). These could all have their own bespoke custom authorization systems, much as how sudo has its own custom one, but instead most of them have centralized on Polkit. In theory Polkit gives you a single thing to look at and a single thing to learn, rather than learning systemd's authentication system, NetworkManager's authentication system, etc. It also means that programs have less of a temptation to hard-code (some of) their authentication rules, because Polkit is very flexible.

(In many cases programs couldn't feasibly use PAM instead, because they want certain actions to be automatically authorized. For example, in its standard configuration libvirt wants everyone in group 'libvirt' to be able to issue libvirt VM management commands without constantly having to authenticate. PAM could probably be extended to do this but it would start to get complicated, partly because PAM configuration files aren't a programming language and so implementing logic in PAM gets awkward in a hurry.)

On the bad side, Polkit is a non-declarative authorization system, and a complex one with its rules not in any single place (instead they're distributed through multiple files in two different formats). Authorization decisions are normally made in (JavaScript) code, which means that they can encode essentially arbitrary logic (although there are standard forms of things). This means that the only way to know who is authorized to do a particular thing is to read its XML 'action' file and then look through all of the JavaScript code to find and then understand things that apply to it.

(Even 'who is authorized' is imprecise by default. Polkit normally allows anyone to authenticate as any administrative account, provided that they know its password and possibly other authentication information. This makes the passwords of people in group wheel or group admin very dangerous things, since anyone who can get their hands on one can probably execute any Polkit-protected action.)

This creates a situation where there's no way in Polkit to get a global overview of who is authorized to do what, or what a particular person has authorization for, since this doesn't exist in a declarative form and instead has to be determined on the fly by evaluating code. Instead you have to know what's customary, like the group that's 'administrative' for your Linux distribution (wheel or admin, typically) and what special groups (like 'libvirt') do what, or you have to read and understand all of the JavaScript and XML involved.

In other words, there's no feasible way to audit what Polkit is allowing people to do on your system. You have to trust that programs have made sensible decisions in their Polkit configuration (ones that you agree with), or run the risk of system malfunctions by turning everything off (or allowing only root to be authorized to do things).

(Not even Polkit itself can give you visibility into why a decision was made or fully predict it in advance, because the JavaScript rules have no pre-filtering to narrow down what they apply to. The only way you find out what a rule really does is invoking it. Well, invoking the function that the addRule() or addAdminRule() added to the rule stack.)

This complexity (and the resulting opacity of authorization) is probably intrinsic in Polkit's goals. I even think they made the right decision by having you write logic in JavaScript rather than try to create their own language for it. However, I do wish Polkit had a declarative subset that could express all of the simple cases, reserving JavaScript rules only for complex ones. I think this would make the overall system much easier for system administrators to understand and analyze, so we had a much better idea (and much better control) over who was authorized for what.

(3 comments.)

Brief notes on learning and adjusting Polkit on modern Linuxes

Chris's Wiki :: blog

By: cks

23 November 2025 at 04:07

Polkit (also, also) is a multi-faceted user level thing used to control access to privileged operations. It's probably used by various D-Bus services on your system, which you can more or less get a list of with pkaction, and there's a pkexec program that's like su and sudo. There are two reasons that you might care about Polkit on your system. First, there might be tools you want to use that use Polkit, such as systemd's run0 (which is developing some interesting options). The other is that Polkit gives people an alternate way to get access to root or other privileges on your servers and you may have opinions about that and what authentication should be required.

Unfortunately, Polkit configuration is arcane and as far as I know, there aren't really any readily accessible options for it. For instance, if you want to force people to authenticate for root-level things using the root password instead of their password, as far as I know you're going to have to write some JavaScript yourself to define a suitable Administrator identity rule. The polkit manual page seems to document what you can put in the code reasonably well, but I'm not sure how you test your new rules and some areas seem underdocumented (for example, it's not clear how 'addAdminRule()' can be used to say that the current user cannot authenticate as an administrative user at all).

(If and when I wind up needing to test rules, I will probably try to do it in a scratch virtual machine that I can blow up. Fortunately Polkit is never likely to be my only way to authenticate things.)

Polkit also has some paper cuts in its current setup. For example, as far as I can see there's no easy way to tell Polkit-using programs that you want to immediately authenticate for administrative access as yourself, rather than be offered a menu of people in group wheel (yourself included) and having to pick yourself. It's also not clear to me (and I lack a test system) if the default setup blocks people who aren't in group wheel (or group admin, depending on your Linux distribution flavour) from administrative authentication or if instead they get to pick authenticating using one of your passwords. I suspect it's the latter.

(All of this makes Polkit seem like it's not really built for multi-user Linux systems, or at least multi-user systems where not everyone is an administrator.)

PS: Now that I've looked at it, I have some issues with Polkit from the perspective of a system administrator, but those are going to be for another entry.

Sidebar: Some options for Polkit (root) authentication

If you want everyone to authenticate as root for administrative actions, I think what you want is:

polkit.addAdminRule(function(action, subject) {
    return ["unix-user:0"];
});

If you want to restrict this to people in group wheel, I think you want something like:

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // might not work to say 'no'?
        return [];
    }
});

If you want people in group wheel to authenticate as themselves, not root, I think you return 'unix-user:' + subject.user instead of 'unix-user:0'. I don't know if people still get prompted by Polkit to pick a user if there's only one possible user.

(7 comments.)

Automatically scrubbing ZFS pools periodically on FreeBSD

Chris's Wiki :: blog

By: cks

20 November 2025 at 03:17

We've been moving from OpenBSD to FreeBSD for firewalls. One advantage of this is giving us a mirrored ZFS pool for the machine's filesystems; we have a lot of experience operating ZFS and it's a simple, reliable, and fully supported way of getting mirrored system disks on important machines. ZFS has checksums and you want to periodically 'scrub' your ZFS pools to verify all of your data (in all of its copies) through these checksums (ideally relatively frequently). All of this is part of basic ZFS knowledge, so I was a little bit surprised to discover that none of our FreeBSD machines had ever scrubbed their root pools, despite some of them having been running for months.

It turns out that while FreeBSD comes with a configuration option to do periodic ZFS scrubs, the option isn't enabled by default (as of FreeBSD 14.3). Instead you have to know to enable it, which admittedly isn't too hard to find once you start looking.

FreeBSD has a general periodic(8) system for triggering things on a daily, weekly, monthly, or other basis. As covered in the manual page, the default configuration for this is in /etc/defaults/periodic.conf and you can override things by creating or modifying /etc/periodic.conf. ZFS scrubs are a 'daily' periodic setting, and as of 14.3 the basic thing you want is an /etc/periodic.conf with:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"

FreeBSD will normally scrub each pool a certain number of days after its previous scrub (either a manual scrub or an automatic scrub through the periodic system). The default number of days is 35, which is a bit high for my tastes, so I suggest that you shorten it, making your periodic.conf stanza be:

# Enable ZFS scrubs
daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold="14"

There are other options you can set that are covered in /etc/defaults/periodic.conf.

(That the daily automatic scrubs happen some number of days after the pool was last scrubbed means that you can adjust their timing by doing a manual scrub. If you have a bunch of machines that you set up at the same time, you can get them to space out their scrubs by scrubbing one a day by hand, and so on.)

Looking at the other ZFS periodic options, I might also enable the daily ZFS status report, because I'm not certain if there's anything else that will alert you if or when ZFS starts reporting errors:

# Find out about ZFS errors?
daily_status_zfs_enable="YES"

You can also tell ZFS to TRIM your SSDs every day. As far as I can see there's no option to do the TRIM less often than once a day; I guess if you want that you have to create your own weekly or monthly periodic script (perhaps by copying the 801.trim-zfs daily script and modifying it appropriately). Or you can just do 'zpool trim ...' every so often by hand.

A surprise with how '#!' handles its program argument in practice

Chris's Wiki :: blog

By: cks

18 November 2025 at 03:54

Every so often I get to be surprised about some Unix thing. Today's surprise is the actual behavior of '#!' in practice on at least Linux, FreeBSD, and OpenBSD, which I learned about from a comment by Aristotle Pagaltzis on my entry on (not) using '#!/usr/bin/env'. I'll quote the starting part here:

In fact the shebang line doesn’t require absolute paths, you can use relative paths too. The path is simply resolved from your current directory, just as any other path would be – the kernel simply doesn’t do anything special for shebang line paths at all. [...]

I found this so surprising that I tested it on our Linux servers as well as a FreeBSD and an OpenBSD machine. On the Linux servers (and probably on the others too), the kernel really does accept the full collection of relative paths in '#!'. You can write '#!python3', '#!bin/python3', '#!../python3', '#!../../../usr/bin/python3', and so on, and provided that your current directory is in the right place in the filesystem, they all worked.

(On FreeBSD and OpenBSD I only tested the '#!python3' case.)

As far as I can tell, this behavior goes all the way back to 4.2 BSD (which isn't quite the origin point of '#!' support in the Unix kernel but is about as close as we can get). The execve() kernel implementation in sys/kern_exec.c finds the program from your '#!' line with a namei() call that uses the same arguments (apart from the name) as it did to find the initial executable, and that initial executable can definitely be a relative path.

Although this is probably the easiest way to implement '#!' inside the kernel, I'm a little bit surprised that it survived in Linux (in a completely independent implementation) and in OpenBSD (where the security people might have had a double-take at some point). But given Hyrum's Law there are probably people out there who are depending on this behavior so we're now stuck with it.

(In the kernel, you'd have to go at least a little bit out of your way to check that the new path starts with a '/' or use a kernel name lookup function that only resolves absolute paths. Using a general name lookup function that accepts both absolute and relative paths is the simplest approach.)

PS: I don't have access to Illumos based systems, other BSDs (NetBSD, etc), or macOS, but I'd be surprised if they had different behavior. People with access to less mainstream Unixes (including commercial ones like AIX) can give it a try to see if there are any Unixes that don't support relative paths in '#!'.

(10 comments.)

Discovering orphaned binaries in /usr/sbin on Fedora 42

Chris's Wiki :: blog

By: cks

11 November 2025 at 04:10

Over on the Fediverse, I shared a somewhat unwelcome discovery I made after upgrading to Fedora 42:

This is my face when I have quite a few binaries in /usr/sbin on my office Fedora desktop that aren't owned by any package. Presumably they were once owned by packages, but the packages got removed without the files being removed with them, which isn't supposed to happen.

(My office Fedora install has been around for almost 20 years now without being reinstalled, so things have had time to happen. But some of these binaries date from 2021.)

There seem to be two sorts of these lingering, unowned /usr/sbin programs. One sort, such as /usr/sbin/getcaps, seems to have been left behind when its package moved things to /usr/bin, possibly due to this RPM bug (via). The other sort is genuinely unowned programs dating to anywhere from 2007 (at the oldest) to 2021 (at the newest), which have nothing else left of them sitting around. The newest programs are what I believe are wireless management programs: iwconfig, iwevent, iwgetid, iwlist, iwpriv, and iwspy, and also "ifrename" (which I believe was also part of a 'wireless-tools' package). I had the wireless-tools package installed on my office desktop until recently, but I removed it some time during Fedora 40, probably sparked by the /sbin to /usr/sbin migration, and it's possible that binaries didn't get cleaned up properly due to that migration.

The most interesting orphan is /usr/sbin/sln, dating from 2018, when apparently various people discovered it as an orphan on their system. Unlike all the other orphan programs, the sln manual page is still shipped as part of the standard 'man-pages' package and so you can read sln(8) online. Based on the manual page, it sounds like it may have been part of glibc at one point.

(Another orphaned program from 2018 is pam_tally, although it's coupled to pam_tally2.so, which did get removed.)

I don't know if there's any good way to get mappings from files to RPM packages for old Fedora versions. If there is, I'd certainly pick through it to try to find where various of these files came from originally. Unfortunately I suspect that for sufficiently old Fedora versions, much of this information is either offline or can't be processed by modern versions of things like dnf.

(The basic information is used by eg 'dnf provides' and can be built by hand from the raw RPMs, but I have no desire to download all of the RPMs for decade-old Fedora versions even if they're still available somewhere. I'm curious but not that curious.)

PS: At the moment I'm inclined to leave everything as it is until at least Fedora 43, since RPM bugs are still being sorted out here. I'll have to clean up genuinely orphaned files at some point but I don't think there's any rush. And I'm not removing any more old packages that use '/sbin/<whatever>', since that seems like it has some bugs.

Some notes on duplicating xterm windows

Chris's Wiki :: blog

By: cks

5 November 2025 at 03:45

Recently on the Fediverse, Dave Fischer mentioned a neat hack:

In the decades-long process of getting my fvwm config JUST RIGHT, my xterm right-click menu now has a "duplicate" command, which opens a new xterm with the same geometry, on the same node, IN THE SAME DIRECTORY. (Directory info aquired via /proc.)

[...]

(See also a followup note.)

This led to @grawity sharing an xterm-native approach to this, using xterm's spawn-new-terminal() internal function that's available through xterm's keybindings facility.

I have a long-standing shell function in my shell that attempts to do this (imaginatively called 'spawn'), but this is only available in environments where my shell is set up, so I was quite interested in the whole area and did some experiments. The good news is that xterm's 'spawn-new-terminal' works, in that it will start a new xterm and the new xterm will be in the right directory. The bad news for me is that that's about all that it will do, and in my environment this has two limitations that will probably make it not something I use a lot.

The first limitation is that this starts an xterm that doesn't copy the command line state or settings of the parent xterm. If you've set special options on the parent xterm (for example, you like your root xterms to have a red foreground), this won't be carried over to the new xterm. Similarly, if you've increased (or decreased) the font size in your current xterm or otherwise changed its settings, spawn-new-terminal doesn't duplicate these; you get a default xterm. This is reasonable but disappointing.

(While spawn-new-terminal takes arguments that I believe it will pass to the new xterm, as far as I know there's no way to retrieve the current xterm's command line arguments to insert them here.)

The larger limitation for me is that when I'm at home, I'm often running SSH inside of an xterm in order to log in to some other system (I have a 'sshterm' script to automate all the aspects of this). What I really want when I 'duplicate' such an xterm is not a copy of the local xterm running a local shell (or even starting another SSH to the remove system), but the remote (shell) context, with the same (remote) current directory and so on. This is impossible to get in general and difficult to set up even for situations where it's theoretically possible. To use spawn-new-terminal effectively, you basically need either all local xterms or copious use of remote X forwarded over SSH (where the xterm is running on the remote system, so a duplicate of it will be as well and can get the right current directory).

Going through this experience has given me some ideas on how to improve the situation overall. Probably I should write a 'spawn' shell script to replace or augment my 'spawn' shell function so I can readily have it in more places. Then when I'm ssh'd in to a system, I can make the 'spawn' script at least print out a command line or two for me to copy and paste to get set up again.

(Two command lines is the easiest approach, with one command that starts the right xterm plus SSH combination and the other a 'cd' to the right place that I'd execute in the new logged in window. It's probably possible to combine these into an all-in-one script but that starts to get too clever in various ways, especially as SSH has no straightforward way to pass extra information to a login shell.)

(5 comments.)

Removing Fedora's selinux-policy-targeted package is mostly harmless so far

Chris's Wiki :: blog

By: cks

1 November 2025 at 01:32

A while back I discussed why I might want to remove the selinux-policy-targeted RPM package for a Fedora 42 upgrade. Today, I upgraded my office workstation from Fedora 41 to Fedora 42, and as part of preparing for that upgrade I removed the selinux-policy-targeted policy (and all of the packages that depended on it). The result appears to work, although there were a few things that came up during the upgrade and I may reinstall at least selinux-policy-targeted itself to get rid of them (for now).

The root issue appears to be that when I removed the selinux-policy-targeted package, I probably should have edited /etc/selinux/config to set SELINUXTYPE to some bogus value, not left it set to "targeted". For entirely sensible reasons, various packages have postinstall scripts that assume that if your SELinux configuration says your SELinux type is 'targeted', they can do things that implicitly or explicitly require things from the package or from the selinux-policy package, which got removed when I removed selinux-policy-targeted.

I'm not sure if my change to SELINUXTYPE will completely fix things, because I suspect that there are other assumptions about SELinux policy programs and data files being present lurking in standard, still-installed package tools and so on. Some of these standard SELinux related packages definitely can't be removed without gutting Fedora of things that are important to me, so I'll either have to live with periodic failures of postinstall scripts or put selinux-policy-targeted and some other bits back. On the whole, reinstalling selinux-policy-targeted is probably the safest way and the issue that caused me to remove it only applies during Fedora version upgrades and might anyway be fixed in Fedora 42.

What this illustrates to me is that regardless of package dependencies, SELinux is not really optional on Fedora. The Fedora environment assumes that a functioning SELinux environment is there and if it isn't, things are likely to go wrong. I can't blame Fedora for this, or for not fully capturing this in package dependencies (and Fedora did protect the selinux-policy-targeted package from being removed; I overrode that by hand, so what happens afterward is on me).

(Although I haven't checked modern versions of Fedora, I suspect that there's no official way to install Fedora without getting a SELinux policy package installed, and possibly selinux-policy-targeted specifically.)

PS: I still plan to temporarily remove selinux-policy-targeted when I upgrade my home desktop to Fedora 42. A few package postinstall glitches is better than not being able to read DNF output due to the package's spam.

(2 comments.)

PUSTIMO BURŽOAZNO REVOLUCIJO V PRETEKLOSTI, NA DNEVNEM REDU JE SOCIALISTIČNA!

Rdeča Pesa

By: Uredništvo

26 November 2025 at 12:09

V nedavnem referendumu o prostovoljnem končanju življenja lahko vidimo poskus, da bi končno izpolnili program francoske revolucije. V časih pred buržoazno revolucijo so v Evropi cerkve imele monopol nad potekom življenja. Rojstva, poroke, smrt, glavne življenjske postaje, je cerkev obvladovala s svojimi zakramenti. Buržoazna država je sčasoma prevzela register rojstev in poroke, s podržavljenjem nadzora nad prebivalstvom je ljudi osvobodila cerkvenega gospostva. S svobodnim odločanjem o porajanju otrok je zagotovila svobodno upravljanje z življenjem – od rojstva do smrti. Ne pa o smrti sami! Referendum 23. novembra bi nas lahko osvobodil tudi na tej točki, na zadnjem oporišču cerkvene moči. Pa nas ni!

Zakaj se državljanke in državljani niso hoteli otresti zadnjega srednjeveškega jarma? Razlogi za glasovanje »proti« so bili različni, nas poučujejo strokovnjaki. Ideologija »svetosti življenja« in pokorščina papeški cerkvi sta očitna razloga. A za nas pomembnejši je razlog tistih, ki so menili, da je zakon slab. Zakon je res močno zapletel prostovoljno končanje življenja. Bolj ko zapletajo pravne postopke, večja je verjetnost, da bodo povečevali tudi možnosti za izigravanje zakona, množili »pravne luknje«.

Nasploh zmore buržoazna pravna država »osvoboditi« posameznice in posameznike samo s pomočjo »pravne države«, tj. pravnega fetišizma. To pa ni emancipacija človeka. Kritiko francoske revolucije je objavil Karl Marx že leta 1844 v Prispevku k judovskemu vprašanju. Francoska revolucija, je napisal, osvobodi individua zgolj politično in s tem razbije družbo na atomizirane individue, ki jih povezuje samo še pravo.

Preprosto rečeno: buržoazna revolucija uvede boj vseh proti vsem v mejah zakonov buržoazne pravne države. Med drugim je to tudi temelj za »svobodno« delovno pogodbo med proletarko, proletarcem in kapitalistom – ta pogodba pa je osnova za kapitalistično izkoriščanje. Buržoazni pravni fetišizem ne zagotavlja svobodnega sožitja v solidarni družbi.

»Proti« je bržkone glasoval tudi marsikdo, ki je hotel s tem protestirati proti politiki sedanje vlade. A proti vladi je lahko glasoval spet iz različnih razlogov: npr. zato ker ni vzpostavila trdnega javnega zdravstva – ali pa zato ker ni zdravstva dokončno sprivatizirala. Torej iz popolnoma nasprotnih razlogov.

Iz tega, da različni, celo nasprotujoči si razlogi pripeljejo do enake odločitve na referendumu, lahko razberemo splošno omejenost buržoaznega političnega sistema. Zakon je pripravila stranka, podprle so ga koalicijske stranke, sprejel ga je strankarski parlament. Med pripravo zakona so v javnosti nekoliko, sicer ne preveč zavzeto, razpravljali o vprašanju prostovoljnega končanja življenja. A koliko so misli iz razprave upoštevali pri pisanju zakona, so odločale stranke.

Pred referendumom je bila razprava o zakonu resda živahna – a ni več mogla vplivati na zakon. Idej, ki so jih predstavili v razpravi pred referendumom, ni bilo mogoče ustvarjalno uporabiti. Ujete so bile v skopo izbiro za zakon ali proti zakonu, kakršnega so določile stranke. Strankarska demokracija ne zmore izkoristiti vseh intelektualnih moči družbe. Tudi na volitvah se lahko odločamo zgolj za strankarske liste ali proti njim. Navsezadnje se odločimo za najmanj slabo možnost. Buržoazna demokracija ni demokratična.

Zato smo spet pred starim vprašanjem: je treba najprej do konca izpeljati buržoazno revolucijo (človekove pravice, pravna država, buržoazni parlamentarizem) – ali je na dnevnem redu socialistična revolucija in se moramo bojevati za odpravo razredov in izkoriščanja, za podružbljenje produkcijskih sredstev in odločanja, za solidarno sožitje?

V kapitalizmu ni mogoče uresničiti obljub buržoazne revolucije. Brez neplačanega dela v gospodinjstvu bi se kapitalizem sesul. V vseh obdobjih je potreboval nesvobodno delo – od »tradicionalnih« odnosov v kolonijah do suženjstva na ameriškem jugu in migrantskega dela zdaj.

Zato bi s popravljanjem kapitalizma samo cepetali v zgodovinski slepi ulici. Na dnevnem redu je socialistična revolucija. Še zlasti za nas, ki smo jo še nedavno prakticirali.

Gostujoče pero je napisal Rastko Močnik

GOSTUJOČI PRISPEVEK // RP je odprta platforma in omogoča objavo prispevkov avtoric, ki se dotikajo naprednih bojev ali vprašanj

FOTO: Žiga Živulovič jr./Bobo

The post PUSTIMO BURŽOAZNO REVOLUCIJO V PRETEKLOSTI, NA DNEVNEM REDU JE SOCIALISTIČNA! first appeared on Rdeča Pesa.

Modern Linux filesystem mounts are rather complex things

Chris's Wiki :: blog

By: cks

28 October 2025 at 03:04

Once upon a time, Unix filesystem mounts worked by putting one inode on top of another, and this was also how they worked in very early Linux. It wasn't wrong to say that mounts were really about inodes, with the names only being used to find the inodes. This is no longer how things work in Linux (and perhaps other Unixes, but Linux is what I'm most familiar with for this). Today, I believe that filesystem mounts in Linux are best understood as namespace operations.

Each separate (unmounted) filesystem is a a tree of names (a namespace). At a broad level, filesystem mounts in Linux take some name from that filesystem tree and project it on top of something in an existing namespace, generally with some properties attached to the projection. A regular conventional mount takes the root name of the new filesystem and puts the whole tree somewhere, but for a long time Linux's bind mounts took some other name in the filesystem as their starting point (what we could call the root inode of the mount). In modern Linux, there can also be multiple mount namespaces in existence at one time, with different contents and properties. A filesystem mount does not necessarily appear in all of them, and different things can be mounted at the same spot in the tree of names in different mount namespaces.

(Some mount properties are still global to the filesystem as a whole, while other mount properties are specific to a particular mount. See mount(2) for a discussion of general mount properties. I don't know if there's a mechanism to handle filesystem specific mount properties on a per mount basis.)

This can't really be implemented with an inode-based view of mounts. You can somewhat implement traditional Linux bind mounts with an inode based approach, but mount namespaces have to be separate from the underlying inodes. At a minimum a mount point must be a pair of 'this inode in this namespace has something on top of it', instead of just 'this inode has something on top of it'.

(A pure inode based approach has problems going up the directory tree even in old bind mounts, because the parent directory of a particular directory depends on how you got to the directory. If /usr/share is part of /usr and you bind mounted /usr/share to /a/b, the value of '..' depends on if you're looking at '/usr/share/..' or '/a/b/..', even though /usr/share and /a/b are the same inode in the /usr filesystem.)

If I'm reading manual pages correctly, Linux still normally requires the initial mount of any particular filesystem be of its root name (its true root inode). Only after that initial mount is made can you make bind mounts to pull out some subset of its tree of names and then unmount the original full filesystem mount. I believe that a particular filesystem can provide ways to sidestep this with a filesystem specific mount option, such as btrfs's subvol= mount option that's covered in the btrfs(5) manual page (or 'btrfs subvolume set-default').

(One comment.)

Two reasons why Unix traditionally requires mount points to exist

Chris's Wiki :: blog

By: cks

24 October 2025 at 02:29

Recently on the Fediverse, argv minus one asked a good question:

Why does #Linux require #mount points to exist?

And are there any circumstances where a mount can be done without a pre-existing mount point (i.e. a mount point appears out of thin air)?

I think there is one answer for why this is a good idea in general and otherwise complex to do, although you can argue about it, and then a second historical answer based on how mount points were initially implemented.

The general problem is directory listings. We obviously want and need mount points to appear in readdir() results, but in the kernel, directory listings are historically the responsibility of filesystems and are generated and returned in pieces on the fly (which is clearly necessary if you have a giant directory; the kernel doesn't read the entire thing into memory and then start giving your program slices out of it as you ask). If mount points never appear in the underlying directory, then they must be inserted at some point in this process. If mount points can sometimes exist and sometimes not, it's worse; you need to somehow keep track of which ones actually exist and then add the ones that don't at the end of the directory listing. The simplest way to make sure that mount points always exist in directory listings is to require them to have an existence in the underlying filesystem.

(This was my initial answer.)

The historical answer is that in early versions of Unix, filesystems were actually mounted on top of inodes, not directories (or filesystem objects). When you passed a (directory) path to the mount(2) system call, all it was used for was getting the corresponding inode, which was then flagged as '(this) inode is mounted on' and linked (sort of) to the new mounted filesystem on top of it. All of the things that dealt with mount points and mounted filesystem did so by inode and inode number, with no further use of the paths and the root inode of the mounted filesystem being quietly substituted for the mounted-on inode. All of the mechanics of this needed the inode and directory entry for the name to actually exist (and V7 required the name to be a directory).

I don't think modern kernels (Linux or otherwise) still use this approach to handling mounts, but I believe it lingered on for quite a while. And it's a sufficiently obvious and attractive implementation choice that early versions of Linux also used it (see the Linux 0.96c version of iget() in fs/inode.c).

Sidebar: The details of how mounts worked in V7

When you passed a path to the mount(2) system call (called 'smount()' in sys/sys3.c), it used the name to get the inode and then set the IMOUNT flag from sys/h/inode.h on it (and put the mount details in a fixed size array of mounts, which wasn't very big). When iget() in sys/iget.c was fetching inodes for you and you'd asked for an IMOUNT inode, it gave you the root inode of the filesystem instead, which worked in cooperation with name lookup in a directory (the name lookup in the directory would find the underlying inode number, and then iget() would turn it into the mounted filesystem's root inode). This gave Research Unix a simple, low code approach to finding and checking for mount points, at the cost of pinning a few more inodes into memory (not necessarily a small thing when even a big V7 system only had at most 200 inodes in memory at once, but then a big V7 system was limited to 8 mounts, see h/param.h).

(3 comments.)

We don't update kernels without immediately rebooting the machine

Chris's Wiki :: blog

By: cks

22 October 2025 at 03:07

I've mentioned this before in passing (cf, also) but today I feel like saying it explicitly: our habit with all of our machines is to never apply a kernel update without immediately rebooting the machine into the new kernel. On our Ubuntu machines this is done by holding the relevant kernel packages; on my Fedora desktops I normally run 'dnf update --exclude "kernel*"' unless I'm willing to reboot on the spot.

The obvious reason for this is that we want to switch to the new kernel under controlled, attended conditions when we'll be able to take immediate action if something is wrong, rather than possibly have the new kernel activate at some random time without us present and paying attention if there's a power failure, a kernel panic, or whatever. This is especially acute on my desktops, where I use ZFS by building my own OpenZFS packages and kernel modules. If something goes wrong and the kernel modules don't load or don't work right, an unattended reboot can leave my desktops completely unusable and off the network until I can get to them. I'd rather avoid that if possible (sometimes it isn't).

(In general I prefer to reboot my Fedora machines with me present because weird things happen from time to time and sometimes I make mistakes, also.)

The less obvious reason is that when you reboot a machine right after applying a kernel update, it's clear in your mind that the machine has switched to a new kernel. If there are system problems in the days immediately afterward the update, you're relatively likely to remember this and at least consider the possibility that the new kernel is involved. If you apply a kernel update, walk away without rebooting, and the machine reboots a week and a half later for some unrelated reason, you may not remember that one of the things the reboot did was switch to a new kernel.

(Kernels aren't the only thing that this can happen with, since not all system updates and changes take effect immediately when made or applied. Perhaps one should reboot after making them, too.)

I'm assuming here that your Linux distribution's package management system is sensible, so there's no risk of losing old kernels (especially the one you're currently running) merely because you installed some new ones but didn't reboot into them. This is how Debian and Ubuntu behave (if you don't 'apt autoremove' kernels), but not quite how Fedora's dnf does it (as far as I know). Fedora dnf keeps the N most recent kernels around and probably doesn't let you remove the currently running kernel even if it's more than N kernels old, but I don't believe it tracks whether or not you've rebooted into those N kernels and stretches the N out if you haven't (or removes more recent installed kernels that you've never rebooted into, instead of older kernels that you did use at one point).

PS: Of course if kernel updates were perfect this wouldn't matter. However this isn't something you can assume for the Linux kernel (especially as patched by your distribution), as we've sometimes seen. Although big issues like that are relatively uncommon.

(4 comments.)

The early Unix history of chown() being restricted to root

Chris's Wiki :: blog

By: cks

13 October 2025 at 03:37

A few years ago I wrote about the divide in chown() about who got to give away files, where BSD and V7 were on one side, restricting it to root, while System III and System V were on the other, allowing the owner to give them away too. At the time I quoted the V7 chown(2) explanation of this:

[...] Only the super-user may execute this call, because if users were able to give files away, they could defeat the (nonexistent) file-space accounting procedures.

Recently, for reasons, chown(2) and its history was on my mind and so I wondered if the early Research Unixes had always had this, or if a restriction was added at some point.

The answer is that the restriction was added in V6, where the V6 chown(2) manual page has the same wording as V7. In Research Unix V5 and earlier, people can chown(2) away their own files; this is documented in the V4 chown(2) manual page and is what the V5 kernel code for chown() does. This behavior runs all the way back to the V1 chown() manual page, with an extra restriction that you can't chown() setuid files.

(Since I looked it up, the restriction on chown()'ing setuid files was lifted in V4. In V4 and later, a setuid file has its setuid bit removed on chown; in V3 you still can't give away such a file, according to the V3 chown(2) manual page.)

At this point you might wonder where the System III and System V unrestricted chown came from. The surprising to me answer seems to be that System III partly descends from PWB/UNIX, and PWB/UNIX 1.0, although it was theoretically based on V6, has pre-V6 chown(2) behavior (kernel source, manual page). I suspect that there's a story both to why V6 made chown() more restricted and also why PWB/UNIX specifically didn't take that change from V6, but I don't know if it's been documented anywhere (a casual Internet search didn't turn up anything).

(The System III chown(2) manual page says more or less the same thing as the PWB/UNIX manual page, just more formally, and the kernel code is very similar.)

(One comment.)

Maybe why OverlayFS had its readdir() inode number issue

Chris's Wiki :: blog

By: cks

12 October 2025 at 02:53

A while back I wrote about readdir()'s inode numbers versus OverlayFS, which discussed an issue where for efficiency reasons, OverlayFS sometimes returned different inode numbers in readdir() than in stat(). This is not POSIX legal unless you do some pretty perverse interpretations (as covered in my entry), but lots of filesystems deviate from POSIX semantics every so often. A more interesting question is why, and I suspect the answer is related to another issue that's come up, the problem of NFS exports of NFS mounts.

What's common in both cases is that NFS servers and OverlayFS both must create an 'identity' for a file (a NFS filehandle and an inode number, respectively). In the case of NFS servers, this identity has some strict requirements; OverlayFS has a somewhat easier life, but in general it still has to create and track some amount of information. Based on reading the OverlayFS article, I believe that OverlayFS considers this expensive enough to only want to do it when it has to.

OverlayFS definitely needs to go to this effort when people call stat(), because various programs will directly use the inode number (the POSIX 'file serial number') to tell files on the same filesystem apart. POSIX technically requires OverlayFS to do this for readdir(), but in practice almost everyone that uses readdir() isn't going to look at the inode number; they look at the file name and perhaps the d_type field to spot directories without needing to stat() everything.

If there was a special 'not a valid inode number' signal value, OverlayFS might use that, but there isn't one (in either POSIX or Linux, which is actually a problem). Since OverlayFS needs to provide some sort of arguably valid inode number, and since it's reading directories from the underlying filesystems, passing through their inode numbers from their d_ino fields is the simple answer.

(This entry was inspired by Kevin Lyda's comment on my earlier entry.)

Sidebar: Why there should be a 'not a valid inode number' signal value

Because both standards and common Unix usage include a d_ino field in the structure readdir() returns, they embed the idea that the stat()-visible inode number can easily be recovered or generated by filesystems purely by reading directories, without needing to perform additional IO. This is true in traditional Unix filesystems, but it's not obvious that you would do that all of the time in all filesystems. The on disk format of directories might only have some sort of object identifier for each name that's not easily mapped to a relatively small 'inode number' (which is required to be some C integer type), and instead the 'inode number' is an attribute you get by reading file metadata based on that object identifier (which you'll do for stat() but would like to avoid for reading directories).

But in practice if you want to design a Unix filesystem that performs decently well and doesn't just make up inode numbers in readdir(), you must store a potentially duplicate copy of your 'inode numbers' in directory entries.

(2 comments.)

Restarting or redoing something after a systemd service restarts

Chris's Wiki :: blog

By: cks

10 October 2025 at 03:21

Suppose, not hypothetically, that your system is running some systemd based service or daemon that resets or erase your carefully cultivated state when it restarts. One example is systemd-networkd, although you can turn that off (or parts of it off, at least), but there are likely others. To clean up after this happens, you'd like to automatically restart or redo something after a systemd unit is restarted. Systemd supports this, but I found it slightly unclear how you want to do this and today I poked at it, so it's time for notes.

(This is somewhat different from triggering one unit when another unit becomes active, which I think is still not possible in general.)

First, you need to put whatever you want to do into a script and a .service unit that will run the script. The traditional way to run a script through a .service unit is:

[Unit]
....

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/your/script/here

[Install]
WantedBy=multi-user.target

(The 'RemainAfterExit' is load-bearing, also.)

To get this unit to run after another unit is started or restarted, what you need is PartOf=, which causes your unit to be stopped and started when the other unit is, along with 'After=' so that your unit starts after the other unit instead of racing it (which could be counterproductive when what you want to do is fix up something from the other unit). So you add:

[Unit]
...
PartOf=systemd-networkd.service
After=systemd-networkd.service

(This is what works for me in light testing. This assumes that the unit you want to re-run after is normally always running, as systemd-networkd is.)

In testing, you don't need to have your unit specifically enabled by itself, although you may want it to be for clarity and other reasons. Even if your unit isn't specifically enabled, systemd will start it after the other unit because of the PartOf=. If the other unit is started all of the time (as is usually the case for systemd-networkd), this effectively makes your unit enabled, although not in an obvious way (which is why I think you should specifically 'systemctl enable' it, to make it obvious). I think you can have your .service unit enabled and active without having the other unit enabled, or even present.

You can declare yourself PartOf a .target unit, and some stock package systemd units do for various services. And a .target unit can be PartOf a .service; on Fedora, 'sshd-keygen.target' is PartOf sshd.service in a surprisingly clever little arrangement to generate only the necessary keys through a templated 'sshd-keygen@.service' unit.

I admit that the whole collection of Wants=, Requires=, Requisite=, BindsTo=, PartOf=, Upholds=, and so on are somewhat confusing to me. In the past, I've used the wrong version and suffered the consequences, and I'm not sure I have them entirely right in this entry.

Note that as far as I know, PartOf= has those Requires= consequences, where if the other unit is stopped, yours will be too. In a simple 'run a script after the other unit starts' situation, stopping your unit does nothing and can be ignored.

(If this seems complicated, well, I think it is, and I think one part of the complication is that we're trying to use systemd as an event-based system when it isn't one.)

Systemd-resolved's new 'DNS Server Delegation' feature (as of systemd 258)

Chris's Wiki :: blog

By: cks

9 October 2025 at 03:04

A while ago I wrote an entry about things that resolved wasn't for as of systemd 251. One of those things was arbitrary mappings of (DNS) names to DNS servers, for example if you always wanted *.internal.example.org to query a special DNS server. Systemd-resolved didn't have a direct feature for this and attempting to attach your DNS names to DNS server mappings to a network interface could go wrong in various ways. Well, time marches on and as of systemd v258 this is no longer the state of affairs.

Systemd v258 introduces systemd.dns-delegate files, which allow you to map DNS names to DNS servers independently from network interfaces. The release notes describe this as:

A new DNS "delegate zone" concept has been introduced, which are additional lookup scopes (on top of the existing per-interface and the one global scope so far supported in resolved), which carry one or more DNS server addresses and a DNS search/routing domain. It allows routing requests to specific domains to specific servers. Delegate zones can be configured via drop-ins below /etc/systemd/dns-delegate.d/*.dns-delegate.

Since systemd v258 is very new I don't have any machines where I can actually try this out, but based on the systemd.dns-delegate documentation, you can use this both for domains that you merely want diverted to some DNS server and also domains that you also want on your search path. Per resolved.conf's Domains= documentation, the latter is 'Domains=example.org' (example.org will be one of the domains that resolved tries to find single-label hostnames in, a search domain), and the former is 'Domains=~example.org' (where we merely send queries for everything under 'example.org' off to whatever DNS= you set, a route-only domain).

(While resolved.conf's Domains= officially promises to check your search domains in the order you listed them, I believe this is strictly for a single 'Domains=' setting for a single interface. If you have multiple 'Domains=' settings, for example in a global resolved.conf, a network interface, and now in a delegation, I think systemd-resolved makes no promises.)

Right now, these DNS server delegations can only be set through static files, not manipulated through resolvectl. I believe fiddling with them through resolvectl is on the roadmap, but for now I guess we get to restart resolved if we need to change things. In fact resolvectl doesn't expose anything to do with them, although I believe read-only information is available via D-Bus and maybe varlink.

Given the timing of systemd v258's release relative to Fedora releases, I probably won't be able to use this feature until Fedora 44 in the spring (Fedora 42 is current and Fedora 43 is imminent, which won't have systemd v258 given that v258 was released only a couple of weeks ago). My current systemd-resolved setup is okay (if it wasn't I'd be doing something else), but I can probably find uses for these delegations to improve it.

(One comment.)

Readdir()'s inode numbers versus OverlayFS

Chris's Wiki :: blog

By: cks

2 October 2025 at 03:09

Recently I re-read Deep Down the Rabbit Hole: Bash, OverlayFS, and a 30-Year-Old Surprise (via) and this time around, I stumbled over a bit in the writeup that made me raise my eyebrows:

Bash’s fallback getcwd() assumes that the inode [number] from stat() matches one returned by readdir(). OverlayFS breaks that assumption.

I wouldn't call this an 'assumption' so much as 'sane POSIX semantics', although I'm not sure that POSIX absolutely requires this.

As we've seen before, POSIX talks about 'file serial number(s)' instead of inode numbers. The best definition of these is covered in sys/stat.h, where we see that a 'file identity' is uniquely determined by the combination the inode number and the device ID (st_dev), and POSIX says that 'at any given time in a system, distinct files shall have distinct file identities' while hardlinks have the same identity. The POSIX description of readdir() and dirent.h don't caveat the d_ino file serial numbers from readdir(), so they're implicitly covered by the general rules for file serial numbers.

In theory you can claim that the POSIX guarantees don't apply here since readdir() is only supplying d_ino, the file serial number, not the device ID as well. I maintain that this fails due to a POSIX requirement:

[...] The value of the structure's d_ino member shall be set to the file serial number of the file named by the d_name member. [...]

If readdir() gives one file serial number and a fstatat() of the same name gives another, a plain reading of POSIX is that one of them is lying. Files don't have two file serial numbers, they have one. Readdir() can return duplicate d_ino numbers for files that aren't hardlinks to each other (and I think legitimately may do so in some unusual circumstances), but it can't return something different than what fstatat() does for the same name.

The perverse argument here turns on POSIX's 'at any given time'. You can argue that the readdir() is at one time and the stat() is at another time and the system is allowed to entirely change file serial numbers between the two times. This is certainly not the intent of POSIX's language but I'm not sure there's anything in the standard that rules it out, even though it makes file serial numbers fairly useless since there's no POSIX way to get a bunch of them at 'a given time' so they have to be coherent.

So to summarize, OverlayFS has chosen what are effectively non-POSIX semantics for its readdir() inode numbers (under some circumstances, in the interests of performance) and Bash used readdir()'s d_ino in a traditional Unix way that caused it to notice. Unix filesystems can depart from POSIX semantics if they want, but I'd prefer if they were a bit more shamefaced about it. People (ie, programs) count on those semantics.

(The truly traditional getcwd() way wouldn't have been a problem, because it predates readdir() having d_ino and so doesn't use it (it stat()s everything to get inode numbers). I reflexively follow this pre-d_ino algorithm when I'm talking about doing getcwd() by hand (cf), but these days you want to use the dirent d_ino and if possible d_type, because they're much more efficient than stat()'ing everything.)

(3 comments.)

FreeBSD vs. SmartOS: Who's Faster for Jails, Zones, and bhyve VMs?

IT+Notes

By: Stefano Marinelli

19 September 2025 at 08:50

Which virtualization host performs better? I put FreeBSD and SmartOS in a head-to-head showdown. The performance of Jails, Zones, and bhyve VMs surprised me, forcing a second round of tests on different hardware to find the real winner.

Unix mail programs have had two approaches to handling your mail

Chris's Wiki :: blog

By: cks

23 September 2025 at 02:34

Historically, Unix mail programs (what we call 'mail clients' or 'mail user agents' today) have had two different approaches to handling your email, what I'll call the shared approach and the exclusive approach, with the shared approach being the dominant one. To explain the shared approach, I have to back up to talk about what Unix mail transfer agents (MTAs) traditionally did. When a Unix MTA delivered email to you, at first it delivered email into a single file in a specific location (such as '/usr/spool/mail/<login>') in a specific format, initially mbox; even then, this could be called your 'inbox'. Later, when the maildir mailbox format became popular, some MTAs gained the ability to deliver to maildir format inboxes.

(There have been a number of Unix mail spool formats over the years, which I'm not going to try to get into here.)

A 'shared' style mail program worked directly with your inbox in whatever format it was in and whatever location it was in. This is how the V7 'mail' program worked, for example. Naturally these programs didn't have to work on your inbox; you could generally point them at another mailbox in the same format. I call this style 'shared' because you could use any number of different mail programs (mail clients) on your mailboxes, providing that they all understood the format and also provided that all of them agreed on how to lock your mailbox against modifications, including against your system's MTA delivering new email right at the point where your mail program was, for example, trying to delete some.

(Locking issues are one of the things that maildir was designed to help with.)

An 'exclusive' style mail program (or system) was designed to own your email itself, rather than try to share your system mailbox. Of course it had to access your system mailbox a bit to get at your email, but broadly the only thing an exclusive mail program did with your inbox was pull all your new email out of it, write it into the program's own storage format and system, and then usually empty out your system inbox. I call this style 'exclusive' because you generally couldn't hop back and forth between mail programs (mail clients) and would be mostly stuck with your pick, since your main mail program was probably the only one that could really work with its particular storage format.

(Pragmatically, only locking your system mailbox for a short period of time and only doing simple things with it tended to make things relatively reliable. Shared style mail programs had much more room for mistakes and explosions, since they had to do more complex operations, at least on mbox format mailboxes. Being easy to modify is another advantage of the maildir format, since it outsources a lot of the work to your Unix filesystem.)

This shared versus exclusive design choice turned out to have some effects when mail moved to being on separate servers and accessed via POP and then later IMAP. My impression is that 'exclusive' systems coped fairly well with POP, because the natural operation with POP is to pull all of your new email out of the server and store it locally. By contrast, shared systems coped much better with IMAP than exclusive ones did, because IMAP is inherently a shared mail environment where your mail stays on the IMAP server and you manipulate it there.

(Since IMAP is the dominant way that mail clients/user agents get at email today, my impression is that the 'exclusive' approach is basically dead at this point as a general way of doing mail clients. Almost no one wants to use an IMAP client that immediately moves all of their email into a purely local data storage of some sort; they want their email to stay on the IMAP server and be accessible from and by multiple clients and even devices.)

Most classical Unix mail clients are 'shared' style programs, things like Alpine, Mutt, and the basic Mail program. One major 'exclusive' style program, really a system, is (N)MH (also). MH is somewhat notable because in its time it was popular enough that a number of other mail programs and mail systems supported its basic storage format to some degree (for example, procmail can deliver messages to MH-format directories, although it doesn't update all of the things that MH would do in the process).

Another major source of 'exclusive' style mail handling systems is GNU Emacs. I believe that both rmail and GNUS normally pull your email from your system inbox into their own storage formats, partly so that they can take exclusive ownership and don't have to worry about locking issues with other mail clients. GNU Emacs has a number of mail reading environments (cf, also) and I'm not sure what the others do (apart from MH-E, which is a frontend on (N)MH).

(There have probably been other 'exclusive' style systems. Also, it's a pity that as far as I know, MH never grew any support for keeping its messages in maildir format directories, which are relatively close to MH's native format.)

(4 comments.)

These days, systemd can be a cause of restrictions on daemons

Chris's Wiki :: blog

By: cks

20 September 2025 at 02:59

One of the traditional rites of passage for Linux system administrators is having a daemon not work in the normal system configuration (eg, when you boot the system) but work when you manually run it as root. The classical cause of this on Unix was that $PATH wasn't fully set in the environment the daemon was running in but was in your root shell. On Linux, another traditional cause of this sort of thing has been SELinux and a more modern source (on Ubuntu) has sometimes been AppArmor. All of these create hard to see differences between your root shell (where the daemon works when run by hand) and the normal system environment (where the daemon doesn't work). These days, we can add another cause, an increasingly common one, and that is systemd service unit restrictions, many of which are covered in systemd.exec.

(One pernicious aspect of systemd as a cause of these restrictions is that they can appear in new releases of the same distribution. If a daemon has been running happily in an older release and now has surprise issues in a new Ubuntu LTS, I don't always remember to look at its .service file.)

Some of systemd's protective directives simply cause failures to do things, like access user home directories if ProtectHome= is set to something appropriate. Hopefully your daemon complains loudly here, reporting mysterious 'permission denied' or 'file not found' errors. Some systemd settings can have additional, confusing effects, like PrivateTmp=. A standard thing I do when troubleshooting a chain of programs executing programs executing programs is to shim in diagnostics that dump information to /tmp, but with PrivateTmp= on, my debugging dump files are mysteriously not there in the system-wide /tmp.

(On the other hand, a daemon may not complain about missing files if it's expected that the files aren't always there. A mailer usually can't really tell the difference between 'no one has .forward files' and 'I'm mysteriously not able to see people's home directories to find .forward files in them'.)

Sometimes you don't get explicit errors, just mysterious failures to do some things. For example, you might set IP address access restrictions with the intention of blocking inbound connections but wind up also blocking DNS queries (and this will also depend on whether or not you use systemd-resolved). The good news is that you're mostly not going to find standard systemd .service files for normal daemons shipped by your Linux distribution with IP address restrictions. The bad news is that at some point .service files may start showing up that impose IP address restrictions with the assumption that DNS resolution is being done via systemd-resolved as opposed to direct DNS queries.

(I expect some Linux distributions to resist this, for example Debian, but others may declare that using systemd-resolved is now mandatory in order to simplify things and let them harden service configurations.)

Right now, you can usually test if this is the problem by creating a version of the daemon's .service file with any systemd restrictions stripped out of it and then seeing if using that version makes life happy. In the future it's possible that some daemons will assume and require some systemd restrictions (for instance, assuming that they have a /tmp all of their own), making things harder to test.

(One comment.)

Some stuff on how Linux consoles interact with the mouse

Chris's Wiki :: blog

By: cks

19 September 2025 at 01:24

On at least x86 PCs, Linux text consoles ('TTY' consoles or 'virtual consoles') support some surprising things. One of them is doing some useful stuff with your mouse, if you run an additional daemon such as gpm or the more modern consolation. This is supported on both framebuffer consoles and old 'VGA' text consoles. The experience is fairly straightforward; you install and activate one of the daemons, and afterward you can wave your mouse around, select and paste text, and so on. How it works and what you get is not as clear, and since I recently went diving into this area for reasons, I'm going to write down what I now know before I forget it (with a focus on how consolation works).

The quick summary is that the console TTY's mouse support is broadly like a terminal emulator. With a mouse daemon active, the TTY will do "copy and paste" selection stuff on its own. A mouse aware text mode program can put the console into a mode where mouse button presses are passed through to the program, just as happens in xterm or other terminal emulators.

The simplest TTY mode is when a non-mouse-aware program or shell is active, which is to say a program that wouldn't try to intercept mouse actions itself if it was run in a regular terminal window and would leave mouse stuff up to the terminal emulator. In this mode, your mouse daemon reads mouse input events and then uses sub-options of the TIOCLINUX ioctl to inject activities into the TTY, for example telling it to 'select' some text and then asking it to paste that selection to some file descriptor (normally the console itself, which delivers it to whatever foreground program is taking terminal input at the time).

(In theory you can use the mouse to scroll text back and forth, but in practice that was removed in 2020, both for the framebuffer console and for the VGA console. If I'm reading the code correctly, a VGA console might still have a little bit of scrollback support depending on how much spare VGA RAM you have for your VGA console size. But you're probably not using a VGA console any more.)

The other mode the console TTY can be in is one where some program has used standard xterm-derived escape sequences to ask for xterm-compatible "mouse tracking", which is the same thing it might ask for in a terminal emulator if it wanted to handle the mouse itself. What this does in the kernel TTY console driver is set a flag that your mouse daemon can query with TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't directly handle or look at mouse events. Instead, consolation (or gpm) reads the flag and, when the flag is set, uses the TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL sub-option to report the mouse position and button presses to the kernel (instead of handling mouse activity itself). The kernel then turns around and sends mouse reporting escape codes to the TTY, as the program asked for.

(As I discovered, we got a CVE this year related to this, where the kernel let too many people trigger sending programs 'mouse' events. See the stable kernel commit message for details.)

A mouse daemon like consolation doesn't have to pay attention to the kernel's TTY 'mouse reporting' flag. As far as I can tell from the current Linux kernel code, if the mouse daemon ignores the flag it can keep on doing all of its regular copy and paste selection and mouse button handling. However, sending mouse reports is only possible when a program has specifically asked for it; the kernel will report an error if you ask it to send a mouse report at the wrong time.

(As far as I can see there's no notification from the kernel to your mouse daemon that someone changed the 'mouse reporting' flag. Instead you have to poll it; it appears consolation does this every time through its event loop before it handles any mouse events.)

PS: Some documentation on console mouse reporting was written as a 2020 kernel documentation patch (alternate version) but it doesn't seem to have made it into the tree. According to various sources, eg, the mouse daemon side of things can only be used by actual mouse daemons, not by programs, although programs do sometimes use other bits of TIOCLINUX's mouse stuff.

PPS: It's useful to install a mouse daemon on your desktop or laptop even if you don't intend to ever use the text TTY. If you ever wind up in the text TTY for some reason, perhaps because your regular display environment has exploded, having mouse cut and paste is a lot nicer than not having it.

My Fedora machines need a cleanup of their /usr/sbin for Fedora 42

Chris's Wiki :: blog

By: cks

17 September 2025 at 03:06

One of the things that Fedora is trying to do in Fedora 42 is unifying /usr/bin and /usr/sbin. In an ideal (Fedora) world, your Fedora machines will have /usr/sbin be a symbolic link to /usr/bin after they're upgraded to Fedora 42. However, if your Fedora machines have been around for a while, or perhaps have some third party packages installed, what you'll actually wind up with is a /usr/sbin that is mostly symbolic links to /usr/bin but still has some actual programs left.

One source of these remaining /usr/sbin programs is old packages from past versions of Fedora that are no longer packaged in Fedora 41 and Fedora 42. Old packages are usually harmless, so it's easy for them to linger around if you're not disciplined; my home and office desktops (which have been around for a while) still have packages from as far back as Fedora 28.

(An added complication of tracking down file ownership is that some RPMs haven't been updated for the /sbin to /usr/sbin merge and so still believe that their files are /sbin/<whatever> instead of /usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find these.)

Obviously, you shouldn't remove old packages without being sure of whether or not they're important to you. I'm also not completely sure that all packages in the Fedora 41 (or 42) repositories are marked as '.fc41' or '.fc42' in their RPM versions, or if there are some RPMs that have been carried over from previous Fedora versions. Possibly this means I should wait until a few more Fedora versions have come to pass so that other people find and fix the exceptions.

(On what is probably my cleanest Fedora 42 test virtual machine, there are a number of packages that 'dnf list --extras' doesn't list that have '.fc41' in their RPM version. Some of them may have been retained un-rebuilt for binary compatibility reasons. There's also the 'shim' UEFI bootloaders, which date from 2024 and don't have Fedora releases in their RPM versions, but those I expect to basically never change once created. But some others are a bit mysterious, such as 'libblkio', and I suspect that they may have simply been missed by the Fedora 42 mass rebuild.)

PS: In theory anyone with access to the full Fedora 42 RPM repository could sweep the entire thing to find packages that still install /usr/sbin files or even /sbin files, which would turn up any relevant not yet rebuilt packages. I don't know if there's any easy way to do this through dnf commands, although I think dnf does have access to a full file list for all packages (which is used for certain dnf queries).

(4 comments.)

The idea of /usr/sbin has failed in practice

Chris's Wiki :: blog

By: cks

15 September 2025 at 03:17

One of the changes in Fedora Linux 42 is unifying /usr/bin and /usr/sbin, by moving everything in /usr/sbin to /usr/bin. To some people, this probably smacks of anathema, and to be honest, my first reaction was to bristle at the idea. However, the more I thought about it, the more I had to concede that the idea of /usr/sbin has failed in practice.

We can tell /usr/sbin has failed in practice by asking how many people routinely operate without /usr/sbin in their $PATH. In a lot of environments, the answer is that very few people do, because sooner or later you run into a program that you want to run (as yourself) to obtain useful information or do useful things. Let's take FreeBSD 14.3 as an illustrative example (to make this not a Linux biased entry); looking at /usr/sbin, I recognize iostat, manctl (you might use it on your own manpages), ntpdate (which can be run by ordinary people to query the offsets of remote servers), pstat, swapinfo, and traceroute. There are probably others that I'm missing, especially if you use FreeBSD as a workstation and so care about things like sound volumes and keyboard control.

(And if you write scripts and want them to send email, you'll care about sendmail and/or FreeBSD's 'mailwrapper', both in /usr/sbin. There's also DTrace, but I don't know if you can DTrace your own binaries as a non-root user on FreeBSD.)

For a long time, there has been no strong organizing principle to /usr/sbin that would draw a hard line and create a situation where people could safely leave it out of their $PATH. We could have had a principle of, for example, "programs that don't work unless run by root", but no such principle was ever followed for very long (if at all). Instead programs were more or less shoved in /usr/sbin if developers thought they were relatively unlikely to be used by normal people. But 'relatively unlikely' is not 'never', and shortly after people got told to 'run traceroute' and got 'command not found' when they tried, /usr/sbin (probably) started appearing in $PATH.

(And then when you asked 'how does my script send me email about something', people told you about /usr/sbin/sendmail and another crack appeared in the wall.)

If /usr/sbin is more of a suggestion than a rule and it appears in everyone's $PATH because no one can predict which programs you want to use will be in /usr/sbin instead of /usr/bin, I believe this means /usr/sbin has failed in practice. What remains is an unpredictable and somewhat arbitrary division between two directories, where which directory something appears in operates mostly as a hint (a hint that's invisible to people who don't specifically look where a program is).

(This division isn't entirely pointless and one could try to reform the situation in a way short of Fedora 42's "burn the entire thing down" approach. If nothing else the split keeps the size of both directories somewhat down.)

PS: The /usr/sbin like idea that I think is still successful in practice is /usr/libexec. Possibly a bunch of things in /usr/sbin should be relocated to there (or appropriate subdirectories of it).

(3 comments.)

My machines versus the Fedora selinux-policy-targeted package

Chris's Wiki :: blog

By: cks

14 September 2025 at 02:26

I upgrade Fedora on my office and home workstations through an online upgrade with dnf, and as part of this I read (or at least scan) DNF's output to look for problems. Usually this goes okay, but DNF5 has a general problem with script output and when I did a test upgrade from Fedora 41 to Fedora 42 on a virtual machine, it generated a huge amount of repeated output from a script run by selinux-policy-targeted, repeatedly reporting "Old compiled fcontext format, skipping" for various .bin files in /etc/selinux/targeted/contexts/files. The volume of output made the rest of DNF's output essentially unreadable. I would like to avoid this when I actually upgrade my office and home workstations to Fedora 42 (which I still haven't done, partly because of this issue).

(You can't make this output easier to read because DNF5 is too smart for you. This particular error message reportedly comes from 'semodule -B', per this Fedora discussion.)

The 'targeted' policy is one of several SELinux policies that are supported or at least packaged by Fedora (although I suspect I might see similar issues with the other policies too). My main machines don't use SELinux and I have it completely disabled, so in theory I should be able to remove the selinux-policy-targeted package to stop it from repeatedly complaining during the Fedora 42 upgrade process. In practice, selinux-policy-targeted is a 'protected' package that DNF will normally refuse to remove. Such packages are listed in /etc/dnf/protected.d/ in various .conf files; selinux-policy-targeted installs (well, includes) a .conf file to protect itself from removal once installed.

(Interestingly, sudo protects itself but there's nothing specifically protecting su and the rest of util-linux. I suspect util-linux is so pervasively a dependency that other protected things hold it down, or alternately no one has ever worried about people removing it and shooting themselves in the foot.)

I can obviously remove this .conf file and then DNF will let me remove selinux-policy-targeted, which will force the removal of some other SELinux policy packages (both selinux-policy packages themselves and some '*-selinux' sub-packages of other packages). I tried this on another Fedora 41 test virtual machine and nothing obvious broke, but that doesn't mean that nothing broke at all. It seems very likely that almost no one tests Fedora without the selinux-policy collective installed and I suspect it's not a supported configuration.

I could reduce my risks by removing the packages only just before I do the upgrade to Fedora 42 and put them back later (well, unless I run into a dnf issue as a result, although that issue is from 2024). Also, now that I've investigated this, I could in theory delete the .bin files in /etc/selinux/targeted/contexts/files before the upgrade, hopefully making it so that selinux-policy-targeted has less or nothing to complain about. Since I'm not using SELinux, hopefully the lack of these files won't cause any problems, but of course this is less certain a fix than removing selinux-policy-targeted (for example, perhaps the .bin files would get automatically rebuilt early on in the upgrade process as packages are shuffled around, and bring the problem back with them).

Really, though, I wish DNF5 didn't have its problem with script output. All of this is hackery to deal with that underlying issue.

Some thoughts on Ubuntu automatic ('unattended') package upgrades

Chris's Wiki :: blog

By: cks

1 September 2025 at 02:46

The default behavior of a stock Ubuntu LTS server install is that it enables 'unattended upgrades', by installing the package unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades, which controls this). Historically, we haven't believed in unattended automatic package upgrades and eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential advantages; in practice it mostly results in package upgrades being applied after some delay that depends on when they come out relative to working days.

I have a few machines that actually are stock Ubuntu servers, for reasons outside the scope of this entry. These machines naturally have automated upgrades turned on and one of them (in a cloud, using the cloud provider's standard Ubuntu LTS image) even appears to automatically reboot itself if kernel updates need that. These machines are all in undemanding roles (although one of them is my work IPv6 gateway), so they aren't necessarily indicative of what we'd see on more complex machines, but none of them have had any visible problems from these unattended upgrades.

(I also can't remember the last time that we ran into a problem with updates when we applied them. Ubuntu updates still sometimes have regressions and other problems, forcing them to be reverted or reissued, but so far we haven't seen problems ourselves; we find out about these problems only through the notices in the Ubuntu security lists.)

If we were starting from scratch today in a greenfield environment, I'm not sure we'd bother building our automation for manual package updates. Since we have the automation and it offers various extra features (even if they're rarely used), we're probably not going to switch over to automated upgrades (including in our local build of Ubuntu 26.04 LTS when that comes out next year).

(The advantage of switching over to standard unattended upgrades is that we'd get rid of a local tool that, like all local tools, is all our responsibility. The less local weird things we have, the better, especially since we have so many as it is.)

(One comment.)

Getting the Cinnamon desktop environment to support "AppIndicator"

Chris's Wiki :: blog

By: cks

22 August 2025 at 02:34

The other day I wrote about what "AppIndicator" is (a protocol) and some things about how the Cinnamon desktop appeared to support it, except they weren't working for me. Now I actually understand what's going on, more or less, and how to solve my problem of a program complaining that it needed AppIndicator.

Cinnamon directly implements the AppIndicator notification protocol in xapp-sn-watcher, part of Cinnamon's xapp(s) package. Xapp-sn-watcher is started as part of your (Cinnamon) session. However, it has a little feature, namely that it will exit if no one is asking it to do anything:

XApp-Message: 22:03:57.352: (SnWatcher) watcher_startup: ../xapp-sn-watcher/xapp-sn-watcher.c:592: No active monitors, exiting in 30s

In a normally functioning Cinnamon environment, something will soon show up to be an active monitor and stop xapp-sn-watcher from exiting:

Cjs-Message: 22:03:57.957: JS LOG: [LookingGlass/info] Loaded applet xapp-status@cinnamon.org in 88 ms
[...]
XApp-Message: 22:03:58.129: (SnWatcher) name_owner_changed_signal: ../xapp-sn-watcher/xapp-sn-watcher.c:162: NameOwnerChanged signal received (n: org.x.StatusIconMonitor.cinnamon_0, old: , new: :1.60
XApp-Message: 22:03:58.129: (SnWatcher) handle_status_applet_name_owner_appeared: ../xapp-sn-watcher/xapp-sn-watcher.c:64: A monitor appeared on the bus, cancelling shutdown

This something is a standard Cinnamon desktop applet. In System Settings → Applets, it's way down at the bottom and is called "XApp Status Applet". If you've accidentally wound up with it not turned on, xapp-sn-watcher will (probably) not have a monitor active after 30 seconds, and then it will exit (and in the process of exiting, it will log alarming messages about failed GLib assertions). Not having this xapp-status applet turned on was my problem, and turning it on fixed things.

(I don't know how it got turned off. It's possible I wen through the standard applets at some point and turned some of them off in an excess of ignorant enthusiasm.)

As I found out from leigh scott in my Fedora bug report, the way to get this debugging output from xapp-sn-watcher is to run 'gsettings set org.x.apps.statusicon sn-watcher-debug true'. This will cause xapp-sn-watcher to log various helpful and verbose things to your ~/.xsession-errors (although apparently not the fact that it's actually exiting; you have to deduce that from the timestamps stopping 30 seconds later and that being the timestamps on the GLib assertion failures).

(I don't know why there's both a program and an applet involved in this and I've decided not to speculate.)

What an "AppIndicator" is in Linux desktops and some notes on it

Chris's Wiki :: blog

By: cks

20 August 2025 at 03:19

Suppose, not hypothetically, that you start up some program on your Fedora 42 Cinnamon desktop and it helpfully tells you "<X> requires AppIndicator to run. Please install the AppIndicator plugin for your desktop". You are likely confused, so here are some notes.

'AppIndicator' itself is the name of an application notification protocol, apparently originally from KDE, and some desktop environments may need a (third party) extension to support it, such as the Ubuntu one for GNOME Shell. Unfortunately for me, Cinnamon is not one of those desktops. It theoretically has native support for this, implemented in /usr/libexec/xapps/xapp-sn-watcher, part of Cinnamon's xapps package.

The actual 'AppIndicator' protocol is done over D-Bus, because that's the modern way. Since this started as a KDE thing, the D-Bus name is 'org.kde.StatusNotifierWatcher'. What provides certain D-Bus names is found in /usr/share/dbus-1/services, but not all names are mentioned there and 'org.kde.StatusNotifierWatcher' is one of the missing ones. In this case /etc/xdg/autostart/xapp-sn-watcher.desktop mentions the D-Bus name in its 'Comment=', but that's probably not something you can count on to find what your desktop is (theoretically) using to provide a given D-Bus name. I found xapp-sn-watcher somewhat through luck.

There are probably a number of ways to see what D-Bus names are currently registered and active. The one that I used when looking at this is 'dbus-send --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames'. As far as I know, there's no easy way to go from an error message about 'AppIndicator' to knowing that you want 'org.kde.StatusNotifierWatcher'; in my case I read the source of the thing complaining which was helpfully in Python.

(I used the error message to find the relevant section of code, which showed me what it wasn't finding.)

I have no idea how to actually fix the problem, or if there is a program that implements org.kde.StatusNotifierWatcher as a generic, more or less desktop independent program the way that stalonetray does for system tray stuff (or one generation of system tray stuff, I think there have been several iterations of it, cf).

(Yes, I filed a Fedora bug, but I believe Cinnamon isn't particularly supported by Fedora so I don't expect much. I also built the latest upstream xapps tree and it also appears to fail in the same way. Possibly this means something in the rest of the system isn't working right.)

Getting Linux nflog and tcpdump packet filters to sort of work together

Chris's Wiki :: blog

By: cks

17 August 2025 at 02:38

So, suppose that you have a brand new nflog version of OpenBSD's pflog, so you can use tcpdump to watch dropped packets (or in general, logged packets). And further suppose that you specifically want to see DNS requests to your port 53. So of course you do:

# tcpdump -n -i nflog:30 'port 53'
tcpdump: NFLOG link-layer type filtering not implemented

Perhaps we can get clever by reading from the interface in one tcpdump and sending it to another to be interpreted, forcing the pcap filter to be handled entirely in user space instead of the kernel:

# tcpdump --immediate-mode -w - -U -i nflog:30 | tcpdump -r - 'port 53'
tcpdump: listening on nflog:30, link-type NFLOG (Linux netfilter log messages), snapshot length 262144 bytes
reading from file -, link-type NFLOG (Linux netfilter log messages), snapshot length 262144
tcpdump: NFLOG link-layer type filtering not implemented

Alas we can't.

As far as I can determine, what's going on here is that the netfilter log system, 'NFLOG', uses a 'packet' format that isn't the same as any of the regular formats (Ethernet, PPP, etc) and adds some additional (meta)data about the packet to every packet you capture. I believe the various attributes this metadata can contain are listed in the kernel's nfnetlink_log.h.

(I believe it's not technically correct to say that this additional stuff is 'before' the packet; instead I believe the packet is contained in a NFULA_PAYLOAD attribute.)

Unfortunately for us, tcpdump (or more exactly libpcap) doesn't know how to create packet capture filters for this format, not even ones that are interpreted entirely in user space (as happens when tcpdump reads from a file).

I believe that you have two options. First, you can use tshark with a display filter, not a capture filter:

# tshark -i nflog:30 -Y 'udp.port == 53 or tcp.port == 53'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'nflog:30'
[...]

(Tshark capture filters are subject to the same libpcap inability to work on NFLOG formatted packets as tcpdump has.)

Alternately and probably more conveniently, you can tell tcpdump to use the 'IPV4' datalink type instead of the default, as mentioned in (opaque) passing in the tcpdump manual page:

# tcpdump -i nflog:30 -L
Data link types for nflog:30 (use option -y to set):
  NFLOG (Linux netfilter log messages)
  IPV4 (Raw IPv4)
# tcpdump -i nflog:30 -y ipv4 -n 'port 53'
tcpdump: data link type IPV4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on nflog:30, link-type IPV4 (Raw IPv4), snapshot length 262144 bytes
[...]

Of course this is only applicable if you're only doing IPv4. If you have some IPv6 traffic that you want to care about, I think you have to use tshark display filters (which means learning how to write Wireshark display filters, something I've avoided so far).

I think there is some potentially useful information in the extra NFLOG data, but to get it or to filter on it I think you'll need to use tshark (or Wireshark) and consult the NFLOG display filter reference, although that doesn't seem to give you access to all of the NFLOG stuff that 'tshark -i nflog:30 -V' will print about packets.

(Or maybe the trick is that you need to match 'nflog.tlv_type == <whatever> and nflog.tlv_value == <whatever>'. I believe that some NFLOG attributes are available conveniently, such as 'nflog.prefix', which corresponds to NFULA_PREFIX. See packet-nflog.c.)

PS: There's some information on the NFLOG format in the NFLOG linktype documentation and tcpdump's supported data link types in the link-layer header types documentation.

Implementing a basic equivalent of OpenBSD's pflog in Linux nftables

Chris's Wiki :: blog

By: cks

13 August 2025 at 02:38

OpenBSD's and FreeBSD's PF system has a very convenient 'pflog' feature, where you put in a 'log' bit in a PF rule and this dumps a copy of any matching packets into a pflog pseudo-interface, where you can both see them with 'tcpdump -i pflog0' and have them automatically logged to disk by pflogd in pcap format. Typically we use this to log blocked packets, which gives us both immediate and after the fact visibility of what's getting blocked (and by what rule, also). It's possible to mostly duplicate this in Linux nftables, although with more work and there's less documentation on it.

The first thing you need is nftables rules with one or two log statements of the form 'log group <some number>'. If you want to be able to both log packets for later inspection and watch them live, you need two 'log group' statements with different numbers; otherwise you only need one. You can use different (group) numbers on different nftables rules if you want to be able to, say, look only at accepted but logged traffic or only dropped traffic. In the end this might wind up looking something like:

tcp port ssh counter log group 30 log group 31 drop;

As the nft manual page will tell you, this uses the kernel 'nfnetlink_log' to forward the 'logs' (packets) to a netlink socket, where exactly one process (at most) can subscribe to a particular group to receive those logs (ie, those packets). If we want to both log the packets and be able to tcpdump them, we need two groups so we can have ulogd getting one and tcpdump getting the other.

To see packets from any particular log group, we use the special 'nflog:<N>' pseudo-interface that's hopefully supported by your Linux version of tcpdump. This is used as 'tcpdump -i nflog:30 ...' and works more or less like you'd want it to. However, as far as I know there's no way to see meta-information about the nftables filtering, such as what rule was involved or what the decision was; you just get the packet.

To log the packets to disk for later use, the default program is ulogd, which in Ubuntu is called 'ulogd2'. Ulogd(2) isn't as automatic as OpenBSD's and FreeBSD's pf logging; instead you have to configure it in /etc/ulogd.conf, and on Ubuntu make sure you have the 'ulogd2-pcap' package installed (along with ulogd2 itself). Based merely on getting it to work, what you want in /etc/ulogd.conf is the following three bits:

# A 'stack' of source, handling, and destination
stack=log31:NFLOG,base1:BASE,pcap31:PCAP

# The source: NFLOG group 31, for IPv4 traffic
[log31]
group=31
# addressfamily=10 for IPv6

# the file path is correct for Ubuntu
[pcap31]
file="/var/log/ulog/ulogd.pcap"
sync=0

(On Ubuntu 24.04, any .pcap files in /var/log/ulog will be automatically rotated by logrotate, although I think by default it's only weekly, so you might want to make it daily.)

The ulogd documentation suggests that you will need to capture IPv4 and IPv6 traffic separately, but I've only used this on IPv4 traffic so I don't know. This may imply that you need separate nftables rules to log (and drop) IPv6 traffic so that you can give it a separate group number for ulogd (I'm not sure if it needs a separate one for tcpdump or if tcpdump can sort it out).

Ulogd can also log to many different things than PCAP format, including JSON and databases. It's possible that there are ways to enrich the ulogd pcap logs, or maybe just the JSON logs, with additional useful information such as the network interface involved and other things. I find the ulogd documentation somewhat opaque on this (and also it's incomplete), and I haven't experimented.

(According to this, the JSON logs can be enriched or maybe default to that.)

Given the assorted limitations and other issues with ulogd, I'm tempted to not bother with it and only have our nftables setups support live tcpdump of dropped traffic with a single 'log group <N>'. This would save us from the assorted annoyances of ulogd2.

PS: One reason to log to pcap format files is that then you can use all of the tcpdump filters that you're already familiar with in order to narrow in on (blocked) traffic of interest, rather than having to put together a JSON search or something.

(One comment.)

The 'nft' command may not show complete information for iptables rules

Chris's Wiki :: blog

By: cks

12 August 2025 at 03:04

These days, nftables is the Linux network firewall system that you want to use, and especially it's the system that Ubuntu will use by default even if you use the 'iptables' command. The nft command is the official interface to nftables, and it has a 'nft list ruleset' sub-command that will list your NFT rules. Since iptables rules are implemented with nftables, you might innocently expect that 'nft list ruleset' will show you the proper NFT syntax to achieve your current iptables rules.

Well, about that:

# iptables -vL INPUT
[...] target prot opt in  out  source   destination         
[...] ACCEPT tcp  --  any any  anywhere anywhere    match-set nfsports dst match-set nfsclients src
# nft list ruleset
[...]
      ip protocol tcp xt match "set" xt match "set" counter packets 0 bytes 0 accept
[...]

As they say, "yeah no". As the documentation tells you (eventually), somewhat reformatted:

xt TYPE NAME

TYPE := match | target | watcher

This represents an xt statement from xtables compat interface. It is a fallback if translation is not available or not complete. Seeing this means the ruleset (or parts of it) were created by iptables-nft and one should use that to manage it.

Nftables has a native set type (and also maps), but, quite reasonably, the old iptables 'ipset' stuff isn't translated to nftables sets by the iptables compatibility layer. Instead the compatibility layer uses this 'xt match' magic that the nft command can only imperfectly tell you about. To nft's credit, it prints a warning comment (which I've left out) that the rules are being managed by iptables-nft and you shouldn't touch them. Here, all of the 'xt match "set"' bits in the nft output are basically saying "opaque stuff happens here".

This still makes me a little bit sad because it makes it that bit harder to bootstrap my nftables knowledge from what iptables rules convert into. If I wanted to switch to nftables rules and nftables sets (for example for my now-simpler desktop firewall rules), I'd have to do that from relative scratch instead of getting to clean up what the various translation tools would produce or report.

(As a side effect it makes it less likely that I'll convert various iptables things to being natively nft/nftables based, because I can't do a fully mechanical conversion. If they still work with iptables-nft, I'm better off leaving them as is. Probably this also means that iptables-nft support is likely to have a long, long life.)

The unusual way I end my X desktop sessions

Chris's Wiki :: blog

By: cks

5 August 2025 at 03:47

I use an eccentric X 'desktop' that is not really a desktop as such in the usual sense but instead a window manager and various programs that I run (as a sysadmin, there's a lot of terminal windows). One of the ways that my desktop is unusual is in how I exit from my X session. First, I don't use xdm or any other graphical login manager; instead I run my session through xinit. When you use an xinit based session, you give xinit a program or a script to run, and when the program exits, xinit terminates the X server and your session.

(If you gave xinit a shell script, whatever foreground program the script ended with was your keystone program.)

Traditionally, this keystone program for your X session was your window manager. At one level this makes a lot of sense; your window manager is basically the core of your X session anyway, so you might as well make quitting from it end the session. However, for a very long time I've used a do-nothing iconified xterm running a shell as my keystone program.

(If you look at FvwmIconMan's strip of terminal windows in my (2011) desktop tour, this is the iconified 'console-ex' window.)

The minor advantage to having an otherwise unused xterm as my session keystone program is that I can start my window manager basically at the start of my (rather complex) session startup, so that I can immediately have it manage all of the other things I start (technically I run a number of commands to set up X settings before I start fvwm, but it's the first program I start that will actually show anything on the screen). The big advantage is that using something else as my keystone program means that I can kill and restart my window manager if something goes badly wrong, and more generally that I don't have to worry about restarting it. This doesn't happen very often, but when it does happen I'm very glad that I can recover my session instead of having to abruptly terminate everything. And should I have to terminate fvwm, this 'console' xterm is a convenient idle xterm in which to restart it (or in general, any other program of my session that needs restarting).

(The 'console' xterm is deliberately placed up at the top of the screen, in an area that I don't normally put non-fvwm windows in, so that if fvwm exits and everything de-iconifies, it's highly likely that this xterm will be visible so I can type into it. If I put it in an ordinary place, it might wind up covered up by a browser window or another xterm or whatever.)

I don't particularly have to use an (iconified) xterm with a shell in it; I could easily have written a little Tk program that displayed a button saying 'click me to exit'. However, the problem with such a program (and the advantage of my 'console' xterm) is that it would be all too easy to accidentally click the button (and force-end my session). With the iconified xterm, I need to do a bunch of steps to exit; I have to deiconify that xterm, focus the window, and Ctrl-D the shell to make it exit (causing the xterm to exit). This is enough out of the way that I don't think I've ever done it by accident.

PS: I believe modern desktop environments like GNOME, KDE, and Cinnamon have moved away from making their window manager be the keystone program and now use a dedicated session manager program that things talk to. One reason for this may be that modern desktop shells seem to be rather more prone to crashing for various reasons, which would be very inconvenient if that ended your session. This isn't all bad, at least if there's a standard D-Bus protocol for ending a session so that you can write an 'exit the session' thing that will work across environments.

(4 comments.)

Starting scripts with '#!/usr/bin/env <whatever>' is rarely useful

Chris's Wiki :: blog

By: cks

3 August 2025 at 02:09

In my entry on getting decent error reports in Bash for 'set -e', I said that even if you were on a system where /bin/sh was Bash and so my entry worked if you started your script with '#!/bin/sh', you should use '#!/bin/bash' instead for various reasons. A commentator took issue with this direct invocation of Bash and suggested '#!/usr/bin/env bash' instead. It's my view that using env this way, especially for Bash, is rarely useful and thus is almost always unnecessary and pointless (and sometimes dangerous).

The only reason to start your script with '#!/usr/bin/env <whatever>' is if you expect your script to run on a system where Bash or whatever else isn't where you expect (or when it has to run on systems that have '<whatever>' in different places, which is probably most common for third party packages). Broadly speaking this only happens if your script is portable and will run on many different sorts of systems. If your script is specific to your systems (and your systems are uniform), this is pointless; you know where Bash is and your systems aren't going to change it, not if they're sane. The same is true if you're targeting a specific Linux distribution, such as 'this is intrinsically an Ubuntu script'.

(In my case, the script I was doing this to is intrinsically specific to Ubuntu and our environment. It will never run on anything else.)

It's also worth noting that '#!/usr/bin/env <whatever>' only works if (the right version of) <whatever> can be found on your $PATH, and in fact the $PATH of every context where you will run the script (including, for example, from cron). If the system's default $PATH doesn't include the necessary directories, this will likely fail some of the time. This makes using 'env' especially dangerous in an environment where people may install their own version of interpreters like Python, because your script's use of 'env' may find their Python on their $PATH instead of the version that you expect.

(These days, one of the dangers with Python specifically is that people will have a $PATH that (currently) points to a virtual environment with some random selection of Python packages installed and not installed, instead of the system set of packages.)

As a practical matter, pretty much every mainstream Linux distribution has a /bin/bash (assuming that you install Bash, and I'm sorry, Nix and so on aren't mainstream). If you're targeting Linux in general, assuming /bin/bash exists is entirely reasonable. If a Linux distribution relocates Bash, in my view the resulting problems are on them. A lot of the time, similar things apply for other interpreters, such as Python, Perl, Ruby, and so on. '#!/usr/bin/python3' on Linux is much more likely to get you a predictable Python environment than '#!/usr/bin/env python3', and if it fails it will be a clean and obvious failure that's easy to diagnose.

Another issue is that even if your script is fixed to use 'env' to run Bash, it may or may not work in such an alternate environment because other things you expect to find in $PATH may not be there. Unless you're actually testing on alternate environments (such as Nix or FreeBSD), using 'env' may suggest more portability than you're actually able to deliver.

My personal view is that for most people, '#!/usr/bin/env' is a reflexive carry-over that they inherited from a past era of multi-architecture Unix environments, when much less was shipped with the system and so was in predictable locations. In that past Unix era, using '#!/usr/bin/env python' was a reasonably sensible thing; you could hope that the person who wanted to run your script had Python, but you couldn't predict where. For most people, those days are over, especially for scripts and programs that are purely for your internal use and that you won't be distributing to the world (much less inviting people to run your 'written on X' script on a Y, such as a FreeBSD script being run on Linux).

(17 comments.)

The XLibre project is explicitly political and you may not like the politics

Chris's Wiki :: blog

By: cks

2 August 2025 at 02:52

A commentator on my 2024 entry on the uncertain possible futures of Unix graphical desktops brought up the XLibre project. XLibre is ostensibly a fork of the X server that will be developed by a new collection of people, which on the surface sounds unobjectionable and maybe a good thing for people (like me) who want X to keep being viable; as a result it has gotten a certain amount of publicity from credulous sources who don't look behind the curtain. Unfortunately for everyone, XLibre is an explicitly political project, and I don't mean that in the sense of disagreements about technical directions (the sense that you could say that 'forking is a political action', because it's the manifestation of a social disagreement). Instead I mean it in the regular sense of 'political', which is that the people involved in XLibre (especially its leader) have certain social values and policies that they espouse, and the XLibre project is explicitly manifesting some of them.

(Plus, a project cannot be divorced from the people involved in it.)

I am not going to summarize here; instead, you should read the Register article and its links, and also the relevant sections of Ariadne Conill's announcement of Wayback and their links. However, even if you "don't care" about politics, you should see this correction to earlier XLibre changes where the person making the earlier changes didn't understand what '2^16' did in C (I would say that the people who reviewed the changes also missed it, but there didn't seem to be anyone doing so, which ought to raise your eyebrows when it comes to the X server).

Using, shipping it as part of a distribution, or advocating for XLibre is not a neutral choice. To do so is to align yourself, knowingly or unknowingly, with the politics of XLibre and with the politics of its leadership and the people its leadership will attract to the project. This is always true to some degree with any project, but it's especially true when the project is explicitly manifesting some of its leadership's values, out in the open. You can't detach XLibre from its leader .

My personal view is that I don't want to have anything to do with XLibre and I will think less of any Unix or Linux distribution that includes it, especially ones that intend to make it their primary X server. At a minimum, I feel those distributions haven't done their due diligence.

In general, my personal guess is that a new (forked) standalone X server is also the wrong approach to maintaining a working X server environment over the long term. Wayback combined with XWayland seems like a much more stable base because each of them has more support in various ways (eg, there are a lot of people who are going to want old X programs to keep working for years or decades to come and so lots of demand for most of XWayland's features).

(This elaborates on my comment on XLibre in this entry. I also think that a viable X based environment is far more likely to stop working due to important programs becoming Wayland-only than because you can no longer get a working X server.)

(3 comments.)

A danger of GST 2.0

Root+Privileges

By: VM

10 September 2025 at 04:44

Since Union finance minister Nirmala Sitharaman's announcement last week that India's Goods and Services Tax (GST) rates will be rationalised anew from September 22, I've been seeing a flood of pieces all in praise — and why not?

The GST regime has been somewhat controversial since its launch because, despite simplifying compliance for businesses and industry, it increased the costs for consumers. The Indian government exacerbated that pain point by undermining the fiscal federalism of the Union, increasing its revenues at the expense of states' as well as cutting allocations.

While there is (informed) speculation that the next Finance Commission will further undercut the devolution of funds to the states, GST 2.0 offers some relief to consumers in the form of making various products more affordable. Populism is popular, after all.

However, increasing affordability isn't always a good thing even if your sole goal is to increase consumption. This is particularly borne out in the food and nutrition domain.

For example, under the new tax regime, from September 22, the GST on pizza bread will slip from 5% to zero. This means both sourdough pizza bread and maida (refined flour) pizza bread will go from 5% to zero. However, because there is more awareness of maida as an ingredient in the populace and less so of sourdough, and because maida as a result enjoys a higher economy of scale and is thus less expensive (before tax), the demand for maida bread is likely to increase more than the demand for sourdough bread.

This is unfortunate: ideally, sourdough bread should be more affordable — or, alternatively, the two breads should be equally affordable as well as have threshold-based front-of-pack labelling. That is to say, liberating consumers to be able to buy new food products or more of the old ones without simultaneously empowering consumers to make more informed choices could tilt demand in favour of unhealthier foods.

Ultimately, the burden of non-communicable diseases in the population will increase, as will consumers' expenses on healthcare, dietary interventions, and so on. I explained this issue in The Hindu on September 9, 2025, and set out solutions that the Indian government must implement in its food regulation apparatus posthaste.

Without these measures, GST 2.0 will likely be bad news for India's dietary and nutritional ambitions.

Lighting the way with Parrondo’s paradox

Root+Privileges

By: VM

7 September 2025 at 11:07

Lighting the way with Parrondo’s paradox

In science, paradoxes often appear when familiar rules are pushed into unfamiliar territory. One of them is Parrondo’s paradox, a curious mathematical result showing that when two losing strategies are combined, they can produce a winning outcome. This might sound like trickery but the paradox has deep connections to how randomness and asymmetry interact in the physical world. In fact its roots can be traced back to a famous thought experiment explored by the US physicist Richard Feynman, who analysed whether one could extract useful work from random thermal motion. The link between Feynman’s thought experiment and Parrondo’s paradox demonstrates how chance can be turned into order when the conditions are right.

Imagine two games. Each game, when played on its own, is stacked against you. In one, the odds are slightly less than fair, e.g. you win 49% of the time and lose 51%. In another, the rules are even more complex, with the chances of winning and losing depending on your current position or capital. If you keep playing either game alone, the statistics say you will eventually go broke.

But then there’s a twist. If you alternate the games — sometimes playing one, sometimes the other — your fortune can actually grow. This is Parrondo’s paradox, proposed in 1996 by the Spanish physicist Juan Parrondo.

The answer to how combining losing games can result in a winning streak lies in how randomness interacts with structure. In Parrondo’s games, the rules are not simply fair or unfair in isolation; they have hidden patterns. When the games are alternated, these patterns line up in such a way that random losses become rectified into net gains.

Say there’s a perfectly flat surface in front of you. You place a small bead on it and then you constantly jiggle the surface. The bead jitters back and forth. Because the noise you’re applying to the bead’s position is unbiased, the bead simply wanders around in different directions on the surface. Now, say you introduce a switch that alternates the surface between two states. When the switch is ON, an ice-tray shape appears on the surface. When the switch is OFF, it becomes flat again. This ice-tray shape is special: the cups are slightly lopsided because there’s a gentle downward slope from left to right in each cup. At the right end, there’s a steep wall. If you’re jiggling the surface when the switch is OFF, the bead diffuses a little towards the left, a little towards the right, and so on. When you throw the switch to ON, the bead falls into the nearest cup. Because each cup is slightly tilted towards the right, the bead eventually settles near the steep wall there. Then you move the switch to OFF again.

As you repeat these steps with more and more beads over time, you’ll see they end up a little to the right of where they started. This is Parrando’s paradox. The jittering motion you applied to the surface caused each bead to move randomly. The switch you used to alter the shape of the surface allowed you to expend some energy in order to rectify the beads’ randomness.

The reason why Parrondo’s paradox isn’t just a mathematical trick lies in physics. At the microscopic scale, particles of matter are in constant, jittery motion because of heat. This restless behaviour is known as Brownian motion, named after the botanist Robert Brown, who observed pollen grains dancing erratically in water under a microscope in 1827. At this scale, randomness is unavoidable: molecules collide, rebound, and scatter endlessly.

Scientists have long wondered whether such random motion could be tapped to extract useful work, perhaps to drive a microscopic machine. This was Feynman’s thought experiment as well, involving a device called the Brownian ratchet, a.k.a. the Feynman-Smoluchowski ratchet. The Polish physicist Marian Smoluchowski dreamt up the idea in 1912 and which Feynman popularised in a lecture 50 years later, in 1962.

Picture a set of paddles immersed in a fluid, constantly jolted by Brownian motion. A ratchet and pawl mechanism is attached to the paddles (see video below). The ratchet allows the paddles to rotate in one direction but not the other. It seems plausible that the random kicks from molecules would turn the paddles, which the ratchet would then lock into forward motion. Over time, this could spin a wheel or lift a weight.

In one of his physics famous lectures in 1962, Feynman analysed the ratchet. He showed that the pawl itself would also be subject to Brownian motion. It would jiggle, slip, and release under the same thermal agitation as the paddles. When everything is at the same temperature, the forward and backward slips would cancel out and no net motion would occur.

This insight was crucial: it preserved the rule that free energy can’t be extracted from randomness at equilibrium. If motion is to be biased in only one direction, there needs to be a temperature difference between different parts of the ratchet. In other words, random noise alone isn’t enough: you also need an asymmetry, or what physicists call nonequilibrium conditions, to turn randomness into work.

Let’s return to Parrondo’s paradox now. The paradoxical games are essentially a discrete-time abstraction of Feynman’s ratchet. The losing games are like unbiased random motion: fluctuations that on their own can’t produce net gain because the gains become cancelled out. But when they’re alternated cleverly, they mimic the effect of adding asymmetry. The combination rectifies the randomness, just as a physical ratchet can rectify the molecular jostling when a gradient is present.

This is why Parrondo explicitly acknowledged his inspiration from Feynman’s analysis of the Brownian ratchet. Where Feynman used a wheel and pawl to show how equilibrium noise can’t be exploited without a bias, Parrondo created games whose hidden rules provided the bias when they were combined. Both cases highlight a universal theme: randomness can be guided to produce order.

The implications of these ideas extend well beyond thought experiments. Inside living cells, molecular motors like kinesin and myosin actually function like Brownian ratchets. These proteins move along cellular tracks by drawing energy from random thermal kicks with the aid of a chemical energy gradient. They demonstrate that life itself has evolved ways to turn thermal noise into directed motion by operating out of equilibrium.

Parrondo’s paradox also has applications in economics, evolutionary biology, and computer algorithms. For example, alternating between two investment strategies, each of which is poor on its own, may yield better long-term outcomes if the fluctuations in markets interact in the right way. Similarly, in genetics, when harmful mutations alternate in certain conditions, they can produce beneficial effects for populations. The paradox provides a framework to describe how losing at one level can add up to winning at another.

Feynman’s role in this story is historical as well as philosophical. By dissecting the Brownian ratchet, he demonstrated how deeply the laws of thermodynamics constrain what’s possible. His analysis reminded physicists that intuition about randomness can be misleading and that only careful reasoning could reveal the real rules.

In 2021, a group of scientists from Australia, Canada, France, and Germany wrote in Cancers that the mathematics of Parrondo’s paradox could also illuminate the biology of cancerous tumours. Their starting point was the observation that cancer cells behave in ways that often seem self-defeating: they accumulate genetic and epigenetic instability, devolve into abnormal states, sometimes stop dividing altogether, and often migrate away from their original location and perish. Each of these traits looks like a “losing strategy” — yet cancers that use these ‘strategies’ together are often persistent.

The group suggested that the paradox arises because cancers grow in unstable, hostile environments. Tumour cells deal with low oxygen, intermittent blood supply, attacks by the immune system, and toxic drugs. In these circumstances, no single survival strategy is reliable. A population of only stable tumour cells would be wiped out when the conditions change. Likewise a population of only unstable cells would collapse under its own chaos. But by maintaining a mix, the group contended, cancers achieve resilience. Stable, specialised cells can exploit resources efficiently while unstable cells with high plasticity constantly generate new variations, some of which could respond better to future challenges. Together, the team continued, the cancer can alternate between the two sets of cells so that it can win.

The scientists also interpreted dormancy and metastasis of cancers through this lens. Dormant cells are inactive and can lie hidden for years, escaping chemotherapy drugs that are aimed at cells that divide. Once the drugs have faded, they restart growth. While a migrating cancer cell has a high chance of dying off, even one success can seed a tumor in a new tissue.

On the flip side, the scientists argued that cancer therapy can also be improved by embracing Parrondo’s paradox. In conventional chemotherapy, doctors repeatedly administer strong drugs, creating a strategy that often backfires: the therapy kills off the weak, leaving the strong behind — but in this case the strong are the very cells you least want to survive. By contrast, adaptive approaches that alternate periods of treatment with rest or that mix real drugs with harmless lookalikes could harness evolutionary trade-offs inside the tumor and keep it in check. Just as cancer may use Parrondo’s paradox to outwit the body, doctors may one day use the same paradox to outwit cancer.

On August 6, physicists from Lanzhou University in China published a paper in Physical Review E discussing just such a possibility. They focused on chemotherapy, which is usually delivered in one of two main ways. The first, called the maximum tolerated dose (MTD), uses strong doses given at intervals. The second, called low-dose metronomic (LDM), uses weaker doses applied continuously over time. Each method has been widely tested in clinics and each one has drawbacks.

MTD often succeeds at first by rapidly killing off drug-sensitive cancer cells. In the process, however, it also paves the way for the most resistant cancer cells to expand, leading to relapse. LDM on the other hand keeps steady pressure on a tumor but can end up either failing to control sensitive cells if the dose is too low or clearing them so thoroughly that resistant cells again dominate if the dose is too strong. In other words, both strategies can be losing games in the long run.

The question the study’s authors asked was whether combining these two flawed strategies in a specific sequence could achieve better results than deploying either strategy on its own. This is the sort of situation Parrondo’s paradox describes, even if not exactly. While the paradox is concerned with combining outright losing strategies, the study has discussed combining two ineffective strategies.

To investigate, the researchers used mathematical models that treated tumors as ecosystems containing three interacting populations: healthy cells, drug-sensitive cancer cells, and drug-resistant cancer cells. They applied equations from evolutionary game theory that tracked how the fractions of these groups shifted in different conditions.

The models showed that in a purely MTD strategy, the resistant cells soon took over, and in a purely LDM strategy, the outcomes depended strongly on drug strength but still ended badly. But when the two schedules were alternated, the tumor behaved differently. The more sensitive cells were suppressed but not eliminated while their persistence prevented the resistant cells from proliferating quickly. The team also found that the healthy cells survived longer.

Of course, tumours are not well-mixed soups of cells; in reality they have spatial structure. To account for this, the team put together computer simulations where individual cells occupied positions on a grid; grew, divided or died according to fixed rules; and interacted with their neighbours. This agent-based approach allowed the team to examine how pockets of sensitive and resistant cells might compete in more realistic tissue settings.

Their simulations only confirmed the previous set of results. A therapeutic strategy that alternated between MTD and LDM schedules extended the amount of time before the resistant cells took over and while the healthy cells dominated. When the model started with the LDM phase in particular, the sensitive cancer cells were found to compete with the resistant cancer cells and the arrival of the MTD phase next applied even more pressure on the latter.

This is an interesting finding because it suggests that the goal of therapy may not always be to eliminate every sensitive cancer cell as quickly as possible but, paradoxically, that sometimes it may be wiser to preserve some sensitive cells so that they can compete directly with resistant cells and prevent them from monopolising the tumor. In clinical terms, alternating between high- and low-dose regimens may delay resistance and keep tumours tractable for longer periods.

Then again this is cancer — the “emperor of all maladies” — and in silico evidence from a physics-based model is only the start. Researchers will have to test it in real, live tissue in animal models (or organoids) and subsequently in human trials. They will also have to assess whether certain cancers, followed by a specific combination of drugs for those cancers, will benefit more (or less) from taking the Parrando’s paradox way.

As Physics reported on August 6:

[University of London mathematical oncologist Robert] Noble … says that the method outlined in the new study may not be ripe for a real-world clinical setting. “The alternating strategy fails much faster, and the tumor bounces back, if you slightly change the initial conditions,” adds Noble. Liu and colleagues, however, plan to conduct in vitro experiments to test their mathematical model and to select regimen parameters that would make their strategy more robust in a realistic setting.

GST 2.0 + WordPress.com

Root+Privileges

By: VM

4 September 2025 at 05:36

Union finance minister Nirmala Sitharaman announced sweeping changes to the GST rates on September 3. However, I think the rate for software services (HSN 99831) will remain unchanged at 18%. This is a bummer because every time I renew my WordPress.com site or purchase software over the internet in rupees, the total cost increases by almost a fifth.

The disappointment is compounded by the fact that WordPress.com and many other software service providers provide adjusted rates for users in India in order to offset the country's lower purchasing power per capita. For example, the lowest WordPress and Ghost plans by WordPress.com and MagicPages.co, respectively, cost $4 and $12 a month. But for users in India, the WordPress.com plan costs Rs 200 a month while MagicPages.co offers a Rs 450 per month plan, both with the same feature set — a big difference. The 18% GST however wipes out some, not all, of these gains.

Paying for software services over the internet when they're billed in dollars rather than rupees isn't much different. While GST doesn't apply, the rupee-to-dollar rate has become abysmal. [Checks] Rs 88.14 to the dollar at 11 am. Ugh.

I also hoped for a GST rate cut on software services because if content management software in particular becomes more affordable, more people would be able to publish on the internet.

Make Your Own Backup System – Part 2: Forging the FreeBSD Backup Stronghold

IT+Notes

By: Stefano Marinelli

29 July 2025 at 06:00

Build a bulletproof backup server with FreeBSD, ZFS, and jails. Complete guide covering encryption, security hardening, and multiple backup strategies for enterprise-grade data protection.

Another thing V7 Unix gave us is environment variables

Chris's Wiki :: blog

By: cks

29 July 2025 at 02:38

Simon Tatham recently wondered "Why is PATH called PATH?". This made me wonder the closely related question of when environment variables appeared in Unix, and the answer is that the environment and environment variables appeared in V7 Unix as another of the things that made it so important to Unix history (also).

Up through V6, the exec system call and family of system calls took two arguments, the path and the argument list; we can see this in both the V6 exec(2) manual page and the implementation of the system call in the kernel. As bonus trivia, it appears that the V6 exec() limited you to 510 characters of arguments (and probably V1 through V5 had a similarly low limit, but I haven't looked at their kernel code).

In V7, the exec(2) manual page now documents a possible third argument, and the kernel implementation is much more complex, plus there's an environ(5) manual page about it. Based on h/param.h, V7 also had a much higher size limit on the combined sized of arguments and environment variables, which isn't all that surprising given the addition of the environment. Commands like login.c were updated to put some things into the new environment; login sets a default $PATH and a $HOME, for example, and environ(5) documents various other uses (which I haven't checked in the source code).

This implies that the V7 shell is where $PATH first appeared in Unix, where the manual page describes it as 'the search path for commands'. This might make you wonder how the V6 shell handled locating commands, and where it looked for them. The details are helpfully documented in the V6 shell manual page, and I'll just quote what it has to say:

If the first argument is the name of an executable file, it is invoked; otherwise the string `/bin/' is prepended to the argument. (In this way most standard commands, which reside in `/bin', are found.) If no such command is found, the string `/usr' is further prepended (to give `/usr/bin/command') and another attempt is made to execute the resulting file. (Certain lesser-used commands live in `/usr/bin'.)

('Invoked' here is carrying some extra freight, since this may not involve a direct kernel exec of the file. An executable file that the kernel didn't like would be directly run by the shell.)

I suspect that '$PATH' was given such as short name (instead of a longer, more explicit one) simply as a matter of Unix style at the time. Pretty much everything in V7 was terse and short in this style for various reasons, and verbose environment variable names would have reduced that limited exec argument space.

(2 comments.)

Unix had good reasons to evolve since V7 (and had to)

Chris's Wiki :: blog

By: cks

27 July 2025 at 03:21

There's a certain sort of person who feels that the platonic ideal of Unix is somewhere around Research Unix V7 and it's almost all been downhill since then (perhaps with the exception of further Research Unixes and then Plan 9, although very few people got their hands on any of them). For all that I like Unix and started using it long ago when it was simpler (although not as far back as V7), I reject this view and think it's completely mistaken.

V7 Unix was simple but it was also limited, both in its implementation (which often took shortcuts (also, also, also) and in its overall features (such as short filenames). Obviously V7 didn't have networking, but even things that most people think of as perfectly reasonable and good Unix features like '#!' support for shell scripts in the kernel and processes being in multiple groups at once. That V7 was a simple and limited system meant that its choices were to grow to meet people's quite reasonable needs or to fall out of use.

(Some of these needs were for features and some of them were for performance. The original V7 filesystem was quite simple but also suffered from performance issues, ones that often got worse over time.)

I'll agree that the path that the growth of Unix has taken since V7 is not necessarily ideal; we can all point to various things about modern Unixes that we don't like. Any particular flaws came about partly because people don't necessarily make ideal decisions and partly because we haven't necessarily had perfect understandings of the problems when people had to do something, and then once they'd done something they were constrained by backward compatibility.

(In some ways Plan 9 represents 'Unix without the constraint of backward compatibility', and while I think there are a variety of reasons that it failed to catch on in the world, that lack of compatibility is one of them. Even if you had access to Plan 9, you had to be fairly dedicated to do your work in a Plan 9 environment (and that was before the web made it worse).)

PS: It's my view that the people who are pushing various Unixes forward aren't incompetent, stupid, or foolish. They're rational and talented people who are doing their best in the circumstances that they find themselves. If you want to throw stones, don't throw them at the people, throw them at the overall environment that constrains and shapes how everything in this world is pushed to evolve. Unix is far from the only thing shaped in potentially undesirable ways by these forces; consider, for example, C++.

(It's also clear that a lot of people involved in the historical evolution of BSD and other Unixes were really quite smart, even if you don't like, for example, the BSD sockets API.)

(One comment.)

NFS v4 delegations on a Linux NFS server can act as mandatory locks

Chris's Wiki :: blog

By: cks

23 July 2025 at 01:33

Over on the Fediverse, I shared an unhappy learning experience:

Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS client that's now not responding, good luck writing to the file from any other machine. Your writes will hang.

NFS v4 delegations are an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular file over to a client that is using that file. There are various sorts of delegations, but even a basic read delegation will force the NFS server to recall the delegation if anything else wants to write to the file or to remove it. Recalling a delegation requires notifying the NFS v4 client that it has lost the delegation and then having the client accept and respond to that. NFS v4 clients have to respond to the loss of a delegation because they may be holding local state that needs to be flushed back to the NFS server before the delegation can be released.

(After all the NFS v4 server promised the client 'this file is yours to fiddle around with, I will consult you before touching it'.)

Under some circumstances, when the NFS v4 server is unable to contact the NFS v4 client, it will simply sit there waiting and as part of that will not allow you to do things that require the delegation to be released. I don't know if there's a delegation recall timeout, although I suspect that there is, and I don't know how to find out what the timeout is, but whatever the value is, it's substantial (it may be the 90 second 'default lease time' from nfsd4_init_leases_net(), or perhaps the 'grace', also probably 90 seconds, or perhaps the two added together).

(90 seconds is not what I consider a tolerable amount of time for my editor to completely freeze when I tell it to write out a new version of the file. When NFS is involved, I will typically assume that something has gone badly wrong well before then.)

As mentioned, the NFS v4 RFC also explicitly notes that NFS v4 clients may have to flush file state in order to release their delegation, and this itself may take some time. So even without an unavailable client machine, recalling a delegation may stall for some possibly arbitrary amount of time (depending on how the NFS v4 server behaves; the RFC encourages NFS v4 servers to not be hasty if the client seems to be making a good faith effort to clear its state). Both the slow client recall and the hung client recall can happen even in the absence of any actual file locks; in my case, the now-unavailable client merely having read from the file was enough to block things.

This blocking recall is effectively a mandatory lock, and it affects both remote operations over NFS and local operations on the fileserver itself. Short of waiting out whatever timeout applies, you have two realistic choices to deal with this (the non-realistic choice is to reboot the fileserver). First, you can bring the NFS client back to life, or at least something that's at its IP address and responds to the server with NFS v4 errors. Second, I believe you can force everything from the client to expire through /proc/fs/nfsd/clients/<ID>, by writing 'expire' to the client's 'ctl' file. You can find the right client ID by grep'ing for something in all of the clients/*/info files.

Discovering this makes me somewhat more inclined than before to consider entirely disabling 'leases', the underlying kernel feature that is used to implement these NFS v4 delegations (I discovered how to do this when investigating NFS v4 client locks on the server). This will also affect local processes on the fileserver, but that now feels like a feature since hung NFS v4 delegation recalls will stall or stop even local operations.

Why Ubuntu 24.04's ls can show a puzzling error message on NFS filesystems

Chris's Wiki :: blog

By: cks

18 July 2025 at 02:19

Suppose that you're on Ubuntu 24.04, using NFS v4 filesystems mounted from a Linux NFS fileserver, and at some point you do a 'ls -l' or a 'ls -ld' of something you don't own. You may then be confused and angered:

; /bin/ls -ld ckstst
/bin/ls: ckstst: Permission denied
drwx------ 64 ckstst [...] 131 Jul 17 12:06 ckstst

(There are situations where this doesn't happen or doesn't repeat, which I don't understand but which I'm assuming are NFS caching in action.)

If you apply strace to the problem, you'll find that the failing system call is listxattr(2), which is trying to list 'extended attributes'. On Ubuntu 24.04, ls comes from Coreutils, and Coreutils apparently started using listxattr() in version 9.4.

The Linux NFS v4 code supports extended attributes (xattrs), which are from RFC 8276; they're supported in both the client and the server since mid-2020 if I'm reading git logs correctly. Both the normal Ubuntu 22.04 LTS and 24.04 LTS server kernels are recent enough to include this support on both the server and clients, and I don't believe there's any way to turn just them off in the kernel server (although if you disable NFS v4.2 they may disappear too).

However, the NFS v4 server doesn't treat listxattr() operations the way the kernel normally does. Normally, the kernel will let you do listxattr() on an object (a directory, a file, etc) that you don't have read permissions on, just as it will let you do stat() on it. However, the NFS v4 server code specifically requires that you have read access to the object. If you don't, you get EACCES (no second S).

(The sausage is made in nfsd_listxattr() in fs/nfsd/vfs.c, specifically in the fh_verify() call that uses NFSD_MAY_READ instead of NFSD_MAY_NOP, which is what eg GETATTR uses.)

In January of this year, Coreutils applied a workaround to this problem, which appeared in Coreutils 9.6 (and is mentioned in the release notes).

Normally we'd have found this last year, but we've been slow to roll out Ubuntu 24.04 LTS machines and apparently until now no one ever did a 'ls -l' of unreadable things on one of them (well, on a NFS mounted filesystem).

(This elaborates on a Fediverse post. Our patch is somewhat different than the official one.)

The development version of OpenZFS is sometimes dangerous, illustrated

Chris's Wiki :: blog

By: cks

12 July 2025 at 02:41

I've used OpenZFS on my office and home desktops (on Linux) for what is a long time now, and over that time I've consistently used the development version of OpenZFS, updating to the latest git tip on a regular basis (cf). There have been occasional issues but I've said, and continue to say, that the code that goes into the development version is generally well tested and I usually don't worry too much about it. But I do worry somewhat, and I do things like read every commit message for the development version and I sometimes hold off on updating my version if a particular significant change has recently landed.

But, well, sometimes things go wrong in a development version. As covered in Rob Norris's An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too) (via), there was a recently discovered bug in the development version of OpenZFS that could or would have corrupted RAIDZ vdevs. When I saw the fix commit go by in the development version, I felt extremely lucky that I use mirror vdevs, not raidz, and so avoided being affected by this.

(While I might have detected this at the first scrub after some data was corrupted, the data would have been gone and at a minimum I'd have had to restore it from backups. Which I don't currently have on my home desktop.)

In general this is a pointed reminder that the development version of OpenZFS isn't perfect, no matter how long I and other people have been lucky with it. You might want to think twice before running the development version in order to, for example, get support for the very latest kernels that are used by distributions like Fedora. Perhaps you're better off delaying your kernel upgrades a bit longer and sticking to released branches.

I don't know if this is going to change my practices around running the development version of OpenZFS on my desktops. It may make me more reluctant to update to the very latest version on my home desktop; it would be straightforward to have that run only time-delayed versions of what I've already run through at least one scrub cycle on my office desktop (where I have backups). And I probably won't switch to the next release version when it comes out, partly because of kernel support issues.

(One comment.)

(Maybe) understanding how to use systemd-socket-proxyd

Chris's Wiki :: blog

By: cks

10 July 2025 at 03:47

I recently read systemd has been a complete, utter, unmitigated success (via among other places), where I found a mention of an interesting systemd piece that I'd previously been unaware of, systemd-socket-proxyd. As covered in the article, the major purpose of systemd-socket-proxyd is the bridge between systemd dynamic socket activation and a conventional programs that listens on some socket, so that you can dynamically activate the program when a connection comes in. Unfortunately the systemd-socket-proxyd manual page is a little bit opaque about how it works for this purpose (and what the limitations are). Even though I'm familiar with systemd stuff, I had to think about it for a bit before things clicked.

A systemd socket unit activates the corresponding service unit when a connection comes in on the socket. For simple services that are activated separately for each connection (with 'Accept=yes'), this is actually a templated unit, but if you're using it to activate a regular daemon like sshd (with 'Accept=no') it will be a single .service unit. When systemd activates this unit, it will pass the socket to it either through systemd's native mechanism or an inetd-compatible mechanism using standard input. If your listening program supports either mechanism, you don't need systemd-socket-proxyd and your life is simple. But plenty of interesting programs don't; they expect to start up and bind to their listening socket themselves. To work with these programs, systemd-socket-proxyd accepts a socket (or several) from systemd and then proxies connections on that socket to the socket your program is actually listening to (which will not be the official socket, such as port 80 or 443).

All of this is perfectly fine and straightforward, but the question is, how do we get our real program to be automatically started when a connection comes in and triggers systemd's socket activation? The answer, which isn't explicitly described in the manual page but which appears in the examples, is that we make the socket's .service unit (which will run systemd-socket-proxyd) also depend on the .service unit for our real service with a 'Requires=' and an 'After='. When a connection comes in on the main socket that systemd is doing socket activation for, call it 'fred.socket', systemd will try to activate the corresponding .service unit, 'fred.service'. As it does this, it sees that fred.service depends on 'realthing.service' and must be started after it, so it will start 'realthing.service' first. Your real program will then start, bind to its local socket, and then have systemd-socket-proxyd proxy the first connection to it.

To automatically stop everything when things are idle, you set systemd-socket-proxyd's --exit-idle-time option and also set StopWhenUnneeded=true on your program's real service unit ('realthing.service' here). Then when systemd-socket-proxyd is idle for long enough, it will exit, systemd will notice that the 'fred.service' unit is no longer active, see that there's nothing that needs your real service unit any more, and shut that unit down too, causing your real program to exit.

The obvious limitation of using systemd-socket-proxyd is that your real program no longer knows the actual source of the connection. If you use systemd-socket-proxyd to relay HTTP connections on port 80 to an nginx instance that's activated on demand (as shown in the examples in the systemd-socket-proxyd manual page), that nginx sees and will log all of the connections as local ones. There are usage patterns where this information will be added by something else (for example, a frontend server that is a reverse proxy to a bunch of activated on demand backend servers), but otherwise you're out of luck as far as I know.

Another potential issue is that systemd's idea of when the .service unit for your real program has 'started' and thus it can start running systemd-socket-proxyd may not match when your real program actually gets around to setting up its socket. I don't know if systemd-socket-proxyd will wait and try a bit to cope with the situation where it gets started a bit faster than your real program can get its socket ready.

(Systemd has ways that your real program can signal readiness, but if your program can use these ways it may well also support being passed sockets from systemd as a direct socket activated thing.)

Linux 'exportfs -r' stops on errors (well, problems)

Chris's Wiki :: blog

By: cks

9 July 2025 at 03:26

Linux's NFS export handling system has a very convenient option where you don't have to put all of your exports into one file, /etc/exports, but can instead write them into a bunch of separate files in /etc/exports.d. This is very convenient for allowing you to manage filesystem exports separately from each other and to add, remove, or modify only a single filesystem's exports. Also, one of the things that exportfs(8) can do is 'reexport' all current exports, synchronizing the system state to what is in /etc/exports and /etc/exports.d; this is 'exportfs -r', and is a handy thing to do after you've done various manipulations of files in /etc/exports.d.

Although it's not documented and not explicit in 'exportfs -v -r' (which will claim to be 'exporting ...' for various things), I have an important safety tip which I discovered today: exportfs does nothing on a re-export if you have any problems in your exports. In particular, if any single file in /etc/exports.d has a problem, no files from /etc/exports.d get processed and no exports are updated.

One potential problem with such files is syntax errors, which is fair enough as a 'problem'. But another problem is that they refer to directories that don't exist, for example because you have lingering exports for a ZFS pool that you've temporarily exported (which deletes the directories that the pool's filesystems may have previously been mounted on). A missing directory is an error even if the exportfs options include 'mountpoint', which only does the export if the directory is a mount point.

When I stubbed my toe on this I was surprised. What I'd vaguely expected was that the error would cause only the particular file in /etc/exports.d to not be processed, and that it wouldn't be a fatal error for the entire process. Exportfs itself prints no notices about this being a fatal problem, and it will happily continue to process other files in /etc/exports.d (as you can see with 'exportfs -v -r' with the right ordering of where the problem file is) and claim to be exporting them.

Oh well, now I know and hopefully it will stick.

(One comment.)

Systemd user units, user sessions, and environment variables

Chris's Wiki :: blog

By: cks

8 July 2025 at 03:17

A variety of things in typical graphical desktop sessions communicate through the use of environment variables; for example, X's $DISPLAY environment variable. Somewhat famously, modern desktops run a lot of things as systemd user units, and it might be nice to do that yourself (cf). When you put these two facts together, you wind up with a question, namely how the environment works in systemd user units and what problems you're going to run into.

The simplest case is using systemd-run to run a user scope unit ('systemd-run --user --scope --'), for example to run a CPU heavy thing with low priority. In this situation, the new scope will inherit your entire current environment and nothing else. As far as I know, there's no way to do this with other sorts of things that systemd-run will start.

Non-scope user units by default inherit their environment from your user "systemd manager". I believe that there is always only a single user manager for all sessions of a particular user, regardless of how you've logged in. When starting things via 'systemd-run', you can selectively pass environment variables from your current environment with 'systemd-run --user -E <var> -E <var> -E ...'. If the variable is unset in your environment but set in the user systemd manager, this will unset it for the new systemd-run started unit. As you can tell, this will get very tedious if you want to pass a lot of variables from your current environment into the new unit.

You can manipulate your user "systemd manager environment block", as systemctl describes it in Environment Commands. In particular, you can export current environment settings to it with 'systemctl --user import-environment VAR VAR2 ...'. If you look at this with 'systemctl --user show-environment', you'll see that your desktop environment has pushed a lot of environment variables into the systemd manager environment block, including things like $DISPLAY (if you're on X). All of these environment variables for X, Wayland, DBus, and so on are probably part of how the assorted user units that are part of your desktop session talk to the display and so on.

You may now see a little problem. What happens if you're logged in with a desktop X session, and then you go elsewhere and SSH in to your machine (maybe with X forwarding) and try to start a graphical program as a systemd user unit? Since you only have a single systemd manager regardless of how many sessions you have, the systemd user unit you started from your SSH session will inherit all of the environment variables that your desktop session set and it will think it has graphics and open up a window on your desktop (which is hopefully locked, and in any case it's not useful to you over SSH). If you import the SSH session's $DISPLAY (or whatever) into the systemd manager's environment, you'll damage your desktop session.

For specific environment variables, you can override or remove them with 'systemd-run --user -E ...' (for example, to override or remove $DISPLAY). But hunting down all of the session environment variables that may trigger undesired effects is up to you, making systemd-run's user scope units by far the easiest way to deal with this.

(I don't know if there's something extra-special about scope units that enables them and only them to be passed your entire environment, or of this is simply a limitation in systemd-run that it doesn't try to implement this for anything else.)

The reason I find all of this regrettable is that it makes putting applications and other session processes into their own units much harder than it should be. Systemd-run's scope units inherit your session environment but can't be detached, so at a minimum you have extra systemd-run processes sticking around (and putting everything into scopes when some of them might be services is unaesthetic). Other units can be detached but don't inherit your environment, requiring assorted contortions to make things work.

PS: Possibly I'm missing something obvious about how to do this correctly, or perhaps there's an existing helper that can be used generically for this purpose.

(2 comments.)

What is going on in Unix with errno's limited nature

Chris's Wiki :: blog

By: cks

4 July 2025 at 02:13

If you read manual pages, such as Linux's errno(3), you'll soon discover an important and peculiar seeming limitation of looking at errno. To quote the Linux version:

The value in errno is significant only when the return value of the call indicated an error (i.e., -1 from most system calls; -1 or NULL from most library functions); a function that succeeds is allowed to change errno. The value of errno is never set to zero by any system call or library function.

This is also more or less what POSIX says in errno, although in standards language that's less clear. All of this is a sign of what has traditionally been going on behind the scenes in Unix.

The classical Unix approach to kernel system calls doesn't return multiple values, for example the regular return value and errno. Instead, Unix kernels have traditionally returned either a success value or the errno value along with an indication of failure, telling them apart in various ways (such as the PDP-11 return method). At the C library level, the simple approach taken in early Unix was that system call wrappers only bothered to set the C level errno if the kernel signaled an error. See, for example, the V7 libc/crt/cerror.s combined with libc/sys/dup.s, where the dup() wrapper only jumps to cerror and sets errno if the kernel signals an error. The system call wrappers could all have explicitly set errno to 0 on success, but they didn't.

The next issue is that various C library calls may make a number of system calls themselves, some of which may fail without the library call itself failing. The classical case is stdio checking to see whether stdout is connected to a terminal and so should be line buffered, which was traditionally implemented by trying to do a terminal-only ioctl() to the file descriptors, which would fail with ENOTTY on non-terminal file descriptors. Even if stdio did a successful write() rather than only buffering your output, the write() system call wrapper wouldn't change the existing ENOTTY errno value from the failed ioctl(). So you can have a fwrite() (or printf() or puts() or other stdio call) that succeeds while 'setting' errno to some value such as ENOTTY.

When ANSI C and POSIX came along, they inherited this existing situation and there wasn't much they could do about it (POSIX was mostly documenting existing practice). I believe that they also wanted to allow a situation where POSIX functions were implemented on top of whatever oddball system calls you wanted to have your library code do, even if they set errno. So the only thing POSIX could really require was the traditional Unix behavior that if something failed and it was documented to set errno on failure, you could then look at errno and have it be meaningful.

(This was what existing Unixes were already mostly doing and specifying it put minimal constraints on any new POSIX environments, including POSIX environments on top of other operating systems.)

(This elaborates on a Fediverse post of mine, and you can run into this in non-C languages that have true multi-value returns under the right circumstances.)

(3 comments.)

How history works in the version of the rc shell that I use

Chris's Wiki :: blog

By: cks

30 June 2025 at 03:15

Broadly, there have been three approaches to command history in Unix shells. In the beginning there was none, which was certainly simple but which led people to be unhappy. Then csh gave us in-memory command history, which could be recalled and edited with shell builtins like '!!' but which lasted only as long as that shell process did. Finally, people started putting 'readline style' interactive command editing into shells, which included some history of past commands that you could get back with cursor-up, and picked up the GNU Readline feature of a $HISTORY file. Broadly speaking, the shell would save the in-memory (readline) history to $HISTORY when it exited and load the in-memory (readline) history from $HISTORY when it started.

I use a reimplementation of rc, the shell created by Tom Duff, and my version of the shell started out with a rather different and more minimal mechanism for history. In the initial release of this rc, all the shell itself did was write every command executed to $history (if that variable was set). Inspecting and reusing commands from a $history file was left up to you, although rc provided a helper program that could be used in a variety of ways. For example, in a terminal window I commonly used '-p' to print the last command and then either copied and pasted it with the mouse or used an rc function I wrote to repeat it directly.

(You didn't have to set $history to the same file in every instance of rc. I arranged to have a per-shell history file that was removed when the shell exited, because I was only interested in short term 'repeat a previous command' usage of history.)

Later, the version of rc that I use got support for GNU Readline and other line editing environments (and I started using it). GNU Readline maintains its own in-memory command history, which is used for things like cursor-up to the previous line. In rc, this in-memory command history is distinct from the $history file history, and things can get confusing if you mix the two (for example, cursor-up to an invocation of your 'repeat the last command' function won't necessarily repeat the command you expect).

It turns out that at least for GNU Readline, the current implementation in rc does the obvious thing; if $history is set when rc starts, the commands from it are read into GNU Readline's in-memory history. This is one half of the traditional $HISTORY behavior. Rc's current GNU Readline code doesn't attempt to save its in-memory history back to $history on exit, because if $history is set the regular rc code has already been recording all of your commands there. Rc otherwise has no shell builtins to manipulate GNU Readline's command history, because GNU Readline and other line editing alternatives are just optional extra features that have relatively minimal hooks into the core of rc.

(In theory this allows thenshell to inject a synthetic command history into rc on startup, but it requires thenshell to know exactly how I handle my per-shell history file.)

Sidebar: How I create per-shell history in this version of `rc`

The version of rc that I use doesn't have an 'initialization' shell function that runs when the shell is started, but it does support a 'prompt' function that's run just before the prompt is printed. So my prompt function keeps track of the 'expected shell PID' in a variable and compares it to the actual PID. If there's a mismatch (including the variable being unset), the prompt function goes through a per-shell initialization, including setting up my per-shell $history value.

(One comment.)

Current cups-browsed seems to be bad for central CUPS print servers

Chris's Wiki :: blog

By: cks

28 June 2025 at 02:26

Suppose, not hypothetically, that you have a central CUPS print server, and that people also have Linux desktops or laptops that they point at your print server to print to your printers. As of at least Ubunut 24.04, if you're doing this you probably want to get people to turn off and disable cups-browsed on their machines. If you don't, your central print server may see a constant flood of connections from client machines running cups-browsed. You're probably running it, as I believe that cups-browsed is installed and activated by default these days in most desktop Linux environments.

(We didn't really notice this in prior Ubuntu versions, although it's possible cups-browsed was always doing something like this and what's changed in the Ubuntu 24.04 version is that it's doing it more and faster.)

I'm not entirely sure why this happens, and I'm also not sure what the CUPS requests typically involve, but one pattern that we see is that such clients will make a lot of requests to the CUPS server's /admin/ URL. I'm not sure what's in these requests, because CUPS immediately rejects them as unauthenticated. Another thing we've seen is frequent attempts to get printer attributes for printers that don't exist and that have name patterns that look like local printers. One of the reason that the clients are hitting the /admin/ endpoint may be to somehow add these printers to our CUPS server, which is definitely not going to work.

(We've also seen signs that some Ubuntu 24.04 applications can repeatedly spam the CUPS server, probably with status requests for printers or print jobs. This may be something enabled or encouraged by cups-browsed.)

My impression is that modern Linux desktop software, things like cups-browsed included, is not really spending much time thinking about larger scale, managed Unix environments where there are a bunch of printers (or at least print queues), the 'print server' is not on your local machine and not run by you, anything random you pick up through broadcast on the local network is suspect, and so on. I broadly sympathize with this, because such environments are a small minority now, but it would be nice if client side CUPS software didn't cause problems in them.

(I suspect that cups-browsed and its friends are okay in an environment where either the 'print server' is local or it's operated by you and doesn't require authentication, there's only a few printers, everyone on the local network is friendly and if you see a printer it's definitely okay to use it, and so on. This describes a lot of Linux desktop environments, including my home desktop.)

Some notes on X terminals in their heyday

Chris's Wiki :: blog

By: cks

26 June 2025 at 03:41

I recently wrote about how the X Window System didn't immediately have (thin client) X terminals. X terminals are now a relatively obscure part of history and it may not be obvious to people today why they were a relatively significant deal at the time. So today I'm going to add some additional notes about X terminals in their heyday, from their introduction around 1989 through the mid 1990s.

One of the reactions to my entry that I've seen is to wonder if there was much point to X terminals, since it seems like they should be close to much more functional normal computers and all you'd save is perhaps storage. Practically this wasn't the case in 1989 when they were introduced; NCD's initial models cost substantially less than, say, a Sparcstation 1 (also introduced in 1989), it appears less than half the cost of even a diskless Sparcstation 1. I believe that one reason for this is that memory was comparatively more expensive in those days and X terminals could get away with much, much less of it, since they didn't need to run a Unix kernel and enough of a Unix user space to boot up the X server (and I believe that some or all of the software was run directly from ROM instead of being loaded into precious RAM).

(The NCD16 apparently started at 1 MByte of RAM and the NCD19 at 2 MBytes, for example. You could apparently get a Sparcstation 1 with that little memory but you probably didn't want to use it for much.)

In one sense, early PCs were competition for X terminals in that they put computation on people's desks, but in another sense they weren't, because you couldn't use them as an inexpensive way to get Unix on people's desks. There eventually was at least one piece of software for this, DESQview/X, but it appeared later and you'd have needed to also buy the PC to run it on, as well as a 'high resolution' black and white display card and monitor. Of course, eventually the march of PCs made all of that cheap, which was part of the diminishing interest in X terminals in the later part of the 1990s and onward.

(I suspect that one reason that X terminals had lower hardware costs was that they probably had what today we would call a 'unified memory system', where the framebuffer's RAM was regular RAM instead of having to be separate because it came on a separate physical card.)

You might wonder how well X terminals worked over the 10 MBit Ethernet that was all you had at the time. With the right programs it could work pretty well, because the original approach of X was that you sent drawing commands to the X server, not rendered bitmaps. If you were using things that could send simple, compact rendering commands to your X terminal, such as xterm, 10M Ethernet could be perfectly okay. Anything that required shipping bitmapped graphics could be not as impressive, or even not something you'd want to touch, but for what you typically used monochrome X for between 1989 and 1995 or so, this was generally okay.

(Today many things on X want to ship bitmaps around, even for things like displaying text. But back in the day text was shipped as, well, text, and it was the X server that rendered the fonts.)

When looking at the servers you'd need for a given number of diskless Unix workstations or X terminals, the X terminals required less server side disk space but potentially more server side memory and CPU capacity, and were easier to administer. As noted by some commentators here, you might also save on commercial software licensing costs if you could license it only for your few servers instead of your lots of Unix workstations. I don't know how the system administration load actually compared to a similar number of PCs or Macs, but in my Unix circles we thought we scaled much better and could much more easily support many seats (and many potential users if you had, for example, many more students than lab desktops).

My perception is that what killed off X terminals as particularly attractive, even for Unix places, was that on the one hand the extra hardware capabilities PCs needed over X terminals kept getting cheaper and cheaper and on the other hand people started demanding more features and performance, like decent colour displays. That brought the X terminal 'advantage' more or less down to easier administration, and in the end that wasn't enough (although some X terminals and X 'thin client' setups clung on quite late, eg the SunRay, which we had some of in the 2000s).

Of course that's a Unix centric view. In a larger view, Unix was displaced on the desktop by PCs, which naturally limited the demand for both X terminals and dedicated Unix workstations (which were significantly marketed toward the basic end of Unix performance, and see also). By no later than the end of the 1990s, PCs were better basic Unix workstations than the other options and you could use them to run other software too if you wanted to, so they mostly ran over everything else even in the remaining holdouts.

(We ran what were effectively X terminals quite late, but the last few generations were basic PCs running LTSP not dedicated hardware. All our Sun Rays got retired well before the LTSP machines.)

(I think that the 'personal computer' model has or at least had some significant pragmatic advantages over the 'terminal' model, but that's something for another entry.)

(8 comments.)

Compute GPUs can have odd failures under Linux (still)

Chris's Wiki :: blog

By: cks

24 June 2025 at 03:04

Back in the early days of GPU computation, the hardware, drivers, and software were so relatively untrustworthy that our early GPU machines had to be specifically reserved by people and that reservation gave them the ability to remotely power cycle the machine to recover it (this was in the days before our SLURM cluster). Things have gotten much better since then, with things like hardware and driver changes so that programs with bugs couldn't hard-lock the GPU hardware. But every so often we run into odd failures where something funny is going on that we don't understand.

We have one particular SLURM GPU node that has been flaky for a while, with the specific issue being that every so often the NVIDIA GPU would throw up its hands and drop off the PCIe bus until we rebooted the system. This didn't happen every time it was used, or with any consistent pattern, although some people's jobs seemed to regularly trigger this behavior. Recently I dug up a simple to use GPU stress test program, and when this machine's GPU did its disappearing act this Saturday, I grabbed the machine, rebooted it, ran the stress test program, and promptly had the GPU disappear again. Success, I thought, and since it was Saturday, I stopped there, planning to repeat this process today (Monday) at work, while doing various monitoring things.

Since I'm writing a Wandering Thoughts entry about it, you can probably guess the punchline. Nothing has changed on this machine since Saturday, but all today the GPU stress test program could not make the GPU disappear. Not with the same basic usage I'd used Saturday, and not with a different usage that took the GPU to full power draw and a reported temperature of 80C (which was a higher temperature and power draw than the GPU had been at when it disappeared, based on our Prometheus metrics). If I'd been unable to reproduce the failure at all with the GPU stress program, that would have been one thing, but reproducing it once and then not again is just irritating.

(The machine is an assembled from parts one, with an RTX 4090 and a Ryzen Threadripper 1950X in an X399 Taichi motherboard that is probably not even vaguely running the latest BIOS, seeing as the base hardware was built many years ago, although the GPU has been swapped around since then. Everything is in a pretty roomy 4U case, but if the failure was consistent we'd have assumed cooling issues.)

I don't really have any theories for what could be going on, but I suppose I should try to find a GPU stress test program that exercises every last corner of the GPU's capabilities at full power rather than using only one or two parts at a time. On CPUs, different loads light up different functional units, and I assume the same is true on GPUs, so perhaps the problem is in one specific functional unit or a combination of them.

(Although this doesn't explain why the GPU stress test program was able to cause the problem on Saturday but not today, unless a full reboot didn't completely clear out the GPU's state. Possibly we should physically power this machine off entirely for long enough to dissipate any lingering things.)

(2 comments.)

The X Window System didn't immediately have X terminals

Chris's Wiki :: blog

By: cks

23 June 2025 at 03:21

For a while, X terminals were a reasonably popular way to give people comparatively inexpensive X desktops. These X terminals relied on X's network transparency so that only the X server had to run on the X terminal itself, with all of your terminal windows and other programs running on a server somewhere and just displaying on the X terminal. For a long time, using a big server and a lab full of X terminals was significantly cheaper than setting up a lab full of actual workstations (until inexpensive and capable PCs showed up). Given that X started with network transparency and X terminals are so obvious, you might be surprised to find out that X didn't start with them.

In the early days, X ran on workstations. Some of them were diskless workstations, and on some of them (especially the diskless ones), you would log in to a server somewhere to do a lot of your more heavy duty work. But they were full workstations, with a full local Unix environment and you expected to run your window manager and other programs locally even if you did your real work on servers. Although probably some people who had underpowered workstations sitting around experimented with only running the X server locally, with everything else done remotely (except perhaps the window manager).

The first X terminals arrived only once X was reasonably well established as the successful cross-vendor Unix windowing system. NCD, who I suspect were among the first people to make an X terminal, was founded only in 1987 and of course didn't immediately ship a product (it may have shipped its first product in 1989). One indication of the delay in X terminals is that XDM was only released with X11R3, in October of 1988. You technically didn't need XDM to have an X terminal, but it made life much easier, so its late arrival is a sign that X terminals didn't arrive much before then.

(It's quite possible that the possibility for an 'X terminal' was on people's minds even in the early days of X. The Bell Labs Blit was a 'graphical terminal' that had papers written and published about it sometime in 1983 or 1984, and the Blit was definitely known in various universities and so on. Bell Labs even gave people a few of them, which is part of how I wound up using one for a while. Sadly I'm not sure what happened to it in the end, although by now it would probably be a historical artifact.)

(This entry was prompted by a comment on a recent entry of mine.)

PS: A number of people seem to have introduced X terminals in 1989; I didn't spot any in 1988 or earlier.

Sidebar: Using an X terminal without XDM

If you didn't have XDM available or didn't want to have to rely on it, you could give your X terminal the ability to open up a local terminal window that ran a telnet client. To start up an X environment, people would telnet into their local server, set $DISPLAY (or have it automatically set by the site's login scripts), and start at least their window manager by hand. This required your X terminal to not use any access control (at least when you were doing the telnet thing), but strong access control wasn't exactly an X terminal feature in the first place.

What I've observed about Linux kernel WireGuard on 10G Ethernet so far

Chris's Wiki :: blog

By: cks

20 June 2025 at 03:08

I wrote about a performance mystery with WireGuard on 10G Ethernet, and since then I've done additional measurements with results that both give some clarity and leave me scratching my head a bit more. So here is what I know about the general performance characteristics of Linux kernel WireGuard on a mixture of Ubuntu 22.04 and 24.04 servers with stock settings, and using TCP streams inside the WireGuard tunnels (because the high bandwidth thing we care about runs over TCP).

CPU performance is important even when WireGuard isn't saturating the CPU.
CPU performance seems to be more important on the receiving side than on the sending side. If you have two machines, one faster than the other, you get more bandwidth sending a TCP stream from the slower machine to the faster one. I don't know if this is an artifact of the Linux kernel implementation or if the WireGuard protocol requires the receiver to do more work than the sender.
There seems to be a single-peer bandwidth limit (related to CPU speeds). You can increase the total WireGuard bandwidth of a given server by talking to more than one peer.
When talking to a single peer, there's both a unidirectional bandwidth limit and a bidirectional bandwidth limit. If you send and receive to a single peer at once, you don't get the sum of the unidirectional send and unidirectional receive; you get less.
There's probably also a total WireGuard bandwidth that, in our environment, falls short of 10G bandwidth (ie, a server talking WireGuard to multiple peers can't saturate its 10G connection, although maybe it could if I had enough peers in my test setup).

The best performance between a pair of WireGuard peers I've gotten is from two servers with Xeon E-2226G CPUs; these can push their 10G Ethernet to about 850 MBytes/sec of WireGuard bandwidth in one direction and about 630 MBytes/sec in each direction if they're both sending and receiving. These servers (and other servers with slower CPUs) can basically saturate their 10G-T network links with plain (non-WireGuard) TCP.

If I was to build a high performance 'WireGuard gateway' today, I'd build it with a fast CPU and dual 10G networks, with WireGuard traffic coming in (and going out) one 10G interface and the resulting gatewayed traffic using the other. WireGuard on fast CPUs can run fast enough that a single 10G interface could limit total bandwidth under the right (or wrong) circumstances; segmenting WireGuard and clear traffic onto different interfaces avoids that.

(A WireGuard gateway that only served clients at 1G or less would likely be perfectly fine with a single 10G interface and reasonably fast CPUs. But I'd want to test how many 1G clients it took to reach the total WireGuard bandwidth limit on a 10G WireGuard server before I was completely confident about that.)

(2 comments.)

A performance mystery with Linux WireGuard on 10G Ethernet

Chris's Wiki :: blog

By: cks

18 June 2025 at 03:45

As a followup on discovering that WireGuard can saturate a 1G Ethernet (on Linux), I set up WireGuard on some slower servers here that have 10G networking. This isn't an ideal test but it's more representative of what we would see with our actual fileservers, since I used spare fileserver hardware. What I got out of it was a performance and CPU usage mystery.

What I expected to see was that WireGuard performance would top out at some level above 1G as the slower CPUs on both the sending and the receiving host ran into their limits, and I definitely wouldn't see them drive the network as fast as they could without WireGuard. What I actually saw was that WireGuard did hit a speed limit but the CPU usage didn't seem to saturate, either for kernel WireGuard processing or for the iperf3 process. These machines can manage to come relatively close to 10G bandwidth with bare TCP, while with WireGuard they were running around 400 MBytes/sec of on the wire bandwidth (which translates to somewhat less inside the WireGuard connection, due to overheads).

One possible explanation for this is increased packet handling latency, where the introduction of WireGuard adds delays that keep things from running at full speed. Another possible explanation is that I'm running into CPU limits that aren't obvious from simple tools like top and htop. One interesting thing is that if I do a test in both directions at once (either an iperf3 bidirectional test or two iperf3 sessions, one in each direction), the bandwidth in each direction is slightly over half the unidirectional bandwidth (while a bidirectional test without WireGuard runs at full speed in both directions at once). This certainly makes it look like there's a total WireGuard bandwidth limit in these servers somewhere; unidirectional traffic gets basically all of it, while bidirectional traffic splits it fairly between each direction.

I looked at 'perf top' on the receiving 10G machine and kernel spin lock stuff seems to come in surprisingly high. I tried having a 1G test machine also send WireGuard traffic to the receiving 10G test machine at the same time and the incoming bandwidth does go up by about 100 Mbytes/sec, so perhaps on these servers I'm running into a single-peer bandwidth limitation. I can probably arrange to test this tomorrow.

(I can't usefully try both of my 1G WireGuard test machines at once because they're both connected to the same 1G switch, with a 1G uplink into our 10G switch fabric.)

PS: The two 10G servers are running Ubuntu 24.04 and Ubuntu 22.04 respectively with standard kernels; the faster server with more CPUs was the 'receiving' server here, and is running 24.04. The two 1G test servers are running Ubuntu 24.04.

(2 comments.)

Linux kernel WireGuard can go 'fast' on decent hardware

Chris's Wiki :: blog

By: cks

17 June 2025 at 02:19

I'm used to thinking of encryption as a slow thing that can't deliver anywhere near to network saturation, even on basic gigabit Ethernet connections. This is broadly the experience we see with our current VPN servers, which struggle to turn in more than relatively anemic bandwidth with OpenVPN and L2TP, and so for a long time I assumed it would also be our experience with WireGuard if we tried to put anything serious behind it. I'd seen the 2023 Tailscale blog post about this but discounted it as something we were unlikely to see; as their kernel throughput on powerful sounding AWS nodes was anemic by 10G standards, so I assumed our likely less powerful servers wouldn't even get 1G rates.

Today, for reasons beyond the scope of this entry, I wound up wondering how fast we could make WireGuard go. So I grabbed a couple of spare servers we had with reasonably modern CPUs (by our limited standards), put our standard Ubuntu 24.04 on them, and took a quick look to see how fast I could make them go over 1G networking. To my surprise, the answer is that WireGuard can saturate that 1G network with no particularly special tuning, and the system CPU usage is relatively low (4.5% on the client iperf3 side, 8% on the server iperf3 side; each server has a single Xeon E-2226G). The low usage suggests that we could push well over 1G of WireGuard bandwidth through a 10G link, which means that I'm going to set one up for testing at some point.

While the Xeon E-2226G is not a particularly impressive CPU, it's better than the CPUs our NFS fileservers have (the current hardware has Xeon Silver 4410Ys). But I suspect that we could sustain over 1G of WireGuard bandwidth even on them, if we wanted to terminate WireGuard on the fileservers instead of on a 'gateway' machine with a fast CPU (and a 10G link).

More broadly, I probably need to reset my assumptions about the relative speed of encryption as compared to network speeds. These days I suspect a lot of encryption methods can saturate a 1G network link, at least in theory, since I don't think WireGuard is exceptionally good in this respect (as I understand it, encryption speed wasn't particularly a design goal; it was designed to be secure first). Actual implementations may vary for various reasons so perhaps our VPN servers need some tuneups.

(The actual bandwidth achieved inside WireGuard is less than the 1G data rate because simply being encrypted adds some overhead. This is also something I'm going to have to remember when doing future testing; if I want to see how fast WireGuard is driving the underlying networking, I should look at the underlying networking data rate, not necessarily WireGuard's rate.)

(2 comments.)

Reading view

Sidebar: Cleaning this up on the most tangled virtual machine

Sidebar: What systemd uses

Sidebar: What we may wind up with in loader.conf

Sidebar: A rewritten command order for ingress traffic

Appendix: Dealing with Ubuntu package updates

Sidebar: Disabling Polkit entirely

Sidebar: Some options for Polkit (root) authentication

Sidebar: The details of how mounts worked in V7

Sidebar: Why there should be a 'not a valid inode number' signal value

Sidebar: How I create per-shell history in this version of rc

Sidebar: Using an X terminal without XDM

Sidebar: How I create per-shell history in this version of `rc`