❌

Reading view

There are new articles available, click to refresh the page.

I'm not sure we'd use AppArmor much even if we could

By: cks

The news of the time interval is a string of local privilege escalation vulnerabilities in Linux (in part in the kernel). We very much need the security boundary of Unix logins, and some of these vulnerabilities are mitigated or blocked by various Linux kernel security modules ('LSMs') (cf), so I've recently been thinking if we'd use AppArmor, the LSM that Ubuntu supports.

(AppArmor didn't block as many of the vulnerabilities as a proper SELinux setup did, but SELinux needs distribution buyin and that's not what Canonical provides.)

We've traditionally disabled AppArmor because it's had issues in our environment of NFS home directories in our own locations for them (also). So let's assume that AppArmor magically works now for NFS home directories and other directories (or can easily be set with tuning knobs), and still provides meaningful security afterward. Setting up AppArmor for our environment will take some amount of work (cf), so the question is how much protection against local privilege escalation we get.

Roughly speaking, our systems fall into two categories; systems that normal people can access and run programs on, and systems that are purely for services (including things such as IMAP mail). For services, in theory we (or the people writing AppArmor profiles) can work out what the services should be allowed to do and not do, and thus lock things down against local privilege escalations in kernel systems that the services shouldn't be touching anyway (and other vulnerabilities, such as information disclosure from reading files the service shouldn't be accessing). However, this protects against an unlikely set of chained issues, where there's both a vulnerability in a service itself and then an additional vulnerability in the kernel.

(If these issues aren't unlikely, we have bigger problems.)

That leaves the systems where normal people can run their own programs (which are the ones where we really need the security boundary of logins). On these systems we have to assume that an attacker can gain the ability to run relatively arbitrary programs, either by compromising an account outright or through, for example, a compromised package that people are using in the code they're writing for their research (or a compromised editor extension, or etc; there are lots of ways in). Since people are effectively running arbitrary code, we can't protect ourselves by having AppArmor restrict what specific programs can do the way we can on service-based machines. Instead, we have to find and inventory kernel features that people will never legitimately use, and then block them through AppArmor rules.

(This is how a strict SELinux setup appears to protect against the recent vulnerabilities; a normal login is simply not allowed to use, eg, RDS sockets.)

The Linux kernel has a lot of features and facilities, although some of them are blocked off because we don't allow user namespaces, and people doing CS research do a lot of things, some of them at least unusual. Could an AppArmor profile (or a set of them) be written so that people would be allowed access to what they use and not allowed access to things that they don't? Probably (although AppArmor is more focused on programs than on people, well, logins). Would we be able to find an out of the box set of AppArmor rules and so on that worked? Maybe, and this depends on exploits not being found in areas that people pretty much have to be given access to.

If we had a reliable set of AppArmor or SELinux profiles, we might well use them because it would be easy enough. Without a reliable set of AppArmor profiles, I'm not sure we'd try to build some ourselves unless we were desperate. And if we were going to do the work, it appears that we might get more results for less effort through things like explicitly blocking all the loadable kernel modules for Linux socket types that we don't use.

(Some people even block all kernel modules that their current configuration doesn't use. I'm not sure I'd go that far, but I suppose you can always un-block things like the netfilter modules if you turn out to want to add some nftables rules later.)

Getting C code navigation even for Debian (or Ubuntu) packages

By: cks

Every so often, I want (or need) to make modifications to programs in an Ubuntu package, and often the programs are written in C (and these days I'm using dgit to manipulate the package). One of my challenges when I do this is that I generally don't start out knowing where and how to change the code to do what I want; instead, I have to navigate around an unfamiliar code base and work out enough of its structure to find the specific bit of code I need to change.

These days, the dominant way to get smart code navigation and other code knowledge things is through LSP servers and clients. A variety of modern and semi-modern languages have LSP servers that you can immediately use in your editor of choice and then navigate around random code bases with handy features like 'find definition' and 'find references' (for example, Go, Python, and Rust). Unfortunately, C isn't such a language. In the general case, understanding C code requires knowing how it's compiled, and that means you often have to tell C LSP servers this information. Well, specifically you have to tell this stuff to clangd, the dominant LSP server for C and C++.

(There's also ccls, which may work out part of this information on its own, but it seems to be less popular and I have no experience with it.)

Fortunately for people like me, there is a simple way to gather this compilation information even if the program's build system doesn't do it for you, and that's Bear (which is available as a standard Ubuntu package for extra convenience). Bear operates as a front-end on however you normally build your program; you build your program (or collection of programs) with 'bear -- <build command>', and Bear monitors compiler execution and records everything. This is slower than a normal build (sometimes significantly so), but you get a compilation database out of it and then you can use LSP tooling to jump around the source code.

(My understanding is that gcc, clang, and so on can generate this compilation information if they're asked, and modern build systems often ask them to do so, but an old fashioned build system using things like 'make' won't include the magic compiler options necessary. Possibly you can include them yourself by hand, but Bear takes care of the work for you.)

Somewhat to my surprise, Bear not only works with programs built by 'make', it also works when you build Debian or Ubuntu packages under Bear with 'bear -- dpkg-buildpackage -uc -b'. If you're building a substantial package (such as Dovecot), you're definitely going to notice the slowdown, but you do get LSP based code intelligence out of it (and you only have to do this once, not every time you change the code).

(Under some circumstances you may have to edit the generated compile_commands.json to take out gcc options that clang doesn't support, but fortunately the JSON file is in a human friendly format where each compiler option is on its own line. Possibly there's a way to manipulate the Debian/Ubuntu package build process to not use such options in the first place.)

Building Debian and Ubuntu packages contaminate your source directory, so once you've run a build under Bear to generate the compile_commands.json file, you need to move the file to safety and then reset your source directory somehow. If you're using dgit (which I very much think you should be), I believe this can be done with a variant of the standard dgit source directory reset instructions:

git clean -xdf -e compile_commands.json
git reset --hard

The process I suspect I'm going to follow in future dgit modifications of Ubuntu packages is to set up the package with dgit, build it once under Bear in unmodified state, rm the generated .deb and .ddeb files, and then start poking around the source code with LSP intelligence to find where I need to make my modifications (and then commit them and do a dgit build as usual).

(This elaborates on some Fediverse posts.)

In praise of the Linux kernel netconsole (in the right circumstances)

By: cks

The Linux kernel's netconsole is a kernel module that will "log kernel printk messages over UDP" to a remote system, which makes it another form of kernel (message) console. These days it can be activated either on boot or after boot, and in the past I've had mixed views of it. However, I recently had a nice experience with netconsole that's left me more well inclined to it in specific situations.

A while back, my home desktop started locking up every once in a while. Several years ago my home desktop had a somewhat similar problem that was due to hardware issues, but the lockups this time were different, in that the machine would lock up for a bit and then reboot on its own. Local logs showed nothing, but I happen to have another machine sitting around so I thought I might as well try netconsole again. These days netconsole can be enabled on the fly:

modprobe netconsole
cd /sys/kernel/config/netconsole
mkdir heedra
cd heedra
echo em0 >dev_name
echo 192.168.X.Y >remote_ip
echo 1 >enabled

(This other machine is called heedra for obscure reasons.)

On the other machine I ran a simple script to capture output inside a screen session:

#!/bin/sh
while :; do
   nc --recv-only -u -l 6666 |
      tee $HOME/work/h-logs/netconsole
done

(The advantage of --recv-only is that nc won't complain if I hit CR a few times in the screen session to create blank lines, so new messages are more obvious.)

After a while, my home desktop locked up again and rebooted soon afterward. When I checked the netconsole log file on the other machine, I discovered that I had actually captured kernel log messages, and reasonably useful ones at that.

The kernel logs revealed that this appears to be a kernel 'soft lockup', where all cores had gone to 100% system usage during what appears to be TLB flushes or cross-core kernel communication. In several of the kernel stack backtraces, bpf_trace_run4 appears, so I suspect that there's an uncommon eBPF locking race or issue that's infrequently tickled by the eBPF metrics gathering programs I normally run on my desktop.

(It's probably not from the eBPF programs systemd uses for network access control, since those are used widely.)

Capturing these kernel messages doesn't give me a solution, but at least it gives me a way forward if the lockups get too frequent and annoying (I can try disabling my eBPF metrics collectors). And I couldn't have gotten these messages with anything else except a serial console, which I don't have available on my home desktop and anyway would have needed a second machine in physical proximity (which is awkward in my home setup).

My understanding is that netconsole isn't quite as reliable as a serial console for getting last gasp kernel panic messages out, since you need more kernel pieces to still be working to transmit network packets. But it's more reliable than anything short of a serial console, and serial consoles are generally in short supply on modern desktops and desktop-like things (including hand-built SLURM nodes). For one off, small scale use my listening script would be fine, although if we needed to use it on a larger scale, we'd need some infrastructure to collect netconsole logs from multiple machines.

(Some suggestions for that are in the comments on my earlier entry.)

Your Linux distribution may no longer auto-generate new SSH host keys

By: cks

All Linux distributions (and all systems) face the need to generate SSH host keys when your system gets installed. One traditional way this was done was if the system started and discovered it had no SSH host keys, it would generate new ones. One way this was handy was that if you wanted to generate new SSH host keys for some reason, you could remove the existing ones and either reboot or restart the SSH daemon (which would usually trigger this).

As I found out the hard way the other day, some Linux distributions don't do this any more. In particular, Ubuntu doesn't. If you remove your SSH host keys, your SSH daemon will refuse to (re)start, and as far as I know there's no convenient, simple way to regenerate the necessary keys. If you make this mistake (as I did), you'll get to have fun looking up the ssh-keygen arguments you need (and then typing them in on the system console or a serial connection).

Before I started writing this entry, I would have guessed that this was common behavior across multiple distributions, because in this day and age it makes sense for your SSH keys to be set up in the installer rather than (possibly) on system boot, in a situation where the kernel's random number generation may not have accumulated much entropy. However, it turns out that Fedora doesn't behave like this.

Fedora's OpenSSH package has an entire set of systemd units and a script to generate SSH host keys if any of them are missing. Fedora has a templated sshd-keygen@service, which uses /usr/libexec/openssh/sshd-keygen to generate a host key of the appropriate type if it doesn't exist. Then Fedora's sshd.service unit 'wants' sshd-keygen.target, which in turn wants sshd-keygen@rsa.service, sshd-keygen@ecdsa.service, and sshd-keygen@ed25519.service, so before sshd starts, any missing host keys will be generated (whether or not your specific SSH server configuration uses them).

Since Ubuntu usually follows Debian, I assume Debian also doesn't automatically regenerate SSH host keys (and if it does, it doesn't seem to use the approach Fedora does). Fedora derived enterprise distributions probably follow Fedora, but I'm not even going to look. Other distributions may go either way, there probably isn't anything you could describe as a standard approach for this.

In the future, if I want to reset an Ubuntu machine's SSH host keys, the simplest thing for me will be to copy the Fedora sshd-keygen over to the system and run it (since my desktops are Fedora, I have convenient access to it). On a quick scan, the script itself is distribution-independent, so in theory you (I) could fish it out of Fedora in advance and stash a copy somewhere.

(Especially for servers, there's an argument that a missing SSH host key should be a fatal error for sshd, not something you should automatically fix up, since something is obviously badly wrong. If you generate new SSH host keys anyway so maybe people can SSH in to check the server, what you're effectively doing is training people to accept mismatched host keys in times of problems.)

Update: In a comment, Andreas pointed out 'ssh-keygen -A', which does exactly this system host key regeneration.

The easy way to switch my libvirt-based virtual machines to UEFI

By: cks

I mentioned before that I've been switching some libvirt-based virtual machines to UEFI. I've recently had to do some more things there, which has led me to discover what's important about the XML parts of your libvirt machine definitions for this. Or at least, what's important if you use virt-manager to change things.

(There's a long story that boils down to libvirt external snapshots not playing well with virtual CD-ROMs, BIOS PXE booting being annoying, and UEFI Secure Boot causing the Ubuntu 26.04 GRUB to refuse to touch the Ubuntu 22.04 installer kernel.)

As mentioned in the previous entry, what determines whether the machine boots into UEFI or BIOS is whether or not the <os> XML node has a "firmware='efi'" attribute set on it. Once you have UEFI firmware, the <os> XML node can have a '<firmware>' node with some '<feature>' nodes that tell it what to do about Secure Boot:

 <os firmware='efi'>
   <type arch='x86_64' machine='pc-q35-9.2'>hvm</type>
   <firmware>
     <feature enabled='yes' name='enrolled-keys'/>
     <feature enabled='yes' name='secure-boot'/>
   </firmware>
 </os>

By itself this isn't a fully specified UEFI set of attributes, because you need <loader> and <nvram> elements as well, and these vary based on your secure-boot and enrolled-keys settings.

Conveniently for me, if you edit your XML in virt-manager, don't have (or remove) the <loader> and <nvram> elements, and then pick the 'Apply' button, virt-manager will pick appropriate values for you based on your settings for Secure Boot (or the lack of it). This can be used when you're turning off Secure Boot (or turning it on), or when you're moving from BIOS to UEFI.

(This might also happen if you use 'virsh edit', but I haven't tested that. But I suspect it's virt-manager doing some convenient magic for you.)

So the easy way to convert a machine from BIOS booting to UEFI, with or without secure boot, is to add "firmware='efi'" to the <os> attribute and past in an appropriate <firmware> block. The block above is for full Secure Boot. For full lack of Secure Boot, I want:

   <firmware>
     <feature enabled='no' name='enrolled-keys'/>
     <feature enabled='no' name='secure-boot'/>
   </firmware>

Apparently if you flip around between Secure Boot and non-Secure Boot, you may want to reset your NVRAM file. One way to do this is to remove the relevant NVRAM file that I will find in /var/lib/libvirt/qemu/nvram/. Another way is to use --reset-nvram with 'virsh start', eg 'virsh start foo --reset-nvram'. You can also use --reset-nvram with 'virsh snapshot-revert', and I may be doing that someday.

(You don't need to reset the NVRAM file when going from BIOS to UEFI because BIOS doesn't have a NVRAM file. If you go from UEFI to BIOS and then back to UEFI, probably you want to reset your NVRAM, but also maybe you want two separate VMs instead of switching between BIOS and UEFI all the time.)

Understanding the Ubuntu server installer initramfs

By: cks

I recently wrote about all of the various steps of a UEFI network install, where you have a whole collection of DHCP, GRUB fetching things via TFTP and HTTP, and so on, all to boot into your Ubuntu server install ISO image. Specifically, all of the GRUB stuff and much of the complicated DHCP stuff is there because we have to load the installer's kernel and initial ramdisk. Our primary usage for UEFI network installs is to reinstall physical servers that are now in inconvenient locations, so eventually it occurred to me that if we already have running Linux systems, there are simpler ways to boot into a specific kernel and initramfs with specific command line arguments. One way is to add new GRUB boot entries, and another way is kexec.

If we're already using a local kernel and initramfs, it might be convenient to get rid of the need for a DHCP server too, by copying the network parameters from the currently running server and embedding them in both the kernel boot parameters and, more importantly, the cloud-init files that the installer will use. To do this, we need to embed the cloud-init files in the initramfs (and then point to them with 'ds=nocloud;s=/whatever' in the kernel command lines). Well, that's the theory, but it turns out that this is not quite the practice.

The problem is that contrary to what you (I) might think, the Ubuntu server installer is not running from the initramfs. Instead, the initramfs constructs an in-memory root filesystem from various squashfs filesystem images that it gets from /casper on the installer ISO. As part of the initramfs boot, Casper mounts the ISO image (either via NFS or via a HTTP copy), finds those files on it in /casper, and then uses these files to construct the root filesystem that will then have the ISO image (still) mounted in it when Casper pivots the system into running from it. This means that while it's readily possible to add files to the initramfs, your added files are immediately discarded when Casper pivots to its pre-built root filesystem. Since the squashfs filesystem images come from the ISO image, they're generic across your systems and you can't use them to embed per-system configurations.

(In the process of this pivot, Casper will do things like switch to a standard systemd init environment.)

To deal with Casper dropping the initramfs, we must arrange to copy our injected initramfs contents into the root filesystem that Casper builds before Casper pivots into it and discards the initramfs (as far as I know, there's no way to access the initramfs after this, especially with it pre-mounted so that your cloud-init file can be immediately read). Sadly Casper makes this complicated and potentially specific to the Ubuntu server installer you're using.

As part of the Casper initramfs process, Casper will run a collection of scripts from /scripts/casper-bottom, so ideally we can just add our own script to that and have it copy things from the initramfs to appropriate places in /root (the real root filesystem to be). Unfortunately, Casper doesn't scan this directory for scripts to run; instead what scripts to run (in what order) is handled by /scripts/casper-bottom/ORDER (this is the standard Casper way and is used for other Casper 'directories of scripts'). So we have to add our script and also replace the ORDER file from the ISO's initrd with one that includes our script.

A Linux kernel initramfs is a collection of cpio archives, with the last archive (usually) compressed. You can put your own uncompressed cpio archive on the front, or (usually) compress your own cpio archive with the same compression method as the compressed archive and stick it on at the end. Files in later cpio archives overwrite files from earlier cpio archives, and since we need to overwrite /scripts/casper-bottom/ORDER, we have to put our cpio archive at the end. Starting no later than Ubuntu 22.04 LTS, the standard installers all have the last cpio archive compressed with zstd, so that's also what we need to compress our own cpio archive.

(I believe there are potentially tricky issues with sticking compressed archives together this way, which I will leave to others to investigate. I made a 26.04 version work without problems but that could have been luck.)

To make this less annoying, we can use two local cpio archives. One archive contains only our additions and changes to /scripts/casper-bottom; it's zstd compressed and goes on the end of the initramfs, and we can even prepare generic, amended initramfs images with this already pre-built. Then the only per-machine addition we need to build is our cloud-init configuration files, which can go into an uncompressed cpio archive that we put on the front of our initramfs (perhaps the prepared, modified initramfs). This will give us a full initramfs that we can use as kexec's '--initrd' argument (or set up in a GRUB entry).

(This is not quite enough by itself to enable a DHCP-less network boot and install, because we also have to configure the system's IP address and other details in Casper itself via the 'ip=' command line argument; see casper(7) for the format of that. With a proper ip= setting, Casper can find the ISO image and mount it, and with a proper cloud-init injected into the initramfs and then the installer root filesystem, the server installer will properly set up networking and keep it up so that you can go through the normal over the network installer operation.)

PS: Apparently I will go through quite a lot to not have to maintain and update DHCP server entries, even through scripts that the future me might have fun writing.

The various steps of a UEFI network install from an Ubuntu server ISO

By: cks

Suppose, not hypothetically, that you have a locally customized Ubuntu server install ISO image (and have for a while), and you also now have a number of UEFI based machines that it would be convenient to (re)install over the network without having to visit them in person (and they don't have IPMIs/BMCs that support virtual media). It turns out that you can take an Ubuntu ISO and install from it over the network, but how the various steps and stages connect together isn't obvious. Here are my notes on this, before I forget them all. I'll assume that you already have a modern Ubuntu server installer configuration setup, but you can also do this with a stock ISO image that will walk you through the full set of server installer questions.

The process of booting your ISO over the network goes like this (including recommended things):

  1. Your UEFI based server sends out a DHCP request that includes, among other things, its request for one of the UEFI network booting options.
  2. Your DHCP server answers with the server's IP and either a HTTP URL (for UEFI HTTP boot) to shimx64.efi, or a TFTP server and the (TFTP) path to shimx64.efi. Most of your machines will probably want the TFTP option. Provided that your DHCP server gave the server you're installing a usable gateway, this TFTP and HTTP server doesn't have to be on the same network as the server you're network installing.
  3. Shimx64.efi will load grubx64.efi (which must be the signed grubnetx64.efi) from the same server and (relative) path as it was loaded, eg if shimx64.efi was loaded from '/inst/2604/grub/shimx64.efi', it will load '/inst/2604/grub/grubx64.efi'.

    The shimx64.efi and grub(net)x64.efi don't have to be from the Ubuntu version you're booting, but your grubx64.efi should match the GRUB modules you're going to use with it. You probably want to use the latest GRUB you can conveniently get your hands on.

  4. GRUB will load '/grub/grub.cfg' and various other things in '/grub' from your TFTP or HTTP server. Unlike the shimx64 to grubx64 transition, GRUB (at least the Ubuntu version) insists on using an absolute path, not one relative to the directory it was loaded from. GRUB will expect to find various things in, for example, '/grub/x86_64-efi/'.

    In your /grub/grub.cfg, you can switch all future accesses to HTTP by using '(http)' in future references to things, perhaps with a prefix:

    set http=(http)/inst/
    

    Your grub.cfg can be universal for all of your machines, or you can go on to load a machine-specific one using some GRUB variables:

    source $http/grub/by-net/$net_default_ip
    

    (This trick comes from a co-worker, not me.)

    Some GRUB documentation will claim that GRUB will automatically search for a variety of grub.cfg names that are derived from the machine's IP address and other parameters. This is experimentally false for the Ubuntu 26.04 UEFI grubnetx64.efi; my server logs show no attempts for anything other than '/grub/grub.cfg'.

  5. Whatever GRUB configuration file you use now loads the appropriate installer ISO's kernel and initrd, ideally over HTTP instead of TFTP because you switched above. You can get both of these from the /casper directory on the ISO (along with things you don't need). Once you've put these where you want them, you can specify them as, say:

    linux $http/casper/vmlinuz ip=dhcp [other options to come] ---
    initrd $http/casper/initrd
    

    Because the Ubuntu ISO's initrd contains kernel modules, it's specifically tied to the ISO's kernel; you have to use a matching pair and can't just swap in a more modern kernel with better hardware support for your hardware.

  6. GRUB boots the installer kernel with the installer initrd, which makes its own DHCP request (and hopefully gets the same IP back), because once booted into Linux you no longer get to use UEFI services and the UEFI-obtained DHCP stuff. If you forgot to put 'ip=dhcp' into the kernel command line, the Ubuntu server installer initrd won't do DHCP, won't set up any networking, and everything else will fail.

    (It would be nice if the kernel automatically inherited all of the UEFI IP settings, including the TFTP or HTTP server information, but as far as I can tell it doesn't.)

  7. The initrd 'mounts' the ISO. You have two options for how this is done, which are covered in the casper(7) manual page. Either the .iso image itself can be fetched over HTTP, stuffed into RAM, and mounted as a ramdisk image, or you can NFS mount an extracted directory tree version of it from a suitable server (perhaps the very install server that you've been TFTP'ing and HTTP'ing from so far; GRUB's $net_default_server variable may be convenient for this).

    The simpler option is configuring a NFS mount. This is done with the (kernel) command line options:

    netboot=nfs nfsroot=W.X.Y.Z:/some/path/
    

    To fetch the ISO from a URL, the kernel command line parameter is 'iso-url=http://...', but by itself this will probably fail because the default ramdisk is too small. So instead you need to also specify a bigger ramdisk (the size appears to be superstition, cf, but it works for Ubuntu 26.04 beta):

    root=/dev/ram0 ramdisk_size=1500000 iso-url=....
    

    A potential advantage of directly loading the ISO is that once it's loaded, you don't really have to care about the network connection to the install server. With a NFS mount, if something resets the networking you're really up the creek. On the other hand, the NFS mount starts quickly and means you don't have to care about things like ramdisk sizes and how much RAM your servers have.

  8. Something fetches your your installer configuration quite early on (I think it may be the installer proper, not the initrd). If you don't provide a configuration, all you've done is network booted the stock install ISO and it's now going to sit there asking you to interact with the installer on the system console (which might be good enough if the server has a BMC with KVM over IP support). To either automatically install your system or to allow you remote SSH access to the installer, you need a cloud-init configuration. I believe that you can use the version you've embedded in your ISO image with your regular ds= parameter, but you may find it more convenient to fetch it via HTTP with more kernel command line parameters:

    "ds=nocloud-net;s=http://..."
    

    (You have to put this in quotes or GRUB will break it at the ';'.)

    If your install isn't fully automated and you want remote access to it to configure the interactive sections, your cloud-init user-data must include a chpasswd section for the user 'installer', or a ssh_authorized_keys with an appropriate key (which will again be used for 'installer').

    (I found this long ago from here.)

    (It's possible that you can configure a kernel serial console and then use IPMI Serial Over LAN to talk to the installer, if you have an IPMI with SoL support but no KVM over IP.)

  9. The Ubuntu server installer will start up as normal, just as if it was booted from a real ISO, except that when the installer gets to configuring the network, it will reset networking and proceed according to your default configured networking, if any. This makes it critical to set 'dhcpv4: true' (or 'dhcpv6: true' if you're that sort of person) in your installer configuration, because otherwise your server will drop off the network, probably breaking its (network) install, especially if you opted to NFS-mount the ISO image's directory tree instead of fetching the ISO into RAM.

Provided that you've configured an appropriate cloud-init password or SSH key, you can SSH in to your network-booted server as 'installer' and be put in the regular server ISO installer environment, where you can go through whatever interactive steps you normally would with an in-person install. You'll want to use a big window and it needs to be a modern terminal program like gnome-terminal (don't try this with xterm). If you set 'network' as one of your interactive sections and you don't want to keep using DHCP in the installed system, you can switch from getting networking through DHCP to the same networking being set statically.

(You can also switch from DHCP to a static networking setup after the system has booted into its new local Ubuntu install; your install DHCP server is probably not going anywhere.)

Some of the kernel parameters here are confusing, because some of the time they can be interpreted by the kernel and some of the time they're ignored by the kernel and interpreted by things like casper(7). This is the case with the 'ip=' parameter, which in theory can be interpreted by the kernel but in practice is interpreted by Casper, with a different syntax. Since I just went on an extended digging session to find this out, I will tell you that the syntax Casper actually accepts for 'ip=' is the extended syntax used for klibc's ipconfig in its -d argument, because if your 'ip=' is something complicated, Casper winds up more or less passing it to ipconfig.

(This contradicts the Casper manual page but I extracted the 26.04 /casper/initrd to find this out. Not that it really matters, because in practice you mostly have to have DHCP working to get UEFI to network boot and then to keep your running install ISO on the network so you can talk to it.)

The minimal-changes version of going from an Ubuntu (server) ISO image to a booting it over the network is the iso-url option, although you will need to extract /casper/vmlinuz and /casper/initrd from the ISO. This avoids setting up NFS service on your install server, and also avoids having to unpack the ISO (which is easy enough with the right tools, but you have to know what the right tools are). My personal view is that I prefer the NFS option, and if you're the right kind of person you can use Apache Alias directives to serve /casper right out of the ISO's extracted directory tree rather than copy them into your web server area.

PS: It's possible to do much the same with a BIOS PXE booting server, but you have to use PXELINUX instead of GRUB (in practice you'll want to use the 'lpxelinux.0' variant that understands HTTP). Once you're at the stage of loading and booting the kernel, everything is the same; you need to boot the /casper vmlinuz and initrd, with the same kernel command line options as in the UEFI case. The one gotcha is that you can't use the syslinux INITRD directive because it messes up the kernel command line.

I'm now using nftables for (new) static rulesets

By: cks

Over on the Fediverse, I said:

I feel I've now written enough Linux nftables configurations that I've come to like it. It's a more pf-like experience than iptables, that's for sure (and that's a good thing when you're writing a coherent ruleset instead of manipulating things on the fly).

I've had to write a few static IP filtering rulesets recently (on Ubuntu), and in each case I immediately reached for nftables and enjoyed the experience. The nftables documentation isn't what I consider great but I can navigate through it and get things done, and I even managed to get NAT working on a recent machine. I'm now mostly considering my iptables knowledge to be a legacy thing that I'll expect to use less and less in the future, although I'm not going to go out and convert iptables rulesets to nftables rulesets.

(Partly this is a conservation of attention thing. Both iptables and nftables have a lot of dim corners that I have to remember if I'm doing anything complicated with them, and I only have so much of a brain.)

One of the ways that nftables is nicer for me is that the natural way to write a nftables ruleset is to edit /etc/nftables.conf (or some other file). This lets you (me) see all of your rules in one place, think about all of them before you try to use them, revise them, and so on. You can even pre-write a nftables.conf elsewhere (in your home directory or whatever), and it's natural to put comments in. Nftables also has an acceptably PF-like concept of symbolic variables and 'anonymous sets' that can be used to write compact rules in straightforward cases in your nftables.conf; as far as I know there's no equivalent of this in iptables.

(In iptables you can use actual sets that you define and populate, or you can write shell scripts with 'for' loops and so on, but neither of these are entirely fun and as far as I know there's no great way to populate sets from nicely formatted files.)

However, this is only for whole, static rulesets. As I expected before, iptables is going to stay what I use if I need to add and remove rules on the fly, for example to block access to a service on startup or to add and remove some rules as network interfaces come and go. I know that you can do on the fly rule changes with nftables (and many of the nft examples in the manual page are of on the fly changes), but this is an area of nftables that I haven't explored and don't really want to. Unless I need to flip back and forth between two (or more) entire sets of rules, I'm going to keep using iptables for on the fly stuff.

(If I'm moving between several rulesets, 'nft -f /etc/some-file' is the easy way to flush and reload a coherent set of rules all at once, and I can write each ruleset as a coherent thing all in one place with helpful comments and so on.)

This is also only for new rulesets. Even with my new fondness for nftables, I'm not likely to rewrite existing, stable collections of iptables rules into nftables rules even if they can be expressed as a static collection of things. The one case where I can imagine doing a conversion is if I need to change existing iptables rules around substantially and rewriting them as nftables rules is easier than recovering iptables stuff that I may have forgotten by then.

PS: In fact the /etc/nftables.conf experience is sufficiently like the BSD pf experience that it fooled my mind recently. When I was working on the rules for the system with NAT, I kept adding filtering rules for the host to the 'forward' chain and then being confused when they didn't work. BSD pf doesn't have an input versus forward distinction, so my mind drifting into 'I'll just put the host rules here along with the forwarding rules'.

Some quick notes to myself on nftables 'symbolic variables'

By: cks

Nftables is the current generation Linux firewall rule system, supplanting iptables (which supplanted ipchains). As covered in the nft manual page, nftables has the concept of 'symbolic variables'. Since I'm used to BSD PF, I will crudely describe these as a combination of some parts of pf tables and PF macros. I personally feel that the nft manual page doesn't do a good job of documenting what's possible in these, so here are some notes.

The simple case is simple values:

define tundev = "tun0";
define outdev = "eno1";
define natip = 128.100.x.y
define tunnet = 172.29.0.0/16

(It turns out that the ';' here is decorative and I put it in out of superstition, judging from actually reading the "Lexical Conventions" section.)

I'm not sure of the rules of when you have to quote things and when you don't. As covered in the manual page, you use these symbolic values in the relevant nftables bits, for example a SNAT rule:

ip saddr $tunnet oifname $outdev counter snat to $natip;

Nftables also has the concept of 'anonymous sets', which are written in the obvious PF-like syntax of '{ ..., ..., ... }'. You can use symbolic variables to define anonymous sets, and if you do they can span multiple lines and have embedded comments, and of course you can have multiple elements on one line (not shown in this example):

define allowed_udp_ports = {
        # DNS
        53,
        # NTP
        123,
        # for HTTP/3 aka QUIC
        443
}

(I suspect that symbolic values written directly in nftables rules can also span multiple lines and have embedded comments, but I haven't checked.)

A comma on the last entry is optional. Unlike in BSD PF, elements must be separated by commas.

You can use this to define port numbers, IP address ranges, and no doubt other things. However, I don't know how efficient it is if you're defining large numbers of things, and of course you can't update your defined things without reloading your entire ruleset. If you need either of features, you're going to have to figure out named nftables sets or maps.

There's no direct equivalent of the BSD PF syntax for defining a table from a file with eg 'table <SSH_IN> persist file "/etc/pf/SSH-ALLOWED"'. The closest you can come is to define an anonymous set in a file you 'include' in your nftables rules.

(I believe this is also the best you can do for loading named sets and maps from files.)

PS: Apparently there are also anonymous maps, to go with named ones.

Sidebar: Named sets in nftables

Since I just worked this out, well, found an example, here is how you write a set in your nftables.conf:

table inet filter {
    set allowed_tcp_ports {
       typeof tcp dport
       elements = { 22, 25, 80, 443 }
    }

    chain input {
       [...]
       meta iifname $outdev tcp dport @allowed_tcp_ports counter accept;
[...]

Now that I understand the use of 'typeof', I'll probably use it for all sets and maps rather than trying to look up the specific type involved (although nft can help with that with 'nft describe').

Systemd v258's 'systemctl -v restart' and its limitations

By: cks

If you've done much work with systemd services, you've probably gotten entirely used to the traditional dance of 'systemctl restart something; journalctl -f -u something' so you can see the shutdown and restart log messages of what you just theoretically restarted, assuming it's happy with life. In systemd v258, systemctl gained a new feature to help with this, systemctl -v. The help describes it reasonably well:

Display unit log output while executing unit operations.

(This means any unit operation; you can use it with 'systemctl stop', 'systemctl start', and 'systemctl reload' too.)

All of this is nice and I'm certainly going to enjoy using this feature on our future Ubuntu 26.04 machines and on my Fedora machines. However, it has an obvious limitation for 'restart', 'start', and 'reload' that in many cases is going to have me still using the the journalctl stuff as well.

That limitation is right there in the description: 'while executing unit operations'. If you do 'systemctl -v restart something', systemctl stops following your service's log output the moment it considers your service to have started. In some services, this will be when the service has genuinely started and reported this to systemd, for example for a Type=notify service. In many others, for example 'Type=exec' services where you directly run some binary and it sits there doing things, systemd will consider the service started the moment your binary is running. Since systemd considers the service started, it will stop following the logs in 'systemctl -v restart'.

This is often not sufficient. Many services have a certain amount of post-exec work to do before they've genuinely started, such as loading configuration files, opening databases, initializing internal services, and so on. Some services can error out at this point, so that (as systemd sees it) they were successfully started but then immediately failed. Sometimes, the service itself intrinsically is only 'up' after it has talked to the outside world and established something, such as a DSL PPPoE link.

All of this isn't systemd's fault, but it means that 'systemctl -v restart' may only tell you the very early part of the story. And that's why for a lot of services I need to keep doing the 'journalctl' part too.

Users and session classes in Systemd v258 and later (and a gotcha)

By: cks

So I upgraded my home desktop from Fedora 42 to Fedora 43 and sound stopped working. Having your audio stop working is practically a rite of passage for Linux people, so I've been through the drill, but things rapidly turned weird when trying to restart sound daemons through 'systemctl --user restart ...' failed with systemd errors about not being able to contact the (systemd) user service manager.

Let me skip ahead and show you the culprit:

systemd-logind[2524]: New session '1' of user 'cks' with class 'user-light' and type 'tty'.

Establishing your user service manager when you log in is one of the jobs of pam_systemd. One of the things pam_systemd decides about your session is its class. In System v258 and later, one of the possible classes is 'user-light', for which systemd notes:

Similar to user, but sessions of this class will not pull in the user@.service(5) of the user, and thus possibly have no service manager of the user running.

(Emphasis mine.)

This 'possibly' is understated. What it means in practice is that a 'user-light' class session won't have a systemd user service manager running unless something else started it for you, for example another session that wasn't a 'user-light' one (because you only ever have one user service manager; it normally starts with your first session and exits after your last one). In turn, anything that runs as a systemd user service won't start and can't be started indirectly through, for example, systemd socket activation. And in modern Fedora, all of the sound infrastructure is handled as systemd user services (as is your user D-Bus session).

So how did we get here? Well, as the rest of the section notes:

If no session class is specified via either the PAM module option or via the $XDG_SESSION_CLASS environment variable, the class is automatically chosen, depending on various session parameters, such as the session type (if known), whether the session has a TTY or X11 display, and the user disposition.

(The 'user disposition' comes from systemd-userdbd and its JSON User Records. For normal /etc/passwd accounts, the user disposition is determined from their UID.)

The actual process pam_systemd follows is somewhat arcane. To simplify, all SSH logins are 'class=user', root is always 'user-early', and system users on the console are 'user-light'. So if you log in on the console (as I do, also) and you're considered a 'system user', you don't get a user service manager started automatically (and then things break).

Systemd is more or less hard coded to consider all UIDs up to SYS_UID_MAX in /etc/login.defs to be 'system users' (cf). On many machines, this will be all UIDs up to 999, and this number has been drifting upward over time. At various times in the past the first non-system UID and GID has been 200, and then later it was 500, so if you have logins created this long ago, systemd now considers them system users who get special handling. I have been using my Fedora desktops for a very long time, so even without even weirder things I would have fallen victim to this.

(Even on our servers, my UID is 915 and we have a significant number of people with UIDs under 1000. If pam_systemd ever stops forcing all SSH logins into class 'user', we're going to have a whole collection of problems. On my desktops, my 'natural' UID would be either 200 or 500, based on the GIDs that were created to go with it on my home and work desktops.)

Unfortunately there's no way to set a single account parameter in systemd-userdbd, so there's no way to keep using /etc/passwd but tag your historical, low-UID account to be a regular account. There's also no direct way to manipulate pam_systemd's hard coded class (re)mapping process; your only option is to completely override all class assignments with a 'class=' option on pam_systemd. This is made extra difficult on Fedora because (of course) pam_systemd is invoked in a number of generic PAM stacks such as 'system-auth', and you may not want to force all uses of pam_systemd through them to force a 'user' class for all accounts in all situations.

It's possible to work around this with sufficiently complex PAM conditionals (also). Or I could make /etc/pam.d/login use a different version of system-auth that's customized for it, although that would force root logins into class 'user' instead of 'user-early' unless I engaged in other PAM hacks.

PS: Given how much breaks without a user service manager, it feels like either pam_systemd or the 'login' PAM stack should specifically make it so that everyone who logs in on a console tty has one, with all system UIDs being class 'user-early', not just root.

PPS: I won't be working around this by changing my local UID, however peculiar it is. Partly this is because I can't fix it by adopting the same UID as we have on our servers, which would let me usefully NFS mount my home directory from our fileservers on my work desktop; as mentioned, that UID is also under the current Fedora SYS_UID_MAX.

(You can't truly fix NFS UID mapping issues with NFS v4 without Kerberos.)

Sidebar: Why my work machine didn't experience this

One reason I was willing to impulsively upgrade my home desktop last night was that the upgrade to Fedora 43 had gone fine on my work desktop, and it certainly had no sound problems afterward. My console login on my work machine was still a 'user-light' session, but the reason it had a systemd user service manager was that one had been created earlier and was sticking around. To cut a long story short, on my work desktop I was set up as a loginctl 'linger' account (/var/lib/systemd/linger says this happened May 21st 2021). Such a 'linger' account creates a session at system boot, which creates a user service manager as the result, and that session and user service manager remains until system shutdown.

Regardless of how many times you log in, you only ever have one systemd user service manager. So once a user service manager is created for any reason, including the user service manager that's started at boot for a 'linger' account, your console 'user-light' logins will still get access to that user service manager, Pipewire and other things will start normally, sound will work, and you (I) won't notice anything different.

In theory I could work around this today by setting myself up as a 'loginctl linger' account on my home machine too, and skip any PAM changes. In practice, I'm reluctant to assume that pam_systemd will always create systemd user service managers for system UIDs that are set 'linger'. It strikes me as rather the kind of thing that might get optimized some day, much as 'user-light' was optimized into systemd v258 (cf, also, also).

Finding out what your big RPMs are, in two different 'sizes'

By: cks

Suppose, not hypothetically, that you have an old Fedora system with a lot of packages installed and a 70 GByte root filesystem, which is now awkwardly small during system upgrades and so on. You would like to find out which of your roughly 7,500 packages are contributing the most to your space usage.

(The real solution is to move to a bigger pair of NVMe drives, but that involves various yak shaving and you want to upgrade to Fedora 43 today.)

The simple version of 'how big are your RPMs' is to ask rpm for the ordinary (binary) size of all of your installed binary RPMs:

rpm -qa --qf '%{SIZE:humaniec} %{N}-%{V}-%{R}.%{ARCH}\n' | sort -hr

This will tell you interesting things, like how the Fedora 43 version of wine-core-11.0-2.fc43.x86_64 is 1.3 GBytes all by itself. However, it's not necessarily the full answer for what is using up your disk space, because a single (source) package can create many binary packages (often these mostly get installed together and it's hard to split them apart in any useful way). For instance, on my work machine with the 70 GByte root partition, there are 263 'texlive' packages and 101 'perl' packages (and 66 'qemu' packages).

Often a more useful way to break down packages is by the total installed size for a particular source package. This is where I turn to my 'sumup' script, and also to 'numfmt', to get the following:

rpm -qa --qf '%{SIZE} %{SOURCERPM} %{N}-%{V}-%{R}.%{ARCH}\n' |
  sumup 2 1 | numfmt --format '%8.1f' --to iec 

This may reveal surprises that you didn't know. For example, my home desktop has 847 MBytes of packages derived from 'rocm-compilersupport', despite my home machine having no AMD GPU (it uses the integrated Intel GPU). These appear to be present as dependencies of Blender (based on what 'dnf remove' told me it wanted to do).

(It can also tell you that lots of binary packages derived from a single source package don't necessarily result in a lot of disk space being consumed. All of those 263 texlive packages amount to 289 Mbytes, and those 101 Perl packages, 43 Mbytes.)

I preserved the binary name, version, release, and architecture in the second command, even though it's not used, so that I can later copy and paste the 'rpm' command snippet to grep its output to find out all of the binary packages derived from a source package of interest. A smart approach to this would be to split this up into two commands:

rpm -qa --qf '%{SIZE} %{SOURCERPM} %{N}-%{V}-%{R}.%{ARCH}\n' >/tmp/foo
sumup 2 1 </tmp/foo | numfmt --format '%8.1f' --to iec

Putting the initial output in a file is useful because 'rpm -qa --qf ...' is not necessarily the fastest thing in the world, at least if you're asking it for the 'size' of RPMs. With the initial output saved in a file, I can just grep the file, which is going to be very fast.

PS: If your install of Fedora has been around for a while, this may also reveal various obsolete packages. I have llvm-libs packages that seem to go all the way back to Fedora 32. I probably don't need those any more, or at least I hope I don't. But cleaning up old RPMs from past Fedora releases is its own subject and doesn't at all fit in the margins of this entry.

Updating Ubuntu packages that you have local changes for with dgit

By: cks

Suppose, not entirely hypothetically, that you've made local changes to an Ubuntu package using dgit and now Ubuntu has come out with an update to that package that you want to switch to, with your local changes still on top. Back when I wrote about moving local changes to a new Ubuntu release with dgit, I wrote an appendix with a theory of how to do this, based on a conversation. Now that I've actually done this, I've discovered that there is a minor variation and I'm going to write it down explicitly (with additional notes because I forgot some things between then and now).

I'll assume we're starting from an existing dgit based repository with a full setup of local changes, including an updated debian/changelog. Our first step, for safety, is to make a branch to capture the current state of our repository. I suggest you name this branch after the current upstream package version that you're on top of, for example if the current upstream version you're adding local changes to can be summarized as 'ubuntu2.6':

git branch cslab-2.6

Making a branch allows you to use 'git diff cslab-2.6..' later to see exactly what changed between your versions. A useful thing to do here is to exclude the 'debian/' directory from diffs, which can be done with 'git diff cslab-2.6.. -- . :!debian', although your shell may require you to quote the '!' (cf).

Then we need to use dgit to fetch the upstream updates:

dgit fetch -d ubuntu

We need to use '-d ubuntu', at least in current versions of dgit, or 'dgit fetch' gets confused and fails. At this point we have the updated upstream in the remote tracking branch 'dgit/dgit/jammy,-security,-updates' but our local tree is still not updated.

(All of dgit's remote tracking branches start with 'dgit/dgit/', while all of its local branches start with just 'dgit/'. This is less than optimal for my clarity.)

Normally you would now rebase to shift your local changes on top of the new upstream, but we don't want to immediately do that. The problem is that our top commit is our own dgit-based change to debian/changelog, and we don't want to rebase that commit; instead we'll make a new version of it after we rebase our real local changes. So our first step is to discard our top commit:

git reset --hard HEAD~

(In my original theory I didn't realize we had to drop this commit before the rebase, not after, because otherwise things get confused. At a minimum, you wind up with debian/changelog out of order, and I don't know if dropping your HEAD commit after the rebase works right. It's possible you might get debian/changelog rebase conflicts as well, so I feel dropping your debian/changelog change before the rebase is cleaner.)

Now we can rebase, for which the simpler two-argument form does work (but not plain rebasing, or at least I didn't bother testing plain rebasing):

git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates

(If you are wondering how this command possibly works, as I was part way through writing this entry, note that the first branch is 'dgit/dgit/...', ie our remote tracking branch, and then second branch is 'dgit/...', our local branch with our changes on it.)

At this point we should have all of our local changes stacked on top of the upstream changes, but no debian/changelog entry for them that will bump the package version. We create that with:

gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

Then we can build with 'dpkg-buildpackage -uc -b', and afterward do 'git clean -xdf; git reset --hard' to reset your tree back to its pristine state.

(My view is that while you can prepare a source package for your work if you want to, the 'source' artifact you really want to save is your dgit VCS repository. This will be (much) less bulky when you clean it up to get rid of all of the stuff (to be polite) that dpkg-buildpackage leaves behind.)

Canonical's Netplan is hard to deal with in automation

By: cks

Suppose, not entirely hypothetically, that you've traditionally used /etc/resolv.conf on your Ubuntu servers but you're considering switching to systemd-resolved, partly for fast failover if your normal primary DNS server is unavailable and partly because it feels increasingly dangerous not to, since resolved is the normal configuration and what software is likely to expect. One of the ways that resolv.conf is nice is that you can set the configuration by simply copying a single file that isn't used for anything else. On Ubuntu, this is unfortunately not the case for systemd-resolved.

Canonical expects you to operate all of your Ubuntu server networking through Canonical Netplan. In reality, Netplan will render things down to a systemd-networkd configuration, which has some important effects and creates some limitations. Part of that rendered networkd configuration is your DNS resolution settings, and the natural effect of this is that they have to be associated with some interface, because that's the resolved model of the world. This means that Netplan specifically attaches DNS server information to a specific network interfaces in your Netplan configuration. This means that you must find the specific device name and then modify settings within it, and those settings are intermingled (in the same file) with settings you can't touch.

(Sometimes Netplan goes the other way, separating interface specific configuration out to a completely separate section.)

Netplan does not give you a way to do this; if anything, Netplan goes out of its way to not do so. For example, Netplan can dump its full or partial configuration, but it does so in YAML form with no option for JSON (which you could readily search through in a script with jq). However, if you want to modify the Netplan YAML without editing it by hand, 'netplan set' sometimes requires JSON as input. Lack of any good way to search or query Netplan's YAML matters because for things like DNS settings, you need to know the right interface name. Without support for this in Netplan, you wind up doing hacks to try to get the right interface name.

Netplan also doesn't provide you any good way to remove settings. The current Ubuntu 26.04 beta installer writes a Netplan configuration that locks your interfaces to specific MAC addresses:

  enp1s0:
    match:
      macaddress: "52:54:00:a5:d5:fb"
    [...]
    set-name: "enp1s0"

This is rather undesirable if you may someday swap network cards or transplant server disks from one chassis to another, so we would like to automatically take it out. Netplan provides no support for this; 'netplan set' can't be given a blank replacement, for example (and 'netplan set "network.ethernets.enp1s0.match={}"' doesn't do anything). If Netplan would give you all of the enp1s0 block in JSON format, maybe you could edit the JSON and replace the whole thing, but that's not available so far.

(For extra complication you also need to delete the set-name, which is only valid with a 'match:'.)

Another effect of not being able to delete things in scripts is that you can't write scripts that move things out to a different Netplan .conf file that has only your settings for what you care about. If you could reliably get the right interface name and you could delete DNS settings from the file the installer wrote, you could fairly readily create a '/etc/netplan/60-resolv.conf' file that was something close to a drop-in /etc/resolv.conf. But as it is, you can't readily do that.

There are all sorts of modifications you might want to make through a script, such as automatically configuring a known set of VLANs to attach them to whatever the appropriate host interface is. Scripts are good for automation and they're also good for avoiding errors, especially if you're doing repetitive things with slight differences (such as setting up a dozen VLANs on your DHCP server). Netplan fights you almost all the way about doing anything like this.

My best guess is that all of Canonical's uses of Netplan either use internal tooling that reuses Netplan's (C) API or simply re-write Netplan files from scratch (based on, for example, cloud provider configuration information).

(To save other people the time, the netplan Python package on PyPI seems to be a third party package and was last updated in 2019. Which is a pity, because it theoretically has a quite useful command line tool.)

One bleakly amusing thing I've found out through using 'netplan set' on Ubuntu 26.04 is that the Ubuntu server installer and Netplan itself have slightly different views on how Netplan files should be written. The original installer version of the above didn't have the quotes around the strings; 'netplan set' added them.

(All of this would be better if there was a widely agreed on, generally shipped YAML equivalent of 'jq', or better yet something that could also modify YAML in place as well as query it in forms that were useful for automation. But the 'jq for YAML' ecosystem appears to be fragmented at best.)

Early notes on switching some libvirt-based virtual machines to UEFI

By: cks

I keep around a small collection of virtual machines so I don't have to drag out one of our spare physical servers to test things on. These virtual machines have traditionally used traditional MBR-based booting ('BIOS' in libvirt instead of 'UEFI'), partly because for a long time libvirt didn't support snapshots of UEFI based virtual machines and snapshots are very important for my use of these scratch virtual machines. However, I recently discovered that libvirt now can do snapshots of UEFI based virtual machines, and also all of our physical server installs are UEFI based, so in the past couple of days I've experimented with moving some of my Ubuntu scratch VMs from BIOS to UEFI.

As far as I know, virt-manager and virsh don't directly allow you to switch a virtual machine between BIOS and UEFI after it's been created, partly because the result is probably not going to boot (unless you deliberately set up the OS inside the VM with both an EFI boot and a BIOS MBR boot environment). Within virt-manager, you can only select BIOS or UEFI at setup time, so you have to destroy your virtual machine and recreate it. This works, but it's a bit annoying.

(On the other hand, if you've had some virtual machines sitting around for years and years, you might want to refresh all of their settings anyway.)

It's possible to change between BIOS and UEFI by directly editing the libvirt XML to transform the <os> node. You may want to remove any old snapshots first because I don't know what happens if you revert from a 'changed to UEFI' machine to a snapshot where your virtual machine was a BIOS one. In my view, the easiest way to get the necessary XML is to create (or recreate) another virtual machine with UEFI, and then dump and copy its XML with some minor alterations.

For me, on Fedora with the latest libvirt and company, the <os> XML of a BIOS booting machine is:

 <os>
   <type arch='x86_64' machine='pc-q35-6.1'>hvm</type>
 </os>

Here the 'machine=' is the machine type I picked, which I believe is the better of the two options virt-manager gives me.

My UEFI based machines look like this:

 <os firmware='efi'>
   <type arch='x86_64' machine='pc-q35-9.2'>hvm</type>
   <firmware>
     <feature enabled='yes' name='enrolled-keys'/>
     <feature enabled='yes' name='secure-boot'/>
   </firmware>
   <loader readonly='yes' secure='yes' type='pflash' format='qcow2'>/usr/share/edk2/ovmf/OVMF_CODE_4M.secboot.qcow2</loader>
   <nvram template='/usr/share/edk2/ovmf/OVMF_VARS_4M.secboot.qcow2' templateFormat='qcow2' format='qcow2'>/var/lib/libvirt/qemu/nvram/[machine name]_VARS.qcow2</nvram>
 </os>

Here the '[machine-name]' bit is the libvirt name of my virtual machine, such as 'vmguest1'. This nvram file doesn't have to exist in advance; libvirt will create it the first time you start up the virtual machine. I believe it's used to provide snapshots of the UEFI variables and so on to go with snapshots of your physical disks and snapshots of the virtual machine configuration.

(This feature may have landed in libvirt 10.10.0, if I'm reading release notes correctly. Certainly reading the release notes suggests that I don't want to use anything before then with UEFI snapshots.)

Manually changing the XML on one of my scratch machines has worked fine to switch it from BIOS MBR to UEFI booting as far as I can tell, but I carefully cleared all of its disk state and removed all of its snapshots before I tried this. I suspect that I could switch it back to BIOS if I wanted to. Over time, I'll probably change over all of my as yet unchanged scratch virtual machines to UEFI through direct XML editing, because it's the less annoying approach for me. Now that I've looked this up, I'll probably do it through 'virsh edit ...' rather than virt-manager, because that way I get my real editor.

(This is the kind of entry I write for my future use because I don't want to have to re-derive this stuff.)

PS: Much of this comes from this question and answers.

Fedora's virt-manager started using external snapshots for me as of Fedora 41

By: cks

Today I made an unpleasant discovery about virt-manager on my (still) Fedora 42 machines that I shared on the Fediverse:

This is my face that Fedora virt-manager appears to have been defaulting to external snapshots for some time and SURPRISE, external snapshots can't be reverted by virsh. This is my face, especially as it seems to have completely screwed up even deleting snapshots on some virtual machines.

(I only discovered this today because today is the first time I tried to touch such a snapshot, either to revert to it or to clean it up. It's possible that there is some hidden default for what sort of snapshot to make and it's only been flipped for me.)

Neither virt-manager nor virsh will clearly tell you about this. In virt-manager you need to click on each snapshot and if it says 'external disk only', congratulations, you're in trouble. In virsh, 'virsh snapshot-list --external <vm>' will list external snaphots, and then 'virsh snapshot-list --tree <vm>' will tell you if they depend on any internal snapshots.

My largest problems came from virtual machines where I had earlier internal snapshots and then I took more snapshots, which became external snapshots from Fedora 41 onward. You definitely can't revert to an external snapshot in this situation, at least not with virsh or virt-manager, and the error messages I got were generic ones about not being able to revert external snapshots. I haven't tested reverting external snapshots for a VM with no internal ones.

(Not being able to revert to external snapshots is a long standing libvirt issue, but it's possible they now work if you only have external snapshots. Otherwise, Fedora 41 and Fedora 42 defaulting to external snapshots is extremely hard to understand (to be polite).)

Update: you can revert an external snapshot in the latest libvirt if all of your snapshots are external. You can't revert them if libvirt helpfully gave you external snapshots on top of internal ones by switching the default type of snapshots (probably in Fedora 41).

If you have an external snapshot that you need to revert to, all I can do is point to a libvirt wiki page on the topic (although it may be outdated by now) along with libvirt's documentation on its snapshot XML. I suspect that there is going to be suffering involved. I haven't tried to do this; when it came up today I could afford to throw away the external snapshot.

If you have internal snapshots and you're willing to throw away the external snapshot and what's built on it, you can use virsh or virt-manager to revert to an internal snapshot and then delete the external snapshot. This leaves the external snapshot's additional disk file or files dangling around for you to delete by hand.

If you have only an external snapshot, it appears that libvirt will let you delete the snapshot through 'virsh snapshot-delete <vm> <external-snapshot>', which preserves the current state of the machine's disks. This only helps if you don't want the snapshot any more, but this is one of my common cases (where I take precautionary snapshots before significant operations and then get rid of them later when I'm satisfied, or at least committed).

The worst situation appears to be if you have an external snapshot made after (and thus on top of) an earlier internal snapshot and you to keep the live state of things while getting rid of the snapshots. As far as I can tell, it's impossible to do this through libvirt, although some of the documentation suggests that you should be able to. The process outlined in libvirt's Merging disk image chains didn't work for me (see also Disk image chains).

(If it worked, this operation would implicitly invalidate the snapshots and I don't know how you get rid of them inside libvirt, since you can't delete them normally. I suspect that to get rid of them, you need to shut down all of the libvirt daemons and then delete the XML files that (on Fedora) you'll find in /var/lib/libvirt/qemu/snapshot/<domain>.)

One reason to delete external snapshots you don't need is if you ever want to be able to easily revert snapshots in the future. I wouldn't trust making internal snapshots on top of external ones, if libvirt even lets you, so if you want to be able to easily revert, it currently appears that you need to have and use only internal snapshots. Certainly you can't mix new external snapshots with old internal snapshots, as I've seen.

(The 5.1.0 virt-manager release will warn you to not mix snapshot modes and defaults to whatever snapshot mode you're already using. I don't know what it defaults to if you don't have any snapshots, I haven't tried that yet.)

Sidebar: Cleaning this up on the most tangled virtual machine

I've tried the latest preview releases of the libvirt stuff, but it doesn't make a difference in the most tangled situation I have:

$ virsh snapshot-delete hl-fedora-36 fedora41-preupgrade
error: Failed to delete snapshot fedora41-preupgrade
error: Operation not supported: deleting external snapshot that has internal snapshot as parent not supported

This VM has an internal snapshot as the parent because I didn't clean up the first snapshot (taken before a Fedora 41 upgrade) before making the second one (taken before a Fedora 42 upgrade).

In theory one can use 'virsh blockcommit' to reduce everything down to a single file, per the knowledge base section on this. In practice it doesn't work in this situation:

$ virsh blockcommit hl-fedora-36 vda --verbose --pivot --active
error: invalid argument: could not find base image in chain for 'vda'

(I tried with --base too and that didn't help.)

I was going to attribute this to the internal snapshot but then I tried 'virsh blockcommit' on another virtual machine with only an external snapshot and it failed too. So I have no idea how this is supposed to work.

Since I could take a ZFS snapshot of the entire disk storage, I chose violence, which is to say direct usage of qemu-img. First, I determined that I couldn't trivially delete the internal snapshot before I did anything else:

$ qemu-img snapshot -d fedora40-preupgrade fedora35.fedora41-preupgrade
qemu-img: Could not delete snapshot 'fedora40-preupgrade': snapshot not found

The internal snapshot is in the underlying file 'fedora35.qcow2'. Maybe I could have deleted it safely even with an external thing sitting on top of it, but I decided not to do that yet and proceed to the main show:

$ qemu-img commit -d fedora35.fedora41-preupgrade
Image committed.
$ rm fedora35.fedora41-preupgrade

Using 'qemu-img info fedora35.qcow2' showed that the internal snapshot was still there, so I removed it with 'qemu-img snapshot -d' (this time on fedora35.qcow2).

All of this left libvirt's XML drastically out of step with the underlying disk situation. So I removed the XML for the snapshots (after saving a copy), made sure all libvirt services weren't running, and manually edited the VM's XML, where it turned out that all I needed to change was the name of the disk file. This appears to have worked fine.

I suspect that I could have skipped manually removing the internal snapshot and its XML and libvirt would then have been happy to see it and remove it.

(I'm writing all of the commands and results down partly for my future reference.)

Cleaning old GPG RPM keys that your Fedora install is keeping around

By: cks

Approximately all RPM packages are signed by GPG keys (or maybe they're supposed to be called PGP keys), which your system stores in the RPM database as pseudo-packages (because why not). If your Fedora install has been around long enough, as mine have, you will have accumulated a drift of old keys and sometimes you either want to clean them up or something unfortunate will happen to one of those keys (I'll get to one case for it).

One basic command to see your collection of GPG keys in the RPM database is (taken from this gist):

rpm -q gpg-pubkey --qf '%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n'

On some systems this will give you a nice short list of keys. On others, your list may be very long.

Since Fedora 42 (cf), DNF has functionality (I believe more or less built in) that should offer to remove old GPG keys that have actually expired. This is in the 'expired PGP keys plugin' which comes from the 'libdnf5-plugin-expired-pgp-keys' if you don't have it installed (with a brief manpage that's called 'libdnf5-expired-pgp-keys'). I believe there was a similar DNF4 plugin. However, there are two situations where this seems to not work correctly.

The first situation is now-obsolete GPG keys that haven't expired yet, for various reasons; these may be for past versions of Fedora, for example. These days, the metadata for every DNF repository you use should list a URL for its GPG keys (see the various .repo files in /etc/yum.repos.d/ and look for the 'gpgkey=' lines). So one way to clean up obsolete keys is to fetch all of the current keys for all of your current repositories (or at least the enabled ones), and then remove anything you have that isn't among the list. This process is automated for you by the 'clean-rpm-gpg-pubkey' command and package, which is mentioned in some Fedora upgrade instructions. This will generally clean out most of your obsolete keys, although rare people will have keys that are so old that it chokes on them.

The second situation is apparently a repository operator who is sufficiently clever to have re-issued an expired key using the same key ID and fingerprint but a new expiry date in the future; this fools RPM and related tools and everything chokes. This is unfortunate, since it will often stall all DNF updates unless you disable the repo. One repository operator who has done this is Google, for their Fedora Chrome repository. To fix this you'll have to manually remove the relevant GPG key or keys. Once you've used clean-rpm-gpg-pubkey to reduce your list of GPG keys to a reasonable level, you can use the RPM command I showed above to list all your remaining keys, spot the likely key or keys (based on who owns it, for example), and then use 'rpm -e --allmatches gpg-pubkey-d38b4796-570c8cd3' (or some other appropriate gpg-pubkey name) to manually scrub out the GPG key. Doing a DNF operation such as installing or upgrading a package from the repository should then re-import the current key.

(This also means that it's theoretically harmless to overshoot and remove the wrong key, because it will be fetched back the next time you need it.)

(When I wrote my Fediverse post about discovering clean-rpm-gpg-pubkey, I apparently thought I would remember it without further prompting. This was wrong, and in fact I didn't even remember to use it when I upgraded my home desktop. This time it will hopefully stick, and if not, I have it written down here where it will probably be easier to find.)

UEFI-only booting with GRUB has gone okay on our (Ubuntu 24.04) servers

By: cks

We've been operating Ubuntu servers for a long time and for most of that time we've booted them through traditional MBR BIOS boots. Initially it was entirely through MBR and then later it was still mostly through MBR (somewhat depending on who installed a particular server; my co-workers are more tolerant of UEFI than I am). But when we built the 24.04 version of our customized install media, my co-worker wound up making it UEFI only, and so for the past two years all of our 24.04 machines have been UEFI (with us switching BIOSes on old servers into UEFI mode as we updated them). The headline news is that it's gone okay, more or less as you'd expect and hope by now.

All of our servers have mirrored system disks, and the one UEFI thing we haven't really had to deal with so far is fixing Ubuntu's UEFI boot disk redundancy stuff after one disk fails. I think we know how to do it in theory but we haven't had to go through it in practice. It will probably work out okay but it does make me a bit nervous, along with the related issue that the Ubuntu installer makes it hard to be consistent about which disk your '/boot/efi' filesystem comes from.

(In the installer, /boot/efi winds up on the first disk that you set as the boot device, but the disks aren't always presented in order so you can do this on 'the first disk' in the installer and discover that the first disk it listed was /dev/sdb.)

The Ubuntu 24.04 default bootloader is GRUB, so that's what we've wound up with even though as a UEFI-only environment we could in theory use simpler ones, such as systemd-boot. I'm not particularly enthused about GRUB but in practice it does what we want, which is to reliably boot our servers, and it has the huge benefit that it's actively supported by Ubuntu (okay, Canonical) so they're going to make sure it works right, including with their UEFI disk redundancy stuff. If Ubuntu switches default UEFI bootloaders in their server installs, I expect we'll follow along.

(I don't know if Canonical has any plans to switch away from GRUB to something else. I suspect that they'll stick with GRUB for as long as they support MBR booting, which I suspect will be a while, especially as people look more and more likely to hold on to old hardware for much longer than normally expected.)

PS: One reason I'm writing this down is that I've been unenthused about UEFI for a long time, so I'm not sure I would have predicted our lack of troubles in advance. So I'm going to admit it, UEFI has been actually okay. And in its favour, UEFI has regularized some things that used to be pretty odd in the MBR BIOS era.

(I'm still not happy about the UEFI non-story around redundant system disks, but I've accepted that hacks like the Ubuntu approach are the best we're going to get. I don't know what distributions such as Fedora are doing here; my Fedora machines are MBR based and staying that way until the hardware gets replaced, which on current trends won't be any time soon.)

Installing Void Linux on ZFS with Hibernation Support

Installing Void Linux on ZFS with Hibernation Support

Introduction

FreeBSD continues to make strides in desktop support, but Linux still holds an advantage in hardware compatibility. After running openSUSE Tumbleweed on my mini PC for several months, I decided it was time to switch to a solution I could control more closely. Not because Tumbleweed doesn't work well - it works great! - but I prefer having direct control over what happens on my machine. And I want native ZFS, because I prefer it over btrfs and it allows me to manage snapshots, backups, and rollbacks just as I do on FreeBSD, using the same tools and procedures.

The choice of Void Linux comes from its BSD-like approach: modular and free of unnecessary complexity. This makes it an excellent solution for this type of setup.

ZFSBootMenu is an extremely powerful tool. It provides an experience similar to FreeBSD's boot loader and natively supports ZFS. I strongly recommend reading the documentation and exploring its features, as some of them - like the built-in SSH daemon - can be genuine lifesavers in recovery scenarios.

Prerequisites and Audience

This guide is not for absolute beginners. If you're new to Linux or Unix-like operating systems, you'd be better served by a ready-to-use distribution like openSUSE Leap (or Tumbleweed for a rolling distribution), Linux Mint, Debian, Ubuntu, or Manjaro. The purpose of this article is to demonstrate a stable, upgradeable, and reasonably secure base setup for users already comfortable with system administration. It uses the glibc variant of Void Linux. The musl version requires different commands, for example for locale generation.

Use at your own risk.

This guide synthesizes instructions from several sources:

If your setup differs from what's described here (NVMe disk, UEFI boot, Secure Boot disabled), consult the linked guides for explanations and variations.

Installation Script (Optional)

If you want to reproduce this setup quickly, I maintain a script that automates the procedure described in this guide: disk partitioning, ZFS pool and dataset creation, encrypted swap for hibernation resume, dracut configuration, and ZFSBootMenu EFI setup. An optional KDE Plasma desktop installation is also supported.

The script is interactive and will ask for the required parameters (target disk, timezone and keymap, passphrases, desktop options). Requirements, usage instructions, and known limitations are documented in the repository README

That said, I still recommend going through the manual process at least once. Understanding each step is part of the value of this setup, especially when troubleshooting or adapting it to different hardware.

Boot Environment

Since ZFS isn't supported by the base Void Linux image, we'll use hrmpf, an excellent rescue system based on Void Linux that includes ZFS support out of the box.

After booting, you can either proceed directly or SSH into the machine to continue remotely. I generally prefer SSH since it makes copy-paste operations much easier - especially when dealing with UUIDs and long commands. To enable SSH access, set a root password and allow root login:

passwd

Edit /etc/ssh/sshd_config and enable:

PermitRootLogin yes

Restart the SSH daemon:

sv restart sshd

Find the machine's IP address:

ip addr

You can now connect via SSH from another device.

Initial Setup

Set up the environment variables and generate a host ID - we need it for ZFS:

source /etc/os-release
export ID

zgenhostid -f 0x00bab10c

Disk Configuration

Identify your target disk and set up the partition variables. This approach keeps everything consistent and reduces errors:

# Set the base disk - adjust this to match your system
export DISK="/dev/nvme0n1"

# For NVMe disks, partitions are named like nvme0n1p1, nvme0n1p2, etc.
# For SATA/SAS disks (sda, sdb), partitions are named sda1, sda2, etc.
# Set the partition separator accordingly:
export PART_SEP="p"  # Use "p" for NVMe, empty string "" for SATA/SAS

# Define partition numbers
export BOOT_PART="1"
export SWAP_PART="2"
export POOL_PART="3"

# Build full device paths
export BOOT_DEVICE="${DISK}${PART_SEP}${BOOT_PART}"
export SWAP_DEVICE="${DISK}${PART_SEP}${SWAP_PART}"
export POOL_DEVICE="${DISK}${PART_SEP}${POOL_PART}"

Verify your configuration before proceeding:

echo "Boot device: $BOOT_DEVICE"
echo "Swap device: $SWAP_DEVICE"
echo "Pool device: $POOL_DEVICE"

Wipe the Disk

Warning: This operation will irreversibly destroy all data on the selected disk. Double-check that you've selected the correct disk and be sure to have a complete backup of your system!

zpool labelclear -f "$DISK"

wipefs -a "$DISK"
sgdisk --zap-all "$DISK"

Create Partitions

EFI System Partition

If you're not using UEFI boot, adapt this procedure following the appropriate guide linked at the beginning of this post:

sgdisk -n "${BOOT_PART}:1m:+512m" -t "${BOOT_PART}:ef00" "$DISK"

Swap Partition

The swap partition should be slightly larger than your RAM to support hibernation. When you hibernate, the entire contents of RAM are written to swap, so you need enough space to hold it all plus some overhead. In this example, I have 16 GB of RAM, so I'm creating an 18 GB swap partition:

sgdisk -n "${SWAP_PART}:0:+18g" -t "${SWAP_PART}:8200" "$DISK"

ZFS Pool Partition

sgdisk -n "${POOL_PART}:0:-10m" -t "${POOL_PART}:bf00" "$DISK"

Set Up ZFS Encryption

Encrypting the disk is strongly recommended, especially for laptops. Replace SomeKeyphrase with a strong passphrase that's easy to type. Keep in mind that during early boot, the keyboard layout might default to US, so choose a passphrase that's easy to type on a US keyboard layout:

echo 'SomeKeyphrase' > /etc/zfs/zroot.key
chmod 000 /etc/zfs/zroot.key

Create the ZFS Pool

Create the pool with conservative, well-tested options:

zpool create -f -o ashift=12 \
 -O compression=lz4 \
 -O acltype=posixacl \
 -O xattr=sa \
 -O relatime=on \
 -O encryption=aes-256-gcm \
 -O keylocation=file:///etc/zfs/zroot.key \
 -O keyformat=passphrase \
 -o autotrim=on \
 -o compatibility=openzfs-2.2-linux \
 -m none zroot "$POOL_DEVICE"

Create ZFS Datasets

zfs create -o mountpoint=none zroot/ROOT
zfs create -o mountpoint=/ -o canmount=noauto zroot/ROOT/${ID}
zfs create -o mountpoint=/home zroot/home

zpool set bootfs=zroot/ROOT/${ID} zroot

Export and Reimport for Installation

zpool export zroot
zpool import -N -R /mnt zroot
zfs load-key -L prompt zroot

zfs mount zroot/ROOT/${ID}
zfs mount zroot/home

udevadm trigger

Install the Base System

XBPS_ARCH=x86_64 xbps-install \
  -S -R https://mirrors.servercentral.com/voidlinux/current \
  -r /mnt base-system

Copy Host Configuration

Copy the files we generated earlier to the new system:

cp /etc/hostid /mnt/etc
mkdir -p /mnt/etc/zfs
cp /etc/zfs/zroot.key /mnt/etc/zfs

Configure Encrypted Swap

Now we'll set up the encrypted swap partition. This is where the hibernation magic happens - by using a separate LUKS-encrypted partition instead of a ZFS zvol, we can properly resume from hibernation.

Format the swap partition with LUKS:

cryptsetup luksFormat --type luks1 "$SWAP_DEVICE"

Open the encrypted partition, create the swap filesystem, and activate it:

cryptsetup luksOpen "$SWAP_DEVICE" cryptswap
mkswap /dev/mapper/cryptswap
swapon /dev/mapper/cryptswap

Preserve Variables for Chroot

Before entering the chroot, save the disk variables so they remain available inside the new environment:

cat << EOF > /mnt/root/disk-vars.sh
export DISK="$DISK"
export PART_SEP="$PART_SEP"
export BOOT_PART="$BOOT_PART"
export SWAP_PART="$SWAP_PART"
export POOL_PART="$POOL_PART"
export BOOT_DEVICE="$BOOT_DEVICE"
export SWAP_DEVICE="$SWAP_DEVICE"
export POOL_DEVICE="$POOL_DEVICE"
export ID="$ID"
EOF

Enter the Chroot Environment

xchroot /mnt

From this point forward, all commands are executed inside the new system.

First, load the saved variables:

source /root/disk-vars.sh

Configure fstab

Add the swap entry to /etc/fstab:

/dev/mapper/cryptswap   none            swap            defaults        0 0

Set Up Automatic Swap Unlock

To avoid entering the swap password separately after unlocking the ZFS pool, we'll create a keyfile stored on the encrypted ZFS dataset. This is secure because the keyfile only becomes accessible after the ZFS pool is unlocked.

First, install cryptsetup in the new system:

xbps-install -S cryptsetup

Generate a random keyfile and add it to the LUKS partition:

dd bs=1 count=64 if=/dev/urandom of=/boot/volume.key

cryptsetup luksAddKey "$SWAP_DEVICE" /boot/volume.key

chmod 000 /boot/volume.key
chmod -R g-rwx,o-rwx /boot

Add the keyfile to /etc/crypttab:

echo "cryptswap   $SWAP_DEVICE   /boot/volume.key   luks" >> /etc/crypttab

Include the keyfile and crypttab in the initramfs. Create /etc/dracut.conf.d/10-crypt.conf:

install_items+=" /boot/volume.key /etc/crypttab "

Basic System Configuration

Configure keyboard layout and hardware clock. Adjust the keymap and timezone to match your location:

cat << EOF >> /etc/rc.conf
KEYMAP="us"
HARDWARECLOCK="UTC"
EOF

ln -sf /usr/share/zoneinfo/Europe/Rome /etc/localtime

Configure locales:

cat << EOF >> /etc/default/libc-locales
en_US.UTF-8 UTF-8
en_US ISO-8859-1
EOF

echo "LANG=en_US.UTF-8" > /etc/locale.conf

xbps-reconfigure -f glibc-locales

Set the root password:

passwd

Configure ZFS Boot Support

cat << EOF > /etc/dracut.conf.d/zol.conf
nofsck="yes"
add_dracutmodules+=" zfs "
omit_dracutmodules+=" btrfs "
install_items+=" /etc/zfs/zroot.key "
EOF

Install ZFS:

xbps-install -S zfs

Configure ZFSBootMenu

Set the basic boot properties:

zfs set org.zfsbootmenu:commandline="quiet" zroot/ROOT
zfs set org.zfsbootmenu:keysource="zroot/ROOT/${ID}" zroot

The Critical Step: Hibernation Support

Now we need to configure hibernation resume. This is the key insight that makes this setup work: normally, the encrypted ZFS root mounts first, and then it unlocks the swap partition. But when resuming from hibernation, the kernel needs to read the hibernation image from swap before mounting the root filesystem - otherwise, the saved state would be lost.

To solve this, we tell ZFSBootMenu to unlock the swap partition early, before mounting ZFS, by specifying its LUKS UUID.

Get the UUID of your swap partition:

blkid "$SWAP_DEVICE"

You'll see output like:

/dev/...: UUID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" TYPE="crypto_LUKS" PARTUUID="..."

Store the UUID in a variable for the next step:

SWAP_UUID=$(blkid -s UUID -o value "$SWAP_DEVICE")
echo "Swap UUID: $SWAP_UUID"

Now set the boot parameters using the captured UUID:

zfs set org.zfsbootmenu:commandline="rd.luks.uuid=$SWAP_UUID resume=/dev/mapper/cryptswap" zroot/ROOT/${ID}

Set Up EFI Boot

Create and mount the EFI partition:

mkfs.vfat -F32 "$BOOT_DEVICE"

mkdir -p /boot/efi

Add the EFI partition to /etc/fstab using its UUID:

BOOT_UUID=$(blkid -s UUID -o value "$BOOT_DEVICE")
echo "UUID=$BOOT_UUID    /boot/efi    vfat    defaults    0 0" >> /etc/fstab

Mount it:

mount /boot/efi

Install ZFSBootMenu

xbps-install -S curl

mkdir -p /boot/efi/EFI/ZBM
curl -o /boot/efi/EFI/ZBM/VMLINUZ.EFI -L https://get.zfsbootmenu.org/efi
cp /boot/efi/EFI/ZBM/VMLINUZ.EFI /boot/efi/EFI/ZBM/VMLINUZ-BACKUP.EFI

Configure the EFI boot entries:

xbps-install -S efibootmgr

efibootmgr -c -d "$DISK" -p "$BOOT_PART" \
  -L "ZFSBootMenu (Backup)" \
  -l '\EFI\ZBM\VMLINUZ-BACKUP.EFI'

efibootmgr -c -d "$DISK" -p "$BOOT_PART" \
  -L "ZFSBootMenu" \
  -l '\EFI\ZBM\VMLINUZ.EFI'

Microcode updates

Void Linux is modular, so you may need to install additional packages for your specific hardware. For the Intel microcode, you need the non-free repo: For example:

# For Intel CPUs
xbps-install -S void-repo-nonfree 
xbps-install -S intel-ucode

# For AMD CPUs/GPUs
xbps-install -S linux-firmware-amd

After installing microcode updates, regenerate the boot images and exit:

xbps-reconfigure -fa

Desktop Installation (Optional)

If all you need is a minimal system or a server, you're done and ready to reboot. For a complete desktop environment, continue with the following steps.

Install Core Desktop Packages

xbps-install -S vim nano dbus elogind polkit xorg xorg-fonts xorg-video-drivers xorg-input-drivers dejavu-fonts-ttf terminus-font NetworkManager pipewire alsa-pipewire wireplumber xdg-user-dirs unzip gzip xz 7zip

Install KDE Plasma

xbps-install -S kde-plasma dolphin konsole firefox kdegraphics-thumbnailers ffmpegthumbs vlc ark kwrite discover kf6-purpose

Enable Services

ln -s /etc/sv/NetworkManager /etc/runit/runsvdir/default/
ln -s /etc/sv/dbus /etc/runit/runsvdir/default/
ln -s /etc/sv/udevd /etc/runit/runsvdir/default/
ln -s /etc/sv/polkitd /etc/runit/runsvdir/default/
ln -s /etc/sv/sddm /etc/runit/runsvdir/default/

Configure PipeWire Audio

mkdir -p /etc/xdg/autostart
ln -sf /usr/share/applications/pipewire.desktop /etc/xdg/autostart/

mkdir -p /etc/pipewire/pipewire.conf.d
ln -sf /usr/share/examples/wireplumber/10-wireplumber.conf /etc/pipewire/pipewire.conf.d/
ln -sf /usr/share/examples/pipewire/20-pipewire-pulse.conf /etc/pipewire/pipewire.conf.d/

mkdir -p /etc/alsa/conf.d
ln -sf /usr/share/alsa/alsa.conf.d/50-pipewire.conf /etc/alsa/conf.d
ln -sf /usr/share/alsa/alsa.conf.d/99-pipewire-default.conf /etc/alsa/conf.d

Enable Additional Repositories and Flatpak (Optional)

xbps-install -S void-repo-nonfree void-repo-multilib void-repo-multilib-nonfree

xbps-install -S flatpak
flatpak remote-add --if-not-exists flathub https://dl.flathub.org/repo/flathub.flatpakrepo

Create a Regular User and exit

For desktop use, create a non-root user with appropriate group memberships. Replace username with your desired username.

useradd -m username
passwd username
usermod username -G video,wheel,plugdev,kvm,audio,network
exit

Fix for NetworkManager

xchroot will bind mount /etc/resolv.conf and leave an empty file. Network Manager won't like it. So let's clean it up:

umount -l /mnt/etc/resolv.conf 2>/dev/null || true

rm -f /mnt/etc/resolv.conf
ln -s /run/NetworkManager/resolv.conf /mnt/etc/resolv.conf

Exit and Reboot

umount -n -R /mnt
zpool export zroot
reboot

Post-Installation

If everything went well, after entering your ZFS encryption password, you'll be greeted by the SDDM login screen.

Testing Hibernation

To verify that hibernation works correctly, you can clock the "Hibernate" button or:

loginctl hibernate

The system should power off. When you turn it back on, ZFSBootMenu will prompt for the password, unlock the swap partition, detect the hibernation image, and resume your session exactly where you left off.

If resume fails, check that: 1. The LUKS UUID in the ZFS commandline property matches your swap partition 2. The swap partition is large enough for your RAM 3. The dracut configuration includes the crypttab and keyfile

Conclusion

You now have a fully functional Void Linux system with native ZFS, full disk encryption, and working hibernation. The system is rolling, lightweight, and easy to maintain. Enjoy!

Restricting IP address access to specific ports in eBPF: a sketch

By: cks

The other day I covered how I think systemd's IPAddressAllow and IPAddressDeny restrictions work, which unfortunately only allows you to limit this to specific (local) ports only if you set up the sockets for those ports in a separate systemd.socket unit. Naturally this raises the question of whether there is a good, scalable way to restrict access to specific ports in eBPF that systemd (or other interested parties) could use. I think the answer is yes, so here is a sketch of how I think you'd this.

Why we care about a 'scalable' way to do this is because systemd generates and installs its eBPF programs on the fly. Since tcpdump can do this sort of cross-port matching, we could write an eBPF program that did it directly. But such a program could get complex if we were matching a bunch of things, and that complexity might make it hard to generate on the fly (or at least make it complex enough that systemd and other programs didn't want to). So we'd like a way that still allows you to generate a simple eBPF program.

Systemd uses cgroup socket SKB eBPF programs, which attach to a cgroup and filter all network packets on ingress or egress. As far as I can understand from staring at code, these are implemented by extracting the IPv4 or IPv4 address of the other side from the SKB and then querying what eBPF calls a LPM (Longest Prefix Match) map. The normal way to use an LPM map is to use the CIDR prefix length and the start of the CIDR network as the key (for individual IPv4 addresses, the prefix length is 32), and then match against them, so this is what systemd's cgroup program does. This is a nicely scalable way to handle the problem; the eBPF program itself is basically constant, and you have a couple of eBPF maps (for the allow and deny sides) that systemd populates with the relevant information from IPAddressAllow and IPAddressDeny.

However, there's nothing in eBPF that requires the keys to be just CIDR prefixes plus IP addresses. A LPM map key has to start with a 32-bit prefix, but the size of the rest of the key can vary. This means that we can make our keys be 16 bits longer and stick the port number in front of the IP address (and increase the CIDR prefix size appropriately). So to match packets to port 22 from 128.100.0.0/16, your key would be (u32) 32 for the prefix length then something like 0x00 0x16 0x80 0x64 0x00 0x00 (if I'm doing the math and understanding the structure right). When you query this LPM map, you put the appropriate port number in front of the IP address.

This does mean that each separate port with a separate set of IP address restrictions needs its own set of map entries. If you wanted a set of ports to all have a common set of restrictions, you could use a normally structured LPM map and a second plain hash map where the keys are port numbers. Then you check the port and the IP address separately, rather than trying to combine them in one lookup. And there are more complex schemes if you need them.

Which scheme you'd use depends on how you expect port based access restrictions to be used. Do you expect several different ports, each with its own set of IP access restrictions (or only one port)? Then my first scheme is only a minor change from systemd's current setup, and it's easy to extend it to general IP address controls as well (just use a port number of zero to mean 'this applies to all ports'). If you expect sets of ports to all use a common set of IP access controls, or several sets of ports with different restrictions for each set, then you might want a scheme with more maps.

(In theory you could write this eBPF program and set up these maps yourself, then use systemd resource control features to attach them to your .service unit. In practice, at that point you probably should write host firewall rules instead, it's likely to be simpler. But see this blog post and the related VCS repository, although that uses a more hard-coded approach.)

How I think systemd IP address restrictions on socket units works

By: cks

Among the systemd resource controls are IPAddressAllow= and IPAddressDeny=, which allow you to limit what IP addresses your systemd thing can interact with. This is implemented with eBPF. A limitation of these as applied to systemd .service units is that they restrict all traffic, both inbound connections and things your service initiates (like, say, DNS lookups), while you may want only a simple inbound connection filter. However, you can also set these on systemd.socket units. If you do, your IP address restrictions apply only to the socket (or sockets), not to the service unit that it starts. To quote the documentation:

Note that for socket-activated services, the IP access list configured on the socket unit applies to all sockets associated with it directly, but not to any sockets created by the ultimately activated services for it.

So if you have a systemd socket activated service, you can control who can access the socket without restricting who the service itself can talk to.

In general, systemd IP access controls are done through eBPF programs set up on cgroups. If you set up IP access controls on a socket, such as ssh.socket in Ubuntu 24.04, you do get such eBPF programs attached to the ssh.socket cgroup (and there is a ssh.socket cgroup, perhaps because of the eBPF programs):

# pwd
/sys/fs/cgroup/system.slice
# bpftool cgroup list ssh.socket
ID  AttachType      AttachFlags  Name
12  cgroup_inet_ingress   multi  sd_fw_ingress
11  cgroup_inet_egress    multi  sd_fw_egress

However, if you look there are no processes or threads in the ssh.socket cgroup, which is not really surprising but also means there is nothing there for these eBPF programs to apply to. And if you dump the eBPF program itself (with 'ebpftool dump xlated id 12'), it doesn't really look like it checks for the port number.

What I think must be going on is that the eBPF filtering program is connected to the SSH socket itself. Since I can't find any relevant looking uses in the systemd code of the `SO_ATTACH_*' BPF related options from socket(7) (which would be used with setsockopt(2) to directly attach programs to a socket), I assume that what happens is that if you create or perhaps start using a socket within a cgroup, that socket gets tied to the cgroup and its eBPF programs, and this attachment stays when the socket is passed to another program in a different cgroup.

(I don't know if there's any way to see what eBPF programs are attached to a socket or a file descriptor for a socket.)

If this is what's going on, it unfortunately means that there's no way to extend this feature of socket units to get per-port IP access control in .service units. Systemd isn't writing special eBPF filter programs for socket units that only apply to those exact ports, which you could in theory reuse for a service unit; instead, it's arranging to connect (only) specific sockets to its general, broad IP access control eBPF programs. Programs that make their own listening sockets won't be doing anything to get eBPF programs attached to them (and only them), so we're out of luck.

(One could experiment with relocating programs between cgroups, with the initial cgroup in which the program creates its listening sockets restricted and the other not, but I will leave that up to interested parties.)

Systemd resource controls on user.slice and system.slice work fine

By: cks

We have a number of systems where we traditionally set strict overcommit handling, and for some time this has caused us some heartburn. Some years ago I speculated that we might want to use resource controls on user.slice or systemd.slice if they worked, and then recently in a comment here I speculated that this was the way to (relatively) safely limit memory use if it worked.

Well, it does (as far as I can tell, without deep testing). If you want to limit how much of the system's memory people who log in can use so that system services don't explode, you can set MemoryMin= on system.slice to guarantee some amount of memory to it and all things under it. Alternately, you can set MemoryMax= on user.slice, collectively limiting all user sessions to that amount of memory. In either case my view is that you might want to set MemorySwapMax= on user.slice so that user sessions don't spend all of their time swapping. Which one you set things on depends on which is easier and you trust more; my inclination is MemoryMax, although that means you need to dynamically size it depending on this machine's total memory.

(If you want to limit user memory use you'll need to make sure that things like user cron jobs are forced into user sessions, rather than running under cron.service in system.slice.)

Of course this is what you should expect, given systemd's documentation and the kernel documentation. On the other hand, the Linux kernel cgroup and memory system is sufficiently opaque and ever changing that I feel the need to verify that things actually do work (in our environment) as I expect them to. Sometimes there are surprises, or settings that nominally work but don't really affect things the way I expect.

This does raise the question of how much memory you want to reserve for the system. It would be nice if you could use systemd-cgtop to see how much memory your system.slice is currently using, but unfortunately the number it will show is potentially misleadingly high. This is because the memory attributed to any cgroup includes (much) more than program RAM usage. For example, on our it seems typical for system.slice to be using under a gigabyte of 'user' RAM but also several gigabytes of filesystem cache and other kernel memory. You probably want to allow for some of that in what memory you reserve for system.slice, but maybe not all of the current usage.

(You can get the current version of the 'memdu' program I use as memdu.py.)

Gnome, GSettings, gconf, and which one you want

By: cks

On the Fediverse a while back, I said:

Ah yes, GNOME, it is of course my mistake that I used gconf-editor instead of dconf-editor. But at least now Gnome-Terminal no longer intercepts F11, so I can possibly use g-t to enter F11 into serial consoles to get the attention of a BIOS. If everything works in UEFI land.

Gnome has had at least two settings systems, GSettings/dconf (also) and the older GConf. If you're using a modern Gnome program, especially a standard Gnome program like gnome-terminal, it will use GSettings and you will want to use dconf-editor to modify its settings outside of whatever Preferences dialogs it gives you (or doesn't give you). You can also use the gsettings or dconf programs from the command line.

(This can include Gnome-derived desktop environments like Cinnamon, which has updated to using GSettings.)

If the program you're using hasn't been updated to the latest things that Gnome is doing, for example Thunderbird (at least as of 2024), then it will still be using GConf. You need to edit its settings using gconf-editor or gconftool-2, or possibly you'll need to look at the GConf version of general Gnome settings. I don't know if there's anything in Gnome that synchronizes general Gnome GSettings settings into GConf settings for programs that haven't yet been updated.

(This is relevant for programs, like Thunderbird, that use general Gnome settings for things like 'how to open a particular sort of thing'. Although I think modern Gnome may not have very many settings for this because it always goes to the GTK GIO system, based on the Arch Wiki's page on Default Applications.)

Because I've made this mistake between gconf-editor and dconf-editor more than once, I've now created a personal gconf-editor cover script that prints an explanation of the situation when I run it without a special --really argument. Hopefully this will keep me sorted out the next time I run gconf-editor instead of dconf-editor.

PS: Probably I want to use gsettings instead of dconf-editor and dconf as much as possible, since gsettings works through the GSettings layer and so apparently has more safety checks than dconf-editor and dconf do.

PPS: Don't ask me what the equivalents are for KDE. KDE settings are currently opaque to me.

Testing Linux memory limits is a bit of a pain

By: cks

For reasons outside of the scope of this entry, I want to test how various systemd memory resource limits work and interact with each other (which means that I'm really digging into cgroup v2 memory controls). When I started trying to do this, it turned out that I had no good test program (or programs), although I had some ones that gave me partial answers.

There are two complexities in memory usage testing programs in a cgroups environment. First, you may be able to allocate more memory than you can actually use, depending on your system's settings for strict overcommit. So it's not enough to see how much memory you can allocate using the mechanism of your choice (I tend to use mmap() rather than go through language allocators). After you've either determined how much memory you can allocate or allocated your target amount, you have to at least force the kernel to materialize your memory by writing something to every page of it. Since the kernel can probably swap out some amount of your memory, you may need to keep repeatedly reading all of it.

The second issue is that if you're not in strict overcommit (and sometimes even if you are), the kernel can let you allocate more memory than you can actually use and then you try to use it, hit you with the OOM killer. For my testing, I care about the actual usable amount of memory, not how much memory I can allocate, so I need to deal with this somehow (and this is where my current test programs are inadequate). Since the OOM killer can't be caught by a process (that's sort of the point), the simple approach is probably to have my test program progressively report on how much memory its touched so far, so I can see how far it got before it was OOM-killed. A more complex approach would be to do the testing in a child process with progress reports back to the parent so it could try to narrow in on how much it could use rather than me guessing that I wanted progress reports every, say, 16 MBytes or 32 MBytes of memory touching.

(Hopefully the OOM killer would only kill the child and not the parent, but with the OOM killer you can never be sure.)

I'm probably not the first person to have this sort of need, so I suspect that other people have written test programs and maybe even put them up somewhere. I don't expect to be able to find them in today's ambient Internet search noise, plus this is very close to the much more popular issue of testing your RAM memory.

(Will I put up my little test program when I hack it up? Probably not, it's too much work to do it properly, with actual documentation and so on. And these days I'm not very enthused about putting more repositories on Github, so I'd need to find some alternate place.)

Systemd and blocking connections to localhost, including via 'any'

By: cks

I recently discovered a surprising path to accessing localhost URLs and services, where instead of connecting to 127.0.0.1 or the IPv6 equivalent, you connected to 0.0.0.0 (or the IPv6 equivalent). In that entry I mentioned that I didn't know if systemd's IPAddressDeny would block this. I've now tested this, and the answer is that systemd's restrictions do block this. If you set 'IPAddressDeny=localhost', the service or whatever is blocked from the 0.0.0.0 variation as well (for both outbound and inbound connections). This is exactly the way it should be, so you might wonder why I was uncertain and felt I needed to test it.

There are a variety of ways at different levels that you might implement access controls on a process (or a group of processes) in Linux, for IP addresses or anything else. For example, you might create an eBPF program that filtered the system calls and system call arguments allowed and attach it to a process and all of its children using seccomp(2). Alternately, for filtering IP connections specifically, you might use a cgroup socket address eBPF program (also), which are among the the cgroup program types that are available. Or perhaps you'd prefer to use a cgroup socket buffer program.

How a program such as systemd implements filtering has implications for what sort of things it has to consider and know about when doing the filtering. For example, if we reasonably conclude that the kernel will have mapped 0.0.0.0 to 127.0.0.1 by the time it invokes cgroup socket address eBPF programs, such a program doesn't need to have any special handling to block access to localhost by people using '0.0.0.0' as the target address to connect to. On the other hand, if you're filtering at the system call level, the kernel has almost certainly not done such mapping at the time it invokes you, so your connect() filter had better know that '0.0.0.0' is equivalent to 127.0.0.1 and it should block both.

This diversity is why I felt I couldn't be completely sure about systemd's behavior without actually testing it. To be honest, I didn't know what the specific options were until I researched them for this entry. I knew systemd used eBPF for IPAddressDeny (because it mentions that in the manual page in passing), but I vaguely knew there are a lot of ways and places to use eBPF and I didn't know if systemd's way needed to know about 0.0.0.0 or if systemd did know.

Sidebar: What systemd uses

As I found out through use of 'bpftool cgroup list /sys/fs/cgroup/<relevant thing>' on a systemd service that I knew uses systemd IP address filtering, systemd uses cgroup socket buffer programs, and is presumably looking for good and bad IP addresses and netblocks in those programs. This unfortunately means that it would be hard for systemd to have different filtering for inbound connections as opposed to outgoing connections, because at the socket buffer level it's all packets.

(You'd have to go up a level to more complicated filters on socket address operations.)

Early Linux package manager history and patching upstream source releases

By: cks

One of the important roles of Linux system package managers like dpkg and RPM is providing a single interface to building programs from source even though the programs may use a wide assortment of build processes. One of the source building features that both dpkg and RPM included (I believe from the start) is patching the upstream source code, as well as providing additional files along with it. My impression is that today this is considered much less important in package managers, and some may make it at least somewhat awkward to patch the source release on the fly. Recently I realized that there may be a reason for this potential oddity in dpkg and RPM.

Both dpkg and RPM are very old (by Linux standards). As covered in Andrew Nesbitt's Package Manager Timeline, both date from the mid-1990s (dpkg in January 1994, RPM in September 1995). Linux itself was quite new at the time and the Unix world was still dominated by commercial Unixes (partly because the march of x86 PCs was only just starting). As a result, Linux was a minority target for a lot of general Unix free software (although obviously not for Linux specific software). I suspect that this was compounded by limitations in early Linux libc, where apparently it had some issues with standards (see eg this, also, also, also).

As a minority target, I suspect that Linux regularly had problems compiling upstream software, and for various reasons not all upstreams were interested in fixing (or changing) that (especially if it involved accepting patches to cope with a non standards compliant environment; one reply was to tell Linux to get standards compliant). This probably left early Linux distributions regularly patching software in order to make it build on (their) Linux, leading to first class support for patching upstream source code in early package managers.

(I don't know for sure because at that time I wasn't using Linux or x86 PCs, and I might have been vaguely in the incorrect 'Linux isn't Unix' camp. My first Linux came somewhat later.)

These days things have changed drastically. Linux is much more standards compliant and of course it's a major platform. Free software that works on non-Linux Unixes but doesn't build cleanly on Linux is a rarity, so it's much easier to imagine (or have) a package manager that is focused on building upstream source code unaltered and where patching is uncommon and not as easy (or trivial) as dpkg and RPM make it.

(You still need to be able to patch upstream releases to handle security patches and so on, since projects don't necessarily publish new releases for them. I believe some projects simply issue patches and tell you to apply them to their current release. And you may have to backport a patch yourself if you're sticking on an older release of the project that they no longer do patches for.)

Why Linux wound up with system package managers

By: cks

Yesterday I discussed the two sorts of program package managers, system package managers that manage the whole system and application package managers that mostly or entirely manage third party programs. Commercial Unix got application package managers in the very early 1990s, but Linux's first program managers were system package managers, in dpkg and RPM (or at least those seem to be the first Linux package managers).

The abstract way to describe why is to say that Linux distributions had to assemble a whole thing from separate pieces; the kernel came from one place, libc from another, coreutils from a third, and so on. The concrete version is to think about what problems you'd have without a package manager. Suppose that you assembled a directory tree of all of the source code of the kernel, libc, coreutils, GCC, and so on. Now you need to build all of these things (or rebuild, let's ignore bootstrapping for the moment).

Building everything is complicated partly because everything goes about it differently. The kernel has its own configuration and build system, a variety of things use autoconf but not necessarily with the same set of options to control things like features, GCC has a multi-stage build process, Perl has its own configuration and bootstrapping process, X is frankly weird and vaguely terrifying, and so on. Then not everyone uses 'make install' to actually install their software, so you have another set of variations for all of this.

(The less said about the build processes for either TeX or GNU Emacs in the early to mid 1990s, the better.)

If you do this at any scale, you need to keep track of all of this information (cf) and you want a uniform interface for 'turn this piece into a compiled and ready to unpack blob'. That is, you want a source package (which encapsulates all of the 'how to do it' knowledge) and a command that takes a source package and does a build with it. Once you're building things that you can turn into blobs, it's simpler to always ship a new version of the blob whenever you change anything.

(You want the 'install' part of 'build and install' to result in a blob rather than directly installing things on your running system because until it finishes, you're not entirely sure the build and install has fully worked. Also, this gives you an easy way to split overall system up into multiple pieces, some of which people don't have to install. And in the very early days, to split them across multiple floppy disks, as SLS did.)

Now you almost have a system package manager with source packages and binary packages. You're building all of the pieces of your Linux distribution in a standard way from something that looks a lot like source packages, and you pretty much want to create binary blobs from them rather than dump everything into a filesystem. People will obviously want a command that takes a binary blob and 'installs' it by unpacking it on their system (and possibly extra stuff), rather than having to run 'tar whatever' all the time themselves, and they'll also want to automatically keep track of which of your packages they've installed rather than having to keep their own records. Now you have all of the essential parts of a system package manager.

(Both dpkg and RPM also keep track of which package installed what files, which is important for upgrading and removing packages, along with things having versions.)

Systemd-networkd and giving your virtual devices alternate names

By: cks

Recently I wrote about how Linux network interface names have a length limit, of 15 characters. You can work around this limit by giving network interfaces an 'altname' property, as exposed in (for example) 'ip link'. While you can't work around this at all in Canonical's Netplan, it looks like you can have this for your VLANs in systemd-networkd, since there's AlternativeName= in the systemd.link manual page.

Except, if you look at an actual VLAN configuration as materialized by Netplan (or written out by hand), you'll discover a problem. Your VLANs don't normally have .link files, only .netdev and .network files (and even your normal Ethernet links may not have .link files). The AlternativeName= setting is only valid in .link files, because networkd is like that.

(The AlternativeName= is a '[Link]' section setting and .network files also have a '[Link]' section, but they allow completely different sets of '[Link]' settings. The .netdev file, which is where you define virtual interfaces, doesn't have a '[Link]' section at all, although settings like AlternativeName= apply to them just as much as to regular devices. Alternately, .netdev files could support setting altnames for virtual devices in the '[NetDev]' section along side the mandatory 'Name=' setting.)

You can work around this indirectly, because you can create a .link file for a virtual network device and have it work:

[Match]
Type=vlan
OriginalName=vlan22-mlab

[Link]
AlternativeNamesPolicy=
AlternativeName=vlan22-matterlab

Networkd does the right thing here even though 'vlan22-mlab' doesn't exist when it starts up; when vlan22-mlab comes into existence, it matches the .link file and has the altname stapled on.

Given how awkward this is (and that not everything accepts or sees altnames), I think it's probably not worth bothering with unless you have a very compelling reason to give an altname to a virtual interface. In my case, this is clearly too much work simply to give a VLAN interface its 'proper' name.

Since I tested, I can also say that this works on a Netplan-based Ubuntu server where the underlying VLAN is specified in Netplan. You have to hand write the .link file and stick it in /etc/systemd/network, but after that it cooperates reasonably well with a Netplan VLAN setup.

Linux network interface names have a length limit, and Netplan

By: cks

Over on the Fediverse, I shared a discovery:

This is my (sad) face that Linux interfaces have a maximum name length. What do you mean I can't call this VLAN interface 'vlan22-matterlab'?

Also, this is my annoyed face that Canonical Netplan doesn't check or report this problem/restriction. Instead your VLAN interface just doesn't get created, and you have to go look at system logs to find systemd-networkd telling you about it.

(This is my face about Netplan in general, of course. The sooner it gets yeeted the better.)

Based on both some Internet searches and looking at kernel headers, I believe the limit is 15 characters for the primary name of an interface. In headers, you will find this called IFNAMSIZ (the kernel) or IF_NAMESIZE (glibc), and it's defined to be 16 but that includes the trailing zero byte for C strings.

(I can be confident that the limit is 15, not 16, because 'vlan22-matterlab' is exactly 16 characters long without a trailing zero byte. Take one character off and it works.)

At the level of ip commands, the error message you get is on the unhelpful side:

# ip link add dev vlan22-matterlab type wireguard
Error: Attribute failed policy validation.

(I picked the type for illustration purposes.)

Systemd-networkd gives you a much better error message:

/run/systemd/network/10-netplan-vlan22-matterlab.netdev:2: Interface name is not valid or too long, ignoring assignment: vlan22-matterlab

(Then you get some additional errors because there's no name.)

As mentioned in my Fediverse post, Netplan tells you nothing. One direct consequence of this is that in any context where you're writing down your own network interface names, such as VLANs or WireGuard interfaces, simply having 'netplan try' or 'netplan apply' succeed without errors does not mean that your configuration actually works. You'll need to look at error logs and perhaps inventory all your network devices.

(This isn't the first time I've seen Netplan behave this way, and it remains just as dangerous.)

As covered in the ip link manual page, network interfaces can have either or both of aliases and 'altname' properties. These alternate names can be (much) longer than 16 characters, and the 'ip link property' altname property can be used in various contexts to make things convenient (I'm not sure what good aliases are, though). However this is somewhat irrelevant for people using Netplan, because the current Netplan YAML doesn't allow you to set interface altnames.

You can set altnames in networkd .link files, as covered in the systemd.link manual page. The direct thing you want is AlternativeName=, but apparently you may also want to set a blank alternative names policy, AlternativeNamesPolicy=. Of course this probably only helps if you're using systemd-networkd directly, instead of through Netplan.

PS: Netplan itself has the notion of Ethernet interfaces having symbolic names, such as 'vlanif0', but this is purely internal to Netplan; it's not manifested as an actual interface altname in the 'rendered' systemd-networkd control files that Netplan writes out.

(Technically this applies to all physical device types.)

An annoyance in how Netplan requires you to specify VLANs

By: cks

Netplan is Canonical's more or less mandatory method of specifying networking on Ubuntu. Netplan has a collection of limitations and irritations, and recently I ran into a new one, which is how VLANs can and can't be specified. To explain this, I can start with the YAML configuration language. To quote the top level version, it looks like:

network:
  version: NUMBER
  renderer: STRING
  [...]
  ethernets: MAPPING
  [...]
  vlans: MAPPING
  [...]

To translate this, you specify VLANs separately from your Ethernet or other networking devices. On the one hand, this is nicely flexible. On the other hand it creates a problem, because here is what you have to write for VLAN properties:

network:
  vlans:
    vlan123:
      id: 123
      link: enp5s0
      addresses: <something>

Every VLAN is on top of some networking device, and because VLANs are specified as a separate category of top level devices, you have to name the underlying device in every VLAN (which gets very annoying and old very fast if you have ten or twenty VLANs to specify). Did you decide to switch from a 1G network port to a 10G network port for the link with all of your VLANs on it? Congratulations, you get to go through every 'vlans:' entry and change its 'link:' value. We hope you don't overlook one.

(Or perhaps you had to move the system disks from one model of 1U server to another model of 1U server because the hardware failed. Or you would just like to write generic install instructions with a generic block of YAML that people can insert directly.)

The best way for Netplan to deal with this would be to allow you to also specify VLANs as part of other devices, especially Ethernet devices. Then you could write:

network:
  ethernet:
    enp5s0: 
      vlans:
        vlan123:
          id: 123
          addresses: <something>

Every VLAN specified in enp5s0's configuration would implicitly use enp5s0 as its underlying link device, and you could rename all of them trivially. This also matches how I think most people think of and deal with VLANs, which is that (obviously) they're tied to some underlying device, and you want to think of them as 'children' of the other device.

(You can have an approach to VLANs where they're more free-floating and the interface that delivers any specific VLAN to your server can change, for load balancing or whatever. But you could still do this, since Netplan will need to keep supporting the separate 'vlans:' section.)

If you want to work around this today, you have to go for the far less convenient approach of artificial network names.

network:
  ethernet:
    vlanif0:
      match:
        name: enp5s0

  vlans:
    vlan123:
      id: 123
      link: vlanif0
      addresses: <something>

This way you only need to change one thing if your VLAN network interface changes, but at the cost of doing a non-standard way of setting up the base interface. (Yes, Netplan accepts it, but it's not how the Ubuntu installer will create your netplan files and who knows what other Canonical tools will have a problem with it as a result.)

We have one future Ubuntu server where we're going to need to set up a lot of VLANs on one underlying physical interface. I'm not sure which option we're going to pick, but the 'vlanif0' option is certainly tempting. If nothing else, it probably means we can put all of the VLANs into a separate, generic Netplan file.

Early experience with using Linux tc to fight bufferbloat latency

By: cks

Over on the Fediverse I mentioned something recently:

Current status: doing extremely "I don't know what I'm really doing, I'm copying from a websiteΒΉ" things with Linux tc to see if I can improve my home Internet latency under load without doing too much damage to bandwidth or breaking my firewall rules. So far, it seems to work and thingsΒ² claim to like the result.

ΒΉ <documentation link>
Β² https://bufferbloat.libreqos.com/ via @davecb

What started this was running into a Fediverse post about the bufferbloat test, trying it, and discovering that (as expected) my home DSL link performed badly, with significant increased latency during downloads, uploads, or both. My memory is that reported figures went up to the area of 400 milliseconds.

Conveniently for me, my Linux home desktop is also my DSL router; it speaks PPPoE directly through my DSL modem. This means that doing traffic shaping on my Linux desktop should cover everything, without any need to wrestle with a limited router OS environment. And there was some more or less cut and paste directions on the site.

So my outbound configuration was simple and obviously not harmful:

tc qdisc add root dev ppp0 cake bandwidth 7.6Mbit

The bandwidth is a guess, although one informed by checking both my raw DSL line rate and what testing sites told me.

The inbound configuration was copied from the documentation and it's where I don't understand what I'm doing:

ip link add name ifb4ppp0 type ifb
tc qdisc add dev ppp0 handle ffff: ingress
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort
ip link set ifb4ppp0 up
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0

(This order follows the documentation.)

Here is what I understand about this. As covered in the tc manual page, traffic shaping and scheduling happens only on 'egress', which is to say for outbound traffic. To handle inbound traffic, we need a level of indirection to a special ifb (Intermediate Functional Block) (also) device, that is apparently used only for our (inbound) tc qdisc.

So we have two pieces. The first is the actual traffic shaping on the IFB link, ifb4ppp0, and setting the link 'up' so that it will actually handle traffic instead of throw it away. The second is that we have to push inbound traffic on ppp0 through ifb4ppp0 to get its traffic shaping. To do this we add a special 'ingress' qdisc to ppp0, which applies to inbound traffic, and then we use a tc filter that matches all (ingress) traffic and redirects it to ifb4ppp0 as 'egress' traffic. Since it's now egress traffic, the tc shaping on ifb4ppp0 will now apply to it and do things.

When I set this up I wasn't certain if it was going to break my non-trivial firewall rules on the ppp0 interface. However, everything seems to fine, and the only thing the tc redirect is affecting is traffic shaping. My firewall blocks and NAT rules are still working.

Applying these tc rules definitely improved my latency scores on the test site; my link went from an F rating to an A rating (and a C rating for downloads and uploads happening at once). Does this improve my latency in practice for things like interactive SSH connections while downloads and uploads are happening? It's hard for me to tell, partly because I don't do such downloads and uploads very often, especially while I'm doing interactive stuff over SSH.

(Of course partly this is because I've sort of conditioned myself out of trying to do interactive SSH while other things are happening on my DSL link.)

The most I can say is that this probably improves things, and that since my DSL connection has drifted into having relatively bad latency to start with (by my standards), it probably helps to minimize how much worse it gets under load.

I do seem to get slightly less bandwidth for transfers than I did before; experimentation says that how much less can be fiddled with by adjusting the tc 'bandwidth' settings, although that also changes latency (more bandwidth creates worse latency). Given that I rarely do large downloads or uploads, I'm willing to trade off slightly lower bandwidth for (much) less of a latency hit. One reason that my bandwidth numbers are approximate anyway is that I'm not sure how much PPPoE DSL framing compensation I need.

(The Arch wiki has a page on advanced traffic control that has some discussion of tc.)

Sidebar: A rewritten command order for ingress traffic

If my understanding is correct, we can rewrite the commands to set up inbound traffic shaping to be more clearly ordered:

# Create and enable ifb link
ip link add name ifb4ppp0 type ifb
ip link set ifb4ppp0 up

# Set CAKE with bandwidth limits for
# our actual shaping, on ifb link.
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort

# Wire ifb link (with tc shaping) to inbound
# ppp0 traffic.
tc qdisc add dev ppp0 handle ffff: ingress
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0

The 'ifb4ppp0' name is arbitrary but conventional, set up as 'ifb4<whatever>'.

Distribution source packages and whether or not to embed in the source code

By: cks

When I described my current ideal Linux source package format, I said that it should be embedded in the source code of the software being packaged. In a comment, bitprophet had a perfectly reasonable and good preference the other way:

Re: other points: all else equal I think I vaguely prefer the Arch "repo contains just the extras/instructions + a reference to the upstream source" approach as it's cleaner overall, and makes it easier to do "more often than it ought to be" cursed things like "apply some form of newer packaging instructions against an older upstream version" (or vice versa).

The Arch approach is isomorphic to the source RPM format, which has various extras and instructions plus a pre-downloaded set of upstream sources. It's not really isomorphic to the Debian source format because you don't normally work with the split up version; the split up version is just a package distribution thing (as dgit shows).

(I believe the Arch approach is also how the FreeBSD and OpenBSD ports trees work. Also, the source package format you work in is not necessarily how you bundle up and distribute source packages, again as shown by Debian.)

Let's call these two packaging options the inline approach (Debian) and the out of line approach (Arch, RPM). My view is that which one you want depends on what you want to do with software and packages. The out of line approach makes it easier to build unmodified packages, and as bitprophet comments it's easy to do weird build things. If you start from a standard template for the type of build and install the software uses, you can practically write the packaging instructions yourself. And the files you need to keep are quite compact (and if you want, it's relatively easy to put a bunch of them into a single VCS repository, each in its own subdirectory).

However, the out of line approach makes modifying upstream software much more difficult than a good version of the inline approach (such as, for example, dgit). To modify upstream software in the out of line approach you have to go through some process similar to what you'd do in the inline approach, and then turn your modifications into patches that your packaging instructions apply on top of the pristine upstream. Moving changes from version to version may be painful in various ways, and in addition to those nice compact out of line 'extras/instructions' package repos, you may want to keep around your full VCS work tree that you built the patches from.

(Out of line versus inline is a separate issue from whether or not the upstream source code should include packaging instructions in any form; I think that generally the upstream should not.)

As a system administrator, I'm biased toward easy modification of upstream packages and thus upstream source because that's most of why I need to build my own packages. However, these days I'm not sure if that's what a Linux distribution should be focusing on. This is especially true for 'rolling' distributions that mostly deal with security issues and bugs not by patching their own version of the software but by moving to a new upstream version that has the security fix or bug fix. If most of what a distribution packages is unmodified from the upstream version, optimizing for that in your (working) source package format is perfectly sensible.

A small suggestion in modern Linux: take screenshots (before upgrades)

By: cks

Mike Hoye recently wrote Powering Up, which is in part about helping people install (desktop) Linux, and the Fediverse thread version of it reminded me of something that I don't do enough of:

A related thing I've taken to doing before potential lurching changes (like Linux distribution upgrades) is to take screenshots and window images. Because comparing a now and then image is a heck of a lot easier than restoring backups, and I can look at it repeatedly as I fix things on the new setup.

Linux distributions and the software they package have a long history of deciding to change things for your own good. They will tinker with font choices, font sizes, default DPI determinations, the size of UI elements, and so on, not quite at the drop of a hat but definitely when you do something like upgrade your distribution and bring in a bunch of significant package version changes (and new programs to replace old programs).

Some people are perfectly okay with these changes. Other people, like me, are quite attached to the specifics of how their current desktop environment looks and will notice and be unhappy about even relatively small changes (eg, also). However, because we're fallible humans, people like me can't always recognize exactly what changed and remember exactly what the old version looked like (these two are related); instead, sometimes all we have is the sense that something changed but we're not quite sure exactly what or exactly how.

Screenshots and window images are the fix for that unspecific feeling. Has something changed? You can call up an old screenshot to check, and to example what (and then maybe work out how to reverse it, or decide to live with the change). Screenshots aren't perfect; for example, they won't necessarily tell you what the old fonts were called or what sizes were being used. But they're a lot better than trying to rely on memory or other options.

It would probably also do me good to get into the habit of taking screenshots periodically, even outside of distribution upgrades. Looking back over time every so often is potentially useful to see more subtle, more long term changes, and perhaps ask myself either why I'm not doing something any more or why I'm still doing it.

(Currently I'm somewhat lackadasical about taking screenshots even before distribution upgrades. I have a distribution upgrade process but I haven't made screenshots part of it, and I don't have an explicit checklist for the process. Which I definitely should create. Possibly I should also try to capture font information in text form, to the extent that I can find it.)

My ideal Linux source package format (at the moment)

By: cks

I've written recently on why source packages are complicated and why packages should be declarative (in contrast to Arch style shell scripts), but I haven't said anything about what I'd like in a source package format, which will mostly be from the perspective of a system administrator who sometimes needs to modify upstream packages or package things myself.

A source package format is a compromise. After my recent experiences with dgit, I now feel that the best option is that a source package is a VCS repository directory tree (Git by default) with special control files in a subdirectory. Normally this will be the upstream VCS repository with packaging control files and any local changes merged in as VCS commits. You perform normal builds in this checked out repository, which has the advantage of convenience and the disadvantage that you have to clean up the result, possibly with liberal use of 'git clean' and 'git reset'. Hermetic builds are done by some tool that copies the checked out files to a build area, or clones the repository, or some other option. If a binary package is built in an environment where this information is available, its metadata should include the exact current VCS commit it was built from, and I would make binary packages not build if there were uncommitted changes.

(Making the native source package a VCS tree with all of the source code makes it easy to work on but mingles package control files with the program source. In today's environment with good distributed VCSes I think this is the right tradeoff.)

The control files should be as declarative as possible, and they should directly express major package metadata such as version numbers (unlike the Debian package format, where the version number is derived from debian/changelog). There should be a changelog but it should be relatively free-form, like RPM changelogs. Changelogs are especially useful for local modifications because they go along with the installed binary package, which means that you can get an answer to 'what did we change in this locally modified package' without having to find your source. The main metadata file that controls everything should be kept simple; I would go as far as to say it should have a format that doesn't allow for multi-line strings, and anything that requires multi-line strings should go in additional separate files (including the package description). You could make it TOML but I don't think you should make it YAML.

Both the build time actions, such as configuring and compiling the source, and the binary package install time actions should by default be declarative; you should be able to say 'this is an autoconf based program and it should have the following additional options', and the build system will take care of everything else. Similarly you should be able to directly express that the binary package needs certain standard things done when it's installed, like adding system users and enabling services. However, this will never be enough so you should also be able to express additional shell script level things that are done to prepare, build, install, upgrade, and so on the package. Unlike RPM and Debian source packages but somewhat like Arch packages, these should be separate files in the control directory, eg 'pkgmeta/build.sh'. Making these separate files makes it much easier to do things like run shellcheck on them or edit them in syntax-aware editor environments.

(It should be possible to combine standard declarative prepare and build actions with additional shell or other language scripting. We want people to be able to do as much as possible with standard, declarative things. Also, although I used '.sh', you should be able to write these actions in other languages too, such as Python or Perl.)

I feel that like RPMs, you should have to at least default to explicitly declaring what files and directories are included in the binary package. Like RPMs, these installed files should be analyzed to determine the binary package dependencies rather than force you to try to declare them in the (source) package metadata (although you'll always have to declare build dependencies in the source package metadata). Like build and install scripts, these file lists should be in separate files, not in the main package metadata file. The RPM collection of magic ways to declare file locations is complex but useful so that, for example, you don't have to keep editing your file lists when the Python version changes. I also feel that you should have to specifically mark files in the file lists with unusual permissions, such as setuid or setgid bits.

The natural way to start packing something new in this system would be to clone its repository and then start adding the package control files. The packaging system could make this easier by having additional tools that you ran in the root of your just-cloned repository and looked around to find indications of things like the name, the version (based on repository tags), the build system in use, and so on, and then wrote out preliminary versions of the control files. More tools could be used incrementally for things like generating the file lists; you'd run the build and 'install' process, then have a tool inventory the installed files for you (and in the process it could recognize places where it should change absolute paths into specially encoded ones for things like 'the current Python package location').

This sketch leaves a lot of questions open, such as what 'source packages' should look like when published by distributions. One answer is to publish the VCS repository but that's potentially quite heavyweight, so you might want a more minimal form. However, once you create a 'source only' minimal form without the VCS history, you're going to want a way to disentangle your local changes from the upstream source.

Linux distribution packaging should be as declarative as possible

By: cks

A commentator on my entry on why Debian and RPM (source) packages are complicated suggested looking at Arch Linux packaging, where most of the information is in a single file as more or less a shell script (example). Unfortunately, I'm not a fan of this sort of shell script or shell script like format, ultimately because it's only declarative by convention (although I suspect Arch enforces some of those conventions). One reason that declarative formats are important is that you can analyze and understand what they do without having to execute code. Another reason is that such formats naturally standardize things, which makes it much more likely that any divergence from the standard approach is something that matters, instead of a style difference.

Being able to analyze and manipulate declarative (source) packaging is useful for large scale changes within a distribution. The RPM source package format uses standard, more or less declarative macros to build most software, which I understand has made it relatively easy to build a lot of software with special C and C++ hardening options. You can inject similar things into a shell script based environment, but then you wind up with ad-hoc looking modifications in some circumstances, as we see in the Dovecot example.

Some things about declarative source packages versus Arch style minimalism are issues of what could be called 'hygiene'. RPM packages push you to list and categorize what files will be included in the built binary package, rather than simply assuming that everything installed into a scratch hierarchy should be packaged. This can be frustrating (and there are shortcuts), but it does give you a chance to avoid accidentally shipping unintended files. You could do this with shell script style minimal packaging if you wanted to, of course. Both RPM and Debian packages have standard and relatively declarative ways to modify a pristine upstream package, and while you can do that in Arch packages, it's not declarative, which hampers various sorts of things.

Basically my feeling is that at scale, you're likely to wind up with something that's essentially as formulaic as a declarative source package format without having its assured benefits. There will be standard templates that everyone is supposed to follow and they mostly will, and you'll be able to mostly analyze the result, and that 'mostly' qualification will be quietly annoying.

(On the positive side, the Arch package format does let you run shellcheck on your shell stanzas, which isn't straightforward to do in the RPM source format.)

Why Debian and RPM (source) packages are complicated

By: cks

A commentator on my early notes on dgit mentioned that they found packaging in Debian overly complicated (and I think perhaps RPMs as well) and would rather build and ship a container. On the one hand, this is in a way fair; my impression is that the process of specifying and building a container is rather easier than for source packages. On the other hand, Debian and RPM source packages are complicated for good reasons.

Any reasonably capable source package format needs to contain a number of things. A source package needs to supply the original upstream source code, some amount of distribution changes, instructions for building and 'installing' the source, a list of (some) dependencies (for either or both build time and install time), a list of files and directories it packages, and possibly additional instructions for things to do when the binary package is installed (such as creating users, enabling services, and so on). Then generally you need some system for 'hermetic' builds, ones that don't depend on things in your local (Linux) login environment. You'll also want some amount of metadata to go with the package, like a name, a version number, and a description. Good source package formats also support building multiple binary packages from a single source package, because sometimes you want to split up the built binary files to reduce the amount of stuff some people have to install. A built binary package contains a subset of this; it has (at least) the metadata, the dependencies, a file list, all of the files in the file list, and those install and upgrade time instructions.

Built containers are a self contained blob plus some metadata. You don't need file lists or dependencies or install and removal actions because all of those are about interaction with the rest of the system and by design containers don't interact with the rest of the system. To build a container you still need some of the same information that a source package has, but you need less and it's deliberately more self-contained and freeform. Since the built container is a self contained artifact you don't need a file list, I believe it's uncommon to modify upstream source code as part of the container build process (instead you patch it in advance in your local repository), and your addition of users, activation of services, and so on is mostly free form and at container build time; once built the container is supposed to be ready to go. And my impression is that in practice people mostly don't try to do things like multiple UIDs in a single container.

(You may still want or need to understand what things you install where in the container image, but that's your problem to keep track of; the container format itself only needs a little bit of information from you.)

Containers have also learned from source packages in that they can be layered, which is to say that you can build your container by starting from some other container, either literally or by sticking another level of build instructions on the end. Layered source packages don't make any sense when you're thinking like a distribution, but they make a lot of sense for people who need to modify the distribution's source packages (this is what dgit makes much easier, partly because Git is effectively a layering system; that's one way to look at a sequence of Git commits).

(My impression of container building is that it's a lot more ad-hoc than package building. Both Debian and RPM have tried to standardize and automate a lot of the standard source code building steps, like running autoconf, but the cost of this is that each of them has a bespoke set of 'convenient' automation to learn if you want to build a package from scratch. With containers, you can probably mostly copy the upstream's shell-based build instructions (or these days, their Dockerfile).)

Dgit based building of (potentially modified) Debian packages can be surprisingly close to the container building experience. Like containers, you first prepare your modifications in a repository and then you run some relatively simple commands to build the artifacts you'll actually use. Provided that your modifications don't change the dependencies, files to be packaged, and so on, you don't have to care about how Debian defines and manipulates those, plus you don't even need to know exactly how to build the software (the Debian stuff takes care of that for you, which is to say that the Debian package builders have already worked it out).

In general I don't think you can get much closer to the container build experience other than the dgit build experience or the general RPM experience (if you're starting from scratch). Packaging takes work because packages aren't isolated, self contained objects; they're objects that need to be integrated into a whole system in a reversible way (ie, you can uninstall them, or upgrade them even though the upgraded version has a somewhat different set of files). You need more information, more understanding, and a more complicated build process.

(Well, I suppose there are flatpaks (and snaps). But these mostly don't integrate with the rest of your system; they're explicitly designed to be self-contained, standalone artifacts that run in a somewhat less isolated environment than containers.)

Moving local package changes to a new Ubuntu release with dgit

By: cks

Suppose, not entirely hypothetically, that you've made local changes to an Ubuntu package on one Ubuntu release, such as 22.04 ('jammy'), and now you want to move to another Ubuntu release such as 24.04 ('noble'). If you're working with straight 'apt-get source' Ubuntu source packages, this is done by tediously copying all of your patches over (hopefully the package uses quilt) to duplicate and recreate your 22.04 work.

If you're using dgit, this is much easier. Partly this is because dgit is based on Git, but partly this is because dgit has an extremely convenient feature where it can have several different releases in the same Git repository. So here's what we want to do, assuming you have a dgit repository for your package already.

(For safety you may want to do this in a copy of your repository. I make rsync'd copies of Git repositories all the time for stuff like this.)

Our first step is to fetch the new 24.04 ('noble') version of the package into our dgit repository as a new dgit branch, and then check out the branch:

dgit fetch -d ubuntu noble,-security,-updates
dgit checkout noble,-security,-updates

We could do this in one operation but I'd rather do it in two, in case there are problems with the fetch.

The Git operation we want to do now is to cherry-pick (also) our changes to the 22.04 version of the package onto the 24.04 version of the package. If this goes well the changes will apply cleanly and we're done. However, there is a complication. If we've followed the usual process for making dgit-based local changes, the last commit on our 22.04 version is an update to debian/changelog. We don't want that change, because we need to do our own 'gbp dch' on the 24.04 version after we've moved our own changes over to make our own 24.04 change to debian/changelog (among other things, the 22.04 changelog change has the wrong version number for the 24.04 package).

In general, cherry-picking all our local changes is 'git cherry-pick old-upstream..old-local'. To get all but the last change, we want 'old-local~' instead. Dgit has long and somewhat obscure branch names; its upstream for our 22.04 changes is 'dgit/dgit/jammy,-security,-updates' (ie, the full 'suite' name we had to use with 'dgit clone' and 'dgit fetch'), while our local branch is 'dgit/jammy,-security,-updates'. So our full command, with a 'git log' beforehand to be sure we're getting what we want, is:

git log dgit/dgit/jammy,-security,-updates..dgit/jammy,-security,-updates~
git cherry-pick dgit/dgit/jammy,-security,-updates..dgit/jammy,-security,-updates~

(We've seen this dgit/dgit/... stuff before when doing 'gbp dch'.)

Then we need to make our debian/changelog update. Here, as an important safety tip, don't blindly copy the command you used while building the 22.04 package, using 'jammy,...' in the --since argument, because that will try to create a very confused changelog of everything between the 22.04 version of the package and the 24.04 version. Instead, you obviously need to update it to your new 'noble' 24.04 upstream, making it:

gbp dch --since dgit/dgit/noble,-security,-updates --local .cslab. --ignore-branch --commit

('git reset --hard HEAD~' may be useful if you make a mistake here. As they say, ask me how I know.)

If the cherry-pick doesn't apply cleanly, you'll have to resolve that yourself. If the cherry-pick applies cleanly but the result doesn't build or perhaps doesn't work because the code has changed too much, you'll be using various ways to modify and update your changes. But at least this is a bunch easier than trying to sort out and update a quilt-based patch series.

Appendix: Dealing with Ubuntu package updates

Based on this conversation, if Ubuntu releases a new version of the package, what I think I need to do is to use 'dgit fetch' and then explicitly rebase:

dgit fetch -d ubuntu

You have to use '-d ubuntu' here or 'dgit fetch' gets confused and fails. There may be ways to fix this with git config settings, but setting them all is exhausting and if you miss one it explodes, so I'm going to have to use '-d ubuntu' all the time (unless dgit fixes this someday).

Dgit repositories don't have an explicit Git upstream set, so I don't think we can use plain rebase. Instead I think we need the more complicated form:

git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates

(Until I do it for real, these arguments are speculative. I believe they should work if I understand 'git rebase' correctly, but I'm not completely sure. I might need the full three argument form and to make the 'upstream' a commit hash.)

Then, as above, we need to drop our debian/changelog change and redo it:

git reset --hard HEAD~
gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

(There may be a clever way to tell 'git rebase' to skip the last change, or you can do an interactive rebase (with '-i') instead of a non-interactive one and delete it yourself.)

Early notes about using dgit on Ubuntu (LTS)

By: cks

I recently read Ian Jackson's Debian’s git transition (via) and had a reaction:

I would really like to be able to patch and rebuild Ubuntu packages from a git repository with our local changes (re)based on top of upstream git. It would be much better than quilt'ing and debuild'ing .dsc packages (I have non-complimentary opinions on the Debian source package format). This news gives me hope that it'll be possible someday, but especially for Ubuntu I have no idea how soon or how well documented it will be.

(It could even be better than RPMs.)

The subsequent discussion got me to try out dgit, especially since it had an attractive dgit-user(7) manual page that gave very simple directions on how to make a local change to an upstream package. It turns out that things aren't entirely smooth on Ubuntu, but they're workable.

The starting point is 'dgit clone', but on Ubuntu you currently get to use special arguments that aren't necessary on Debian:

dgit clone -d ubuntu dovecot jammy,-security,-updates

(You don't have to do this on a machine running 'jammy' (Ubuntu 22.04); it may be more convenient to do it from another one, perhaps with a more up to date dgit.)

The latest Ubuntu package for something may be in either their <release>-security or their <release>-updates 'suite', so you need both. I think this is equivalent to what 'apt-get source' gets you, but you might want to double check. Once you've gotten the source in a Git repository, you can modify it and commit those modifications as usual, for example through Magit. If you have an existing locally patched version of the package that you did with quilt, you can import all of the quilt patches, either one by one or all at once and then using Magit's selective commits to sort things out.

Having made your modifications, whether tentative or otherwise, you can now automatically modify debian/changelog:

gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

(You might want to use -S for snapshots when testing modifications and builds, I don't know. Our practice is to use --local to add a local suffix on the upstream package number, so we can keep our packages straight.)

The special bit is the 'dgit/dgit/<whatever you used in dgit clone>', which tells gbp-dch (part of the gbp suite of stuff) where to start the changelog from. Using --commit is optional; what I did was to first run 'gbp dch' without it, then use 'git diff' to inspect the resulting debian/changelog changes, and then 'git restore debian/changelog' and re-run it with a better set of options until eventually I added the '--commit'.

You can then install build-deps (if necessary) and build the binary packages with the dgit-user(7) recommended 'dpkg-buildpackage -uc -b'. Normally I'd say that you absolutely want to build source packages too, but since you have a Git repository with the state frozen that you can rebuild from, I don't think it's necessary here.

(After the build finishes you can admire 'git status' output that will tell you just how many files in your source tree the Debian or Ubuntu package building process modified. One of the nice things about using Git and building from a Git repository is that you can trivially fix them all, rather than the usual set of painful workarounds.)

The dgit-user(7) manual page suggests but doesn't confirm that if you're bold, you can build from a tree with uncommitted changes. Personally, even if I was in the process of developing changes I'd commit them and then make liberal use of rebasing, git-absorb, and so on to keep updating my (committed) changes.

It's not clear to me how to integrate upstream updates (for example, a new Ubuntu update to the Dovecot package) with your local changes. It's possible that 'dgit pull' will automatically rebase your changes, or give you the opportunity to do that. If not, you can always do another 'dgit clone' and then manually import your Git changes as patches.

(A disclaimer: at this point I've only cloned, modified, and built one package, although it's a real one we use. Still, I'm sold; the ability to reset the tree after a build is valuable all by itself, never mind having a better way than quilt to handle making changes.)

The systemd journal, message priorities, and (syslog) facilities

By: cks

If you use systemd units or systemd-run to conveniently capture output from scripts and programs into the systemd journal, one of the things that it looks like you don't get is message priorities and (syslog) facilities. Fortunately, systemd's journal support is a bit more sophisticated than that.

When you print out regular output and systemd captures it into the journal, systemd assigns it a default priority that's set with SyslogLevel=; this is normally 'info', which is a good default choice. Similarly, you can pick the syslog facility associated with your unit or your systemd-run invocation with SyslogFacility=. Systemd defaults to 'daemon', which may not entirely be what you want. On the other hand, the choice of syslog facility matters less if you're primarily working with journalctl, where what you usually care about is the systemd unit name.

(You can use journalctl to select messages by priority or syslog facility with the -p and --facility options. You can also select by syslog identifier with the -t option. This is probably going to be handy for searching the journal for messages from some of our programs that use syslog to report things.)

If you know that you're logging to systemd (or you don't care that your regular output looks a bit weird in spots), you can also print messages with special priority markers, as covered in sd-daemon(3). Now that I know about this, I may put it to use in some of our scripts and programs. Sadly, unlike the normal Linux logger and its --prio-prefix option, you can't change the syslog facility this way, but if you're doing pure journald logging you probably don't care about that.

(It's possible that sd-daemon(3) actually supports the logger behavior of changing the syslog facility too, but if so it's not documented and you shouldn't count on it. Instead you should assume that you have to control the syslog facility through setting SyslogFacility=, which unfortunately means you can't log just authentication things to 'auth' and everything else to 'daemon' or some other appropriate facility.)

PS: Unfortunately, as far as I know journalctl has no way to augment its normal syslog-like output with some additional fields, such as the priority or the syslog facility. Instead you have to go all the way to a verbose dump of information in one of the supported formats for field selection.

Some notes on using systemd-run or systemd-cat for logging program output

By: cks

In response to yesterday's entry on using systemd (service) units for easy capturing of log output, a commentator drew my attention to systemd-run and systemd-cat. I spent a bit of time poking at both of them and so I've wound up with some things to remember and some opinions.

(The short summary is that you probably want to use systemd-run with a specific unit name that you pick.)

Systemd-cat is very roughly the systemd equivalent of logger. As you'd expect, things that it puts in the systemd journal flow through to anywhere that regular journal entries would, including things that directly get fed from the journal and syslog (including remote syslog destinations). The most convenient way to use systemd-cat is to just have it run a command, at which point it will capture all of the output from the command and put it in the journal. However, there is a little issue with using just 'systemd-cat /some/command', which is that the journal log identifiers that systemd-cat generates in this case will be the direct name of whatever program produced the output. If /some/command is a script that runs a variety of programs that produce output (perhaps it echos some status information itself then runs a program, which produces output on its own), you'll get a mixture of identifier names in the resulting log:

your-script[...]: >>> Frobulating the thing
some-prog[...]: Frobulation results: 23 processed, 0 errors

Journal logs written by systemd-cat also inherit whatever unit it was in (a session unit, cron.service, etc), and the combination can make it hard to clearly see all of the logs from running your script. To do better you need to give systemd-cat an explicit identifier, 'systemd-cat -t <something> /some/command', which point everything is logged with that name, but still in whatever systemd unit systemd-cat ran in.

Generally you want your script to report all its logs under a single unit name, so you can find them and sort them out from all of the other things your system is logging. To do this you need to use systemd-run with an explicit unit name:

systemd-run -u myscript --quiet --wait -G /some/script

I believe you can then hook this into any systemd service unit infrastructure you want, such as sending email if the unit fails (if you do, you probably want to add '--service-type=oneshot'). Using systemd-run this way gets you the best of both systemd-cat worlds; all of the output from /some/script will be directly labeled with what program produced it, but you can find it all using the unit name.

Systemd-run will refuse to activate a unit with a name that duplicates an existing unit, including existing systemd-run units. In many cases this is a feature for script use, since you basically get 'run only one copy' locking for free (although the error message is noisy, so you may want to do your own quiet locking). If you want to always run your program even if another instance is running, you'll have to generate non-constant unit names (or let systemd-run do it for you).

Systemd-cat has some features that systemd-run doesn't offer, such as setting the priority of messages (and setting a different priority for standard error output). If these features are important to you, I'd suggest nesting systemd-cat (with no '-t' argument) inside systemd-run, so you get both the searchable unit name and the systemd-cat features. If you're already in an environment with a useful unit name and you just need to divert log messages from wherever else the environment wants to send them into the system journal, bare systemd-cat will do the job.

(Arguably this is the case for things run from cron, if you're content to look for all of them under cron.service (or crond.service, depending on your Linux distribution). Running things under systemd-cat puts their output in the journal instead of having them send you email, which may be good enough and saves you having to invent and then remember a bunch of unit names.)

Turning to systemd units for easy capturing of log output

By: cks

Suppose, not hypothetically, that you have a third party tool that you need to run periodically. This tool prints things to standard output (or standard error) that are potentially useful to capture somehow. You want this captured output to be associated with the program (or your general system for running the program) and timestamped, and it would be handy if the log output wound up in all of the usual places in your systems for output. Unix has traditionally had some solutions for this, such as logger for sending things to syslog, but they all have a certain amount of annoyances associated with them.

(If you directly run your script or program from cron, you will automatically capture the output in a nice dated form, but you'll also get email all the time. Let's assume we want a quieter experience than email from cron, because you don't need to regularly see the output, you just want it to be available if you go looking.)

On modern Linux systems, the easy and lazy thing to do is to run your script or program from a systemd service unit, because systemd will automatically do this for you and send the result into the systemd journal (and anything that pulls data from that) and, if configured, into whatever overall systems you have for handling syslog logs. You want a unit like this:

[Unit]
Description=Local: Do whatever
ConditionFileIsExecutable=/root/do-whatever

[Service]
Type=oneshot
ExecStart=/root/do-whatever

Unlike the usual setup for running scripts as systemd services, we don't set 'RemainAfterExit=True' because we want to be able to repeatedly trigger our script with, for example, 'systemctl start local-whatever.service'. You can even arrange to get email if this unit (ie, your script) fails.

You can run this directly from cron through suitable /etc/cron.d files that use 'systemctl start', or set up a systemd timer unit (possibly with a randomized start time). The advantage of a systemd timer unit is that you definitely won't ever get email about this unless you specifically configure it. If you're setting up a relatively unimportant and throwaway thing, it being reliably silent is probably a feature.

(Setting up a systemd timer unit also keeps everything within the systemd ecosystem rather than worrying about various aspects of running 'systemctl start' from scripts or crontabs or etc.)

On the one hand, it feels awkward to go all the way to a systemd service unit simply to get easy to handle logs; it feels like there should be a better solution somewhere. On the other hand, it works and it only needs one extra file over what you'd already need (the .service).

Why I (still) love Linux

A screen showing htop

I know, this title might come as a surprise to many. Or perhaps, for those who truly know me, it won’t. I am not a fanboy. The BSDs and the illumos distributions generally follow an approach to design and development that aligns more closely with the way I think, not to mention the wonderful communities around them, but that does not mean I do not use and appreciate other solutions. I usually publish articles about how much I love the BSDs or illumos distributions, but today I want to talk about Linux (or, better, GNU/Linux) and why, despite everything, it still holds a place in my heart. This will be the first in a series of articles where I’ll discuss other operating systems.

Where It All Began

I started right here, with GNU/Linux, back in 1996. It was my first real prompt after the Commodore 64 and DOS. It was my first step toward Unix systems, and it was love at first shell. I felt a sense of freedom - a freedom that the operating systems I had known up to that point (few, to be honest) had never given me. It was like a β€œblank sheet” (or rather, a black one) with a prompt on it. I understood immediately that this prompt, thanks to command chaining, pipes, and all the marvels of Unix and Unix-like systems, would allow me to do anything. And that sense of freedom is what makes me love Unix systems to this day.

I was young, but my intuition was correct. And even though I couldn't afford to keep a full Linux installation on that computer long-term due to hardware limitations, I realized that this would be my future. A year later, a new computer arrived, allowing me to use Linux daily, for everything. And successfully, without missing Windows at all (except for a small partition, strictly for gaming).

When I arrived at university, in 1998, I was one of the few who knew it. One of the few who appreciated it. One of the few who hoped to see a flourishing future for it. Everywhere. Widespread. A dream come true. I was a speaker at Linux Days, I actively participated in translation projects, and I wrote articles for Italian magazines. I was a purist regarding the "GNU/Linux" nomenclature because I felt it was wrong to ignore the GNU part - it was fundamental. Because perhaps the "Year of the Linux Desktop" never arrived, but Linux is now everywhere. On my desktop, without a doubt. But also on my smartphone (Android) and on those of hundreds of millions of people. Just as it is in my car. And in countless devices surrounding us - even if we don’t know it. And this is the true success. Let’s not focus too much on the complaint that "it’s not compatible with my device X". It is your device that is not compatible with Linux, not the other way around. Just like when, many years ago, people complained that their WinModems (modems that offloaded all processing to obscure, closed-source Windows drivers) didn't work on Linux. For "early adopters" like me, this concept has always been present, even though, fortunately, things have improved exponentially.

Linux was what companies accepted most willingly (not totally, but still...): the ongoing lawsuits against the BSDs hampered their spread, and Linux seemed like that "breath of fresh air" the world needed.

Linux and its distributions (especially those untethered from corporations, like Debian, Gentoo, Arch, etc.) allowed us to replicate expensive "commercial" setups at a fraction of the cost. Reliability was good, updating was simple, and there was a certain consistency. Not as marked as that of the BSDs, but sufficient.

The world was ready to accept it, albeit reluctantly. Linus Torvalds, despite his sometimes harsh and undiplomatic tone, carried forward the kernel development with continuity and coherence, making difficult decisions but always in line with the project. The "move fast and break things" model was almost necessary because there was still so much to build. I also remember the era when Linux - speaking of the kernel - was designed almost exclusively for x86. The other architectures, to simplify, worked thanks to a series of adaptations that brought most behavior back to what was expected for x86.

And the distributions, especially the more "arduous" ones to install, taught me a lot. The distro-hopping of the early 2000s made me truly understand partitioning, the boot procedure (Lilo first, then Grub, etc.), and for this, I must mainly thank Gentoo and Arch (and the FreeBSD handbook - but this is for another article). I learned the importance of backups the hard way, and I keep this lesson well in mind today. My Linux desktops ran mainly with Debian (initially), then Gentoo, Arch, and openSUSE (which, at the time, was still called "SUSE Linux"), Manjaro, etc. My old 486sx 25Mhz with 4MB (yes, MB) of RAM, powered by Debian, allowed me to download emails (mutt and fetchmail), news (inn + suck), program in C, and create shell scripts - at the end of the 90s.

When Linux Conquered the World

Then the first Ubuntu was launched, and many things changed. I don't know if it was thanks to Ubuntu or simply because the time was ripe, but attention shifted to Linux on the desktop as well (albeit mainly on the computers of us enthusiasts), and many companies began to contribute actively to the system or distributions.

I am not against the participation of large companies in Open Source. Their contributions can be valuable for the development of Open Source itself, and if companies make money from it, good for them. If this ultimately leads to a more complete and valid Open Source product, then I welcome it! It is precisely thanks to mass adoption that Linux cleared the path for the acceptance of Open Source at all levels. I still remember when, just after graduating, I was told that Linux (and Open Source systems like the BSDs) were "toys for universities". I dare anyone to say that today!

But this must be done correctly: without spoiling the original idea of the project and without hijacking (voluntarily or not) development toward a different model. Toward a different evolution. The use of Open Source must not become a vehicle for a business model that tends to close, trap, or cage the user. Or harm anyone. And if it is oriented toward worsening the product solely for one's own gain, I can only be against it.

What Changed Along the Way

And this is where, unfortunately, I believe things have changed in the Linux world (if not in the kernel itself, at least in many distributions). Innovation used to be disruptive out of necessity. Today, in many cases, disruption happens without purpose, and stability is often sacrificed for changes that do not solve real problems. Sometimes, in the name of improved security or stability, a new, immature, and unstable product is created - effectively worsening the status quo.

To give an example, I am not against systemd on principle, but I consider it a tool distant from the original Unix principles - do one thing and do it well - full of features and functions that, frankly, I often do not need. I don't want systemd managing my containerization. For restarting stopped services? There are monit and supervisor - efficient, effective, and optional. And, I might add: services shouldn't crash; they should handle problems in a non-destructive way. My Raspberry Pi A+ doesn't need systemd, which occupies a huge amount of RAM (and precious clock cycles) for features that will never be useful or necessary on that platform.

But "move fast and break things" has arrived everywhere, and software is often written by gluing together unstable libraries or those laden with system vulnerabilities. Not to mention so-called "vibe coding" - which might give acceptable results at certain levels, but should not be used when security and confidentiality become primary necessities or, at least, without an understanding of what has been written.

We are losing much of the Unix philosophy, and many Linux distributions are now taking the path of distancing themselves from a concept of cross-compatibility ("if it works on Linux, I don't care about other operating systems"), of minimalism, of "do one thing and do it well". And, in my opinion, we are therefore losing many of the hallmarks that have distinguished its behavior over the years.

In my view, this depends on two factors: a development model linked to a concept of "disposable" electronics, applied even to software, and the pressure from some companies to push development where they want, not where the project should go. Therefore, in certain cases, the GPL becomes a double-edged sword: on one hand, it protects the software and ensures that contributions remain available. On the other, it risks creating a situation where the most "influential" player can totally direct development because - unable to close their product - they have an interest in the entire project going in the direction they have predisposed. In these cases, perhaps, BSD licenses actually protect the software itself more effectively. Because companies can take and use without an obligation to contribute. If they do, it is because they want to, as in the virtuous case of Netflix with FreeBSD. And this, while it may remove (sometimes precious) contributions to the operating system, guarantees that the steering wheel remains firmly in the hands of those in charge - whether foundations, groups, or individuals.

And Why I Still Care

And so yes, despite all this, I (still) love Linux.

Because it was the first Open Source project I truly believed in (and which truly succeeded), because it works, and because the entire world has developed around it. Because it is a platform on which tons of distributions have been built (and some, like Alpine Linux, still maintain that sense of minimalism that I consider correct for an operating system). Because it has distributions like openSUSE (and many others) that work immediately and without problems on my laptop (suspension and hibernation included) and on my miniPC, a fantastic tool I use daily. Because hardware support has improved immensely, and it is now rare to find incompatible hardware.

Because it has been my life companion for 30 years and has contributed significantly to putting food on the table and letting me sleep soundly. Because it allowed me to study without spending insane amounts on licenses or manuals. Because it taught me, first, to think outside the box. To be free.

So thank you, GNU/Linux.

Even if your btrfs, after almost 18 years, still eats data in spectacular fashion. Even if you rename my network interfaces after a reboot. Even though, at times, I get the feeling that you’re slowly turning into what you once wanted to defeat.

Even if you are not my first choice for many workloads, I foresee spending a lot of time with you for at least the next 30 years.

Static Web Hosting on the Intel N150: FreeBSD, SmartOS, NetBSD, OpenBSD and Linux Compared

A server rack with some servers and cables

Update: This post has been updated to include Docker benchmarks and a comparison of container overhead versus FreeBSD Jails and illumos Zones.

Note: Some operating systems (FreeBSD and Linux) support kernel TLS (kTLS) and the related SSL_sendfile path in nginx, which can improve HTTPS performance for static files. Since this feature is not available on all the systems included in the comparison (for example NetBSD, OpenBSD and illumos), the benchmarks were run with a common baseline configuration that does not rely on kTLS. The goal is to compare the systems under similar conditions rather than to measure OS specific optimizations.

I often get very specific infrastructure requests from clients. Most of the time it is some form of hosting. My job is usually to suggest and implement the setup that fits their goals, skills and long term plans.

If there are competent technicians on the other side, and they are willing to learn or already comfortable with Unix style systems, my first choices are usually one of the BSDs or an illumos distribution. If they need a control panel, or they already have a lot of experience with a particular stack that will clearly help them, I will happily use Linux and it usually delivers solid, reliable results.

Every now and then someone asks the question I like the least:

β€œBut how does it perform compared to X or Y?”

I have never been a big fan of benchmarks. At best they capture a very specific workload on a very specific setup. They are almost never a perfect reflection of what will happen in the real world.

For example, I discovered that idle bhyve VMs seem to use fewer resources when the host is illumos than when the host is FreeBSD. It looks strange at first sight, but the illumos people are clearly working very hard on this, and the result is a very capable and efficient platform.

Despite my skepticism, from time to time I enjoy running some comparative tests. I already did it with Proxmox KVM versus FreeBSD bhyve, and I also compared Jails, Zones, bhyve and KVM on the same Intel N150 box. That led to the FreeBSD vs SmartOS article where I focused on CPU and memory performance on this small mini PC.

This time I wanted to do something simpler, but also closer to what I see every day: static web hosting.

Instead of synthetic CPU or I/O tests, I wanted to measure how different operating systems behave when they serve a small static site with nginx, both over HTTP and HTTPS.

This is not meant to be a super rigorous benchmark. I used the default nginx packages, almost default configuration, and did not tune any OS specific kernel settings. In my experience, careful tuning of kernel and network parameters can easily move numbers by several tens of percentage points. The problem is that very few people actually spend time chasing such optimizations. Much more often, once a limit is reached, someone yells β€œwe need mooooar powaaaar” while the real fix would be to tune the existing stack a bit.

So the question I want to answer here is more modest and more practical:

With default nginx and a small static site, how much does the choice of host OS really matter on this Intel N150 mini PC?

Spoiler: less than people think, at least for plain HTTP. Things get more interesting once TLS enters the picture.


Disclaimer
These benchmarks are a snapshot of my specific hardware, network and configuration. They are useful to compare relative behavior on this setup. They are not a universal ranking of operating systems. Different CPUs, NICs, crypto extensions, kernel versions or nginx builds can completely change the picture.


Test setup

The hardware is the same Intel N150 mini PC I used in my previous tests: a small, low power box that still has enough cores to be interesting for lab and small production workloads.

On it, I installed several operating systems and environments, always on the bare metal, not nested inside each other. On each OS I installed nginx from the official packages.

Software under test

On the host:

SmartOS, with:
- a Debian 12 LX zone
- an Alpine Linux 3.22 LX zone
- a native SmartOS zone

FreeBSD 14.3-RELEASE:
- nginx running inside a native jail

OpenBSD 7.8:
- nginx on the host

NetBSD 10.1:
- nginx on the host

Debian 13.2:
- nginx on the host

Alpine Linux 3.22:
- nginx on the host
- Docker: Debian 13 container running on the Alpine host (ports mapped)

I also tried to include DragonFlyBSD, but the NIC in this box is not supported. Using a different NIC just for one OS would have made the comparison meaningless, so I excluded it.

nginx configuration

In all environments:

  • nginx was installed from the system packages
  • worker_processes was set to auto
  • the web root contained the same static content

The important part is that I used exactly the same nginx.conf file for all operating systems and all combinations in this article. I copied the same configuration file verbatim to every host, jail and zone. The only changes were the IP address and file paths where needed, for example for the TLS certificate and key.

The static content was a default build of the example site generated by BSSG, my Bash static site generator. The web root was the same logical structure on every OS and container type.

There is no OS specific tuning in the configuration and no kernel level tweaks. This is very close to a β€œpackage install plus minimal config” situation.

TLS configuration

For HTTPS I used a very simple configuration, identical on every host.

Self signed certificate created with:

openssl req -x509 -newkey rsa:4096 -nodes -keyout server.key -out server.crt -days 365 -subj "/CN=localhost"  

Example nginx server block for HTTPS (simplified):

server {  
listen 443 ssl http2;  
listen [::]:443 ssl http2;  

server_name _;  

ssl_certificate /etc/nginx/ssl/server.crt;  
ssl_certificate_key /etc/nginx/ssl/server.key;  

root /var/www/html;  
index index.html index.htm;  

location / {  
try_files $uri $uri/ =404;  
}  
}  

The HTTP virtual host is also the same everywhere, with the root pointing to the BSSG example site.

Load generator

The tests were run from my workstation on the same LAN:

  • client host: a mini PC machine connected at 2.5 Gbit/s
  • switch: 2.5 Gbit/s
  • test tool: wrk

For each target host I ran:

  • wrk -t4 -c50 -d10s http://IP
  • wrk -t4 -c10 -d10s http://IP
  • wrk -t4 -c50 -d10s https://IP
  • wrk -t4 -c10 -d10s https://IP

Each scenario was executed multiple times to reduce noise; the numbers below are medians (or very close to them) from the runs.

The contenders

To keep things readable, I will refer to each setup as follows:

  • SmartOS Debian LX β†’ SmartOS host, Debian 12 LX zone
  • SmartOS Alpine LX β†’ SmartOS host, Alpine 3.22 LX zone
  • SmartOS Native β†’ SmartOS host, native zone
  • FreeBSD Jail β†’ FreeBSD 14.3-RELEASE, nginx in a jail
  • OpenBSD Host β†’ OpenBSD 7.8, nginx on the host
  • NetBSD Host β†’ NetBSD 10.1, nginx on the host
  • Debian Host β†’ Debian 13.2, nginx on the host
  • Alpine Host β†’ Alpine 3.22, nginx on the host
  • Docker Container β†’ Alpine host, Debian 13 Docker container

Everything uses the same nginx configuration file and the same static site.

Static HTTP results

Let us start with plain HTTP, since this removes TLS from the picture and focuses on the kernel, network stack and nginx itself.

HTTP, 4 threads, 50 concurrent connections

Approximate median wrk results:

Environment HTTP 50 connections
SmartOS Debian LX ~46.2 k
SmartOS Alpine LX ~49.2 k
SmartOS Native ~63.7 k
FreeBSD Jail ~63.9 k
OpenBSD Host ~64.1 k
NetBSD Host ~64.0 k
Debian Host ~63.8 k
Alpine Host ~63.9 k
Docker Container ~63.7 k

Two things stand out:

  1. All the native or jail/container setups on the hosts that are not LX zones cluster around 63 to 64k requests per second.
  2. The two SmartOS LX zones sit slightly lower, in the 46 to 49k range, which is still very respectable for this hardware.

In other words, as long as you are on the host or in something very close to it (FreeBSD jail, SmartOS native zone, NetBSD, OpenBSD, Linux on bare metal), static HTTP on nginx will happily max out around 64k requests per second with this small Intel N150 CPU.

The Debian and Alpine LX zones on SmartOS are a bit slower, but not dramatically so. They still deliver close to 50k requests per second and, in a real world scenario, you would probably saturate the network or the client long before hitting those numbers.

HTTP, 4 threads, 10 concurrent connections

With fewer concurrent connections, absolute throughput drops, but the relative picture is similar:

  • SmartOS Native around 44k
  • NetBSD and Alpine Host around 34 to 35k
  • FreeBSD, Debian, OpenBSD around 31 to 33k
  • The Docker Container sits slightly lower at ~30.2k req/s, showing a small overhead from the networking layer
  • The SmartOS LX zones sit slightly below, around 35 to 37k req/s

The important conclusion is simple:

For plain HTTP static hosting, once nginx is installed and correctly configured, the choice between these operating systems makes very little difference on this hardware. Zones and jails add negligible overhead, LX zones add a small one.

If you are only serving static content over HTTP, your choice of OS should be driven by other factors: ecosystem, tooling, update strategy, your own expertise and preference.

Static HTTPS results

TLS is where things start to diverge more clearly and where CPU utilization becomes interesting.

HTTPS, 4 threads, 50 concurrent connections

Approximate medians:

Environment HTTPS 50 connections CPU notes at 50 HTTPS connections
SmartOS Debian LX ~51.4 k CPU saturated
SmartOS Alpine LX ~40.4 k CPU saturated
SmartOS Native ~52.8 k CPU saturated
FreeBSD Jail ~62.9 k around 60% CPU idle
OpenBSD Host ~39.7 k CPU saturated
NetBSD Host ~40.4 k CPU saturated
Debian Host ~62.8 k about 20% CPU idle
Alpine Host ~62.4 k small idle headroom, around 7% idle
Docker Container ~62.7 k CPU saturated

These numbers tell a more nuanced story.

  1. FreeBSD, Debian and Alpine on bare metal form a β€œfast TLS” group.
    All three sit around 62 to 63k requests per second with 50 concurrent HTTPS connections.

  2. FreeBSD does this while using significantly less CPU.
    During the HTTPS tests with 50 connections, the FreeBSD host still had around 60% CPU idle. It is the platform that handled TLS load most comfortably in terms of CPU headroom.

  3. Debian and Alpine are close in throughput, but push the CPU harder.
    Debian still had some idle time left, Alpine even less. In practice, all three are excellent here, but FreeBSD gives you more room before you hit the wall.

  4. SmartOS, NetBSD and OpenBSD form a β€œgood but heavier” TLS group.
    Their HTTPS throughput is in the 40 to 52k req/s range and they reach full CPU usage at 50 concurrent connections. OpenBSD and NetBSD stabilize around 39 to 40k req/s. SmartOS native and the Debian LX zone manage slightly better (around 51 to 53k) but still with the CPU pegged.

HTTPS, 4 threads, 10 concurrent connections

With lower concurrency:

  • FreeBSD, Debian and Alpine still sit in roughly the 29 to 31k req/s range
  • SmartOS Native and LX zones are in the mid to high 30k range
  • The Docker Container drops slightly to ~27.8k req/s
  • NetBSD and OpenBSD sit around 26 to 27k req/s

The relative pattern is the same: for this TLS workload, FreeBSD and modern Linux distributions on bare metal appear to make better use of the cryptographic capabilities of the CPU, delivering higher throughput or more headroom or both.

What TLS seems to highlight

The HTTPS tests point to something that is not about nginx itself, but about the TLS stack and how well it can exploit the hardware.

On this Intel N150, my feeling is:

  • FreeBSD, with the userland and crypto stack I am running, is very efficient at TLS here. It delivers the highest throughput while keeping plenty of CPU in reserve.
  • Debian and Alpine, with their recent kernels and libraries, are also strong performers, close to FreeBSD in throughput, but with less idle CPU.
  • NetBSD, OpenBSD and SmartOS (native and LX) are still perfectly capable of serving a lot of HTTPS traffic, but they have to work harder to keep up and they hit 100% CPU much earlier.

This matches what I see in day to day operations: TLS performance is often less about β€œnginx vs something else” and more about the combination of:

  • the TLS library version and configuration
  • how well the OS uses the CPU crypto instructions
  • kernel level details in the network and crypto paths

I suspect the differences here are mostly due to how each system combines its TLS stack (OpenSSL, LibreSSL and friends), its kernel and its hardware acceleration support. It would take a deeper dive into profiling and configuration knobs to attribute the gaps precisely.

In any case, on this specific mini PC, if I had to pick a platform to handle a large amount of HTTPS static traffic, FreeBSD, Debian and Alpine would be my first candidates, in that order.

Zones, jails, containers and Docker: overhead in practice

Another interesting part of the story is the overhead introduced by different isolation technologies.

From these tests and the previous virtualization article on the same N150 machine, the picture is consistent:

  • FreeBSD jails behave almost like bare metal and are significantly more efficient than Docker.
    For both HTTP and HTTPS, running nginx in a jail on FreeBSD 14.3-RELEASE produces numbers practically identical to native hosts.
    The contrast with Docker is striking: while the Docker container required 100% CPU to reach peak for the HTTP and HTTPS throughput, the FreeBSD jail delivered the same speed with ~60% of the CPU sitting idle. In terms of performance cost per request, Jails are drastically cheaper.

  • SmartOS native zones are also very close to the metal.
    Static HTTP performance reaches the same 64k req/s region and HTTPS is only slightly behind the "fast TLS" group, although with higher CPU usage.

  • SmartOS LX zones introduce a noticeable but modest overhead.
    Both Debian and Alpine LX zones on SmartOS perform slightly worse than the native zone or FreeBSD jails. For static HTTP they are still very fast. For HTTPS the Debian LX zone remains competitive but costs more CPU, while the Alpine LX zone is slower.

  • Docker on Linux performs efficiently but eats the margins. I ran an additional test using a Debian 13 Docker container running on the Alpine Linux host. At peak load (50 connections), the throughput was impressive and virtually identical to bare metal: ~63.7k req/s for HTTP and ~62.7k req/s for HTTPS. However, there is a clear cost. First, while the bare metal host maintained a small CPU buffer (~7% idle) during the HTTPS test, Docker saturated the CPU to 100%. Second, at lower concurrency (10 connections), the overhead became visible. The Docker container scored ~30.2k req/s for HTTP and ~27.8k req/s for HTTPS, slightly trailing the ~31-34k and ~29-31k range of the bare metal counterparts. The abstraction layers (NAT, bridging, namespaces) are extremely efficient, but they are not completely free.

This leads to a clear conclusion on efficiency: FreeBSD Jails provide the highest throughput with the lowest CPU cost. LX zones and Docker containers can match the speed (or come close), but they burn significantly more CPU cycles to do so.

What this means for real workloads

It is easy to get lost in tables and percentages, so let us go back to the initial question.

A client wants static hosting.
Does the choice between FreeBSD, SmartOS, NetBSD or Linux matter in terms of performance?

For plain HTTP on this hardware, with nginx and the same configuration:

  • Not really.
    All the native hosts and FreeBSD jails deliver roughly the same maximum throughput, in the 63 to 64k req/s range. SmartOS LX zones are slightly slower but still strong.

For HTTPS:

  • Yes, it starts to matter a bit more.
  • FreeBSD stands out for how relaxed the CPU is under high TLS load.
  • Debian and Alpine are very close in throughput, with more CPU used but still with some headroom.
  • SmartOS, NetBSD and OpenBSD can still push a lot of HTTPS traffic, but they reach 100% CPU earlier and stabilize at lower request rates.

Does this mean you should always choose FreeBSD or Debian or Alpine for static HTTPS hosting?

Not necessarily.

In real deployments, the bottleneck is rarely the TLS performance of a single node serving a small static site. Network throughput, storage, logging, reverse proxies, CDNs and application layers all play a role.

However, knowing that FreeBSD and current Linux distributions can squeeze more out of a small CPU under TLS is useful when you are:

  • sizing hardware for small VPS nodes that must serve many HTTPS requests
  • planning to consolidate multiple services on a low power box
  • deciding whether you can afford to keep some CPU aside for other tasks (cache, background jobs, monitoring, and so on)

As always, the right answer depends on the complete picture: your skills, your tooling, your backups, your monitoring, the rest of your stack, and your tolerance for troubleshooting when things go sideways.

Final thoughts

From these small tests, my main takeaways are:

  1. Static HTTP is basically solved on all these platforms.
    On a modest Intel N150, every system tested can push around 64k static HTTP requests per second with nginx set to almost default settings. For many use cases, that is already more than enough.

  2. TLS performance is where the OS and crypto stack start to matter.
    FreeBSD, Debian and Alpine squeeze more HTTPS requests out of the N150, and FreeBSD in particular does it with a surprising amount of idle CPU left. NetBSD, OpenBSD and SmartOS need more CPU to reach similar speeds and stabilize at lower throughput once the CPU is saturated.

  3. Jails and native zones are essentially free, LX zones cost a bit more.
    FreeBSD jails and SmartOS native zones show very little overhead for this workload. SmartOS LX zones are still perfectly usable, but if you are chasing every last request per second you will see the cost of the translation layer.

  4. Benchmarks are only part of the story.
    If your team knows OpenBSD inside out and has tooling, scripts and workflows built around it, you might happily accept using more CPU on TLS in exchange for security features, simplicity and familiarity. The same goes for NetBSD or SmartOS in environments where their specific strengths shine.

I will not choose an operating system for a client just because a benchmark looks nicer. These numbers are one of the many inputs I consider. What matters most is always the combination of reliability, security, maintainability and the human beings who will have to operate the
system at three in the morning when something goes wrong.

Still, it is nice to know that if you put a tiny Intel N150 in front of a static site and you pick FreeBSD or a modern Linux distribution for HTTPS, you are giving that little CPU a fair chance to shine.

In Linux, filesystems can and do have things with inode number zero

By: cks

A while back I wrote about how in POSIX you could theoretically use inode (number) zero. Not all Unixes consider inode zero to be valid; prominently, OpenBSD's getdents(2) doesn't return valid entries with an inode number of 0, and by extension, OpenBSD's filesystems won't have anything that uses inode zero. However, Linux is a different beast.

Recently, I saw a Go commit message with the interesting description of:

os: allow direntries to have zero inodes on Linux

Some Linux filesystems have been known to return valid entries with zero inodes. This new behavior also puts Go in agreement with recent glibc.

This fixes issue #76428, and the issue has a simple reproduction to create something with inode numbers of zero. According to the bug report:

[...] On a Linux system with libfuse 3.17.1 or later, you can do this easily with GVFS:

# Create many dir entries
(cd big && printf '%04x ' {0..1023} | xargs mkdir -p)
gio mount sftp://localhost/$PWD/big

The resulting filesystem mount is in /run/user/$UID/gvfs (see the issue for the exact long path) and can be experimentally verified to have entries with inode numbers of zero (well, as reported by reading the directory). On systems using glibc 2.37 and later, you can look at this directory with 'ls' and see the zero inode numbers.

(Interested parties can try their favorite non-C or non-glibc bindings to see if those environments correctly handle this case.)

That this requires glibc 2.37 is due to this glibc bug, first opened in 2010 (but rejected at the time for reasons you can read in the glibc bug) and then resurfaced in 2016 and eventually fixed in 2022 (and then again in 2024 for the thread safe version of readdir). The 2016 glibc issue has a bit of a discussion about the kernel side. As covered in the Go issue, libfuse returning a zero inode number may be a bug itself, but there are (many) versions of libfuse out in the wild that actually do this today.

Of course, libfuse (and gvfs) may not be the only Linux filesystems and filesystem environments that can create this effect. I believe there are alternate language bindings and APIs for the kernel FUSE (also, also) support, so they might have the same bug as libfuse does.

(Both Go and Rust have at least one native binding to the kernel FUSE driver. I haven't looked at either to see what they do about inode numbers.)

PS: My understanding of the Linux (kernel) situation is that if you have something inside the kernel that needs an inode number and you ask the kernel to give you one (through get_next_ino(), an internal function for this), the kernel will carefully avoid giving you inode number 0. A lot of things get inode numbers this way, so this makes life easier for everyone. However, a filesystem can decide on inode numbers itself, and when it does it can use inode number 0 (either explicitly or by zeroing out the d_ino field in the getdents(2) dirent structs that it returns, which I believe is what's happening in the libfuse situation).

Making Polkit authenticate people like su does (with group wheel)

By: cks

Polkit is how a lot of things on modern Linux systems decide whether or not to let people do privileged operations, including systemd's run0, which effectively functions as another su or sudo. Polkit normally has a significantly different authentication model than su or sudo, where an arbitrary login can authenticate for privileged operations by giving the password of any 'administrator' account (accounts in group wheel or group admin, depending on your Linux distribution).

Suppose, not hypothetically, that you want a su like model in Polkit, one where people in group 'wheel' can authenticate by providing the root password, while people not in group 'wheel' cannot authenticate for privileged operations at all. In my earlier entry on learning about Polkit and adjusting it I put forward an untested Polkit stanza to do this. Now I've tested it and I can provide an actual working version.

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // must exist but have a locked password
        return ["unix-user:nobody"];
    }
});

(This goes in /etc/polkit-1/rules.d/50-default.rules, and the filename is important because it has to replace the standard version in /usr/share/polkit-1/rules.d.)

This doesn't quite work the way 'su' does, where it will just refuse to work for people not in group wheel. Instead, if you're not in group wheel you'll be prompted for the password of 'nobody' (or whatever other login you're using), which you can never successfully supply because the password is locked.

As I've experimentally determined, it doesn't work to return an empty list ('[]'), or a Unix group that doesn't exist ('unix-group:nosuchgroup'), or a Unix group that exists but has no members. In all cases my Fedora 42 system falls back to asking for the root password, which I assume is a built-in default for privileged authentication. Instead you apparently have to return something that Polkit thinks it can plausibly use to authenticate the person, even if that authentication can't succeed. Hopefully Polkit will never get smart enough to work that out and stop accepting accounts with locked passwords.

(If you want to be friendly and you expect people on your servers to run into this a lot, you should probably create a login with a more useful name and GECOS field, perhaps 'not-allowed' and 'You cannot authenticate for this operation', that has a locked password. People may or may not realize what's going on, but at least they have a chance.)

PS: This is with the Fedora 42 version of Polkit, which is version 126. This appears to be the most recent version from the upstream project.

Sidebar: Disabling Polkit entirely

Initially I assumed that Polkit had explicit rules somewhere that authorized the 'root' user. However, as far as I can tell this isn't true; there's no normal rules that specifically authorize root or any other UID 0 login name, and despite that root can perform actions that are restricted to groups that root isn't in. I believe this means that you can explicitly disable all discretionary Polkit authorization with an '00-disable.rules' file that contains:

polkit.addRule(function(action, subject) {
    return polkit.Result.NO;
});

Based on experimentation, this disables absolutely everything, even actions that are considered generally harmless (like libvirt's 'virsh list', which I think normally anyone can do).

A slightly more friendly version can be had by creating a situation where there are no allowed administrative users. I think this would be done with a 50-default.rules file that contained:

polkit.addAdminRule(function(action, subject) {
    // must exist but have a locked password
    return ["unix-user:nobody"];
});

You'd also want to make sure that nobody is in any special groups that rules in /usr/share/polkit-1/rules.d use to allow automatic access. You can look for these by grep'ing for 'isInGroup'.

The (early) good and bad parts of Polkit for a system administrator

By: cks

At a high level, Polkit is how a lot of things on modern Linux systems decide whether or not to let you do privileged operations. After looking into it a bit, I've wound up feeling that Polkit has both good and bad aspects from the perspective of a system administrator (especially a system administrator with multi-user Linux systems, where most of the people using them aren't supposed to have any special privileges). While I've used (desktop) Linuxes with Polkit for a while and relied on it for a certain amount of what I was doing, I've done so blindly, effectively as a normal person. This is the first I've looked at the details of Polkit, which is why I'm calling this my early reactions.

On the good side, Polkit is a single source of authorization decisions, much like PAM. On a modern Linux system, there are a steadily increasing number of programs that do privileged things, even on servers (such as systemd's run0). These could all have their own bespoke custom authorization systems, much as how sudo has its own custom one, but instead most of them have centralized on Polkit. In theory Polkit gives you a single thing to look at and a single thing to learn, rather than learning systemd's authentication system, NetworkManager's authentication system, etc. It also means that programs have less of a temptation to hard-code (some of) their authentication rules, because Polkit is very flexible.

(In many cases programs couldn't feasibly use PAM instead, because they want certain actions to be automatically authorized. For example, in its standard configuration libvirt wants everyone in group 'libvirt' to be able to issue libvirt VM management commands without constantly having to authenticate. PAM could probably be extended to do this but it would start to get complicated, partly because PAM configuration files aren't a programming language and so implementing logic in PAM gets awkward in a hurry.)

On the bad side, Polkit is a non-declarative authorization system, and a complex one with its rules not in any single place (instead they're distributed through multiple files in two different formats). Authorization decisions are normally made in (JavaScript) code, which means that they can encode essentially arbitrary logic (although there are standard forms of things). This means that the only way to know who is authorized to do a particular thing is to read its XML 'action' file and then look through all of the JavaScript code to find and then understand things that apply to it.

(Even 'who is authorized' is imprecise by default. Polkit normally allows anyone to authenticate as any administrative account, provided that they know its password and possibly other authentication information. This makes the passwords of people in group wheel or group admin very dangerous things, since anyone who can get their hands on one can probably execute any Polkit-protected action.)

This creates a situation where there's no way in Polkit to get a global overview of who is authorized to do what, or what a particular person has authorization for, since this doesn't exist in a declarative form and instead has to be determined on the fly by evaluating code. Instead you have to know what's customary, like the group that's 'administrative' for your Linux distribution (wheel or admin, typically) and what special groups (like 'libvirt') do what, or you have to read and understand all of the JavaScript and XML involved.

In other words, there's no feasible way to audit what Polkit is allowing people to do on your system. You have to trust that programs have made sensible decisions in their Polkit configuration (ones that you agree with), or run the risk of system malfunctions by turning everything off (or allowing only root to be authorized to do things).

(Not even Polkit itself can give you visibility into why a decision was made or fully predict it in advance, because the JavaScript rules have no pre-filtering to narrow down what they apply to. The only way you find out what a rule really does is invoking it. Well, invoking the function that the addRule() or addAdminRule() added to the rule stack.)

This complexity (and the resulting opacity of authorization) is probably intrinsic in Polkit's goals. I even think they made the right decision by having you write logic in JavaScript rather than try to create their own language for it. However, I do wish Polkit had a declarative subset that could express all of the simple cases, reserving JavaScript rules only for complex ones. I think this would make the overall system much easier for system administrators to understand and analyze, so we had a much better idea (and much better control) over who was authorized for what.

Brief notes on learning and adjusting Polkit on modern Linuxes

By: cks

Polkit (also, also) is a multi-faceted user level thing used to control access to privileged operations. It's probably used by various D-Bus services on your system, which you can more or less get a list of with pkaction, and there's a pkexec program that's like su and sudo. There are two reasons that you might care about Polkit on your system. First, there might be tools you want to use that use Polkit, such as systemd's run0 (which is developing some interesting options). The other is that Polkit gives people an alternate way to get access to root or other privileges on your servers and you may have opinions about that and what authentication should be required.

Unfortunately, Polkit configuration is arcane and as far as I know, there aren't really any readily accessible options for it. For instance, if you want to force people to authenticate for root-level things using the root password instead of their password, as far as I know you're going to have to write some JavaScript yourself to define a suitable Administrator identity rule. The polkit manual page seems to document what you can put in the code reasonably well, but I'm not sure how you test your new rules and some areas seem underdocumented (for example, it's not clear how 'addAdminRule()' can be used to say that the current user cannot authenticate as an administrative user at all).

(If and when I wind up needing to test rules, I will probably try to do it in a scratch virtual machine that I can blow up. Fortunately Polkit is never likely to be my only way to authenticate things.)

Polkit also has some paper cuts in its current setup. For example, as far as I can see there's no easy way to tell Polkit-using programs that you want to immediately authenticate for administrative access as yourself, rather than be offered a menu of people in group wheel (yourself included) and having to pick yourself. It's also not clear to me (and I lack a test system) if the default setup blocks people who aren't in group wheel (or group admin, depending on your Linux distribution flavour) from administrative authentication or if instead they get to pick authenticating using one of your passwords. I suspect it's the latter.

(All of this makes Polkit seem like it's not really built for multi-user Linux systems, or at least multi-user systems where not everyone is an administrator.)

PS: Now that I've looked at it, I have some issues with Polkit from the perspective of a system administrator, but those are going to be for another entry.

Sidebar: Some options for Polkit (root) authentication

If you want everyone to authenticate as root for administrative actions, I think what you want is:

polkit.addAdminRule(function(action, subject) {
    return ["unix-user:0"];
});

If you want to restrict this to people in group wheel, I think you want something like:

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // might not work to say 'no'?
        return [];
    }
});

If you want people in group wheel to authenticate as themselves, not root, I think you return 'unix-user:' + subject.user instead of 'unix-user:0'. I don't know if people still get prompted by Polkit to pick a user if there's only one possible user.

Discovering orphaned binaries in /usr/sbin on Fedora 42

By: cks

Over on the Fediverse, I shared a somewhat unwelcome discovery I made after upgrading to Fedora 42:

This is my face when I have quite a few binaries in /usr/sbin on my office Fedora desktop that aren't owned by any package. Presumably they were once owned by packages, but the packages got removed without the files being removed with them, which isn't supposed to happen.

(My office Fedora install has been around for almost 20 years now without being reinstalled, so things have had time to happen. But some of these binaries date from 2021.)

There seem to be two sorts of these lingering, unowned /usr/sbin programs. One sort, such as /usr/sbin/getcaps, seems to have been left behind when its package moved things to /usr/bin, possibly due to this RPM bug (via). The other sort is genuinely unowned programs dating to anywhere from 2007 (at the oldest) to 2021 (at the newest), which have nothing else left of them sitting around. The newest programs are what I believe are wireless management programs: iwconfig, iwevent, iwgetid, iwlist, iwpriv, and iwspy, and also "ifrename" (which I believe was also part of a 'wireless-tools' package). I had the wireless-tools package installed on my office desktop until recently, but I removed it some time during Fedora 40, probably sparked by the /sbin to /usr/sbin migration, and it's possible that binaries didn't get cleaned up properly due to that migration.

The most interesting orphan is /usr/sbin/sln, dating from 2018, when apparently various people discovered it as an orphan on their system. Unlike all the other orphan programs, the sln manual page is still shipped as part of the standard 'man-pages' package and so you can read sln(8) online. Based on the manual page, it sounds like it may have been part of glibc at one point.

(Another orphaned program from 2018 is pam_tally, although it's coupled to pam_tally2.so, which did get removed.)

I don't know if there's any good way to get mappings from files to RPM packages for old Fedora versions. If there is, I'd certainly pick through it to try to find where various of these files came from originally. Unfortunately I suspect that for sufficiently old Fedora versions, much of this information is either offline or can't be processed by modern versions of things like dnf.

(The basic information is used by eg 'dnf provides' and can be built by hand from the raw RPMs, but I have no desire to download all of the RPMs for decade-old Fedora versions even if they're still available somewhere. I'm curious but not that curious.)

PS: At the moment I'm inclined to leave everything as it is until at least Fedora 43, since RPM bugs are still being sorted out here. I'll have to clean up genuinely orphaned files at some point but I don't think there's any rush. And I'm not removing any more old packages that use '/sbin/<whatever>', since that seems like it has some bugs.

Removing Fedora's selinux-policy-targeted package is mostly harmless so far

By: cks

A while back I discussed why I might want to remove the selinux-policy-targeted RPM package for a Fedora 42 upgrade. Today, I upgraded my office workstation from Fedora 41 to Fedora 42, and as part of preparing for that upgrade I removed the selinux-policy-targeted policy (and all of the packages that depended on it). The result appears to work, although there were a few things that came up during the upgrade and I may reinstall at least selinux-policy-targeted itself to get rid of them (for now).

The root issue appears to be that when I removed the selinux-policy-targeted package, I probably should have edited /etc/selinux/config to set SELINUXTYPE to some bogus value, not left it set to "targeted". For entirely sensible reasons, various packages have postinstall scripts that assume that if your SELinux configuration says your SELinux type is 'targeted', they can do things that implicitly or explicitly require things from the package or from the selinux-policy package, which got removed when I removed selinux-policy-targeted.

I'm not sure if my change to SELINUXTYPE will completely fix things, because I suspect that there are other assumptions about SELinux policy programs and data files being present lurking in standard, still-installed package tools and so on. Some of these standard SELinux related packages definitely can't be removed without gutting Fedora of things that are important to me, so I'll either have to live with periodic failures of postinstall scripts or put selinux-policy-targeted and some other bits back. On the whole, reinstalling selinux-policy-targeted is probably the safest way and the issue that caused me to remove it only applies during Fedora version upgrades and might anyway be fixed in Fedora 42.

What this illustrates to me is that regardless of package dependencies, SELinux is not really optional on Fedora. The Fedora environment assumes that a functioning SELinux environment is there and if it isn't, things are likely to go wrong. I can't blame Fedora for this, or for not fully capturing this in package dependencies (and Fedora did protect the selinux-policy-targeted package from being removed; I overrode that by hand, so what happens afterward is on me).

(Although I haven't checked modern versions of Fedora, I suspect that there's no official way to install Fedora without getting a SELinux policy package installed, and possibly selinux-policy-targeted specifically.)

PS: I still plan to temporarily remove selinux-policy-targeted when I upgrade my home desktop to Fedora 42. A few package postinstall glitches is better than not being able to read DNF output due to the package's spam.

Modern Linux filesystem mounts are rather complex things

By: cks

Once upon a time, Unix filesystem mounts worked by putting one inode on top of another, and this was also how they worked in very early Linux. It wasn't wrong to say that mounts were really about inodes, with the names only being used to find the inodes. This is no longer how things work in Linux (and perhaps other Unixes, but Linux is what I'm most familiar with for this). Today, I believe that filesystem mounts in Linux are best understood as namespace operations.

Each separate (unmounted) filesystem is a a tree of names (a namespace). At a broad level, filesystem mounts in Linux take some name from that filesystem tree and project it on top of something in an existing namespace, generally with some properties attached to the projection. A regular conventional mount takes the root name of the new filesystem and puts the whole tree somewhere, but for a long time Linux's bind mounts took some other name in the filesystem as their starting point (what we could call the root inode of the mount). In modern Linux, there can also be multiple mount namespaces in existence at one time, with different contents and properties. A filesystem mount does not necessarily appear in all of them, and different things can be mounted at the same spot in the tree of names in different mount namespaces.

(Some mount properties are still global to the filesystem as a whole, while other mount properties are specific to a particular mount. See mount(2) for a discussion of general mount properties. I don't know if there's a mechanism to handle filesystem specific mount properties on a per mount basis.)

This can't really be implemented with an inode-based view of mounts. You can somewhat implement traditional Linux bind mounts with an inode based approach, but mount namespaces have to be separate from the underlying inodes. At a minimum a mount point must be a pair of 'this inode in this namespace has something on top of it', instead of just 'this inode has something on top of it'.

(A pure inode based approach has problems going up the directory tree even in old bind mounts, because the parent directory of a particular directory depends on how you got to the directory. If /usr/share is part of /usr and you bind mounted /usr/share to /a/b, the value of '..' depends on if you're looking at '/usr/share/..' or '/a/b/..', even though /usr/share and /a/b are the same inode in the /usr filesystem.)

If I'm reading manual pages correctly, Linux still normally requires the initial mount of any particular filesystem be of its root name (its true root inode). Only after that initial mount is made can you make bind mounts to pull out some subset of its tree of names and then unmount the original full filesystem mount. I believe that a particular filesystem can provide ways to sidestep this with a filesystem specific mount option, such as btrfs's subvol= mount option that's covered in the btrfs(5) manual page (or 'btrfs subvolume set-default').

We don't update kernels without immediately rebooting the machine

By: cks

I've mentioned this before in passing (cf, also) but today I feel like saying it explicitly: our habit with all of our machines is to never apply a kernel update without immediately rebooting the machine into the new kernel. On our Ubuntu machines this is done by holding the relevant kernel packages; on my Fedora desktops I normally run 'dnf update --exclude "kernel*"' unless I'm willing to reboot on the spot.

The obvious reason for this is that we want to switch to the new kernel under controlled, attended conditions when we'll be able to take immediate action if something is wrong, rather than possibly have the new kernel activate at some random time without us present and paying attention if there's a power failure, a kernel panic, or whatever. This is especially acute on my desktops, where I use ZFS by building my own OpenZFS packages and kernel modules. If something goes wrong and the kernel modules don't load or don't work right, an unattended reboot can leave my desktops completely unusable and off the network until I can get to them. I'd rather avoid that if possible (sometimes it isn't).

(In general I prefer to reboot my Fedora machines with me present because weird things happen from time to time and sometimes I make mistakes, also.)

The less obvious reason is that when you reboot a machine right after applying a kernel update, it's clear in your mind that the machine has switched to a new kernel. If there are system problems in the days immediately afterward the update, you're relatively likely to remember this and at least consider the possibility that the new kernel is involved. If you apply a kernel update, walk away without rebooting, and the machine reboots a week and a half later for some unrelated reason, you may not remember that one of the things the reboot did was switch to a new kernel.

(Kernels aren't the only thing that this can happen with, since not all system updates and changes take effect immediately when made or applied. Perhaps one should reboot after making them, too.)

I'm assuming here that your Linux distribution's package management system is sensible, so there's no risk of losing old kernels (especially the one you're currently running) merely because you installed some new ones but didn't reboot into them. This is how Debian and Ubuntu behave (if you don't 'apt autoremove' kernels), but not quite how Fedora's dnf does it (as far as I know). Fedora dnf keeps the N most recent kernels around and probably doesn't let you remove the currently running kernel even if it's more than N kernels old, but I don't believe it tracks whether or not you've rebooted into those N kernels and stretches the N out if you haven't (or removes more recent installed kernels that you've never rebooted into, instead of older kernels that you did use at one point).

PS: Of course if kernel updates were perfect this wouldn't matter. However this isn't something you can assume for the Linux kernel (especially as patched by your distribution), as we've sometimes seen. Although big issues like that are relatively uncommon.

Restarting or redoing something after a systemd service restarts

By: cks

Suppose, not hypothetically, that your system is running some systemd based service or daemon that resets or erase your carefully cultivated state when it restarts. One example is systemd-networkd, although you can turn that off (or parts of it off, at least), but there are likely others. To clean up after this happens, you'd like to automatically restart or redo something after a systemd unit is restarted. Systemd supports this, but I found it slightly unclear how you want to do this and today I poked at it, so it's time for notes.

(This is somewhat different from triggering one unit when another unit becomes active, which I think is still not possible in general.)

First, you need to put whatever you want to do into a script and a .service unit that will run the script. The traditional way to run a script through a .service unit is:

[Unit]
....

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/your/script/here

[Install]
WantedBy=multi-user.target

(The 'RemainAfterExit' is load-bearing, also.)

To get this unit to run after another unit is started or restarted, what you need is PartOf=, which causes your unit to be stopped and started when the other unit is, along with 'After=' so that your unit starts after the other unit instead of racing it (which could be counterproductive when what you want to do is fix up something from the other unit). So you add:

[Unit]
...
PartOf=systemd-networkd.service
After=systemd-networkd.service

(This is what works for me in light testing. This assumes that the unit you want to re-run after is normally always running, as systemd-networkd is.)

In testing, you don't need to have your unit specifically enabled by itself, although you may want it to be for clarity and other reasons. Even if your unit isn't specifically enabled, systemd will start it after the other unit because of the PartOf=. If the other unit is started all of the time (as is usually the case for systemd-networkd), this effectively makes your unit enabled, although not in an obvious way (which is why I think you should specifically 'systemctl enable' it, to make it obvious). I think you can have your .service unit enabled and active without having the other unit enabled, or even present.

You can declare yourself PartOf a .target unit, and some stock package systemd units do for various services. And a .target unit can be PartOf a .service; on Fedora, 'sshd-keygen.target' is PartOf sshd.service in a surprisingly clever little arrangement to generate only the necessary keys through a templated 'sshd-keygen@.service' unit.

I admit that the whole collection of Wants=, Requires=, Requisite=, BindsTo=, PartOf=, Upholds=, and so on are somewhat confusing to me. In the past, I've used the wrong version and suffered the consequences, and I'm not sure I have them entirely right in this entry.

Note that as far as I know, PartOf= has those Requires= consequences, where if the other unit is stopped, yours will be too. In a simple 'run a script after the other unit starts' situation, stopping your unit does nothing and can be ignored.

(If this seems complicated, well, I think it is, and I think one part of the complication is that we're trying to use systemd as an event-based system when it isn't one.)

Systemd-resolved's new 'DNS Server Delegation' feature (as of systemd 258)

By: cks

A while ago I wrote an entry about things that resolved wasn't for as of systemd 251. One of those things was arbitrary mappings of (DNS) names to DNS servers, for example if you always wanted *.internal.example.org to query a special DNS server. Systemd-resolved didn't have a direct feature for this and attempting to attach your DNS names to DNS server mappings to a network interface could go wrong in various ways. Well, time marches on and as of systemd v258 this is no longer the state of affairs.

Systemd v258 introduces systemd.dns-delegate files, which allow you to map DNS names to DNS servers independently from network interfaces. The release notes describe this as:

A new DNS "delegate zone" concept has been introduced, which are additional lookup scopes (on top of the existing per-interface and the one global scope so far supported in resolved), which carry one or more DNS server addresses and a DNS search/routing domain. It allows routing requests to specific domains to specific servers. Delegate zones can be configured via drop-ins below /etc/systemd/dns-delegate.d/*.dns-delegate.

Since systemd v258 is very new I don't have any machines where I can actually try this out, but based on the systemd.dns-delegate documentation, you can use this both for domains that you merely want diverted to some DNS server and also domains that you also want on your search path. Per resolved.conf's Domains= documentation, the latter is 'Domains=example.org' (example.org will be one of the domains that resolved tries to find single-label hostnames in, a search domain), and the former is 'Domains=~example.org' (where we merely send queries for everything under 'example.org' off to whatever DNS= you set, a route-only domain).

(While resolved.conf's Domains= officially promises to check your search domains in the order you listed them, I believe this is strictly for a single 'Domains=' setting for a single interface. If you have multiple 'Domains=' settings, for example in a global resolved.conf, a network interface, and now in a delegation, I think systemd-resolved makes no promises.)

Right now, these DNS server delegations can only be set through static files, not manipulated through resolvectl. I believe fiddling with them through resolvectl is on the roadmap, but for now I guess we get to restart resolved if we need to change things. In fact resolvectl doesn't expose anything to do with them, although I believe read-only information is available via D-Bus and maybe varlink.

Given the timing of systemd v258's release relative to Fedora releases, I probably won't be able to use this feature until Fedora 44 in the spring (Fedora 42 is current and Fedora 43 is imminent, which won't have systemd v258 given that v258 was released only a couple of weeks ago). My current systemd-resolved setup is okay (if it wasn't I'd be doing something else), but I can probably find uses for these delegations to improve it.

FreeBSD vs. SmartOS: Who's Faster for Jails, Zones, and bhyve VMs?

A server rack with some servers and cables

Disclaimer
These benchmarks were performed on my specific hardware and tuned for the workloads I expect to run.
They should not be taken as absolute or universally applicable results.
Different CPUs, storage, networking setups, or workload profiles could produce very different outcomes.
What I’m sharing here is a faithful snapshot of my test environment and use case - a guidepost, not a final verdict.

Years ago, I installed a PCEngines APU at a client's site. It dutifully ran Proxmox with a few small VMs inside. It wasn't a speed demon, but it got the job done. Tasked with running in a closed, uncooled, and unsupervised server closet, it soldiered on for about seven years.

Then, while I was at BSDCan, I got the call. A series of power outages and surges had finally taken their toll, and the APU was dead. It was probably just the power supply, but given its age, we decided it was time for a replacement. I set up a remote bypass to keep them running, but I knew I'd need to install something more powerful soon.

I ordered a modern MiniPC-based on the low-power Intel Processor N150 platform, but with 16GB of RAM and more than enough performance to serve as a decent workstation. I have a similar one in my office running openSUSE Tumbleweed, and it works beautifully.

This time, however, I decided to replace Proxmox with a different virtualization system. This decision wasn't made in a vacuum. In the past, I've put bhyve head-to-head with Proxmox, and my findings were clear: bhyve on FreeBSD is an extremely efficient hypervisor, often outperforming KVM on Proxmox in my tests.

This positive experience is what made FreeBSD with bhyve a top contender. The other path was a KVM-style approach (which would require fewer changes to the VMs), where my options would be NetBSD or an illumos-based OS like SmartOS. Since I had the new hardware on hand, I decided to run some tests to see how these different technologies stacked up against each other, and against the bare metal itself.

The Lineup: What I Put on the Test Bench

My goal was to test every reasonable option on this Intel N150 hardware. The final lineup covered the entire spectrum:

  • The Baseline:
    • FreeBSD 14.3-RELEASE Bare Metal: The ground truth for performance on this hardware.
  • OS-Level Virtualization (Containers):
    • SmartOS Native Zone: The baseline native container on SmartOS.
    • SmartOS LX Zone: Running Ubuntu 24.04 and Alpine Linux.
    • FreeBSD Native Jail: The baseline native container on FreeBSD.
    • FreeBSD Jail with Linux: A jail running a Ubuntu 22.04 userland.
  • Full Hardware Virtualization (HVM):
    • SmartOS bhyve Zone: A FreeBSD guest inside the bhyve hypervisor on a SmartOS host.
    • SmartOS KVM Zone: A FreeBSD guest inside the KVM hypervisor on a SmartOS host.
    • FreeBSD bhyve VM: A FreeBSD guest inside the bhyve hypervisor on a FreeBSD host.

The Benchmark: My sysbench Commands

To keep the comparison fair and simple, I used two core sysbench commands. To ensure consistency, I even compiled sysbench from scratch on the SmartOS native zone to match the versions and compile options on the other systems as closely as possible.

The commands I used in each environment were:

  • For CPU performance: sysbench --test=cpu --cpu-max-prime=20000 run
  • For memory performance: sysbench --test=memory run

First Look: CPU and Memory on the Intel N150

My initial tests on the Intel N150 hardware immediately revealed some interesting trends. The sysbench CPU results from any native FreeBSD environment (bare metal or jail) were on a completely different scale from the Linux and SmartOS guests, making a direct comparison meaningless.

However, by excluding the incompatible FreeBSD-native results, we get a very clear picture of the overhead between the various container technologies.

Valid CPU Performance Comparison (Single Thread, Intel N150)

Host OS Container Tech Guest OS CPU Performance (Events/sec)
FreeBSD Jail (OS-level) Ubuntu 22.04 1108.18
SmartOS LX Zone (OS-level) Ubuntu 24.04 1107.13
SmartOS Native Zone (OS-level) SmartOS 1107.04
SmartOS LX Zone (OS-level) Alpine Linux 1022.81

The takeaway here was clear: for CPU work, the overhead from these containers is basically a rounding error. For CPU-bound tasks, neither SmartOS Zones nor FreeBSD Jails will be a bottleneck.

The memory results, which were consistent across all platforms, were far more revealing.

Overall Memory Performance Comparison (Intel Processor N150)

Host OS Virtualization Tech Guest OS Memory Performance (Transfer Rate)
SmartOS LX Zone (OS-level) Ubuntu 24.04 4970.54 MiB/sec
SmartOS Native Zone (OS-level) SmartOS (Native) 4549.97 MiB/sec
FreeBSD Jail (OS-level) Ubuntu 22.04 4348.32 MiB/sec
FreeBSD Bare Metal FreeBSD (Native) 4005.08 MiB/sec
FreeBSD Native Jail (OS-level) FreeBSD (Native) 3990.13 MiB/sec
SmartOS LX Zone (OS-level) Alpine Linux 3803.72 MiB/sec
FreeBSD bhyve VM (Full HVM) FreeBSD 3636.01 MiB/sec
SmartOS bhyve Zone (Full HVM) FreeBSD 3020.15 MiB/sec
SmartOS KVM Zone (Full HVM) FreeBSD 205.18 MiB/sec

These initial numbers led to a few conclusions: a virtual layer could be a performance boost, the userland matters, and bhyve clearly outclassed the legacy KVM on SmartOS. However, one result was nagging at me: the performance gap between FreeBSD bare metal (4005.08 MiB/sec) and a native bhyve VM (3636.01 MiB/sec) was about 9%. This was a larger drop than I expected. It prompted a new question: was this overhead inherent to bhyve, or was it a quirk of the new N150 hardware?

Going deeper: Testing on an Intel i7-7500U

To see if more mature, better-supported hardware would tell a different story, I replicated the FreeBSD tests on an older Qotom Mini-PC powered by an Intel i7-7500U. The results were illuminating and dramatically changed the narrative.

CPU Performance Comparison (Intel i7-7500U)

Once again, the CPU tests produced strange results. The native FreeBSD environments all reported incredibly high numbers in the millions of events/sec, while the Ubuntu Linuxulator jail's result was on a completely different, incompatible scale. Frankly, given the massive discrepancy between FreeBSD-native and Linux-based environments, I'm unsure that the sysbench CPU figures can be considered totally reliable in absolute terms.

However, what is useful is comparing the native FreeBSD results against each other. This tells us about relative overhead.

Platform CPU Performance (Events/sec) Overhead vs. Bare Metal
FreeBSD Bare Metal 6,377,778 Baseline
FreeBSD Native Jail 6,379,271 ~0.0%
FreeBSD bhyve VM 6,346,852 -0.48%

Even if we're skeptical of the absolute numbers, the relative comparison is crystal clear: the CPU overhead of bhyve is less than half a percent. This is the key takeaway.

Memory Performance Comparison (Intel i7-7500U)

The memory benchmarks, in contrast, were consistent and highly informative.

Platform Memory Performance (Transfer Rate) Overhead vs. Bare Metal
Ubuntu 22.04 Jail 4856.23 MiB/sec +7.55%
FreeBSD Native Jail 4517.73 MiB/sec +0.05%
FreeBSD Bare Metal 4515.24 MiB/sec Baseline
FreeBSD bhyve VM 4491.60 MiB/sec -0.52%

This is where the real story is. The memory performance of a bhyve VM was a mere 0.52% slower than bare metal. This is the kind of near-native performance one hopes for from a top-tier hypervisor and stands in stark contrast to the 9% drop seen on the newer N150.

Breaking Down the Results: What I Learned From Both Tests

This comprehensive two-platform analysis paints a much clearer picture.

1. Hardware Really Matters Performance is not an absolute. The difference between the two platforms was stark: on the mature i7-7500U, bhyve’s overhead was less than 1%, while on the newer, budget N150, it was a more significant 9%. This suggests the performance dip is likely due to missing optimizations for that specific CPU architecture, rather than a fundamental flaw in bhyve itself.

2. bhyve's True Potential is Near-Native Speed The i7 tests prove that bhyve is an exceptionally efficient hypervisor on well-supported hardware. The relative CPU overhead was a negligible -0.48%, and more importantly, the reliable memory benchmarks showed a performance drop of just 0.52% compared to bare metal. This is the gold standard for virtualization.

3. FreeBSD Jails are Feather-Light On both platforms, native FreeBSD jails demonstrated almost zero performance overhead. On the i7, both CPU and memory performance were virtually identical to bare metal (a 0.05% difference). The N150 CPU tests further showed that FreeBSD's container implementation is so efficient that running a Linux userland inside a jail delivered the best CPU scores of the entire lineup.

4. SmartOS Zones Are Also Extremely Efficient Just like Jails, SmartOS's native Zones proved to be remarkably lightweight. The N150 CPU tests confirm this, showing that native and LX zones have virtually identical, top-tier performance. On the memory front, the native Zone delivered performance over 13% faster than the FreeBSD bare-metal baseline, pointing to the high efficiency of the illumos kernel.

5. The Linux Userland Excels at Throughput A clear pattern emerged on both testbeds: the Ubuntu userland consistently delivered excellent benchmark results. On the CPU front, Ubuntu on both FreeBSD and SmartOS delivered the highest, and nearly identical, performance scores on the N150. For memory, the story was even more dramatic: the Ubuntu LX Zone on SmartOS was the top performer, beating bare-metal FreeBSD by nearly 25%, while the Ubuntu jail on the i7 also surpassed its host by over 7%.

Final Thoughts: The Verdict for My Client's New Server

So, what's the bottom line for my client's new MiniPC? This benchmarking journey has made the path forward much clearer.

At the beginning of this process, my main question was whether to stick with a KVM-based setup or make the switch to bhyve. The performance data answers that decisively. The legacy KVM on SmartOS showed a crippling performance penalty, making it a non-starter. Given that, the extra effort to migrate the existing VMs to a bhyve-compatible format is absolutely worth it. The performance gain is just too significant to ignore.

The final question, then, is which host OS to use for bhyve: SmartOS or FreeBSD? This is a much tougher call, as both platforms demonstrated incredible strengths.

SmartOS, powered by the illumos kernel, was a true surprise. It delivered astonishing performance on the target N150 hardware. Its key advantage is the raw speed of its containerization for both CPU and memory tasks. The Ubuntu LX Zone not only ran flawlessly but delivered top-tier CPU scores and outperformed the bare-metal FreeBSD baseline in memory by a massive 25% margin. This points to a highly efficient kernel and offers the tantalizing prospect of running ultra-fast Linux containers alongside performant bhyve VMs on the same host.

On the other hand, FreeBSD proved its mastery of bhyve virtualization. The tests on the i7 hardware showed its implementation to be the gold standard, offering virtually zero performance overhead for full hardware virtualization. Its native Jails are equally efficient, and its Linux compatibility layer is so effective that an Ubuntu jail delivered the fastest CPU performance of all containers tested on FreeBSD. For workloads that must live in a full VM, FreeBSD offers the most performant and native bhyve experience, with the reasonable expectation that its support for newer hardware like the N150 will only improve over time.

Ultimately, the choice comes down to the primary workload. It's a decision between the raw container speed and Linux flexibility of SmartOS versus the pure, uncompromising HVM performance of FreeBSD.

But one thing is certain: thanks to this deep dive, the path forward is much clearer, and it's paved by bhyve.

These days, systemd can be a cause of restrictions on daemons

By: cks

One of the traditional rites of passage for Linux system administrators is having a daemon not work in the normal system configuration (eg, when you boot the system) but work when you manually run it as root. The classical cause of this on Unix was that $PATH wasn't fully set in the environment the daemon was running in but was in your root shell. On Linux, another traditional cause of this sort of thing has been SELinux and a more modern source (on Ubuntu) has sometimes been AppArmor. All of these create hard to see differences between your root shell (where the daemon works when run by hand) and the normal system environment (where the daemon doesn't work). These days, we can add another cause, an increasingly common one, and that is systemd service unit restrictions, many of which are covered in systemd.exec.

(One pernicious aspect of systemd as a cause of these restrictions is that they can appear in new releases of the same distribution. If a daemon has been running happily in an older release and now has surprise issues in a new Ubuntu LTS, I don't always remember to look at its .service file.)

Some of systemd's protective directives simply cause failures to do things, like access user home directories if ProtectHome= is set to something appropriate. Hopefully your daemon complains loudly here, reporting mysterious 'permission denied' or 'file not found' errors. Some systemd settings can have additional, confusing effects, like PrivateTmp=. A standard thing I do when troubleshooting a chain of programs executing programs executing programs is to shim in diagnostics that dump information to /tmp, but with PrivateTmp= on, my debugging dump files are mysteriously not there in the system-wide /tmp.

(On the other hand, a daemon may not complain about missing files if it's expected that the files aren't always there. A mailer usually can't really tell the difference between 'no one has .forward files' and 'I'm mysteriously not able to see people's home directories to find .forward files in them'.)

Sometimes you don't get explicit errors, just mysterious failures to do some things. For example, you might set IP address access restrictions with the intention of blocking inbound connections but wind up also blocking DNS queries (and this will also depend on whether or not you use systemd-resolved). The good news is that you're mostly not going to find standard systemd .service files for normal daemons shipped by your Linux distribution with IP address restrictions. The bad news is that at some point .service files may start showing up that impose IP address restrictions with the assumption that DNS resolution is being done via systemd-resolved as opposed to direct DNS queries.

(I expect some Linux distributions to resist this, for example Debian, but others may declare that using systemd-resolved is now mandatory in order to simplify things and let them harden service configurations.)

Right now, you can usually test if this is the problem by creating a version of the daemon's .service file with any systemd restrictions stripped out of it and then seeing if using that version makes life happy. In the future it's possible that some daemons will assume and require some systemd restrictions (for instance, assuming that they have a /tmp all of their own), making things harder to test.

Some stuff on how Linux consoles interact with the mouse

By: cks

On at least x86 PCs, Linux text consoles ('TTY' consoles or 'virtual consoles') support some surprising things. One of them is doing some useful stuff with your mouse, if you run an additional daemon such as gpm or the more modern consolation. This is supported on both framebuffer consoles and old 'VGA' text consoles. The experience is fairly straightforward; you install and activate one of the daemons, and afterward you can wave your mouse around, select and paste text, and so on. How it works and what you get is not as clear, and since I recently went diving into this area for reasons, I'm going to write down what I now know before I forget it (with a focus on how consolation works).

The quick summary is that the console TTY's mouse support is broadly like a terminal emulator. With a mouse daemon active, the TTY will do "copy and paste" selection stuff on its own. A mouse aware text mode program can put the console into a mode where mouse button presses are passed through to the program, just as happens in xterm or other terminal emulators.

The simplest TTY mode is when a non-mouse-aware program or shell is active, which is to say a program that wouldn't try to intercept mouse actions itself if it was run in a regular terminal window and would leave mouse stuff up to the terminal emulator. In this mode, your mouse daemon reads mouse input events and then uses sub-options of the TIOCLINUX ioctl to inject activities into the TTY, for example telling it to 'select' some text and then asking it to paste that selection to some file descriptor (normally the console itself, which delivers it to whatever foreground program is taking terminal input at the time).

(In theory you can use the mouse to scroll text back and forth, but in practice that was removed in 2020, both for the framebuffer console and for the VGA console. If I'm reading the code correctly, a VGA console might still have a little bit of scrollback support depending on how much spare VGA RAM you have for your VGA console size. But you're probably not using a VGA console any more.)

The other mode the console TTY can be in is one where some program has used standard xterm-derived escape sequences to ask for xterm-compatible "mouse tracking", which is the same thing it might ask for in a terminal emulator if it wanted to handle the mouse itself. What this does in the kernel TTY console driver is set a flag that your mouse daemon can query with TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't directly handle or look at mouse events. Instead, consolation (or gpm) reads the flag and, when the flag is set, uses the TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL sub-option to report the mouse position and button presses to the kernel (instead of handling mouse activity itself). The kernel then turns around and sends mouse reporting escape codes to the TTY, as the program asked for.

(As I discovered, we got a CVE this year related to this, where the kernel let too many people trigger sending programs 'mouse' events. See the stable kernel commit message for details.)

A mouse daemon like consolation doesn't have to pay attention to the kernel's TTY 'mouse reporting' flag. As far as I can tell from the current Linux kernel code, if the mouse daemon ignores the flag it can keep on doing all of its regular copy and paste selection and mouse button handling. However, sending mouse reports is only possible when a program has specifically asked for it; the kernel will report an error if you ask it to send a mouse report at the wrong time.

(As far as I can see there's no notification from the kernel to your mouse daemon that someone changed the 'mouse reporting' flag. Instead you have to poll it; it appears consolation does this every time through its event loop before it handles any mouse events.)

PS: Some documentation on console mouse reporting was written as a 2020 kernel documentation patch (alternate version) but it doesn't seem to have made it into the tree. According to various sources, eg, the mouse daemon side of things can only be used by actual mouse daemons, not by programs, although programs do sometimes use other bits of TIOCLINUX's mouse stuff.

PPS: It's useful to install a mouse daemon on your desktop or laptop even if you don't intend to ever use the text TTY. If you ever wind up in the text TTY for some reason, perhaps because your regular display environment has exploded, having mouse cut and paste is a lot nicer than not having it.

My Fedora machines need a cleanup of their /usr/sbin for Fedora 42

By: cks

One of the things that Fedora is trying to do in Fedora 42 is unifying /usr/bin and /usr/sbin. In an ideal (Fedora) world, your Fedora machines will have /usr/sbin be a symbolic link to /usr/bin after they're upgraded to Fedora 42. However, if your Fedora machines have been around for a while, or perhaps have some third party packages installed, what you'll actually wind up with is a /usr/sbin that is mostly symbolic links to /usr/bin but still has some actual programs left.

One source of these remaining /usr/sbin programs is old packages from past versions of Fedora that are no longer packaged in Fedora 41 and Fedora 42. Old packages are usually harmless, so it's easy for them to linger around if you're not disciplined; my home and office desktops (which have been around for a while) still have packages from as far back as Fedora 28.

(An added complication of tracking down file ownership is that some RPMs haven't been updated for the /sbin to /usr/sbin merge and so still believe that their files are /sbin/<whatever> instead of /usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find these.)

Obviously, you shouldn't remove old packages without being sure of whether or not they're important to you. I'm also not completely sure that all packages in the Fedora 41 (or 42) repositories are marked as '.fc41' or '.fc42' in their RPM versions, or if there are some RPMs that have been carried over from previous Fedora versions. Possibly this means I should wait until a few more Fedora versions have come to pass so that other people find and fix the exceptions.

(On what is probably my cleanest Fedora 42 test virtual machine, there are a number of packages that 'dnf list --extras' doesn't list that have '.fc41' in their RPM version. Some of them may have been retained un-rebuilt for binary compatibility reasons. There's also the 'shim' UEFI bootloaders, which date from 2024 and don't have Fedora releases in their RPM versions, but those I expect to basically never change once created. But some others are a bit mysterious, such as 'libblkio', and I suspect that they may have simply been missed by the Fedora 42 mass rebuild.)

PS: In theory anyone with access to the full Fedora 42 RPM repository could sweep the entire thing to find packages that still install /usr/sbin files or even /sbin files, which would turn up any relevant not yet rebuilt packages. I don't know if there's any easy way to do this through dnf commands, although I think dnf does have access to a full file list for all packages (which is used for certain dnf queries).

My machines versus the Fedora selinux-policy-targeted package

By: cks

I upgrade Fedora on my office and home workstations through an online upgrade with dnf, and as part of this I read (or at least scan) DNF's output to look for problems. Usually this goes okay, but DNF5 has a general problem with script output and when I did a test upgrade from Fedora 41 to Fedora 42 on a virtual machine, it generated a huge amount of repeated output from a script run by selinux-policy-targeted, repeatedly reporting "Old compiled fcontext format, skipping" for various .bin files in /etc/selinux/targeted/contexts/files. The volume of output made the rest of DNF's output essentially unreadable. I would like to avoid this when I actually upgrade my office and home workstations to Fedora 42 (which I still haven't done, partly because of this issue).

(You can't make this output easier to read because DNF5 is too smart for you. This particular error message reportedly comes from 'semodule -B', per this Fedora discussion.)

The 'targeted' policy is one of several SELinux policies that are supported or at least packaged by Fedora (although I suspect I might see similar issues with the other policies too). My main machines don't use SELinux and I have it completely disabled, so in theory I should be able to remove the selinux-policy-targeted package to stop it from repeatedly complaining during the Fedora 42 upgrade process. In practice, selinux-policy-targeted is a 'protected' package that DNF will normally refuse to remove. Such packages are listed in /etc/dnf/protected.d/ in various .conf files; selinux-policy-targeted installs (well, includes) a .conf file to protect itself from removal once installed.

(Interestingly, sudo protects itself but there's nothing specifically protecting su and the rest of util-linux. I suspect util-linux is so pervasively a dependency that other protected things hold it down, or alternately no one has ever worried about people removing it and shooting themselves in the foot.)

I can obviously remove this .conf file and then DNF will let me remove selinux-policy-targeted, which will force the removal of some other SELinux policy packages (both selinux-policy packages themselves and some '*-selinux' sub-packages of other packages). I tried this on another Fedora 41 test virtual machine and nothing obvious broke, but that doesn't mean that nothing broke at all. It seems very likely that almost no one tests Fedora without the selinux-policy collective installed and I suspect it's not a supported configuration.

I could reduce my risks by removing the packages only just before I do the upgrade to Fedora 42 and put them back later (well, unless I run into a dnf issue as a result, although that issue is from 2024). Also, now that I've investigated this, I could in theory delete the .bin files in /etc/selinux/targeted/contexts/files before the upgrade, hopefully making it so that selinux-policy-targeted has less or nothing to complain about. Since I'm not using SELinux, hopefully the lack of these files won't cause any problems, but of course this is less certain a fix than removing selinux-policy-targeted (for example, perhaps the .bin files would get automatically rebuilt early on in the upgrade process as packages are shuffled around, and bring the problem back with them).

Really, though, I wish DNF5 didn't have its problem with script output. All of this is hackery to deal with that underlying issue.

Some thoughts on Ubuntu automatic ('unattended') package upgrades

By: cks

The default behavior of a stock Ubuntu LTS server install is that it enables 'unattended upgrades', by installing the package unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades, which controls this). Historically, we haven't believed in unattended automatic package upgrades and eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential advantages; in practice it mostly results in package upgrades being applied after some delay that depends on when they come out relative to working days.

I have a few machines that actually are stock Ubuntu servers, for reasons outside the scope of this entry. These machines naturally have automated upgrades turned on and one of them (in a cloud, using the cloud provider's standard Ubuntu LTS image) even appears to automatically reboot itself if kernel updates need that. These machines are all in undemanding roles (although one of them is my work IPv6 gateway), so they aren't necessarily indicative of what we'd see on more complex machines, but none of them have had any visible problems from these unattended upgrades.

(I also can't remember the last time that we ran into a problem with updates when we applied them. Ubuntu updates still sometimes have regressions and other problems, forcing them to be reverted or reissued, but so far we haven't seen problems ourselves; we find out about these problems only through the notices in the Ubuntu security lists.)

If we were starting from scratch today in a greenfield environment, I'm not sure we'd bother building our automation for manual package updates. Since we have the automation and it offers various extra features (even if they're rarely used), we're probably not going to switch over to automated upgrades (including in our local build of Ubuntu 26.04 LTS when that comes out next year).

(The advantage of switching over to standard unattended upgrades is that we'd get rid of a local tool that, like all local tools, is all our responsibility. The less local weird things we have, the better, especially since we have so many as it is.)

Getting the Cinnamon desktop environment to support "AppIndicator"

By: cks

The other day I wrote about what "AppIndicator" is (a protocol) and some things about how the Cinnamon desktop appeared to support it, except they weren't working for me. Now I actually understand what's going on, more or less, and how to solve my problem of a program complaining that it needed AppIndicator.

Cinnamon directly implements the AppIndicator notification protocol in xapp-sn-watcher, part of Cinnamon's xapp(s) package. Xapp-sn-watcher is started as part of your (Cinnamon) session. However, it has a little feature, namely that it will exit if no one is asking it to do anything:

XApp-Message: 22:03:57.352: (SnWatcher) watcher_startup: ../xapp-sn-watcher/xapp-sn-watcher.c:592: No active monitors, exiting in 30s

In a normally functioning Cinnamon environment, something will soon show up to be an active monitor and stop xapp-sn-watcher from exiting:

Cjs-Message: 22:03:57.957: JS LOG: [LookingGlass/info] Loaded applet xapp-status@cinnamon.org in 88 ms
[...]
XApp-Message: 22:03:58.129: (SnWatcher) name_owner_changed_signal: ../xapp-sn-watcher/xapp-sn-watcher.c:162: NameOwnerChanged signal received (n: org.x.StatusIconMonitor.cinnamon_0, old: , new: :1.60
XApp-Message: 22:03:58.129: (SnWatcher) handle_status_applet_name_owner_appeared: ../xapp-sn-watcher/xapp-sn-watcher.c:64: A monitor appeared on the bus, cancelling shutdown

This something is a standard Cinnamon desktop applet. In System Settings β†’ Applets, it's way down at the bottom and is called "XApp Status Applet". If you've accidentally wound up with it not turned on, xapp-sn-watcher will (probably) not have a monitor active after 30 seconds, and then it will exit (and in the process of exiting, it will log alarming messages about failed GLib assertions). Not having this xapp-status applet turned on was my problem, and turning it on fixed things.

(I don't know how it got turned off. It's possible I wen through the standard applets at some point and turned some of them off in an excess of ignorant enthusiasm.)

As I found out from leigh scott in my Fedora bug report, the way to get this debugging output from xapp-sn-watcher is to run 'gsettings set org.x.apps.statusicon sn-watcher-debug true'. This will cause xapp-sn-watcher to log various helpful and verbose things to your ~/.xsession-errors (although apparently not the fact that it's actually exiting; you have to deduce that from the timestamps stopping 30 seconds later and that being the timestamps on the GLib assertion failures).

(I don't know why there's both a program and an applet involved in this and I've decided not to speculate.)

What an "AppIndicator" is in Linux desktops and some notes on it

By: cks

Suppose, not hypothetically, that you start up some program on your Fedora 42 Cinnamon desktop and it helpfully tells you "<X> requires AppIndicator to run. Please install the AppIndicator plugin for your desktop". You are likely confused, so here are some notes.

'AppIndicator' itself is the name of an application notification protocol, apparently originally from KDE, and some desktop environments may need a (third party) extension to support it, such as the Ubuntu one for GNOME Shell. Unfortunately for me, Cinnamon is not one of those desktops. It theoretically has native support for this, implemented in /usr/libexec/xapps/xapp-sn-watcher, part of Cinnamon's xapps package.

The actual 'AppIndicator' protocol is done over D-Bus, because that's the modern way. Since this started as a KDE thing, the D-Bus name is 'org.kde.StatusNotifierWatcher'. What provides certain D-Bus names is found in /usr/share/dbus-1/services, but not all names are mentioned there and 'org.kde.StatusNotifierWatcher' is one of the missing ones. In this case /etc/xdg/autostart/xapp-sn-watcher.desktop mentions the D-Bus name in its 'Comment=', but that's probably not something you can count on to find what your desktop is (theoretically) using to provide a given D-Bus name. I found xapp-sn-watcher somewhat through luck.

There are probably a number of ways to see what D-Bus names are currently registered and active. The one that I used when looking at this is 'dbus-send --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames'. As far as I know, there's no easy way to go from an error message about 'AppIndicator' to knowing that you want 'org.kde.StatusNotifierWatcher'; in my case I read the source of the thing complaining which was helpfully in Python.

(I used the error message to find the relevant section of code, which showed me what it wasn't finding.)

I have no idea how to actually fix the problem, or if there is a program that implements org.kde.StatusNotifierWatcher as a generic, more or less desktop independent program the way that stalonetray does for system tray stuff (or one generation of system tray stuff, I think there have been several iterations of it, cf).

(Yes, I filed a Fedora bug, but I believe Cinnamon isn't particularly supported by Fedora so I don't expect much. I also built the latest upstream xapps tree and it also appears to fail in the same way. Possibly this means something in the rest of the system isn't working right.)

Getting Linux nflog and tcpdump packet filters to sort of work together

By: cks

So, suppose that you have a brand new nflog version of OpenBSD's pflog, so you can use tcpdump to watch dropped packets (or in general, logged packets). And further suppose that you specifically want to see DNS requests to your port 53. So of course you do:

# tcpdump -n -i nflog:30 'port 53'
tcpdump: NFLOG link-layer type filtering not implemented

Perhaps we can get clever by reading from the interface in one tcpdump and sending it to another to be interpreted, forcing the pcap filter to be handled entirely in user space instead of the kernel:

# tcpdump --immediate-mode -w - -U -i nflog:30 | tcpdump -r - 'port 53'
tcpdump: listening on nflog:30, link-type NFLOG (Linux netfilter log messages), snapshot length 262144 bytes
reading from file -, link-type NFLOG (Linux netfilter log messages), snapshot length 262144
tcpdump: NFLOG link-layer type filtering not implemented

Alas we can't.

As far as I can determine, what's going on here is that the netfilter log system, 'NFLOG', uses a 'packet' format that isn't the same as any of the regular formats (Ethernet, PPP, etc) and adds some additional (meta)data about the packet to every packet you capture. I believe the various attributes this metadata can contain are listed in the kernel's nfnetlink_log.h.

(I believe it's not technically correct to say that this additional stuff is 'before' the packet; instead I believe the packet is contained in a NFULA_PAYLOAD attribute.)

Unfortunately for us, tcpdump (or more exactly libpcap) doesn't know how to create packet capture filters for this format, not even ones that are interpreted entirely in user space (as happens when tcpdump reads from a file).

I believe that you have two options. First, you can use tshark with a display filter, not a capture filter:

# tshark -i nflog:30 -Y 'udp.port == 53 or tcp.port == 53'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'nflog:30'
[...]

(Tshark capture filters are subject to the same libpcap inability to work on NFLOG formatted packets as tcpdump has.)

Alternately and probably more conveniently, you can tell tcpdump to use the 'IPV4' datalink type instead of the default, as mentioned in (opaque) passing in the tcpdump manual page:

# tcpdump -i nflog:30 -L
Data link types for nflog:30 (use option -y to set):
  NFLOG (Linux netfilter log messages)
  IPV4 (Raw IPv4)
# tcpdump -i nflog:30 -y ipv4 -n 'port 53'
tcpdump: data link type IPV4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on nflog:30, link-type IPV4 (Raw IPv4), snapshot length 262144 bytes
[...]

Of course this is only applicable if you're only doing IPv4. If you have some IPv6 traffic that you want to care about, I think you have to use tshark display filters (which means learning how to write Wireshark display filters, something I've avoided so far).

I think there is some potentially useful information in the extra NFLOG data, but to get it or to filter on it I think you'll need to use tshark (or Wireshark) and consult the NFLOG display filter reference, although that doesn't seem to give you access to all of the NFLOG stuff that 'tshark -i nflog:30 -V' will print about packets.

(Or maybe the trick is that you need to match 'nflog.tlv_type == <whatever> and nflog.tlv_value == <whatever>'. I believe that some NFLOG attributes are available conveniently, such as 'nflog.prefix', which corresponds to NFULA_PREFIX. See packet-nflog.c.)

PS: There's some information on the NFLOG format in the NFLOG linktype documentation and tcpdump's supported data link types in the link-layer header types documentation.

Implementing a basic equivalent of OpenBSD's pflog in Linux nftables

By: cks

OpenBSD's and FreeBSD's PF system has a very convenient 'pflog' feature, where you put in a 'log' bit in a PF rule and this dumps a copy of any matching packets into a pflog pseudo-interface, where you can both see them with 'tcpdump -i pflog0' and have them automatically logged to disk by pflogd in pcap format. Typically we use this to log blocked packets, which gives us both immediate and after the fact visibility of what's getting blocked (and by what rule, also). It's possible to mostly duplicate this in Linux nftables, although with more work and there's less documentation on it.

The first thing you need is nftables rules with one or two log statements of the form 'log group <some number>'. If you want to be able to both log packets for later inspection and watch them live, you need two 'log group' statements with different numbers; otherwise you only need one. You can use different (group) numbers on different nftables rules if you want to be able to, say, look only at accepted but logged traffic or only dropped traffic. In the end this might wind up looking something like:

tcp port ssh counter log group 30 log group 31 drop;

As the nft manual page will tell you, this uses the kernel 'nfnetlink_log' to forward the 'logs' (packets) to a netlink socket, where exactly one process (at most) can subscribe to a particular group to receive those logs (ie, those packets). If we want to both log the packets and be able to tcpdump them, we need two groups so we can have ulogd getting one and tcpdump getting the other.

To see packets from any particular log group, we use the special 'nflog:<N>' pseudo-interface that's hopefully supported by your Linux version of tcpdump. This is used as 'tcpdump -i nflog:30 ...' and works more or less like you'd want it to. However, as far as I know there's no way to see meta-information about the nftables filtering, such as what rule was involved or what the decision was; you just get the packet.

To log the packets to disk for later use, the default program is ulogd, which in Ubuntu is called 'ulogd2'. Ulogd(2) isn't as automatic as OpenBSD's and FreeBSD's pf logging; instead you have to configure it in /etc/ulogd.conf, and on Ubuntu make sure you have the 'ulogd2-pcap' package installed (along with ulogd2 itself). Based merely on getting it to work, what you want in /etc/ulogd.conf is the following three bits:

# A 'stack' of source, handling, and destination
stack=log31:NFLOG,base1:BASE,pcap31:PCAP

# The source: NFLOG group 31, for IPv4 traffic
[log31]
group=31
# addressfamily=10 for IPv6

# the file path is correct for Ubuntu
[pcap31]
file="/var/log/ulog/ulogd.pcap"
sync=0

(On Ubuntu 24.04, any .pcap files in /var/log/ulog will be automatically rotated by logrotate, although I think by default it's only weekly, so you might want to make it daily.)

The ulogd documentation suggests that you will need to capture IPv4 and IPv6 traffic separately, but I've only used this on IPv4 traffic so I don't know. This may imply that you need separate nftables rules to log (and drop) IPv6 traffic so that you can give it a separate group number for ulogd (I'm not sure if it needs a separate one for tcpdump or if tcpdump can sort it out).

Ulogd can also log to many different things than PCAP format, including JSON and databases. It's possible that there are ways to enrich the ulogd pcap logs, or maybe just the JSON logs, with additional useful information such as the network interface involved and other things. I find the ulogd documentation somewhat opaque on this (and also it's incomplete), and I haven't experimented.

(According to this, the JSON logs can be enriched or maybe default to that.)

Given the assorted limitations and other issues with ulogd, I'm tempted to not bother with it and only have our nftables setups support live tcpdump of dropped traffic with a single 'log group <N>'. This would save us from the assorted annoyances of ulogd2.

PS: One reason to log to pcap format files is that then you can use all of the tcpdump filters that you're already familiar with in order to narrow in on (blocked) traffic of interest, rather than having to put together a JSON search or something.

The 'nft' command may not show complete information for iptables rules

By: cks

These days, nftables is the Linux network firewall system that you want to use, and especially it's the system that Ubuntu will use by default even if you use the 'iptables' command. The nft command is the official interface to nftables, and it has a 'nft list ruleset' sub-command that will list your NFT rules. Since iptables rules are implemented with nftables, you might innocently expect that 'nft list ruleset' will show you the proper NFT syntax to achieve your current iptables rules.

Well, about that:

# iptables -vL INPUT
[...] target prot opt in  out  source   destination         
[...] ACCEPT tcp  --  any any  anywhere anywhere    match-set nfsports dst match-set nfsclients src
# nft list ruleset
[...]
      ip protocol tcp xt match "set" xt match "set" counter packets 0 bytes 0 accept
[...]

As they say, "yeah no". As the documentation tells you (eventually), somewhat reformatted:

xt TYPE NAME

TYPE := match | target | watcher

This represents an xt statement from xtables compat interface. It is a fallback if translation is not available or not complete. Seeing this means the ruleset (or parts of it) were created by iptables-nft and one should use that to manage it.

Nftables has a native set type (and also maps), but, quite reasonably, the old iptables 'ipset' stuff isn't translated to nftables sets by the iptables compatibility layer. Instead the compatibility layer uses this 'xt match' magic that the nft command can only imperfectly tell you about. To nft's credit, it prints a warning comment (which I've left out) that the rules are being managed by iptables-nft and you shouldn't touch them. Here, all of the 'xt match "set"' bits in the nft output are basically saying "opaque stuff happens here".

This still makes me a little bit sad because it makes it that bit harder to bootstrap my nftables knowledge from what iptables rules convert into. If I wanted to switch to nftables rules and nftables sets (for example for my now-simpler desktop firewall rules), I'd have to do that from relative scratch instead of getting to clean up what the various translation tools would produce or report.

(As a side effect it makes it less likely that I'll convert various iptables things to being natively nft/nftables based, because I can't do a fully mechanical conversion. If they still work with iptables-nft, I'm better off leaving them as is. Probably this also means that iptables-nft support is likely to have a long, long life.)

NFS v4 delegations on a Linux NFS server can act as mandatory locks

By: cks

Over on the Fediverse, I shared an unhappy learning experience:

Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS client that's now not responding, good luck writing to the file from any other machine. Your writes will hang.

NFS v4 delegations are an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular file over to a client that is using that file. There are various sorts of delegations, but even a basic read delegation will force the NFS server to recall the delegation if anything else wants to write to the file or to remove it. Recalling a delegation requires notifying the NFS v4 client that it has lost the delegation and then having the client accept and respond to that. NFS v4 clients have to respond to the loss of a delegation because they may be holding local state that needs to be flushed back to the NFS server before the delegation can be released.

(After all the NFS v4 server promised the client 'this file is yours to fiddle around with, I will consult you before touching it'.)

Under some circumstances, when the NFS v4 server is unable to contact the NFS v4 client, it will simply sit there waiting and as part of that will not allow you to do things that require the delegation to be released. I don't know if there's a delegation recall timeout, although I suspect that there is, and I don't know how to find out what the timeout is, but whatever the value is, it's substantial (it may be the 90 second 'default lease time' from nfsd4_init_leases_net(), or perhaps the 'grace', also probably 90 seconds, or perhaps the two added together).

(90 seconds is not what I consider a tolerable amount of time for my editor to completely freeze when I tell it to write out a new version of the file. When NFS is involved, I will typically assume that something has gone badly wrong well before then.)

As mentioned, the NFS v4 RFC also explicitly notes that NFS v4 clients may have to flush file state in order to release their delegation, and this itself may take some time. So even without an unavailable client machine, recalling a delegation may stall for some possibly arbitrary amount of time (depending on how the NFS v4 server behaves; the RFC encourages NFS v4 servers to not be hasty if the client seems to be making a good faith effort to clear its state). Both the slow client recall and the hung client recall can happen even in the absence of any actual file locks; in my case, the now-unavailable client merely having read from the file was enough to block things.

This blocking recall is effectively a mandatory lock, and it affects both remote operations over NFS and local operations on the fileserver itself. Short of waiting out whatever timeout applies, you have two realistic choices to deal with this (the non-realistic choice is to reboot the fileserver). First, you can bring the NFS client back to life, or at least something that's at its IP address and responds to the server with NFS v4 errors. Second, I believe you can force everything from the client to expire through /proc/fs/nfsd/clients/<ID>, by writing 'expire' to the client's 'ctl' file. You can find the right client ID by grep'ing for something in all of the clients/*/info files.

Discovering this makes me somewhat more inclined than before to consider entirely disabling 'leases', the underlying kernel feature that is used to implement these NFS v4 delegations (I discovered how to do this when investigating NFS v4 client locks on the server). This will also affect local processes on the fileserver, but that now feels like a feature since hung NFS v4 delegation recalls will stall or stop even local operations.

Why Ubuntu 24.04's ls can show a puzzling error message on NFS filesystems

By: cks

Suppose that you're on Ubuntu 24.04, using NFS v4 filesystems mounted from a Linux NFS fileserver, and at some point you do a 'ls -l' or a 'ls -ld' of something you don't own. You may then be confused and angered:

; /bin/ls -ld ckstst
/bin/ls: ckstst: Permission denied
drwx------ 64 ckstst [...] 131 Jul 17 12:06 ckstst

(There are situations where this doesn't happen or doesn't repeat, which I don't understand but which I'm assuming are NFS caching in action.)

If you apply strace to the problem, you'll find that the failing system call is listxattr(2), which is trying to list 'extended attributes'. On Ubuntu 24.04, ls comes from Coreutils, and Coreutils apparently started using listxattr() in version 9.4.

The Linux NFS v4 code supports extended attributes (xattrs), which are from RFC 8276; they're supported in both the client and the server since mid-2020 if I'm reading git logs correctly. Both the normal Ubuntu 22.04 LTS and 24.04 LTS server kernels are recent enough to include this support on both the server and clients, and I don't believe there's any way to turn just them off in the kernel server (although if you disable NFS v4.2 they may disappear too).

However, the NFS v4 server doesn't treat listxattr() operations the way the kernel normally does. Normally, the kernel will let you do listxattr() on an object (a directory, a file, etc) that you don't have read permissions on, just as it will let you do stat() on it. However, the NFS v4 server code specifically requires that you have read access to the object. If you don't, you get EACCES (no second S).

(The sausage is made in nfsd_listxattr() in fs/nfsd/vfs.c, specifically in the fh_verify() call that uses NFSD_MAY_READ instead of NFSD_MAY_NOP, which is what eg GETATTR uses.)

In January of this year, Coreutils applied a workaround to this problem, which appeared in Coreutils 9.6 (and is mentioned in the release notes).

Normally we'd have found this last year, but we've been slow to roll out Ubuntu 24.04 LTS machines and apparently until now no one ever did a 'ls -l' of unreadable things on one of them (well, on a NFS mounted filesystem).

(This elaborates on a Fediverse post. Our patch is somewhat different than the official one.)

The development version of OpenZFS is sometimes dangerous, illustrated

By: cks

I've used OpenZFS on my office and home desktops (on Linux) for what is a long time now, and over that time I've consistently used the development version of OpenZFS, updating to the latest git tip on a regular basis (cf). There have been occasional issues but I've said, and continue to say, that the code that goes into the development version is generally well tested and I usually don't worry too much about it. But I do worry somewhat, and I do things like read every commit message for the development version and I sometimes hold off on updating my version if a particular significant change has recently landed.

But, well, sometimes things go wrong in a development version. As covered in Rob Norris's An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too) (via), there was a recently discovered bug in the development version of OpenZFS that could or would have corrupted RAIDZ vdevs. When I saw the fix commit go by in the development version, I felt extremely lucky that I use mirror vdevs, not raidz, and so avoided being affected by this.

(While I might have detected this at the first scrub after some data was corrupted, the data would have been gone and at a minimum I'd have had to restore it from backups. Which I don't currently have on my home desktop.)

In general this is a pointed reminder that the development version of OpenZFS isn't perfect, no matter how long I and other people have been lucky with it. You might want to think twice before running the development version in order to, for example, get support for the very latest kernels that are used by distributions like Fedora. Perhaps you're better off delaying your kernel upgrades a bit longer and sticking to released branches.

I don't know if this is going to change my practices around running the development version of OpenZFS on my desktops. It may make me more reluctant to update to the very latest version on my home desktop; it would be straightforward to have that run only time-delayed versions of what I've already run through at least one scrub cycle on my office desktop (where I have backups). And I probably won't switch to the next release version when it comes out, partly because of kernel support issues.

(Maybe) understanding how to use systemd-socket-proxyd

By: cks

I recently read systemd has been a complete, utter, unmitigated success (via among other places), where I found a mention of an interesting systemd piece that I'd previously been unaware of, systemd-socket-proxyd. As covered in the article, the major purpose of systemd-socket-proxyd is the bridge between systemd dynamic socket activation and a conventional programs that listens on some socket, so that you can dynamically activate the program when a connection comes in. Unfortunately the systemd-socket-proxyd manual page is a little bit opaque about how it works for this purpose (and what the limitations are). Even though I'm familiar with systemd stuff, I had to think about it for a bit before things clicked.

A systemd socket unit activates the corresponding service unit when a connection comes in on the socket. For simple services that are activated separately for each connection (with 'Accept=yes'), this is actually a templated unit, but if you're using it to activate a regular daemon like sshd (with 'Accept=no') it will be a single .service unit. When systemd activates this unit, it will pass the socket to it either through systemd's native mechanism or an inetd-compatible mechanism using standard input. If your listening program supports either mechanism, you don't need systemd-socket-proxyd and your life is simple. But plenty of interesting programs don't; they expect to start up and bind to their listening socket themselves. To work with these programs, systemd-socket-proxyd accepts a socket (or several) from systemd and then proxies connections on that socket to the socket your program is actually listening to (which will not be the official socket, such as port 80 or 443).

All of this is perfectly fine and straightforward, but the question is, how do we get our real program to be automatically started when a connection comes in and triggers systemd's socket activation? The answer, which isn't explicitly described in the manual page but which appears in the examples, is that we make the socket's .service unit (which will run systemd-socket-proxyd) also depend on the .service unit for our real service with a 'Requires=' and an 'After='. When a connection comes in on the main socket that systemd is doing socket activation for, call it 'fred.socket', systemd will try to activate the corresponding .service unit, 'fred.service'. As it does this, it sees that fred.service depends on 'realthing.service' and must be started after it, so it will start 'realthing.service' first. Your real program will then start, bind to its local socket, and then have systemd-socket-proxyd proxy the first connection to it.

To automatically stop everything when things are idle, you set systemd-socket-proxyd's --exit-idle-time option and also set StopWhenUnneeded=true on your program's real service unit ('realthing.service' here). Then when systemd-socket-proxyd is idle for long enough, it will exit, systemd will notice that the 'fred.service' unit is no longer active, see that there's nothing that needs your real service unit any more, and shut that unit down too, causing your real program to exit.

The obvious limitation of using systemd-socket-proxyd is that your real program no longer knows the actual source of the connection. If you use systemd-socket-proxyd to relay HTTP connections on port 80 to an nginx instance that's activated on demand (as shown in the examples in the systemd-socket-proxyd manual page), that nginx sees and will log all of the connections as local ones. There are usage patterns where this information will be added by something else (for example, a frontend server that is a reverse proxy to a bunch of activated on demand backend servers), but otherwise you're out of luck as far as I know.

Another potential issue is that systemd's idea of when the .service unit for your real program has 'started' and thus it can start running systemd-socket-proxyd may not match when your real program actually gets around to setting up its socket. I don't know if systemd-socket-proxyd will wait and try a bit to cope with the situation where it gets started a bit faster than your real program can get its socket ready.

(Systemd has ways that your real program can signal readiness, but if your program can use these ways it may well also support being passed sockets from systemd as a direct socket activated thing.)

Linux 'exportfs -r' stops on errors (well, problems)

By: cks

Linux's NFS export handling system has a very convenient option where you don't have to put all of your exports into one file, /etc/exports, but can instead write them into a bunch of separate files in /etc/exports.d. This is very convenient for allowing you to manage filesystem exports separately from each other and to add, remove, or modify only a single filesystem's exports. Also, one of the things that exportfs(8) can do is 'reexport' all current exports, synchronizing the system state to what is in /etc/exports and /etc/exports.d; this is 'exportfs -r', and is a handy thing to do after you've done various manipulations of files in /etc/exports.d.

Although it's not documented and not explicit in 'exportfs -v -r' (which will claim to be 'exporting ...' for various things), I have an important safety tip which I discovered today: exportfs does nothing on a re-export if you have any problems in your exports. In particular, if any single file in /etc/exports.d has a problem, no files from /etc/exports.d get processed and no exports are updated.

One potential problem with such files is syntax errors, which is fair enough as a 'problem'. But another problem is that they refer to directories that don't exist, for example because you have lingering exports for a ZFS pool that you've temporarily exported (which deletes the directories that the pool's filesystems may have previously been mounted on). A missing directory is an error even if the exportfs options include 'mountpoint', which only does the export if the directory is a mount point.

When I stubbed my toe on this I was surprised. What I'd vaguely expected was that the error would cause only the particular file in /etc/exports.d to not be processed, and that it wouldn't be a fatal error for the entire process. Exportfs itself prints no notices about this being a fatal problem, and it will happily continue to process other files in /etc/exports.d (as you can see with 'exportfs -v -r' with the right ordering of where the problem file is) and claim to be exporting them.

Oh well, now I know and hopefully it will stick.

Systemd user units, user sessions, and environment variables

By: cks

A variety of things in typical graphical desktop sessions communicate through the use of environment variables; for example, X's $DISPLAY environment variable. Somewhat famously, modern desktops run a lot of things as systemd user units, and it might be nice to do that yourself (cf). When you put these two facts together, you wind up with a question, namely how the environment works in systemd user units and what problems you're going to run into.

The simplest case is using systemd-run to run a user scope unit ('systemd-run --user --scope --'), for example to run a CPU heavy thing with low priority. In this situation, the new scope will inherit your entire current environment and nothing else. As far as I know, there's no way to do this with other sorts of things that systemd-run will start.

Non-scope user units by default inherit their environment from your user "systemd manager". I believe that there is always only a single user manager for all sessions of a particular user, regardless of how you've logged in. When starting things via 'systemd-run', you can selectively pass environment variables from your current environment with 'systemd-run --user -E <var> -E <var> -E ...'. If the variable is unset in your environment but set in the user systemd manager, this will unset it for the new systemd-run started unit. As you can tell, this will get very tedious if you want to pass a lot of variables from your current environment into the new unit.

You can manipulate your user "systemd manager environment block", as systemctl describes it in Environment Commands. In particular, you can export current environment settings to it with 'systemctl --user import-environment VAR VAR2 ...'. If you look at this with 'systemctl --user show-environment', you'll see that your desktop environment has pushed a lot of environment variables into the systemd manager environment block, including things like $DISPLAY (if you're on X). All of these environment variables for X, Wayland, DBus, and so on are probably part of how the assorted user units that are part of your desktop session talk to the display and so on.

You may now see a little problem. What happens if you're logged in with a desktop X session, and then you go elsewhere and SSH in to your machine (maybe with X forwarding) and try to start a graphical program as a systemd user unit? Since you only have a single systemd manager regardless of how many sessions you have, the systemd user unit you started from your SSH session will inherit all of the environment variables that your desktop session set and it will think it has graphics and open up a window on your desktop (which is hopefully locked, and in any case it's not useful to you over SSH). If you import the SSH session's $DISPLAY (or whatever) into the systemd manager's environment, you'll damage your desktop session.

For specific environment variables, you can override or remove them with 'systemd-run --user -E ...' (for example, to override or remove $DISPLAY). But hunting down all of the session environment variables that may trigger undesired effects is up to you, making systemd-run's user scope units by far the easiest way to deal with this.

(I don't know if there's something extra-special about scope units that enables them and only them to be passed your entire environment, or of this is simply a limitation in systemd-run that it doesn't try to implement this for anything else.)

The reason I find all of this regrettable is that it makes putting applications and other session processes into their own units much harder than it should be. Systemd-run's scope units inherit your session environment but can't be detached, so at a minimum you have extra systemd-run processes sticking around (and putting everything into scopes when some of them might be services is unaesthetic). Other units can be detached but don't inherit your environment, requiring assorted contortions to make things work.

PS: Possibly I'm missing something obvious about how to do this correctly, or perhaps there's an existing helper that can be used generically for this purpose.

Current cups-browsed seems to be bad for central CUPS print servers

By: cks

Suppose, not hypothetically, that you have a central CUPS print server, and that people also have Linux desktops or laptops that they point at your print server to print to your printers. As of at least Ubunut 24.04, if you're doing this you probably want to get people to turn off and disable cups-browsed on their machines. If you don't, your central print server may see a constant flood of connections from client machines running cups-browsed. You're probably running it, as I believe that cups-browsed is installed and activated by default these days in most desktop Linux environments.

(We didn't really notice this in prior Ubuntu versions, although it's possible cups-browsed was always doing something like this and what's changed in the Ubuntu 24.04 version is that it's doing it more and faster.)

I'm not entirely sure why this happens, and I'm also not sure what the CUPS requests typically involve, but one pattern that we see is that such clients will make a lot of requests to the CUPS server's /admin/ URL. I'm not sure what's in these requests, because CUPS immediately rejects them as unauthenticated. Another thing we've seen is frequent attempts to get printer attributes for printers that don't exist and that have name patterns that look like local printers. One of the reason that the clients are hitting the /admin/ endpoint may be to somehow add these printers to our CUPS server, which is definitely not going to work.

(We've also seen signs that some Ubuntu 24.04 applications can repeatedly spam the CUPS server, probably with status requests for printers or print jobs. This may be something enabled or encouraged by cups-browsed.)

My impression is that modern Linux desktop software, things like cups-browsed included, is not really spending much time thinking about larger scale, managed Unix environments where there are a bunch of printers (or at least print queues), the 'print server' is not on your local machine and not run by you, anything random you pick up through broadcast on the local network is suspect, and so on. I broadly sympathize with this, because such environments are a small minority now, but it would be nice if client side CUPS software didn't cause problems in them.

(I suspect that cups-browsed and its friends are okay in an environment where either the 'print server' is local or it's operated by you and doesn't require authentication, there's only a few printers, everyone on the local network is friendly and if you see a printer it's definitely okay to use it, and so on. This describes a lot of Linux desktop environments, including my home desktop.)

Compute GPUs can have odd failures under Linux (still)

By: cks

Back in the early days of GPU computation, the hardware, drivers, and software were so relatively untrustworthy that our early GPU machines had to be specifically reserved by people and that reservation gave them the ability to remotely power cycle the machine to recover it (this was in the days before our SLURM cluster). Things have gotten much better since then, with things like hardware and driver changes so that programs with bugs couldn't hard-lock the GPU hardware. But every so often we run into odd failures where something funny is going on that we don't understand.

We have one particular SLURM GPU node that has been flaky for a while, with the specific issue being that every so often the NVIDIA GPU would throw up its hands and drop off the PCIe bus until we rebooted the system. This didn't happen every time it was used, or with any consistent pattern, although some people's jobs seemed to regularly trigger this behavior. Recently I dug up a simple to use GPU stress test program, and when this machine's GPU did its disappearing act this Saturday, I grabbed the machine, rebooted it, ran the stress test program, and promptly had the GPU disappear again. Success, I thought, and since it was Saturday, I stopped there, planning to repeat this process today (Monday) at work, while doing various monitoring things.

Since I'm writing a Wandering Thoughts entry about it, you can probably guess the punchline. Nothing has changed on this machine since Saturday, but all today the GPU stress test program could not make the GPU disappear. Not with the same basic usage I'd used Saturday, and not with a different usage that took the GPU to full power draw and a reported temperature of 80C (which was a higher temperature and power draw than the GPU had been at when it disappeared, based on our Prometheus metrics). If I'd been unable to reproduce the failure at all with the GPU stress program, that would have been one thing, but reproducing it once and then not again is just irritating.

(The machine is an assembled from parts one, with an RTX 4090 and a Ryzen Threadripper 1950X in an X399 Taichi motherboard that is probably not even vaguely running the latest BIOS, seeing as the base hardware was built many years ago, although the GPU has been swapped around since then. Everything is in a pretty roomy 4U case, but if the failure was consistent we'd have assumed cooling issues.)

I don't really have any theories for what could be going on, but I suppose I should try to find a GPU stress test program that exercises every last corner of the GPU's capabilities at full power rather than using only one or two parts at a time. On CPUs, different loads light up different functional units, and I assume the same is true on GPUs, so perhaps the problem is in one specific functional unit or a combination of them.

(Although this doesn't explain why the GPU stress test program was able to cause the problem on Saturday but not today, unless a full reboot didn't completely clear out the GPU's state. Possibly we should physically power this machine off entirely for long enough to dissipate any lingering things.)

What I've observed about Linux kernel WireGuard on 10G Ethernet so far

By: cks

I wrote about a performance mystery with WireGuard on 10G Ethernet, and since then I've done additional measurements with results that both give some clarity and leave me scratching my head a bit more. So here is what I know about the general performance characteristics of Linux kernel WireGuard on a mixture of Ubuntu 22.04 and 24.04 servers with stock settings, and using TCP streams inside the WireGuard tunnels (because the high bandwidth thing we care about runs over TCP).

  • CPU performance is important even when WireGuard isn't saturating the CPU.

  • CPU performance seems to be more important on the receiving side than on the sending side. If you have two machines, one faster than the other, you get more bandwidth sending a TCP stream from the slower machine to the faster one. I don't know if this is an artifact of the Linux kernel implementation or if the WireGuard protocol requires the receiver to do more work than the sender.

  • There seems to be a single-peer bandwidth limit (related to CPU speeds). You can increase the total WireGuard bandwidth of a given server by talking to more than one peer.

  • When talking to a single peer, there's both a unidirectional bandwidth limit and a bidirectional bandwidth limit. If you send and receive to a single peer at once, you don't get the sum of the unidirectional send and unidirectional receive; you get less.

  • There's probably also a total WireGuard bandwidth that, in our environment, falls short of 10G bandwidth (ie, a server talking WireGuard to multiple peers can't saturate its 10G connection, although maybe it could if I had enough peers in my test setup).

The best performance between a pair of WireGuard peers I've gotten is from two servers with Xeon E-2226G CPUs; these can push their 10G Ethernet to about 850 MBytes/sec of WireGuard bandwidth in one direction and about 630 MBytes/sec in each direction if they're both sending and receiving. These servers (and other servers with slower CPUs) can basically saturate their 10G-T network links with plain (non-WireGuard) TCP.

If I was to build a high performance 'WireGuard gateway' today, I'd build it with a fast CPU and dual 10G networks, with WireGuard traffic coming in (and going out) one 10G interface and the resulting gatewayed traffic using the other. WireGuard on fast CPUs can run fast enough that a single 10G interface could limit total bandwidth under the right (or wrong) circumstances; segmenting WireGuard and clear traffic onto different interfaces avoids that.

(A WireGuard gateway that only served clients at 1G or less would likely be perfectly fine with a single 10G interface and reasonably fast CPUs. But I'd want to test how many 1G clients it took to reach the total WireGuard bandwidth limit on a 10G WireGuard server before I was completely confident about that.)

A performance mystery with Linux WireGuard on 10G Ethernet

By: cks

As a followup on discovering that WireGuard can saturate a 1G Ethernet (on Linux), I set up WireGuard on some slower servers here that have 10G networking. This isn't an ideal test but it's more representative of what we would see with our actual fileservers, since I used spare fileserver hardware. What I got out of it was a performance and CPU usage mystery.

What I expected to see was that WireGuard performance would top out at some level above 1G as the slower CPUs on both the sending and the receiving host ran into their limits, and I definitely wouldn't see them drive the network as fast as they could without WireGuard. What I actually saw was that WireGuard did hit a speed limit but the CPU usage didn't seem to saturate, either for kernel WireGuard processing or for the iperf3 process. These machines can manage to come relatively close to 10G bandwidth with bare TCP, while with WireGuard they were running around 400 MBytes/sec of on the wire bandwidth (which translates to somewhat less inside the WireGuard connection, due to overheads).

One possible explanation for this is increased packet handling latency, where the introduction of WireGuard adds delays that keep things from running at full speed. Another possible explanation is that I'm running into CPU limits that aren't obvious from simple tools like top and htop. One interesting thing is that if I do a test in both directions at once (either an iperf3 bidirectional test or two iperf3 sessions, one in each direction), the bandwidth in each direction is slightly over half the unidirectional bandwidth (while a bidirectional test without WireGuard runs at full speed in both directions at once). This certainly makes it look like there's a total WireGuard bandwidth limit in these servers somewhere; unidirectional traffic gets basically all of it, while bidirectional traffic splits it fairly between each direction.

I looked at 'perf top' on the receiving 10G machine and kernel spin lock stuff seems to come in surprisingly high. I tried having a 1G test machine also send WireGuard traffic to the receiving 10G test machine at the same time and the incoming bandwidth does go up by about 100 Mbytes/sec, so perhaps on these servers I'm running into a single-peer bandwidth limitation. I can probably arrange to test this tomorrow.

(I can't usefully try both of my 1G WireGuard test machines at once because they're both connected to the same 1G switch, with a 1G uplink into our 10G switch fabric.)

PS: The two 10G servers are running Ubuntu 24.04 and Ubuntu 22.04 respectively with standard kernels; the faster server with more CPUs was the 'receiving' server here, and is running 24.04. The two 1G test servers are running Ubuntu 24.04.

Linux kernel WireGuard can go 'fast' on decent hardware

By: cks

I'm used to thinking of encryption as a slow thing that can't deliver anywhere near to network saturation, even on basic gigabit Ethernet connections. This is broadly the experience we see with our current VPN servers, which struggle to turn in more than relatively anemic bandwidth with OpenVPN and L2TP, and so for a long time I assumed it would also be our experience with WireGuard if we tried to put anything serious behind it. I'd seen the 2023 Tailscale blog post about this but discounted it as something we were unlikely to see; as their kernel throughput on powerful sounding AWS nodes was anemic by 10G standards, so I assumed our likely less powerful servers wouldn't even get 1G rates.

Today, for reasons beyond the scope of this entry, I wound up wondering how fast we could make WireGuard go. So I grabbed a couple of spare servers we had with reasonably modern CPUs (by our limited standards), put our standard Ubuntu 24.04 on them, and took a quick look to see how fast I could make them go over 1G networking. To my surprise, the answer is that WireGuard can saturate that 1G network with no particularly special tuning, and the system CPU usage is relatively low (4.5% on the client iperf3 side, 8% on the server iperf3 side; each server has a single Xeon E-2226G). The low usage suggests that we could push well over 1G of WireGuard bandwidth through a 10G link, which means that I'm going to set one up for testing at some point.

While the Xeon E-2226G is not a particularly impressive CPU, it's better than the CPUs our NFS fileservers have (the current hardware has Xeon Silver 4410Ys). But I suspect that we could sustain over 1G of WireGuard bandwidth even on them, if we wanted to terminate WireGuard on the fileservers instead of on a 'gateway' machine with a fast CPU (and a 10G link).

More broadly, I probably need to reset my assumptions about the relative speed of encryption as compared to network speeds. These days I suspect a lot of encryption methods can saturate a 1G network link, at least in theory, since I don't think WireGuard is exceptionally good in this respect (as I understand it, encryption speed wasn't particularly a design goal; it was designed to be secure first). Actual implementations may vary for various reasons so perhaps our VPN servers need some tuneups.

(The actual bandwidth achieved inside WireGuard is less than the 1G data rate because simply being encrypted adds some overhead. This is also something I'm going to have to remember when doing future testing; if I want to see how fast WireGuard is driving the underlying networking, I should look at the underlying networking data rate, not necessarily WireGuard's rate.)

A silly systemd wish for moving new processes around systemd units

By: cks

Linux cgroups offer a bunch of robust features for limiting resource usage and handling resource contention between different groups of processes, which you can use to implement things like per-user memory and CPU resource limits. On a systemd based system, which is to say basically almost all Linuxes today, systemd more or less completely owns the cgroup hierarchy and using cgroups for resource limits requires that the processes involved be placed inside relevant systemd units, and for that matter that the systemd units exist.

Unfortunately, the mechanisms for doing this are a little bit under-developed. If you're dealing with something that goes through PAM and for which putting processes into user slices based on the UID running them is the right answer, you can use pam_systemd (which we do for various reasons). If you want a different hierarchy and things go through PAM, you can perhaps write a PAM session module that does this, copying code from pam_systemd, but I don't know if there's anything for that today. If you have processes that are started in ways that don't go through PAM, as far as I know you're currently out of luck. One case that's quite relevant for us is Apache CGI processes run through suexec.

It would be nice to be able to do better, since the odds that everything that starts processes will pick up the ability to talk to systemd to set up slices, sessions, and so on for them seem rather low. Some things have specific magic support for this, but I don't think the process is very documented and I believe it requires that things change how they start programs (so eg suexec would have to know how to do this). This means that what I'm wishing for is a daemon that would be given some sort of rules and use them to move processes between systemd slices and other units, possibly creating things like user sessions on the fly. Then you could write a rule that said 'if a process is in the Apache system cgroup and its UID isn't <X>, put it in a slice in a user hierarchy'.

An extra problem is that this daemon probably wouldn't be perfect, since it would have to react to processes after they'd appeared rather than intercept their creation; some processes could slip through the cracks or otherwise do weird things. This would make it sort of a hack, rather than something that I suspect anyone would want as a proper feature.

(I don't know if a kernel LSM could make this more reliable by intercepting and acting on certain things, like setuid() calls.)

PS: Possibly the correct answer is to persuade the Apache people to make suexec consult PAM, even if the standard suexec PAM stack does nothing. Then you could in theory add pam_systemd or whatever there. It appears that Debian may have had a custom patch for this at one but I believe they gave it up years and years ago.

Fedora's DNF 5 and the curse of mandatory too-smart output

By: cks

DNF is Fedora's high(er) level package management system, which pretty much any system administrator is going to have to use to install and upgrade packages. Fedora 41 and later have switched from DNF 4 to DNF 5 as their normal (and probably almost mandatory) version of DNF. I ran into some problems with this switch, and since then I've found other issues, all of which boil down to a simple issue: DNF 5 insists on doing too-smart output.

Regardless of what you set your $TERM to and what else you do, if DNF 5 is connected to a terminal (and perhaps if it isn't), it will pretty-print its output in an assortment of ways. As far as I can tell it simply assumes ANSI cursor addressability, among other things, and will always fit its output to the width of your terminal window, truncating output as necessary. This includes output from RPM package scripts that are running as part of the update. Did one of them print a line longer than your current terminal width? Tough, it was probably truncated. Are you using script so that you can capture and review all of the output from DNF and RPM package scripts? Again, tough, you can't turn off the progress bars and other things that will make a complete mess of the typescript.

(It's possible that you can find the information you want in /var/log/dnf5.log in un-truncated and readable form, but if so it's buried in debug output and I'm not sure I trust dnf5.log in general.)

DNF 5 is far from the only offender these days. An increasing number of command line programs simply assume that they should always produce 'smart' output (ideally only if they're connected to a terminal). They have no command line option to turn this off and since they always use 'ANSI' escape sequences, they ignore the tradition of '$TERM' and especially 'TERM=dumb' to turn that off. Some of them can specifically disable colour output (typically with one of a number of environment variables, which may or may not be documented, and sometimes with a command line option), but that's usually the limits of their willingness to stop doing things. The idea of printing one whole line at a time as you do things and not printing progress bars, interleaving output, and so on has increasingly become a non-starter for modern command line tools.

(Another semi-offender is Debian's 'apt' and also 'apt-get' to some extent, although apt-get's progress bars can be turned off and 'apt' is explicitly a more user friendly front end to apt-get and friends.)

PS: I can't run DNF with its output directed into a file because it wants you to interact with it to approve things, and I don't feel like letting it run freely without that.

Netplan can only have WireGuard peers in one file

By: cks

We have started using WireGuard to build a small mesh network so that machines outside of our network can securely get at some services inside it (for example, to send syslog entries to our central syslog server). Since this is all on Ubuntu, we set it up through Netplan, which works but which I said 'has warts' in my first entry about it. Today I discovered another wart due to what I'll call the WireGuard provisioning problem:

Current status: provisioning WireGuard endpoints is exhausting, at least in Ubuntu 22.04 and 24.04 with netplan. So many netplan files to update. I wonder if Netplan will accept files that just define a single peer for a WG network, but I suspect not.

The core WireGuard provisioning problem is that when you add a new WireGuard peer, you have to tell all of the other peers about it (or at least all of the other peers you want to be able to talk to the new peer). When you're using iNetplan, it would be convenient if you could put each peer in a separate file in /etc/netplan; then when you add a new peer, you just propagate the new Netplan file for the peer to everything (and do the special Netplan dance required to update peers).

(Apparently I should now call it 'Canonical Netplan', as that's what its front page calls it. At least that makes it clear exactly who is responsible for Netplan's state and how it's not going to be widely used.)

Unfortunately this doesn't work, and it doesn't work in a dangerous way, which is that Netplan only notices one set of WireGuard peers in one netplan file (at least on servers, using systemd-networkd as the backend). If you put each peer in its own file, only the first peer is picked up. If you define some peers in the file where you define your WireGuard private key, local address, and so on, and some peers in another file, only peers from whichever is first will be used (even if the first file only defines peers, which isn't enough to bring up a WireGuard device by itself). As far as I can see, Netplan doesn't report any errors or warnings to the system logs on boot about this situation; instead, you silently get incomplete WireGuard configurations.

This is visibly and clearly a Netplan issue, because on servers you can inspect the systemd-networkd files written by Netplan (in /run/systemd/network). When I do this, the WireGuard .netdev file has only the peers from one file defined in it (and the .netdev file matches the state of the WireGuard interface). This is especially striking when the netplan file with the private key and listening port (and some peers) is second; since the .netdev file contains the private key and so on, Netplan is clearly merging data from more than one netplan file, not completely ignoring everything except the first one. It's just ignoring any peers encountered after the first set of them.

My overall conclusion is that in Netplan, you need to put all configuration for a given WireGuard interface into a single file, however tempting it might be to try splitting it up (for example, to put core WireGuard configuration stuff in one file and then list all peers in another one).

I don't know if this is an already filed Netplan bug and I don't plan on bothering to file one for it, partly because I don't expect Canonical to fix Netplan issues any more than I expect them to fix anything else and partly for other reasons.

PS: I'm aware that we could build a system to generate the Netplan WireGuard file, or maybe find a YAML manipulating program that could insert and delete blocks that matched some criteria. I'm not interested in building yet another bespoke custom system to deal with what is (for us) a minor problem, since we don't expect to be constantly deploying or removing WireGuard peers.

These days, Linux audio seems to just work (at least for me)

By: cks

For a long time, the common perception was that 'Linux audio' was the punchline for a not particularly funny joke. I sort of shared that belief; although audio had basically worked for me for a long time, I had a simple configuration and dreaded having to make more complex audio work in my unusual desktop environment. But these days, audio seems to just work for me, even in systems that have somewhat complex audio options.

On my office desktop, I've wound up with three potential audio outputs and two audio inputs: the motherboard's standard sound system, a USB headset with a microphone that I use for online meetings, the microphone on my USB webcam, and (to my surprise) a HDMI audio output because my LCD displays do in fact have tiny little speakers built in. In PulseAudio (or whatever is emulating it today), I have the program I use for online meetings set to use the USB headset and everything else plays sound through the motherboard's sound system (which I have basic desktop speakers plugged into). All of this works sufficiently seamlessly that I don't think about it, although I do keep a script around to reset the default audio destination.

On my home desktop, for a long time I had a simple single-output audio system that played through the motherboard's sound system (plus a microphone on a USB webcam that was mostly not connected). Recently I got an outboard USB DAC and, contrary to my fears, it basically plugged in and just worked. It was easy to set the USB DAC as the default output in pavucontrol and all of the settings related to it stick around even when I put it to sleep overnight and it drops off the USB bus. I was quite pleased by how painless the USB DAC was to get working, since I'd been expecting much more hassles.

(Normally I wouldn't bother meticulously switching the USB DAC to standby mode when I'm not using it for an extended time, but I noticed that the case is clearly cooler when it rests in standby mode.)

This is still a relatively simple audio configuration because it's basically static. I can imagine more complex ones, where you have audio outputs that aren't always present and that you want some programs (or more generally audio sources) to use when they are present, perhaps even with priorities. I don't know if the Linux audio systems that Linux distributions are using these days could cope with that, or if they did would give you any easy way to configure it.

(I'm aware that PulseAudio and so on can be fearsomely complex under the hood. As far as the current actual audio system goes, I believe that what my Fedora 41 machines are using for audio is PipeWire (also) with WirePlumber, based on what processes seem to be running. I think this is the current Fedora 41 audio configuration in general, but I'm not sure.)

My Cinnamon desktop customizations (as of 2025)

By: cks

A long time ago I wrote up some basic customizations of Cinnamon, shortly after I started using Cinnamon (also) on my laptop of the time. Since then, the laptop got replaced with another one and various things changed in both the land of Cinnamon and my customizations (eg, also). Today I feel like writing down a general outline of my current customizations, which fall into a number of areas from the modest but visible to the large but invisible.

The large but invisible category is that just like on my main fvwm-based desktop environment, I use xcape (plus a custom Cinnamon key binding for a weird key combination) to invoke my custom dmenu setup (1, 2) when I tap the CapsLock key. I have dmenu set to come up horizontally on the top of the display, which Cinnamon conveniently leaves alone in the default setup (it has its bar at the bottom). And of course I make CapsLock into an additional Control key when held.

(On the laptop I'm using a very old method of doing this. On more modern Cinnamon setups in virtual machines, I do this with Settings β†’ Keyboard β†’ Layout β†’ Options, and then in the CapsLock section set CapsLock to be an additional Ctrl key.)

To start xcape up and do some other things, like load X resources, I have a personal entry in Settings β†’ Startup Applications that runs a script in my ~/bin/X11. I could probably do this in a more modern way with an assortment of .desktop files in ~/.config/autostart (which is where my 'Startup Applications' setting actually wind up) that run each thing individually or perhaps some systemd user units. But the current approach works and is easy to modify if I want to add or remove things (I can just edit the script).

I have a number of Cinnamon 'applets' installed on my laptop and my other Cinnamon VM setups. The ones I have everywhere are Spices Update and Shutdown Applet, the latter because if I tell the (virtual) machine to log me off, shut down, or restart, I generally don't want to be nagged about it. On my laptop I also have CPU Frequency Applet (set to only display a summary) and CPU Temperature Indicator, for no compelling reason. In all environments I also pin launchers for Firefox and (Gnome) Terminal to the Cinnamon bottom bar, because I start both of them often enough. I position the Shutdown Applet on the left side, next to the launchers, because I think of it as a peculiar 'launcher' instead of an applet (on the right).

(The default Cinnamon keybindings also start a terminal with Ctrl + Alt + T, which you can still find through the same process from several years ago provided that you don't cleverly put something in .local/share/glib-2.0/schemas and then run 'glib-compile-schemas .' in that directory. If I was a smarter bear, I'd understand what I should have done when I was experimenting with something.)

On my virtual machines with Cinnamon, I don't bother with the whole xcape and dmenu framework, but I do set up the applets and the launchers and fix CapsLock.

(This entry was sort of inspired by someone I know who just became a Linux desktop user (after being a long time terminal user).)

Sidebar: My Cinnamon 'window manager' custom keybindings

I have these (on my laptop) and perpetually forget about them, so I'm going to write them down now so perhaps that will change.

move-to-corner-ne=['<Alt><Super>Right']
move-to-corner-nw=['<Alt><Super>Left']
move-to-corner-se=['<Primary><Alt><Super>Right']
move-to-corner-sw=['<Primary><Alt><Super>Left']
move-to-side-e=['<Shift><Alt><Super>Right']
move-to-side-n=['<Shift><Alt><Super>Up']
move-to-side-s=['<Shift><Alt><Super>Down']
move-to-side-w=['<Shift><Alt><Super>Left']

I have some other keybindings on the laptop but they're even less important, especially once I added dmenu.

Looking at what NFSv4 clients have locked on a Linux NVS(v4) server

By: cks

A while ago I wrote an entry about (not) finding which NFSv4 client owns a lock on a Linux NFS(v4) server, where the best I could do was pick awkwardly through the raw NFS v4 client information in /proc/fs/nfsd/clients. Recently I discovered an alternative to doing this by hand, which is the nfsdclnts program, and as a result of digging into it and what I was seeing when I tried it out, I now believe I have a better understanding of the entire situation (which was previously somewhat confusing).

The basic thing that nfsdclnts will do is list 'locks' and some information about them with 'nfsdclnts -t lock', in addition to listing other state information such as 'open', for open files, and 'deleg', for NFS v4 delegations. The information it lists is somewhat limited, for example it will list the inode number but not the filesystem, but on the good side nfsdclnts is a Python program so you can easily modify it to report any extra information that exists in the clients/#/states files. However, this information about locks is not complete, because of how file level locks appear to normally manifest in NFS v4 client state.

(The information in the states files is limited, although it contains somewhat more than nfsdclnts shows.)

Here is how I understand NFS v4 locking and states. To start with, NFS v4 has a feature called delegations where the NFS v4 server can hand a lot of authority over a file to a NFS v4 client. When a NFS v4 client accesses a file, the NFS v4 server likes to give it a delegation if this is possible; it normally will be if no one else has the file open or active. Once a NFS v4 client holds a delegation, it can lock the file without involving the NFS v4 server. At this point, the client's 'states' file will report an opaque 'type: deleg' entry for the file (and this entry may or may not have a filename or instead be what nfsdclnts will report as 'disconnected dentry').

While a NFS v4 client has the file delegated, if any other NFS v4 client does anything with the file, including simply opening it, the NFS v4 server will recall the delegation from the original client. As a result, the original client now has to tell the NFS v4 server that it has the file locked. At this point a 'type: lock' entry for the file appears in the first NFS v4 client's states file. If the first NFS v4 client releases its lock while the second NFS v4 client is trying to acquire it, the second NFS v4 client will not have a delegation for the file, so its lock will show up as an explicit 'type: lock' entry in its states file.

An additional wrinkle, a NFS v4 client holding a delegation doesn't immediately release it once all processes have released their locks, closed the file, and so on. Instead the delegation may linger on for some time. If another NFS v4 client opens the file during this time, the first client will lose the delegation but the second NFS v4 client may not get a delegation from the NFS v4 server, so its lock will be visible as a 'type: lock' states file entry.

A third wrinkle is that multiple clients may hold read-only delegations for a file and have fcntl() read locks on it at once, with each of them having a 'type: deleg, access: r' entry for it in their states files. These will only become visible 'type: lock' states entries if the clients have to release their delegations.

So putting this all together:

  • If there is a 'type: lock' entry for the file in any states file (or it's listed in 'nfsdclnts -t lock'), the file is definitely locked by whoever has that entry.

  • If there are no 'type: deleg' or 'type: lock' entries for the file, it's definitely not locked; you can also see this by whether nfsdclnts lists it as having delegations or locks.

  • If there are 'type: deleg' entries for the file, it may or may not be locked by the NFS v4 client (or clients) with the delegation. If the delegation is an 'access: w' delegation, you can see if someone actually has the file locked by accessing the file on another NFS v4 client, which will force the NFS v4 server to recall the delegation and expose the lock if there is one.

If the delegation is 'access: r' and might have multiple read-only locks, you can't force the NFS v4 server to recall the delegation by merely opening the file read-only (for example with 'cat file' or 'less file'). Instead the server will only recall the delegation if you open the file read-write. A convenient way to do this is probably to use 'flock -x <file> -c /bin/true', although this does require you to have more permissions for the file than simply the ability to read it.

Sidebar: Disabling NFS v4 delegations on the server

Based on trawling various places, I believe this is done by writing a '0' to /proc/sys/fs/leases-enabled (or the equivalent 'fs.leases-enabled' sysctl) and then apparently restarting your NFS v4 server processes. This will disable all user level uses of fcntl()'s F_SETLEASE and F_GETLEASE as an additional effect, and I don't know if this will affect any important programs running on the NFS server itself. Based on a study of the kernel source code, I believe that you don't need to restart your NFS v4 server processes if it's sufficient for the NFS server to stop handing out new delegations but current delegations can stay until they're dropped.

(There have apparently been some NFS v4 server and client issues with delegations, cf, along with other NFS v4 issues. However, I don't know if the cure winds up being worse than the disease here, or if there's another way to deal with these stateid problems.)

Getting older, now-replaced Fedora package updates

By: cks

Over the history of a given Fedora version, Fedora will often release multiple updates to the same package (for example, kernels, but there are many others). When it does this, the older package wind up being removed from the updates repository and are no longer readily available through mechanisms like 'dnf list --showduplicates <package>'. For a long time I used dnf's 'local' plugin to maintain a local archive of all packages I'd updated, so I could easily revert, but it turns out that as of Fedora 41's change to dnf5 (dnf version 5), that plugin is not available (presumably it hasn't been ported to dnf5, and may never be). So I decided to look into my other options for retrieving and installing older versions of packages, in case the most recent version has a bug that affects me (which has happened).

Before I take everyone on a long yak-shaving expedition, the simplest and best answer is to install the 'fedora-repos-archive' package, which installs an additional Fedora repository that has those replaced updates. After installing it, I suggest that you edit /etc/yum.repos.d/fedora-updates-archive.repo to disable it by default, which will save you time, bandwidth, and possibly aggravation. Then when you really want to see all possible versions of, say, Rust, you can do:

dnf list --showduplicates --enablerepo=updates-archive rust

You can then use 'dnf downgrade ...' as appropriate.

(Like the other Fedora repositories, updates-archive automatically knows your release version and picks packages from it. I think you can change this a bit with '--releasever=<NN>', but I'm not sure how deep the archive is.)

The other approach is to use Fedora Bodhi (also) and Fedora Koji (also) to fetch the packages for older builds, in much the same way as you can use Bodhi (and Koji) to fetch new builds that aren't in the updates or updates-testing repository yet. To start with, we're going to need to find out what's available. I think this can be done through either Bodhi or Koji, although Koji is presumably more authoritative. Let's do this for Rust in Fedora 41:

bodhi updates query --packages rust --releases f41
koji list-builds --state COMPLETE --no-draft --package rust --pattern '*.fc41'

Note that both of these listings are going to include package versions that were never released as updates for various reasons, and also versions built for the pre-release Fedora 41. Although Koji has a 'f41-updates' tag, I haven't been able to find a way to restrict 'koji list-builds' output to packages with that tag, so we're getting more than we'd like even after we use a pattern to restrict this to just Fedora 41.

(I think you may need to use the source package name, not a binary package one; if so, you can get it with 'rpm -qi rust' or whatever and looking at the 'Source RPM' line and name.)

Once you've found the package version you want, the easiest and fastest way to get it is through the koji command line client, following the directions in Installing Kernel from Koji with appropriate changes:

mkdir /tmp/scr
cd /tmp/scr
koji download-build --arch=x86_64 --arch=noarch rust-1.83.0-1.fc41

This will get you a bunch of RPMs, and then you can do 'dnf downgrade /tmp/scr/*.rpm' to have dnf do the right thing (only downgrading things you actually have installed).

One reason you might want to use Koji is that this gets you a local copy of the old package in case you want to go back and forth between it and the latest version for testing. If you use the dnf updates-archive approach, you'll be re-downloading the old version at every cycle. Of course at that point you can also use Koji to get a local copy of the latest update too, or 'dnf download ...', although Koji has the advantage that it gets all the related packages regardless of their names (so for Rust you get the 'cargo', 'clippy', and 'rustfmt' packages too).

(In theory you can work through the Fedora Bodhi website, but in practice it seems to be extremely overloaded at the moment and very slow. I suspect that the bot scraper plague is one contributing factor.)

PS: If you're using updates-archive and you just want to download the old packages, I think what you want is 'dnf download --enablerepo=updates-archive ...'.

Fedora 41 seems to have dropped an old XFT font 'property'

By: cks

Today I upgraded my office desktop from Fedora 40 to Fedora 41, and as traditional there was a little issue:

Current status: it has been '0' days since a Fedora upgrade caused X font problems, this time because xft apparently no longer accepts 'encoding=...' as a font specification argument/option.

One of the small issues with XFT fonts is that they don't really have canonical names. As covered in the "Font Name" section of fonts.conf, a given XFT font is a composite of a family, a size, and a number of attributes that may be used to narrow down the selection of the XFT font until there's only one option left (or no option left). One way to write that in textual form is, for example, 'Sans:Condensed Bold:size=13'.

For a long time, one of the 'name=value' properties that XFT font matching accepted was 'encoding=<something>'. For example, you might say 'encoding=iso10646-1' to specify 'Unicode' (and back in the long ago days, this apparently could make a difference for font rendering). Although I can't find 'encoding=' documented in historical fonts.conf stuff, I appear to have used it for more than a decade, dating back to when I first converted my fvwm configuration from XLFD fonts to XFT fonts. It's still accepted today on Fedora 40 (although I suspect it does nothing):

: f40 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
DejaVuSans.ttf: "DejaVu Sans" "Regular"

However, it's no longer accepted on Fedora 41:

: f41 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
Unable to parse the pattern

Initially I thought this had to be a change in fontconfig, but that doesn't seem to be the case; both Fedora 40 and Fedora 41 use the same version, '2.15.0', just with different build numbers (partly because of a mass rebuild for Fedora 41). Freetype itself went from version 2.13.2 to 2.13.3, but the release notes don't seem to have anything relevant. So I'm at a loss. At least it was easy to fix once I knew what had happened; I just had to take the ':encoding=iso10646-1' bit out from the places I had it.

(The visual manifestation was that all of my fvwm menus and window title bars switched to a tiny font. For historical reasons all of my XFT font specifications in my fvwm configuration file used 'encoding=...', so in Fedora 41 none of them worked and fvwm reported 'can't load font <whatever>' and fell back to its default of an XLFD font, which was tiny on my HiDPI display.)

PS: I suspect that this change will be coming in other Linux distributions sooner or later. Unsurprisingly, Ubuntu 24.04's fc-match still accepts 'encoding=...'.

PPS: Based on ltrace output, FcNameParse() appears to be what fails on Fedora 41.

I should learn systemd's features for restricting things

By: cks

Today, for reasons beyond the scope of this entry, I took something I'd been running by hand from the command line for testing and tried to set it up under systemd. This is normally straightforward, and it should have been extra straightforward because the thing came with a .service file. But that .service file used a lot of systemd's features for restricting what programs can do, and for my sins I'd decided to set up the program with its binary, configuration file, and so on in different places than it expected (and I think without some things it expected, like a supplementary group for permission to read some files). This was, unfortunately, an abject failure, so I wound up yanking all of the restrictions except 'DynamicUser=true'.

I'm confident that with enough time, I can (or could) sort out all of the problems (although I didn't feel like spending that time today). What this experience really points out is that systemd has a lot of options for really restricting what programs you run can do, and I'm not particularly familiar with them. To get the service working with all of its original restrictions, I'd have to read way through things like systemd.exec and understanding what everything the .service file used did. Once I did that, I could have understood what I needed to change to deal with my setup of the program.

(An expert probably could have fixed things in short order.)

That systemd has a lot of potential restrictions it can impose and that those restrictions are complex is not a flaw of systemd (or its fault). We already know that fine grained permissions are hard to set up and manage in any environment, especially if you don't know what you're doing (as I don't with systemd's restrictions). At the same time, fine grained restrictions are quite useful for being able to apply some restrictions to programs not designed for them.

(The simplicity of OpenBSD's 'pledge' system is great, but it needs the program's active cooperation. For better or worse, Linux doesn't have a native, fully supported equivalent; instead we have to build it out of more fine grained, lower level facilities, and that's what systemd exposes.)

Learning how to do use the restrictions is probably worthwhile in general. We run plenty of things through locally written systemd .service units. Some amount of those things are potentially risky (although generally not too risky), and some of them could be more restricted than they are today if we wanted to do the work and knew what we were doing (and knew some of the gotchas involved).

(And sooner or later we're going to run into more things with restrictions already in their .service units, and we're going to want to change some aspects of how they work.)

I'm working to switch from wget to curl (due to Fedora)

By: cks

I've been using wget for a long time now, which means that I've developed a lot of habits, reflexes and even little scripts around it. Then wget2 happened, or more exactly Fedora switched from wget to wget2 (and Ubuntu is probably going to follow along). I'm very much not a fan of wget2 (also); I find it has both worse behavior and worse output than classical wget, in ways that routinely get in my way. Or got in my way before I started retraining myself to use curl instead of wget.

(It's actually possible that Ubuntu won't follow Fedora here. Ubuntu 24.04's 'wget' is classic wget, and Debian unstable currently has the wget package still as classic wget. The wget to wget2 transition involves the kind of changes that I can see Debian developers rejecting, so maybe Debian will keep 'wget' as classic wget. The upstream has a wget 1.25.0 release as recently as November 2024 (cf); on the other hand, the main project page says that 'currently GNU wget2 is being developed', so it certainly sounds like the upstream wants to move.)

One tool for my switch is wcurl (also, via), which is a cover script to provide a wget-like interface to curl. But I don't have wcurl everywhere (it's not packaged in Ubuntu 24.04, although I think it's coming in 26.04), so I've also been working to remember things like curl's -L and -O options (for downloading things, these are basically 'do what I want' options; I almost always want curl to follow HTTP redirects). There's a number of other options I want to remember, so since I've been looking at the curl manual page, here's some notes to myself.

(If I downloaded multiple URLs at once, I'll probably want to use '--remote-name-all' instead of repeating -O a lot. But I'm probably not going to remember that unless I write a script.)

My 'wcat' script is basically 'curl -L -sS <url>' (-s to not show the progress bar, -S to include at least the HTTP payload on an error, -L to follow redirects). My related 'wretr' script, which is intended to show headers too, is 'curl -L -sS -i <url>' (-i includes headers), or 'curl -sS -i <url>' if I want to explicitly see any HTTP redirect rather than automatically follow it.

(What I'd like is an option to show HTTP headers only if there was an HTTP error, but curl is currently all or nothing here.)

Some of the time I'll want to fetch files with the -J option, which is the curl equivalent of wget's --trust-server-names. This is necessary in cases where a project doesn't bother with good URLs for things. Possibly I also want to use '-R' to set the local downloaded file's timestamp based on the server provided timestamp, which is wget's traditional behavior (sometimes it's good, sometimes it's confusing).

PS: I care about wcurl being part of a standard Ubuntu package because then we can install it as part of one of our standard package sets. If it's a personal script, it's not pervasive, although that's still better than nothing.

PPS: I'm not going to blame Fedora for the switch from wget to wget2. Fedora has a consistent policy of marching forward in changes like this to stay in sync with what upstream is developing, even when they cause pain to people using Fedora. That's just what you sign up for when you choose Fedora (or drift into it, in my case; I've been using 'Fedora' since before it was Fedora).

Launching BSSG - My Journey from Dynamic CMS to Bash Static Site Generator

Photo by Patrick Fore on Unsplash

I've had my own website practically forever. Back in the late '90s, I already had a web page on my ISP's server, and since at least 2001, I've had my own homepage on my own server. I've never been a great graphic designer, let alone a skilled webmaster, so I've always tried to keep things minimal and compatible.

Initially, like many others, I wrote HTML pages by hand. Then I used WYSIWYG creation tools, and eventually, I landed on CMS (Content Management Systems).

The Era of Dynamic CMS

I liked CMS because they allowed me to focus on the content and not on the correctness of the generated HTML. Thanks to them, I started writing my first blog shortly afterward.

Over the years, I've used many tools like PHPNuke, FlatNuke (created and developed by my friend Simone Vellei), eventually moving through Joomla and Wordpress. Wordpress always seemed like the most suitable tool for the job, and I used it for many years. Even today, mainly on the sysadmin side, I manage hundreds of Wordpress sites, and they are reasonably reliable, aside from the plugins (because the problem with Wordpress isn't the software itself, but many of the external plugins).

But this is precisely the problem: all dynamic CMS require constant and continuous security updates because, without them, the chances of defacement are extremely high.

Discovering Static Site Generators

And that's precisely why, when I discovered Carlos Fenollosa's bashblog in 2014, it immediately became clear that, indeed, there was no reason to continue down the path of dynamic CMS. I don't write often, I don't update often, there's no reason to regenerate all the content with every visit. Sure, WordPress caching plugins are often quite effective, but they are still add-ons that need to be kept up to date. And I'm not a fan of adding things to streamline. Often, less is more.

So, I started using bashblog for some 'secondary' projects until, in 2015, I migrated my 'old' Italian blog from WordPress to Pelican. Shortly after, I moved from Pelican to Nikola, and that blog is still generated by Nikola, although (that blog's) updates are now extremely rare (so much so that I consider it almost abandoned). I also created the first Docker container for Nikola and, for a long time, it was listed among the deployment methods on their site.

Building My Own: BSSG

But bashblog continued to fascinate me. So in 2015, for fun, I started developing my own Static Site Generator from scratch. I called it (with little imagination), BSSG - Bash Static Site Generator. The plan was for it to be compatible with the main OSes I use, to remain sufficiently simple and straightforward (!!!), and to be tailored to my needs. I intended to use it only and exclusively for small private things, starting with a sort of diary of mine - more professional than personal - and leave the 'official' blogs to more tested and 'professional' tools.

As time went by, I added some small features I liked: theming support, archives, tags (initially absent). Over time, many functions were added, and the script grew large – large enough to make me pause and ask myself some questions about the long-term stability of this solution. So, it remained only for my 'diary', which, however, grew year after year to the point where I needed to devise some kind of optimization. I then developed (more for fun than out of real necessity) a caching system. On rebuild, only what needs to be rebuilt is reconstructed, making the operation sufficiently fast even as the number of posts grows. Obviously, there are limits: using bash and external tools, the efficiency cannot be compared to that of a proper programming language.

Brief Detour: ITNBlog

And it's here that I decided, in preparation for opening a new blog (this one), to create a new tool called ITNBlog. I would develop it in Python and focus a bit more on performance and completeness. But ITNBlog stalled very quickly: time was limited, I'm not a full-time developer, so I realized I would spend too much time on development and too little on content creation.

Therefore, in 2018, I launched this blog but using Ghost, a solution that gave me good results, including performance-wise. I chose Ghost because I thought that, writing content also from my phone while on the go, a real CMS would be useful. Spoiler: no, it didn't turn out that way, so a few years later I decided to migrate this blog to Hugo. Nevertheless, I continued to develop ITNBlog on and off, as a hobby, without any particular ambitions.

At some point, however, I found myself in a particular situation: Hugo deprecated some features, and the theme I had chosen moved forward. But I ended up in an unpleasant situation: using the latest version of Hugo and the current version of the theme would produce unacceptable output; staying with the old version of Hugo while waiting for the theme update meant making a compromise. I actually build the blog from different devices, and they all have different versions of Hugo installed. Change the theme? Feasible, but I would have had to modify almost the entire site.

I considered migrating to manpageblog by gyptazy – I personally love its simplicity and retro look, and it was the main candidate to replace Hugo. I also created a script and migrated all my posts into the correct format.

BSSG to the Rescue (and ITNBlog's Role)

That's when I realized: I would implement the few missing features needed to make ITNBlog sufficiently complete, and this blog would be published using it, ensuring I'd be committed to its development. However, ITNBlog is not mature enough to be released publicly, so for now, it will remain the engine just for my blog. Then I thought again about BSSG – development had stalled some time ago, but it was still in use – and figured that perhaps, with a little tidying up, I could release it.

Because I'm tired of seeing people use dynamic CMS even to implement primarily static blogs or websites – and BSSG, despite its limitations and inefficiencies, works. And there are many themes to choose from. In short, you can install it and generate your blog in seconds.

Why Choose BSSG?

BSSG is the result of a 10-year evolution. The code isn't extremely consistent, some interesting features are missing (which I plan to implement), and it could use refactoring as the build script is monstrously large. But it works, it's portable (and much of the complexity increased precisely because of portability), and it generates sites that achieve very high accessibility and speed scores.

Here are some highlights:

  • βœ… Portability: Uses native OS tools (e.g., md5sum on Linux, md5 on OpenBSD and NetBSD). Portability itself added much of the complexity!
  • βœ… Simple Theming: Themes are just simple CSS files, so the structure remains the same – simplifying theme switching or creating new ones. More than 50 themes are already available!
  • βœ… Essential Features: Supports RSS feed generation, sitemap.xml, OpenGraph tags (to improve social sharing), internationalization (the blog can be in languages other than English – but not multilingual, at least for now), etc.
  • βœ… Built-in Backup and Restore script: It will just copy the configuration file, posts, and pages. Nothing else.
  • βœ… Minimal Dependencies.
  • βœ… Markdown Support: Posts and pages are in Markdown (CommonMark, Pandoc, and markdown.pl are supported).
  • βœ… Feature Images.
  • βœ… Optional GNU Parallel Integration: To speed up build times when there are many posts. This feature significantly impacts the code and has caused me numerous headaches over time. But it's optional (if parallel isn't found, it proceeds traditionally) and only provides benefits when the number of posts increases: with few posts, performance actually degrades.
  • βœ… High Accessibility and Performance Scores: Sites built with BSSG achieve excellent scores.
  • βœ… BSD Licensed: Released under a BSD license.

One of the problems I've always had with all CMS and SSGs has been choosing a theme. In some cases (like Hugo), the theme heavily influences the output, which is both good and bad. Good because it makes each site unique, but bad because it makes switching themes difficult. In the past, I've sometimes found myself having to change themes because they were abandoned and no longer updated. BSSG works differently: theming comes from using a different CSS file, which makes its structure more rigid, but switching from one theme to another is trivial. To help with the choice, I created a script that will build your site using all the themes present in the themes directory, just like on the examples page of the official website. This way, it will be easy to see and test your site with all available themes. If you want to add a touch of originality, you can choose the 'random' theme, and one will be chosen randomly from the list at each site regeneration.

Admin Interface (Experimental)

BSSG is in production use by some clients (for their internal sites), for whom I also created a basic admin interface (using Node Express, partly to chew on a bit of Node), but I don't feel ready to release it immediately as it's not sufficiently tested. It has an integrated Markdown editor and allows post scheduling, generating the files and launching BSSG with the right options at the right time. This could be that connecting link between traditional CMS and SSGs. There are others, but this one is tightly integrated with BSSG.

BSSG is Available Today

Starting today, BSSG is publicly available. It's not perfect, it probably doesn't make sense to do something of this complexity in bash, development will proceed slowly – but it's here, available to anyone who might find it useful.

Happy blogging everyone!

How we handle debconf questions during our Ubuntu installs

By: cks

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

How we automate installing extra packages during Ubuntu installs

By: cks

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

Linux kernel NFSv4 server and client RPC operation statistics

By: cks

NFS servers and clients communicate using RPC, sending various NFS v3, v4, and possibly v2 (but we hope not) RPC operations to the server and getting replies. On Linux, the kernel exports statistics about these NFS RPC operations in various places, with a global summary in /proc/net/rpc/nfsd (for the NFS server side) and /proc/net/rpc/nfs (for the client side). Various tools will extract this information and convert it into things like metrics, or present it on the fly (for example, nfsstat(8)). However, as far as I know what is in those files and especially how RPC operations are reported is not well documented, and also confusing, which is a problem if you discover that something has an incomplete knowledge of NFSv4 RPC stats.

For a general discussion of /proc/net/rpc/nfsd, see Svenn D'Hert's nfsd stats explained article. I'm focusing on NFSv4, which is to say the 'proc4ops' line. This line is produced in nfsd_show in fs/nfsd/stats.c. The line starts with a count of how many operations there are, such as 'proc4ops 76', and then has one number for each operation. What are the operations and how many of them are there? That's more or less found in the nfs_opnum4 enum in include/linux/nfs4.h. You'll notice that there are some gaps in the operation numbers; for example, there's no 0, 1, or 2. Despite there being no such actual NFS v4 operations, 'proc4ops' starts with three 0s for them, because it works with an array numbered by nfs_opnum4 and like all C arrays, it starts at 0.

(The counts of other, real NFS v4 operations may be 0 because they're never done in your environment.)

For NFS v4 client operations, we look at the 'proc4' line in /proc/net/rpc/nfs. Like the server's 'proc4ops' line, it starts with a count of how many operations are being reported on, such as 'proc4 69', and then a count for each operation. Unfortunately for us and everyone else, these operations are not numbered the same as the NFS server operations. Instead the numbering is given in an anonymous and unnumbered enum in include/linux/nfs4.h that starts with 'NFSPROC4_CLNT_NULL = 0,' (as a spoiler, the 'null' operation is not unused, contrary to the include file's comment). The actual generation and output of /proc/net/rpc/nfs is done in rpc_proc_show in net/sunrpc/stats.c. The whole structure this code uses is set up in fs/nfs/nfs4xdr.c, and while there is a confusing level of indirection, I believe the structure corresponds directly with the NFSPROC4_CLNT_* enum values.

What I think is going on is that Linux has decided to optimize its NFSv4 client statistics to only include the NFS v4 operations that it actually uses, rather than take up a bit of extra memory to include all of the NFS v4 operations, including ones that will always have a '0' count. Because the Linux NFS v4 client started using different NFSv4 operations at different times, some of these operations (such as 'lookupp') are out of order; when the NFS v4 client started using them, they had to be added at the end of the 'proc4' line to preserve backward compatibility with existing programs that read /proc/net/rpc/nfs.

PS: As far as I can tell from a quick look at fs/nfs/nfs3xdr.c, include/uapi/linux/nfs3.h, and net/sunrpc/stats.c, the NFS v3 server and client stats cover all of the NFS v3 operations and are in the same order, the order of the NFS v3 operation numbers.

How Ubuntu 24.04's bad bpftrace package appears to have happened

By: cks

When I wrote about Ubuntu 24.04's completely broken bpftrace '0.20.2-1ubuntu4.2' package (which is now no longer available as an Ubuntu update), I said it was a disturbing mystery how a theoretical 24.04 bpftrace binary was built in such a way that it depended on a shared library that didn't exist in 24.04. Thanks to the discussion in bpftrace bug #2097317, we have somewhat of an answer, which in part shows some of the challenges of building software at scale.

The short version is that the broken bpftrace package wasn't built in a standard Ubuntu 24.04 environment that only had released packages. Instead, it was built in a '24.04' environment that included (some?) proposed updates, and one of the included proposed updates was an updated version of libllvm18 that had the new shared library. Apparently there are mechanisms that should have acted to make the new bpftrace depend on the new libllvm18 if everything went right, but some things didn't go right and the new bpftrace package didn't pick up that dependency.

On the one hand, if you're planning interconnected package updates, it's a good idea to make sure that they work with each other, which means you may want to mingle in some proposed updates into some of your build environments. On the other hand, if you allow your build environments to be contaminated with non-public packages this way, you really, really need to make sure that the dependencies work out. If you don't and packages become public in the wrong order, you get Ubuntu 24.04's result.

(While the RPM build process and package format would have avoided this specific problem, I'm pretty sure that there are similar ways to make it go wrong.)

Contaminating your build environment this way also makes testing your newly built packages harder. The built bpftrace binary would have run inside the build environment, because the build environment had the right shared library from the proposed libllvm18. To see the failure, you would have to run tests (including running the built binary) in a 'pure' 24.04 environment that had only publicly released package updates. This would require an extra package test step; I'm not clear if Ubuntu has this as part of their automated testing of proposed updates (there's some hints in the discussion that they do but that these tests were limited and didn't try to run the binary).

An alarmingly bad official Ubuntu 24.04 bpftrace binary package

By: cks

Bpftrace is a more or less official part of Ubuntu; it's even in the Ubuntu 24.04 'main' repository, as opposed to one of the less supported ones. So I'll present things in the traditional illustrated form (slightly edited for line length reasons):

$ bpftrace
bpftrace: error while loading shared libraries: libLLVM-18.so.18.1: cannot open shared object file: No such file or directory
$ readelf -d /usr/bin/bpftrace | grep libLLVM
 0x0...01 (NEEDED)  Shared library: [libLLVM-18.so.18.1]
$ dpkg -L libllvm18 | grep libLLVM
/usr/lib/llvm-18/lib/libLLVM.so.1
/usr/lib/llvm-18/lib/libLLVM.so.18.1
/usr/lib/x86_64-linux-gnu/libLLVM-18.so
/usr/lib/x86_64-linux-gnu/libLLVM.so.18.1
$ dpkg -l bpftrace libllvm18
[...]
ii  bpftrace       0.20.2-1ubuntu4.2 amd64 [...]
ii  libllvm18:amd64 1:18.1.3-1ubuntu1 amd64 [...]

I originally mis-diagnosed this as a libllvm18 packaging failure, but this is in fact worse. Based on trawling through packages.ubuntu.com, only Ubuntu 24.10 and later have a 'libLLVM-18.so.18.1' in any package; in Ubuntu 24.04, the correct name for this is 'libLLVM.so.18.1'. If you rebuild the bpftrace source .deb on a genuine 24.04 machine, you get a bpftrace build (and binary .deb) that does correctly use 'libLLVM.so.18.1' instead of 'libLLVM-18.so.18.1'.

As far as I can see, there are two things that could have happened here. The first is that Canonical simply built a 24.10 (or later) bpftrace binary .deb and put it in 24.04 without bothering to check if the result actually worked. I would like to say that this shows shocking disregard for the functioning of an increasingly important observability tool from Canonical, but actually it's not shocking at all, it's Canonical being Canonical (and they would like us to pay for this for some reason). The second and worse option is that Canonical is building 'Ubuntu 24.04' packages in an environment that is contaminated with 24.10 or later packages, shared libraries, and so on. This isn't supposed to happen in a properly operating package building environment that intends to create reliable and reproducible results and casts doubt on the provenance and reliability of all Ubuntu 24.04 packages.

(I don't know if there's a way to inspect binary .debs to determine anything about the environment they were built in, the way you can get some information about RPMs. Also, I now have a new appreciation for Fedora putting the Fedora release version into the actual RPM's 'release' name. Ubuntu 24.10 and 24.04 don't have the same version of bpftrace, so this isn't quite as simple as Canonical copying the 24.10 package to 24.04; 24.10 has 0.21.2, while 24.04 is theoretically 0.20.2.)

Incidentally, this isn't an issue of the shared library having its name changed, because if you manually create a 'libLLVM-18.so.18.1' symbolic link to the 24.04 libllvm18's 'libLLVM.so.18.1' and run bpftrace, what you get is:

$ bpftrace
: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
abort

This appears to say that the Ubuntu 24.04 bpftrace binary is incompatible with the Ubuntu 24.04 libllvm18 shared libraries. I suspect that it was built against different LLVM 18 headers as well as different LLVM 18 shared libraries.

The (potential) complexity of good runqueue latency measurement in Linux

By: cks

Run queue latency is the time between when a Linux task becomes ready to run and when it actually runs. If you want good responsiveness, you want a low runqueue latency, so for a while I've been tracking a histogram of it with eBPF, and I put some graphs of it up on some Grafana dashboards I look at. Then recently I improved the responsiveness of my desktop with the cgroup V2 'cpu.idle' setting, and questions came up about how this different from process niceness. When I was looking at those questions, I realized that my run queue latency measurements were incomplete.

When I first set up my run queue latency tracking, I wasn't using either cgroup V2 cpu.idle or process niceness, and so I set up a single global runqueue latency histogram for all tasks regardless of their priority and scheduling class. Once I started using 'idle' CPU scheduling (and testing the effectiveness of niceness), this resulted in hopelessly muddled data that was effectively meaningless during the time that multiple scheduling types of scheduling or multiple nicenesses were running. Running CPU-consuming processes only when the system is otherwise idle is (hopefully) good for the runqueue latency of my regular desktop processes, but more terrible than usual for those 'run only when idle' processes, and generally there's going to be a lot more of them than my desktop processes.

The moment you introduce more than one 'class' of processes for scheduling, you need to split run queue latency measurements up between these classes if you want to really make sense of the results. What these classes are will depend on your environment. I could probably get away with a class for 'cpu.idle' tasks, a class for heavily nice'd tasks, a class for regular tasks, and perhaps a class for (system) processes running with very high priority. If you're doing fair share scheduling between logins, you might need a class per login (or you could ignore run queue latency as too noisy a measure).

I'm not sure I'd actually track all of my classes as Prometheus metrics. For my personal purposes, I don't care very much about the run queue latency of 'idle' or heavily nice'd processes, so perhaps I should update my personal metrics gathering to just ignore those. Alternately, I could write a bpftrace script that gathered the detailed class by class data, run it by hand when I was curious, and ignore the issue otherwise (continuing with my 'global' run queue latency histogram, which is at least honest in general).

The issue with DNF 5 and script output in Fedora 41

By: cks

These days Fedora uses DNF as its high(er) level package management software, replacing yum. However, there are multiple versions of DNF, which behave somewhat differently. Through Fedora 40, the default version of DNF was DNF 4; in Fedora 41, DNF is now DNF 5. DNF 5 brings a number of improvements but it has at least one issue that makes me unhappy with it in my specific situation. Over on the Fediverse I said:

Oh nice, DNF 5 in Fedora 41 has nicely improved the handling of output from RPM scriptlets, so that you can more easily see that it's scriptlet output instead of DNF messages.

[later]

I must retract my praise for DNF 5 in Fedora 41, because it has actually made the handling of output from RPM scriptlets *much* worse than in dnf 4. DNF 5 will repeatedly re-print the current output to date of scriptlets every time it updates a progress indicator of, for example, removing packages. This results in a flood of output for DKMS module builds during kernel updates. Dnf 5's cure is far worse than the disease, and there's no way to disable it.

<bugzilla 2331691>

(Fedora 41 specifically has dnf5-5.2.8.1, at least at the moment.)

This can be mostly worked around for kernel package upgrades and DKMS modules by manually removing and upgrading packages before the main kernel upgrade. You want to do this so that dnf is removing as few packages as possible while your DKMS modules are rebuilding. This is done with:

  1. Upgrade all of your non-kernel packages first:

    dnf upgrade --exclude 'kernel*'
    

  2. Remove the following packages for the old kernel:

    kernel kernel-core kernel-devel kernel-modules kernel-modules-core kernel-modules-extra

    (It's probably easier to do 'dnf remove kernel*<version>*' and let DNF sort it out.)

  3. Upgrade two kernel packages that you can do in advance:

    dnf upgrade kernel-tools kernel-tools-libs
    

Unfortunately in Fedora 41 this still leaves you with one RPM package that you can't upgrade in advance and that will be removed while your DKMS module is rebuilding, namely 'kernel-devel-matched'. To add extra annoyance, this is a virtual package that contains no files, and you can't remove it because a lot of things depend on it.

As far as I can tell, DNF 5 has absolutely no way to shut off its progress bars. It completely ignores $TERM and I can't see anything else that leaves DNF usable. It would have been nice to have some command line switches to control this, but it seems pretty clear that this wasn't high on the DNF 5 road map.

(Although I don't expect this to be fixed in Fedora 41 over its lifetime, I am still deferring the Fedora 41 upgrades of my work and home desktops for as long as possible to minimize the amount of DNF 5 irritation I have to deal with.)

WireGuard's AllowedIPs aren't always the (WireGuard) routes you want

By: cks

A while back I wrote about understanding WireGuard's AllowedIPs, and also recently I wrote about how different sorts of WireGuard setups have different difficulties, where one of the challenges for some setups is setting up what you want routed through WireGuard connections. As Ian Z aka nobrowser recently noted in a comment on the first entry, these days many WireGuard related programs (such as wg-quick and NetworkManager) will automatically set routes for you based on AllowedIPs. Much of the time this will work fine, but there are situations where adding routes for all AllowedIPs ranges isn't what you want.

WireGuard's AllowedIPs setting for a particular peer controls two things at once: what (inside-WireGuard) source IP addresses you will accept from the peer, and what destination addresses WireGuard will send to that peer if the packet is sent to that WireGuard interface. However, it's the routing table that controls what destination addresses are sent to a particular WireGuard interface (or more likely a combination of IP policy routing rules and some routing table).

If your WireGuard IP address is only reachable from other WireGuard peers, you can sensibly bound your AllowedIPs so that the collection of all of them matches the routing table. This is also more or less doable if some of them are gateways for additional networks; hopefully your network design puts all of those networks under some subnet and the subnet isn't too big. However, if your WireGuard IP can wind up being reached by a broader range of source IPs, or even 'all of the Internet' (as is my case), then your AllowedIPs range is potentially much larger than what you want to always be routed to WireGuard.

A related case is if you have a 'work VPN' WireGuard configuration where you could route all of your traffic through your WireGuard connection but some of the time you only want to route traffic to specific (work) subnets. Unless you like changing AllowedIPs all of the time or constructing two different WireGuard interfaces and only activating the correct one, you'll want an AllowedIPs that accepts everything but some of the time you'll only route specific networks to the WireGuard interface.

(On the other hand, with the state of things in Linux, having two separate WireGuard interfaces might be the easiest way to manage this in NetworkManager or other tools.)

I think that most people's use of WireGuard will probably involve AllowedIPs settings that also work for routing, provided that the tools involve handle the recursive routing problem. These days, NetworkManager handles that for you, although I don't know about wg-quick.

(This is one of the entries that I write partly to work it out in my own head. My own configuration requires a different AllowedIPs than the routes I send through the WireGuard tunnel. I make this work with policy based routing.)

Cgroup V2 memory limits and their potential for thrashing

By: cks

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

A gotcha with importing ZFS pools and NFS exports on Linux (as of ZFS 2.3.0)

By: cks

Ever since its Solaris origins, ZFS has supported automatic NFS and CIFS sharing of ZFS filesystems through their 'sharenfs' and 'sharesmb' properties. Part of the idea of this is that you could automatically have NFS (and SMB) shares created and removed as you did things like import and export pools, rather than have to maintain a separate set of export information and keep it in sync with what ZFS filesystems were available. On Linux, OpenZFS still supports this, working through standard Linux NFS export permissions (which don't quite match the Solaris/Illumos model that's used for sharenfs) and standard tools like exportfs. A lot of this works more or less as you'd expect, but it turns out that there's a potentially unpleasant surprise lurking in how 'zpool import' and 'zpool export' work.

In the current code, if you import or export a ZFS pool that has no filesystems with a sharenfs set, ZFS will still run 'exportfs -ra' at the end of the operation even though nothing could have changed in the NFS exports situation. An important effect that this has is that it will wipe out any manually added or changed NFS exports, reverting your NFS exports to what is currently in /etc/exports and /etc/exports.d. In many situations (including ours) this is a harmless operation, because /etc/exports and /etc/exports.d are how things are supposed to be. But in some environments you may have programs that maintain their own exports list and permissions through running 'exportfs' in various ways, and in these environments a ZFS pool import or export will destroy those exports.

(Apparently one such environment is high availability systems, some of which manually manage NFS exports outside of /etc/exports (I maintain that this is a perfectly sensible design decision). These are also the kind of environment that might routinely import or export pools, as HA pools move between hosts.)

The current OpenZFS code runs 'exportfs -ra' entirely blindly. It doesn't matter if you don't NFS export any ZFS filesystems, much less any from the pool that you're importing or exporting. As long as an 'exportfs' binary is on the system and can be executed, ZFS will run it. Possibly this could be changed if someone was to submit an OpenZFS bug report, but for a number of reasons (including that we're not directly affected by this and aren't in a position to do any testing), that someone will not be me.

(As far as I can tell this is the state of the code in all Linux OpenZFS versions up through the current development version and 2.3.0-rc4, the latest 2.3.0 release candidate.)

Appendix: Where this is in the current OpenZFS source code

The exportfs execution is done in nfs_commit_shares() in lib/libshare/os/linux/nfs.c. This is called (indirectly) by sa_commit_shares() in lib/libshare/libshare.c, which is called by zfs_commit_shares() in lib/libzfs/libzfs_mount.c. In turn this is called by zpool_enable_datasets() and zpool_disable_datasets(), also in libzfs_mount.c, which are called as part of 'zpool import' and 'zpool export' respectively.

(As a piece of trivia, zpool_disable_datasets() will also be called during 'zpool destroy'.)

❌