Reading view

There are new articles available, click to refresh the page.

Restricting IP address access to specific ports in eBPF: a sketch

By: cks

The other day I covered how I think systemd's IPAddressAllow and IPAddressDeny restrictions work, which unfortunately only allows you to limit this to specific (local) ports only if you set up the sockets for those ports in a separate systemd.socket unit. Naturally this raises the question of whether there is a good, scalable way to restrict access to specific ports in eBPF that systemd (or other interested parties) could use. I think the answer is yes, so here is a sketch of how I think you'd this.

Why we care about a 'scalable' way to do this is because systemd generates and installs its eBPF programs on the fly. Since tcpdump can do this sort of cross-port matching, we could write an eBPF program that did it directly. But such a program could get complex if we were matching a bunch of things, and that complexity might make it hard to generate on the fly (or at least make it complex enough that systemd and other programs didn't want to). So we'd like a way that still allows you to generate a simple eBPF program.

Systemd uses cgroup socket SKB eBPF programs, which attach to a cgroup and filter all network packets on ingress or egress. As far as I can understand from staring at code, these are implemented by extracting the IPv4 or IPv4 address of the other side from the SKB and then querying what eBPF calls a LPM (Longest Prefix Match) map. The normal way to use an LPM map is to use the CIDR prefix length and the start of the CIDR network as the key (for individual IPv4 addresses, the prefix length is 32), and then match against them, so this is what systemd's cgroup program does. This is a nicely scalable way to handle the problem; the eBPF program itself is basically constant, and you have a couple of eBPF maps (for the allow and deny sides) that systemd populates with the relevant information from IPAddressAllow and IPAddressDeny.

However, there's nothing in eBPF that requires the keys to be just CIDR prefixes plus IP addresses. A LPM map key has to start with a 32-bit prefix, but the size of the rest of the key can vary. This means that we can make our keys be 16 bits longer and stick the port number in front of the IP address (and increase the CIDR prefix size appropriately). So to match packets to port 22 from 128.100.0.0/16, your key would be (u32) 32 for the prefix length then something like 0x00 0x16 0x80 0x64 0x00 0x00 (if I'm doing the math and understanding the structure right). When you query this LPM map, you put the appropriate port number in front of the IP address.

This does mean that each separate port with a separate set of IP address restrictions needs its own set of map entries. If you wanted a set of ports to all have a common set of restrictions, you could use a normally structured LPM map and a second plain hash map where the keys are port numbers. Then you check the port and the IP address separately, rather than trying to combine them in one lookup. And there are more complex schemes if you need them.

Which scheme you'd use depends on how you expect port based access restrictions to be used. Do you expect several different ports, each with its own set of IP access restrictions (or only one port)? Then my first scheme is only a minor change from systemd's current setup, and it's easy to extend it to general IP address controls as well (just use a port number of zero to mean 'this applies to all ports'). If you expect sets of ports to all use a common set of IP access controls, or several sets of ports with different restrictions for each set, then you might want a scheme with more maps.

(In theory you could write this eBPF program and set up these maps yourself, then use systemd resource control features to attach them to your .service unit. In practice, at that point you probably should write host firewall rules instead, it's likely to be simpler. But see this blog post and the related VCS repository, although that uses a more hard-coded approach.)

How I think systemd IP address restrictions on socket units works

By: cks

Among the systemd resource controls are IPAddressAllow= and IPAddressDeny=, which allow you to limit what IP addresses your systemd thing can interact with. This is implemented with eBPF. A limitation of these as applied to systemd .service units is that they restrict all traffic, both inbound connections and things your service initiates (like, say, DNS lookups), while you may want only a simple inbound connection filter. However, you can also set these on systemd.socket units. If you do, your IP address restrictions apply only to the socket (or sockets), not to the service unit that it starts. To quote the documentation:

Note that for socket-activated services, the IP access list configured on the socket unit applies to all sockets associated with it directly, but not to any sockets created by the ultimately activated services for it.

So if you have a systemd socket activated service, you can control who can access the socket without restricting who the service itself can talk to.

In general, systemd IP access controls are done through eBPF programs set up on cgroups. If you set up IP access controls on a socket, such as ssh.socket in Ubuntu 24.04, you do get such eBPF programs attached to the ssh.socket cgroup (and there is a ssh.socket cgroup, perhaps because of the eBPF programs):

# pwd
/sys/fs/cgroup/system.slice
# bpftool cgroup list ssh.socket
ID  AttachType      AttachFlags  Name
12  cgroup_inet_ingress   multi  sd_fw_ingress
11  cgroup_inet_egress    multi  sd_fw_egress

However, if you look there are no processes or threads in the ssh.socket cgroup, which is not really surprising but also means there is nothing there for these eBPF programs to apply to. And if you dump the eBPF program itself (with 'ebpftool dump xlated id 12'), it doesn't really look like it checks for the port number.

What I think must be going on is that the eBPF filtering program is connected to the SSH socket itself. Since I can't find any relevant looking uses in the systemd code of the `SO_ATTACH_*' BPF related options from socket(7) (which would be used with setsockopt(2) to directly attach programs to a socket), I assume that what happens is that if you create or perhaps start using a socket within a cgroup, that socket gets tied to the cgroup and its eBPF programs, and this attachment stays when the socket is passed to another program in a different cgroup.

(I don't know if there's any way to see what eBPF programs are attached to a socket or a file descriptor for a socket.)

If this is what's going on, it unfortunately means that there's no way to extend this feature of socket units to get per-port IP access control in .service units. Systemd isn't writing special eBPF filter programs for socket units that only apply to those exact ports, which you could in theory reuse for a service unit; instead, it's arranging to connect (only) specific sockets to its general, broad IP access control eBPF programs. Programs that make their own listening sockets won't be doing anything to get eBPF programs attached to them (and only them), so we're out of luck.

(One could experiment with relocating programs between cgroups, with the initial cgroup in which the program creates its listening sockets restricted and the other not, but I will leave that up to interested parties.)

Systemd resource controls on user.slice and system.slice work fine

By: cks

We have a number of systems where we traditionally set strict overcommit handling, and for some time this has caused us some heartburn. Some years ago I speculated that we might want to use resource controls on user.slice or systemd.slice if they worked, and then recently in a comment here I speculated that this was the way to (relatively) safely limit memory use if it worked.

Well, it does (as far as I can tell, without deep testing). If you want to limit how much of the system's memory people who log in can use so that system services don't explode, you can set MemoryMin= on system.slice to guarantee some amount of memory to it and all things under it. Alternately, you can set MemoryMax= on user.slice, collectively limiting all user sessions to that amount of memory. In either case my view is that you might want to set MemorySwapMax= on user.slice so that user sessions don't spend all of their time swapping. Which one you set things on depends on which is easier and you trust more; my inclination is MemoryMax, although that means you need to dynamically size it depending on this machine's total memory.

(If you want to limit user memory use you'll need to make sure that things like user cron jobs are forced into user sessions, rather than running under cron.service in system.slice.)

Of course this is what you should expect, given systemd's documentation and the kernel documentation. On the other hand, the Linux kernel cgroup and memory system is sufficiently opaque and ever changing that I feel the need to verify that things actually do work (in our environment) as I expect them to. Sometimes there are surprises, or settings that nominally work but don't really affect things the way I expect.

This does raise the question of how much memory you want to reserve for the system. It would be nice if you could use systemd-cgtop to see how much memory your system.slice is currently using, but unfortunately the number it will show is potentially misleadingly high. This is because the memory attributed to any cgroup includes (much) more than program RAM usage. For example, on our it seems typical for system.slice to be using under a gigabyte of 'user' RAM but also several gigabytes of filesystem cache and other kernel memory. You probably want to allow for some of that in what memory you reserve for system.slice, but maybe not all of the current usage.

(You can get the current version of the 'memdu' program I use as memdu.py.)

Gnome, GSettings, gconf, and which one you want

By: cks

On the Fediverse a while back, I said:

Ah yes, GNOME, it is of course my mistake that I used gconf-editor instead of dconf-editor. But at least now Gnome-Terminal no longer intercepts F11, so I can possibly use g-t to enter F11 into serial consoles to get the attention of a BIOS. If everything works in UEFI land.

Gnome has had at least two settings systems, GSettings/dconf (also) and the older GConf. If you're using a modern Gnome program, especially a standard Gnome program like gnome-terminal, it will use GSettings and you will want to use dconf-editor to modify its settings outside of whatever Preferences dialogs it gives you (or doesn't give you). You can also use the gsettings or dconf programs from the command line.

(This can include Gnome-derived desktop environments like Cinnamon, which has updated to using GSettings.)

If the program you're using hasn't been updated to the latest things that Gnome is doing, for example Thunderbird (at least as of 2024), then it will still be using GConf. You need to edit its settings using gconf-editor or gconftool-2, or possibly you'll need to look at the GConf version of general Gnome settings. I don't know if there's anything in Gnome that synchronizes general Gnome GSettings settings into GConf settings for programs that haven't yet been updated.

(This is relevant for programs, like Thunderbird, that use general Gnome settings for things like 'how to open a particular sort of thing'. Although I think modern Gnome may not have very many settings for this because it always goes to the GTK GIO system, based on the Arch Wiki's page on Default Applications.)

Because I've made this mistake between gconf-editor and dconf-editor more than once, I've now created a personal gconf-editor cover script that prints an explanation of the situation when I run it without a special --really argument. Hopefully this will keep me sorted out the next time I run gconf-editor instead of dconf-editor.

PS: Probably I want to use gsettings instead of dconf-editor and dconf as much as possible, since gsettings works through the GSettings layer and so apparently has more safety checks than dconf-editor and dconf do.

PPS: Don't ask me what the equivalents are for KDE. KDE settings are currently opaque to me.

Testing Linux memory limits is a bit of a pain

By: cks

For reasons outside of the scope of this entry, I want to test how various systemd memory resource limits work and interact with each other (which means that I'm really digging into cgroup v2 memory controls). When I started trying to do this, it turned out that I had no good test program (or programs), although I had some ones that gave me partial answers.

There are two complexities in memory usage testing programs in a cgroups environment. First, you may be able to allocate more memory than you can actually use, depending on your system's settings for strict overcommit. So it's not enough to see how much memory you can allocate using the mechanism of your choice (I tend to use mmap() rather than go through language allocators). After you've either determined how much memory you can allocate or allocated your target amount, you have to at least force the kernel to materialize your memory by writing something to every page of it. Since the kernel can probably swap out some amount of your memory, you may need to keep repeatedly reading all of it.

The second issue is that if you're not in strict overcommit (and sometimes even if you are), the kernel can let you allocate more memory than you can actually use and then you try to use it, hit you with the OOM killer. For my testing, I care about the actual usable amount of memory, not how much memory I can allocate, so I need to deal with this somehow (and this is where my current test programs are inadequate). Since the OOM killer can't be caught by a process (that's sort of the point), the simple approach is probably to have my test program progressively report on how much memory its touched so far, so I can see how far it got before it was OOM-killed. A more complex approach would be to do the testing in a child process with progress reports back to the parent so it could try to narrow in on how much it could use rather than me guessing that I wanted progress reports every, say, 16 MBytes or 32 MBytes of memory touching.

(Hopefully the OOM killer would only kill the child and not the parent, but with the OOM killer you can never be sure.)

I'm probably not the first person to have this sort of need, so I suspect that other people have written test programs and maybe even put them up somewhere. I don't expect to be able to find them in today's ambient Internet search noise, plus this is very close to the much more popular issue of testing your RAM memory.

(Will I put up my little test program when I hack it up? Probably not, it's too much work to do it properly, with actual documentation and so on. And these days I'm not very enthused about putting more repositories on Github, so I'd need to find some alternate place.)

Systemd and blocking connections to localhost, including via 'any'

By: cks

I recently discovered a surprising path to accessing localhost URLs and services, where instead of connecting to 127.0.0.1 or the IPv6 equivalent, you connected to 0.0.0.0 (or the IPv6 equivalent). In that entry I mentioned that I didn't know if systemd's IPAddressDeny would block this. I've now tested this, and the answer is that systemd's restrictions do block this. If you set 'IPAddressDeny=localhost', the service or whatever is blocked from the 0.0.0.0 variation as well (for both outbound and inbound connections). This is exactly the way it should be, so you might wonder why I was uncertain and felt I needed to test it.

There are a variety of ways at different levels that you might implement access controls on a process (or a group of processes) in Linux, for IP addresses or anything else. For example, you might create an eBPF program that filtered the system calls and system call arguments allowed and attach it to a process and all of its children using seccomp(2). Alternately, for filtering IP connections specifically, you might use a cgroup socket address eBPF program (also), which are among the the cgroup program types that are available. Or perhaps you'd prefer to use a cgroup socket buffer program.

How a program such as systemd implements filtering has implications for what sort of things it has to consider and know about when doing the filtering. For example, if we reasonably conclude that the kernel will have mapped 0.0.0.0 to 127.0.0.1 by the time it invokes cgroup socket address eBPF programs, such a program doesn't need to have any special handling to block access to localhost by people using '0.0.0.0' as the target address to connect to. On the other hand, if you're filtering at the system call level, the kernel has almost certainly not done such mapping at the time it invokes you, so your connect() filter had better know that '0.0.0.0' is equivalent to 127.0.0.1 and it should block both.

This diversity is why I felt I couldn't be completely sure about systemd's behavior without actually testing it. To be honest, I didn't know what the specific options were until I researched them for this entry. I knew systemd used eBPF for IPAddressDeny (because it mentions that in the manual page in passing), but I vaguely knew there are a lot of ways and places to use eBPF and I didn't know if systemd's way needed to know about 0.0.0.0 or if systemd did know.

Sidebar: What systemd uses

As I found out through use of 'bpftool cgroup list /sys/fs/cgroup/<relevant thing>' on a systemd service that I knew uses systemd IP address filtering, systemd uses cgroup socket buffer programs, and is presumably looking for good and bad IP addresses and netblocks in those programs. This unfortunately means that it would be hard for systemd to have different filtering for inbound connections as opposed to outgoing connections, because at the socket buffer level it's all packets.

(You'd have to go up a level to more complicated filters on socket address operations.)

Early Linux package manager history and patching upstream source releases

By: cks

One of the important roles of Linux system package managers like dpkg and RPM is providing a single interface to building programs from source even though the programs may use a wide assortment of build processes. One of the source building features that both dpkg and RPM included (I believe from the start) is patching the upstream source code, as well as providing additional files along with it. My impression is that today this is considered much less important in package managers, and some may make it at least somewhat awkward to patch the source release on the fly. Recently I realized that there may be a reason for this potential oddity in dpkg and RPM.

Both dpkg and RPM are very old (by Linux standards). As covered in Andrew Nesbitt's Package Manager Timeline, both date from the mid-1990s (dpkg in January 1994, RPM in September 1995). Linux itself was quite new at the time and the Unix world was still dominated by commercial Unixes (partly because the march of x86 PCs was only just starting). As a result, Linux was a minority target for a lot of general Unix free software (although obviously not for Linux specific software). I suspect that this was compounded by limitations in early Linux libc, where apparently it had some issues with standards (see eg this, also, also, also).

As a minority target, I suspect that Linux regularly had problems compiling upstream software, and for various reasons not all upstreams were interested in fixing (or changing) that (especially if it involved accepting patches to cope with a non standards compliant environment; one reply was to tell Linux to get standards compliant). This probably left early Linux distributions regularly patching software in order to make it build on (their) Linux, leading to first class support for patching upstream source code in early package managers.

(I don't know for sure because at that time I wasn't using Linux or x86 PCs, and I might have been vaguely in the incorrect 'Linux isn't Unix' camp. My first Linux came somewhat later.)

These days things have changed drastically. Linux is much more standards compliant and of course it's a major platform. Free software that works on non-Linux Unixes but doesn't build cleanly on Linux is a rarity, so it's much easier to imagine (or have) a package manager that is focused on building upstream source code unaltered and where patching is uncommon and not as easy (or trivial) as dpkg and RPM make it.

(You still need to be able to patch upstream releases to handle security patches and so on, since projects don't necessarily publish new releases for them. I believe some projects simply issue patches and tell you to apply them to their current release. And you may have to backport a patch yourself if you're sticking on an older release of the project that they no longer do patches for.)

Why Linux wound up with system package managers

By: cks

Yesterday I discussed the two sorts of program package managers, system package managers that manage the whole system and application package managers that mostly or entirely manage third party programs. Commercial Unix got application package managers in the very early 1990s, but Linux's first program managers were system package managers, in dpkg and RPM (or at least those seem to be the first Linux package managers).

The abstract way to describe why is to say that Linux distributions had to assemble a whole thing from separate pieces; the kernel came from one place, libc from another, coreutils from a third, and so on. The concrete version is to think about what problems you'd have without a package manager. Suppose that you assembled a directory tree of all of the source code of the kernel, libc, coreutils, GCC, and so on. Now you need to build all of these things (or rebuild, let's ignore bootstrapping for the moment).

Building everything is complicated partly because everything goes about it differently. The kernel has its own configuration and build system, a variety of things use autoconf but not necessarily with the same set of options to control things like features, GCC has a multi-stage build process, Perl has its own configuration and bootstrapping process, X is frankly weird and vaguely terrifying, and so on. Then not everyone uses 'make install' to actually install their software, so you have another set of variations for all of this.

(The less said about the build processes for either TeX or GNU Emacs in the early to mid 1990s, the better.)

If you do this at any scale, you need to keep track of all of this information (cf) and you want a uniform interface for 'turn this piece into a compiled and ready to unpack blob'. That is, you want a source package (which encapsulates all of the 'how to do it' knowledge) and a command that takes a source package and does a build with it. Once you're building things that you can turn into blobs, it's simpler to always ship a new version of the blob whenever you change anything.

(You want the 'install' part of 'build and install' to result in a blob rather than directly installing things on your running system because until it finishes, you're not entirely sure the build and install has fully worked. Also, this gives you an easy way to split overall system up into multiple pieces, some of which people don't have to install. And in the very early days, to split them across multiple floppy disks, as SLS did.)

Now you almost have a system package manager with source packages and binary packages. You're building all of the pieces of your Linux distribution in a standard way from something that looks a lot like source packages, and you pretty much want to create binary blobs from them rather than dump everything into a filesystem. People will obviously want a command that takes a binary blob and 'installs' it by unpacking it on their system (and possibly extra stuff), rather than having to run 'tar whatever' all the time themselves, and they'll also want to automatically keep track of which of your packages they've installed rather than having to keep their own records. Now you have all of the essential parts of a system package manager.

(Both dpkg and RPM also keep track of which package installed what files, which is important for upgrading and removing packages, along with things having versions.)

Systemd-networkd and giving your virtual devices alternate names

By: cks

Recently I wrote about how Linux network interface names have a length limit, of 15 characters. You can work around this limit by giving network interfaces an 'altname' property, as exposed in (for example) 'ip link'. While you can't work around this at all in Canonical's Netplan, it looks like you can have this for your VLANs in systemd-networkd, since there's AlternativeName= in the systemd.link manual page.

Except, if you look at an actual VLAN configuration as materialized by Netplan (or written out by hand), you'll discover a problem. Your VLANs don't normally have .link files, only .netdev and .network files (and even your normal Ethernet links may not have .link files). The AlternativeName= setting is only valid in .link files, because networkd is like that.

(The AlternativeName= is a '[Link]' section setting and .network files also have a '[Link]' section, but they allow completely different sets of '[Link]' settings. The .netdev file, which is where you define virtual interfaces, doesn't have a '[Link]' section at all, although settings like AlternativeName= apply to them just as much as to regular devices. Alternately, .netdev files could support setting altnames for virtual devices in the '[NetDev]' section along side the mandatory 'Name=' setting.)

You can work around this indirectly, because you can create a .link file for a virtual network device and have it work:

[Match]
Type=vlan
OriginalName=vlan22-mlab

[Link]
AlternativeNamesPolicy=
AlternativeName=vlan22-matterlab

Networkd does the right thing here even though 'vlan22-mlab' doesn't exist when it starts up; when vlan22-mlab comes into existence, it matches the .link file and has the altname stapled on.

Given how awkward this is (and that not everything accepts or sees altnames), I think it's probably not worth bothering with unless you have a very compelling reason to give an altname to a virtual interface. In my case, this is clearly too much work simply to give a VLAN interface its 'proper' name.

Since I tested, I can also say that this works on a Netplan-based Ubuntu server where the underlying VLAN is specified in Netplan. You have to hand write the .link file and stick it in /etc/systemd/network, but after that it cooperates reasonably well with a Netplan VLAN setup.

Linux network interface names have a length limit, and Netplan

By: cks

Over on the Fediverse, I shared a discovery:

This is my (sad) face that Linux interfaces have a maximum name length. What do you mean I can't call this VLAN interface 'vlan22-matterlab'?

Also, this is my annoyed face that Canonical Netplan doesn't check or report this problem/restriction. Instead your VLAN interface just doesn't get created, and you have to go look at system logs to find systemd-networkd telling you about it.

(This is my face about Netplan in general, of course. The sooner it gets yeeted the better.)

Based on both some Internet searches and looking at kernel headers, I believe the limit is 15 characters for the primary name of an interface. In headers, you will find this called IFNAMSIZ (the kernel) or IF_NAMESIZE (glibc), and it's defined to be 16 but that includes the trailing zero byte for C strings.

(I can be confident that the limit is 15, not 16, because 'vlan22-matterlab' is exactly 16 characters long without a trailing zero byte. Take one character off and it works.)

At the level of ip commands, the error message you get is on the unhelpful side:

# ip link add dev vlan22-matterlab type wireguard
Error: Attribute failed policy validation.

(I picked the type for illustration purposes.)

Systemd-networkd gives you a much better error message:

/run/systemd/network/10-netplan-vlan22-matterlab.netdev:2: Interface name is not valid or too long, ignoring assignment: vlan22-matterlab

(Then you get some additional errors because there's no name.)

As mentioned in my Fediverse post, Netplan tells you nothing. One direct consequence of this is that in any context where you're writing down your own network interface names, such as VLANs or WireGuard interfaces, simply having 'netplan try' or 'netplan apply' succeed without errors does not mean that your configuration actually works. You'll need to look at error logs and perhaps inventory all your network devices.

(This isn't the first time I've seen Netplan behave this way, and it remains just as dangerous.)

As covered in the ip link manual page, network interfaces can have either or both of aliases and 'altname' properties. These alternate names can be (much) longer than 16 characters, and the 'ip link property' altname property can be used in various contexts to make things convenient (I'm not sure what good aliases are, though). However this is somewhat irrelevant for people using Netplan, because the current Netplan YAML doesn't allow you to set interface altnames.

You can set altnames in networkd .link files, as covered in the systemd.link manual page. The direct thing you want is AlternativeName=, but apparently you may also want to set a blank alternative names policy, AlternativeNamesPolicy=. Of course this probably only helps if you're using systemd-networkd directly, instead of through Netplan.

PS: Netplan itself has the notion of Ethernet interfaces having symbolic names, such as 'vlanif0', but this is purely internal to Netplan; it's not manifested as an actual interface altname in the 'rendered' systemd-networkd control files that Netplan writes out.

(Technically this applies to all physical device types.)

An annoyance in how Netplan requires you to specify VLANs

By: cks

Netplan is Canonical's more or less mandatory method of specifying networking on Ubuntu. Netplan has a collection of limitations and irritations, and recently I ran into a new one, which is how VLANs can and can't be specified. To explain this, I can start with the YAML configuration language. To quote the top level version, it looks like:

network:
  version: NUMBER
  renderer: STRING
  [...]
  ethernets: MAPPING
  [...]
  vlans: MAPPING
  [...]

To translate this, you specify VLANs separately from your Ethernet or other networking devices. On the one hand, this is nicely flexible. On the other hand it creates a problem, because here is what you have to write for VLAN properties:

network:
  vlans:
    vlan123:
      id: 123
      link: enp5s0
      addresses: <something>

Every VLAN is on top of some networking device, and because VLANs are specified as a separate category of top level devices, you have to name the underlying device in every VLAN (which gets very annoying and old very fast if you have ten or twenty VLANs to specify). Did you decide to switch from a 1G network port to a 10G network port for the link with all of your VLANs on it? Congratulations, you get to go through every 'vlans:' entry and change its 'link:' value. We hope you don't overlook one.

(Or perhaps you had to move the system disks from one model of 1U server to another model of 1U server because the hardware failed. Or you would just like to write generic install instructions with a generic block of YAML that people can insert directly.)

The best way for Netplan to deal with this would be to allow you to also specify VLANs as part of other devices, especially Ethernet devices. Then you could write:

network:
  ethernet:
    enp5s0: 
      vlans:
        vlan123:
          id: 123
          addresses: <something>

Every VLAN specified in enp5s0's configuration would implicitly use enp5s0 as its underlying link device, and you could rename all of them trivially. This also matches how I think most people think of and deal with VLANs, which is that (obviously) they're tied to some underlying device, and you want to think of them as 'children' of the other device.

(You can have an approach to VLANs where they're more free-floating and the interface that delivers any specific VLAN to your server can change, for load balancing or whatever. But you could still do this, since Netplan will need to keep supporting the separate 'vlans:' section.)

If you want to work around this today, you have to go for the far less convenient approach of artificial network names.

network:
  ethernet:
    vlanif0:
      match:
        name: enp5s0

  vlans:
    vlan123:
      id: 123
      link: vlanif0
      addresses: <something>

This way you only need to change one thing if your VLAN network interface changes, but at the cost of doing a non-standard way of setting up the base interface. (Yes, Netplan accepts it, but it's not how the Ubuntu installer will create your netplan files and who knows what other Canonical tools will have a problem with it as a result.)

We have one future Ubuntu server where we're going to need to set up a lot of VLANs on one underlying physical interface. I'm not sure which option we're going to pick, but the 'vlanif0' option is certainly tempting. If nothing else, it probably means we can put all of the VLANs into a separate, generic Netplan file.

Early experience with using Linux tc to fight bufferbloat latency

By: cks

Over on the Fediverse I mentioned something recently:

Current status: doing extremely "I don't know what I'm really doing, I'm copying from a website¹" things with Linux tc to see if I can improve my home Internet latency under load without doing too much damage to bandwidth or breaking my firewall rules. So far, it seems to work and things² claim to like the result.

¹ <documentation link>
² https://bufferbloat.libreqos.com/ via @davecb

What started this was running into a Fediverse post about the bufferbloat test, trying it, and discovering that (as expected) my home DSL link performed badly, with significant increased latency during downloads, uploads, or both. My memory is that reported figures went up to the area of 400 milliseconds.

Conveniently for me, my Linux home desktop is also my DSL router; it speaks PPPoE directly through my DSL modem. This means that doing traffic shaping on my Linux desktop should cover everything, without any need to wrestle with a limited router OS environment. And there was some more or less cut and paste directions on the site.

So my outbound configuration was simple and obviously not harmful:

tc qdisc add root dev ppp0 cake bandwidth 7.6Mbit

The bandwidth is a guess, although one informed by checking both my raw DSL line rate and what testing sites told me.

The inbound configuration was copied from the documentation and it's where I don't understand what I'm doing:

ip link add name ifb4ppp0 type ifb
tc qdisc add dev ppp0 handle ffff: ingress
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort
ip link set ifb4ppp0 up
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0

(This order follows the documentation.)

Here is what I understand about this. As covered in the tc manual page, traffic shaping and scheduling happens only on 'egress', which is to say for outbound traffic. To handle inbound traffic, we need a level of indirection to a special ifb (Intermediate Functional Block) (also) device, that is apparently used only for our (inbound) tc qdisc.

So we have two pieces. The first is the actual traffic shaping on the IFB link, ifb4ppp0, and setting the link 'up' so that it will actually handle traffic instead of throw it away. The second is that we have to push inbound traffic on ppp0 through ifb4ppp0 to get its traffic shaping. To do this we add a special 'ingress' qdisc to ppp0, which applies to inbound traffic, and then we use a tc filter that matches all (ingress) traffic and redirects it to ifb4ppp0 as 'egress' traffic. Since it's now egress traffic, the tc shaping on ifb4ppp0 will now apply to it and do things.

When I set this up I wasn't certain if it was going to break my non-trivial firewall rules on the ppp0 interface. However, everything seems to fine, and the only thing the tc redirect is affecting is traffic shaping. My firewall blocks and NAT rules are still working.

Applying these tc rules definitely improved my latency scores on the test site; my link went from an F rating to an A rating (and a C rating for downloads and uploads happening at once). Does this improve my latency in practice for things like interactive SSH connections while downloads and uploads are happening? It's hard for me to tell, partly because I don't do such downloads and uploads very often, especially while I'm doing interactive stuff over SSH.

(Of course partly this is because I've sort of conditioned myself out of trying to do interactive SSH while other things are happening on my DSL link.)

The most I can say is that this probably improves things, and that since my DSL connection has drifted into having relatively bad latency to start with (by my standards), it probably helps to minimize how much worse it gets under load.

I do seem to get slightly less bandwidth for transfers than I did before; experimentation says that how much less can be fiddled with by adjusting the tc 'bandwidth' settings, although that also changes latency (more bandwidth creates worse latency). Given that I rarely do large downloads or uploads, I'm willing to trade off slightly lower bandwidth for (much) less of a latency hit. One reason that my bandwidth numbers are approximate anyway is that I'm not sure how much PPPoE DSL framing compensation I need.

(The Arch wiki has a page on advanced traffic control that has some discussion of tc.)

Sidebar: A rewritten command order for ingress traffic

If my understanding is correct, we can rewrite the commands to set up inbound traffic shaping to be more clearly ordered:

# Create and enable ifb link
ip link add name ifb4ppp0 type ifb
ip link set ifb4ppp0 up

# Set CAKE with bandwidth limits for
# our actual shaping, on ifb link.
tc qdisc add dev ifb4ppp0 root cake bandwidth 40Mbit besteffort

# Wire ifb link (with tc shaping) to inbound
# ppp0 traffic.
tc qdisc add dev ppp0 handle ffff: ingress
tc filter add dev ppp0 parent ffff: matchall action mirred egress redirect dev ifb4ppp0

The 'ifb4ppp0' name is arbitrary but conventional, set up as 'ifb4<whatever>'.

Distribution source packages and whether or not to embed in the source code

By: cks

When I described my current ideal Linux source package format, I said that it should be embedded in the source code of the software being packaged. In a comment, bitprophet had a perfectly reasonable and good preference the other way:

Re: other points: all else equal I think I vaguely prefer the Arch "repo contains just the extras/instructions + a reference to the upstream source" approach as it's cleaner overall, and makes it easier to do "more often than it ought to be" cursed things like "apply some form of newer packaging instructions against an older upstream version" (or vice versa).

The Arch approach is isomorphic to the source RPM format, which has various extras and instructions plus a pre-downloaded set of upstream sources. It's not really isomorphic to the Debian source format because you don't normally work with the split up version; the split up version is just a package distribution thing (as dgit shows).

(I believe the Arch approach is also how the FreeBSD and OpenBSD ports trees work. Also, the source package format you work in is not necessarily how you bundle up and distribute source packages, again as shown by Debian.)

Let's call these two packaging options the inline approach (Debian) and the out of line approach (Arch, RPM). My view is that which one you want depends on what you want to do with software and packages. The out of line approach makes it easier to build unmodified packages, and as bitprophet comments it's easy to do weird build things. If you start from a standard template for the type of build and install the software uses, you can practically write the packaging instructions yourself. And the files you need to keep are quite compact (and if you want, it's relatively easy to put a bunch of them into a single VCS repository, each in its own subdirectory).

However, the out of line approach makes modifying upstream software much more difficult than a good version of the inline approach (such as, for example, dgit). To modify upstream software in the out of line approach you have to go through some process similar to what you'd do in the inline approach, and then turn your modifications into patches that your packaging instructions apply on top of the pristine upstream. Moving changes from version to version may be painful in various ways, and in addition to those nice compact out of line 'extras/instructions' package repos, you may want to keep around your full VCS work tree that you built the patches from.

(Out of line versus inline is a separate issue from whether or not the upstream source code should include packaging instructions in any form; I think that generally the upstream should not.)

As a system administrator, I'm biased toward easy modification of upstream packages and thus upstream source because that's most of why I need to build my own packages. However, these days I'm not sure if that's what a Linux distribution should be focusing on. This is especially true for 'rolling' distributions that mostly deal with security issues and bugs not by patching their own version of the software but by moving to a new upstream version that has the security fix or bug fix. If most of what a distribution packages is unmodified from the upstream version, optimizing for that in your (working) source package format is perfectly sensible.

A small suggestion in modern Linux: take screenshots (before upgrades)

By: cks

Mike Hoye recently wrote Powering Up, which is in part about helping people install (desktop) Linux, and the Fediverse thread version of it reminded me of something that I don't do enough of:

A related thing I've taken to doing before potential lurching changes (like Linux distribution upgrades) is to take screenshots and window images. Because comparing a now and then image is a heck of a lot easier than restoring backups, and I can look at it repeatedly as I fix things on the new setup.

Linux distributions and the software they package have a long history of deciding to change things for your own good. They will tinker with font choices, font sizes, default DPI determinations, the size of UI elements, and so on, not quite at the drop of a hat but definitely when you do something like upgrade your distribution and bring in a bunch of significant package version changes (and new programs to replace old programs).

Some people are perfectly okay with these changes. Other people, like me, are quite attached to the specifics of how their current desktop environment looks and will notice and be unhappy about even relatively small changes (eg, also). However, because we're fallible humans, people like me can't always recognize exactly what changed and remember exactly what the old version looked like (these two are related); instead, sometimes all we have is the sense that something changed but we're not quite sure exactly what or exactly how.

Screenshots and window images are the fix for that unspecific feeling. Has something changed? You can call up an old screenshot to check, and to example what (and then maybe work out how to reverse it, or decide to live with the change). Screenshots aren't perfect; for example, they won't necessarily tell you what the old fonts were called or what sizes were being used. But they're a lot better than trying to rely on memory or other options.

It would probably also do me good to get into the habit of taking screenshots periodically, even outside of distribution upgrades. Looking back over time every so often is potentially useful to see more subtle, more long term changes, and perhaps ask myself either why I'm not doing something any more or why I'm still doing it.

(Currently I'm somewhat lackadasical about taking screenshots even before distribution upgrades. I have a distribution upgrade process but I haven't made screenshots part of it, and I don't have an explicit checklist for the process. Which I definitely should create. Possibly I should also try to capture font information in text form, to the extent that I can find it.)

My ideal Linux source package format (at the moment)

By: cks

I've written recently on why source packages are complicated and why packages should be declarative (in contrast to Arch style shell scripts), but I haven't said anything about what I'd like in a source package format, which will mostly be from the perspective of a system administrator who sometimes needs to modify upstream packages or package things myself.

A source package format is a compromise. After my recent experiences with dgit, I now feel that the best option is that a source package is a VCS repository directory tree (Git by default) with special control files in a subdirectory. Normally this will be the upstream VCS repository with packaging control files and any local changes merged in as VCS commits. You perform normal builds in this checked out repository, which has the advantage of convenience and the disadvantage that you have to clean up the result, possibly with liberal use of 'git clean' and 'git reset'. Hermetic builds are done by some tool that copies the checked out files to a build area, or clones the repository, or some other option. If a binary package is built in an environment where this information is available, its metadata should include the exact current VCS commit it was built from, and I would make binary packages not build if there were uncommitted changes.

(Making the native source package a VCS tree with all of the source code makes it easy to work on but mingles package control files with the program source. In today's environment with good distributed VCSes I think this is the right tradeoff.)

The control files should be as declarative as possible, and they should directly express major package metadata such as version numbers (unlike the Debian package format, where the version number is derived from debian/changelog). There should be a changelog but it should be relatively free-form, like RPM changelogs. Changelogs are especially useful for local modifications because they go along with the installed binary package, which means that you can get an answer to 'what did we change in this locally modified package' without having to find your source. The main metadata file that controls everything should be kept simple; I would go as far as to say it should have a format that doesn't allow for multi-line strings, and anything that requires multi-line strings should go in additional separate files (including the package description). You could make it TOML but I don't think you should make it YAML.

Both the build time actions, such as configuring and compiling the source, and the binary package install time actions should by default be declarative; you should be able to say 'this is an autoconf based program and it should have the following additional options', and the build system will take care of everything else. Similarly you should be able to directly express that the binary package needs certain standard things done when it's installed, like adding system users and enabling services. However, this will never be enough so you should also be able to express additional shell script level things that are done to prepare, build, install, upgrade, and so on the package. Unlike RPM and Debian source packages but somewhat like Arch packages, these should be separate files in the control directory, eg 'pkgmeta/build.sh'. Making these separate files makes it much easier to do things like run shellcheck on them or edit them in syntax-aware editor environments.

(It should be possible to combine standard declarative prepare and build actions with additional shell or other language scripting. We want people to be able to do as much as possible with standard, declarative things. Also, although I used '.sh', you should be able to write these actions in other languages too, such as Python or Perl.)

I feel that like RPMs, you should have to at least default to explicitly declaring what files and directories are included in the binary package. Like RPMs, these installed files should be analyzed to determine the binary package dependencies rather than force you to try to declare them in the (source) package metadata (although you'll always have to declare build dependencies in the source package metadata). Like build and install scripts, these file lists should be in separate files, not in the main package metadata file. The RPM collection of magic ways to declare file locations is complex but useful so that, for example, you don't have to keep editing your file lists when the Python version changes. I also feel that you should have to specifically mark files in the file lists with unusual permissions, such as setuid or setgid bits.

The natural way to start packing something new in this system would be to clone its repository and then start adding the package control files. The packaging system could make this easier by having additional tools that you ran in the root of your just-cloned repository and looked around to find indications of things like the name, the version (based on repository tags), the build system in use, and so on, and then wrote out preliminary versions of the control files. More tools could be used incrementally for things like generating the file lists; you'd run the build and 'install' process, then have a tool inventory the installed files for you (and in the process it could recognize places where it should change absolute paths into specially encoded ones for things like 'the current Python package location').

This sketch leaves a lot of questions open, such as what 'source packages' should look like when published by distributions. One answer is to publish the VCS repository but that's potentially quite heavyweight, so you might want a more minimal form. However, once you create a 'source only' minimal form without the VCS history, you're going to want a way to disentangle your local changes from the upstream source.

Linux distribution packaging should be as declarative as possible

By: cks

A commentator on my entry on why Debian and RPM (source) packages are complicated suggested looking at Arch Linux packaging, where most of the information is in a single file as more or less a shell script (example). Unfortunately, I'm not a fan of this sort of shell script or shell script like format, ultimately because it's only declarative by convention (although I suspect Arch enforces some of those conventions). One reason that declarative formats are important is that you can analyze and understand what they do without having to execute code. Another reason is that such formats naturally standardize things, which makes it much more likely that any divergence from the standard approach is something that matters, instead of a style difference.

Being able to analyze and manipulate declarative (source) packaging is useful for large scale changes within a distribution. The RPM source package format uses standard, more or less declarative macros to build most software, which I understand has made it relatively easy to build a lot of software with special C and C++ hardening options. You can inject similar things into a shell script based environment, but then you wind up with ad-hoc looking modifications in some circumstances, as we see in the Dovecot example.

Some things about declarative source packages versus Arch style minimalism are issues of what could be called 'hygiene'. RPM packages push you to list and categorize what files will be included in the built binary package, rather than simply assuming that everything installed into a scratch hierarchy should be packaged. This can be frustrating (and there are shortcuts), but it does give you a chance to avoid accidentally shipping unintended files. You could do this with shell script style minimal packaging if you wanted to, of course. Both RPM and Debian packages have standard and relatively declarative ways to modify a pristine upstream package, and while you can do that in Arch packages, it's not declarative, which hampers various sorts of things.

Basically my feeling is that at scale, you're likely to wind up with something that's essentially as formulaic as a declarative source package format without having its assured benefits. There will be standard templates that everyone is supposed to follow and they mostly will, and you'll be able to mostly analyze the result, and that 'mostly' qualification will be quietly annoying.

(On the positive side, the Arch package format does let you run shellcheck on your shell stanzas, which isn't straightforward to do in the RPM source format.)

Why Debian and RPM (source) packages are complicated

By: cks

A commentator on my early notes on dgit mentioned that they found packaging in Debian overly complicated (and I think perhaps RPMs as well) and would rather build and ship a container. On the one hand, this is in a way fair; my impression is that the process of specifying and building a container is rather easier than for source packages. On the other hand, Debian and RPM source packages are complicated for good reasons.

Any reasonably capable source package format needs to contain a number of things. A source package needs to supply the original upstream source code, some amount of distribution changes, instructions for building and 'installing' the source, a list of (some) dependencies (for either or both build time and install time), a list of files and directories it packages, and possibly additional instructions for things to do when the binary package is installed (such as creating users, enabling services, and so on). Then generally you need some system for 'hermetic' builds, ones that don't depend on things in your local (Linux) login environment. You'll also want some amount of metadata to go with the package, like a name, a version number, and a description. Good source package formats also support building multiple binary packages from a single source package, because sometimes you want to split up the built binary files to reduce the amount of stuff some people have to install. A built binary package contains a subset of this; it has (at least) the metadata, the dependencies, a file list, all of the files in the file list, and those install and upgrade time instructions.

Built containers are a self contained blob plus some metadata. You don't need file lists or dependencies or install and removal actions because all of those are about interaction with the rest of the system and by design containers don't interact with the rest of the system. To build a container you still need some of the same information that a source package has, but you need less and it's deliberately more self-contained and freeform. Since the built container is a self contained artifact you don't need a file list, I believe it's uncommon to modify upstream source code as part of the container build process (instead you patch it in advance in your local repository), and your addition of users, activation of services, and so on is mostly free form and at container build time; once built the container is supposed to be ready to go. And my impression is that in practice people mostly don't try to do things like multiple UIDs in a single container.

(You may still want or need to understand what things you install where in the container image, but that's your problem to keep track of; the container format itself only needs a little bit of information from you.)

Containers have also learned from source packages in that they can be layered, which is to say that you can build your container by starting from some other container, either literally or by sticking another level of build instructions on the end. Layered source packages don't make any sense when you're thinking like a distribution, but they make a lot of sense for people who need to modify the distribution's source packages (this is what dgit makes much easier, partly because Git is effectively a layering system; that's one way to look at a sequence of Git commits).

(My impression of container building is that it's a lot more ad-hoc than package building. Both Debian and RPM have tried to standardize and automate a lot of the standard source code building steps, like running autoconf, but the cost of this is that each of them has a bespoke set of 'convenient' automation to learn if you want to build a package from scratch. With containers, you can probably mostly copy the upstream's shell-based build instructions (or these days, their Dockerfile).)

Dgit based building of (potentially modified) Debian packages can be surprisingly close to the container building experience. Like containers, you first prepare your modifications in a repository and then you run some relatively simple commands to build the artifacts you'll actually use. Provided that your modifications don't change the dependencies, files to be packaged, and so on, you don't have to care about how Debian defines and manipulates those, plus you don't even need to know exactly how to build the software (the Debian stuff takes care of that for you, which is to say that the Debian package builders have already worked it out).

In general I don't think you can get much closer to the container build experience other than the dgit build experience or the general RPM experience (if you're starting from scratch). Packaging takes work because packages aren't isolated, self contained objects; they're objects that need to be integrated into a whole system in a reversible way (ie, you can uninstall them, or upgrade them even though the upgraded version has a somewhat different set of files). You need more information, more understanding, and a more complicated build process.

(Well, I suppose there are flatpaks (and snaps). But these mostly don't integrate with the rest of your system; they're explicitly designed to be self-contained, standalone artifacts that run in a somewhat less isolated environment than containers.)

Moving local package changes to a new Ubuntu release with dgit

By: cks

Suppose, not entirely hypothetically, that you've made local changes to an Ubuntu package on one Ubuntu release, such as 22.04 ('jammy'), and now you want to move to another Ubuntu release such as 24.04 ('noble'). If you're working with straight 'apt-get source' Ubuntu source packages, this is done by tediously copying all of your patches over (hopefully the package uses quilt) to duplicate and recreate your 22.04 work.

If you're using dgit, this is much easier. Partly this is because dgit is based on Git, but partly this is because dgit has an extremely convenient feature where it can have several different releases in the same Git repository. So here's what we want to do, assuming you have a dgit repository for your package already.

(For safety you may want to do this in a copy of your repository. I make rsync'd copies of Git repositories all the time for stuff like this.)

Our first step is to fetch the new 24.04 ('noble') version of the package into our dgit repository as a new dgit branch, and then check out the branch:

dgit fetch -d ubuntu noble,-security,-updates
dgit checkout noble,-security,-updates

We could do this in one operation but I'd rather do it in two, in case there are problems with the fetch.

The Git operation we want to do now is to cherry-pick (also) our changes to the 22.04 version of the package onto the 24.04 version of the package. If this goes well the changes will apply cleanly and we're done. However, there is a complication. If we've followed the usual process for making dgit-based local changes, the last commit on our 22.04 version is an update to debian/changelog. We don't want that change, because we need to do our own 'gbp dch' on the 24.04 version after we've moved our own changes over to make our own 24.04 change to debian/changelog (among other things, the 22.04 changelog change has the wrong version number for the 24.04 package).

In general, cherry-picking all our local changes is 'git cherry-pick old-upstream..old-local'. To get all but the last change, we want 'old-local~' instead. Dgit has long and somewhat obscure branch names; its upstream for our 22.04 changes is 'dgit/dgit/jammy,-security,-updates' (ie, the full 'suite' name we had to use with 'dgit clone' and 'dgit fetch'), while our local branch is 'dgit/jammy,-security,-updates'. So our full command, with a 'git log' beforehand to be sure we're getting what we want, is:

git log dgit/dgit/jammy,-security,-updates..dgit/jammy,-security,-updates~
git cherry-pick dgit/dgit/jammy,-security,-updates..dgit/jammy,-security,-updates~

(We've seen this dgit/dgit/... stuff before when doing 'gbp dch'.)

Then we need to make our debian/changelog update. Here, as an important safety tip, don't blindly copy the command you used while building the 22.04 package, using 'jammy,...' in the --since argument, because that will try to create a very confused changelog of everything between the 22.04 version of the package and the 24.04 version. Instead, you obviously need to update it to your new 'noble' 24.04 upstream, making it:

gbp dch --since dgit/dgit/noble,-security,-updates --local .cslab. --ignore-branch --commit

('git reset --hard HEAD~' may be useful if you make a mistake here. As they say, ask me how I know.)

If the cherry-pick doesn't apply cleanly, you'll have to resolve that yourself. If the cherry-pick applies cleanly but the result doesn't build or perhaps doesn't work because the code has changed too much, you'll be using various ways to modify and update your changes. But at least this is a bunch easier than trying to sort out and update a quilt-based patch series.

Appendix: Dealing with Ubuntu package updates

Based on this conversation, if Ubuntu releases a new version of the package, what I think I need to do is to use 'dgit fetch' and then explicitly rebase:

dgit fetch -d ubuntu

You have to use '-d ubuntu' here or 'dgit fetch' gets confused and fails. There may be ways to fix this with git config settings, but setting them all is exhausting and if you miss one it explodes, so I'm going to have to use '-d ubuntu' all the time (unless dgit fixes this someday).

Dgit repositories don't have an explicit Git upstream set, so I don't think we can use plain rebase. Instead I think we need the more complicated form:

git rebase dgit/dgit/jammy,-security,-updates dgit/jammy,-security,-updates

(Until I do it for real, these arguments are speculative. I believe they should work if I understand 'git rebase' correctly, but I'm not completely sure. I might need the full three argument form and to make the 'upstream' a commit hash.)

Then, as above, we need to drop our debian/changelog change and redo it:

git reset --hard HEAD~
gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

(There may be a clever way to tell 'git rebase' to skip the last change, or you can do an interactive rebase (with '-i') instead of a non-interactive one and delete it yourself.)

Early notes about using dgit on Ubuntu (LTS)

By: cks

I recently read Ian Jackson's Debian’s git transition (via) and had a reaction:

I would really like to be able to patch and rebuild Ubuntu packages from a git repository with our local changes (re)based on top of upstream git. It would be much better than quilt'ing and debuild'ing .dsc packages (I have non-complimentary opinions on the Debian source package format). This news gives me hope that it'll be possible someday, but especially for Ubuntu I have no idea how soon or how well documented it will be.

(It could even be better than RPMs.)

The subsequent discussion got me to try out dgit, especially since it had an attractive dgit-user(7) manual page that gave very simple directions on how to make a local change to an upstream package. It turns out that things aren't entirely smooth on Ubuntu, but they're workable.

The starting point is 'dgit clone', but on Ubuntu you currently get to use special arguments that aren't necessary on Debian:

dgit clone -d ubuntu dovecot jammy,-security,-updates

(You don't have to do this on a machine running 'jammy' (Ubuntu 22.04); it may be more convenient to do it from another one, perhaps with a more up to date dgit.)

The latest Ubuntu package for something may be in either their <release>-security or their <release>-updates 'suite', so you need both. I think this is equivalent to what 'apt-get source' gets you, but you might want to double check. Once you've gotten the source in a Git repository, you can modify it and commit those modifications as usual, for example through Magit. If you have an existing locally patched version of the package that you did with quilt, you can import all of the quilt patches, either one by one or all at once and then using Magit's selective commits to sort things out.

Having made your modifications, whether tentative or otherwise, you can now automatically modify debian/changelog:

gbp dch --since dgit/dgit/jammy,-security,-updates --local .cslab. --ignore-branch --commit

(You might want to use -S for snapshots when testing modifications and builds, I don't know. Our practice is to use --local to add a local suffix on the upstream package number, so we can keep our packages straight.)

The special bit is the 'dgit/dgit/<whatever you used in dgit clone>', which tells gbp-dch (part of the gbp suite of stuff) where to start the changelog from. Using --commit is optional; what I did was to first run 'gbp dch' without it, then use 'git diff' to inspect the resulting debian/changelog changes, and then 'git restore debian/changelog' and re-run it with a better set of options until eventually I added the '--commit'.

You can then install build-deps (if necessary) and build the binary packages with the dgit-user(7) recommended 'dpkg-buildpackage -uc -b'. Normally I'd say that you absolutely want to build source packages too, but since you have a Git repository with the state frozen that you can rebuild from, I don't think it's necessary here.

(After the build finishes you can admire 'git status' output that will tell you just how many files in your source tree the Debian or Ubuntu package building process modified. One of the nice things about using Git and building from a Git repository is that you can trivially fix them all, rather than the usual set of painful workarounds.)

The dgit-user(7) manual page suggests but doesn't confirm that if you're bold, you can build from a tree with uncommitted changes. Personally, even if I was in the process of developing changes I'd commit them and then make liberal use of rebasing, git-absorb, and so on to keep updating my (committed) changes.

It's not clear to me how to integrate upstream updates (for example, a new Ubuntu update to the Dovecot package) with your local changes. It's possible that 'dgit pull' will automatically rebase your changes, or give you the opportunity to do that. If not, you can always do another 'dgit clone' and then manually import your Git changes as patches.

(A disclaimer: at this point I've only cloned, modified, and built one package, although it's a real one we use. Still, I'm sold; the ability to reset the tree after a build is valuable all by itself, never mind having a better way than quilt to handle making changes.)

The systemd journal, message priorities, and (syslog) facilities

By: cks

If you use systemd units or systemd-run to conveniently capture output from scripts and programs into the systemd journal, one of the things that it looks like you don't get is message priorities and (syslog) facilities. Fortunately, systemd's journal support is a bit more sophisticated than that.

When you print out regular output and systemd captures it into the journal, systemd assigns it a default priority that's set with SyslogLevel=; this is normally 'info', which is a good default choice. Similarly, you can pick the syslog facility associated with your unit or your systemd-run invocation with SyslogFacility=. Systemd defaults to 'daemon', which may not entirely be what you want. On the other hand, the choice of syslog facility matters less if you're primarily working with journalctl, where what you usually care about is the systemd unit name.

(You can use journalctl to select messages by priority or syslog facility with the -p and --facility options. You can also select by syslog identifier with the -t option. This is probably going to be handy for searching the journal for messages from some of our programs that use syslog to report things.)

If you know that you're logging to systemd (or you don't care that your regular output looks a bit weird in spots), you can also print messages with special priority markers, as covered in sd-daemon(3). Now that I know about this, I may put it to use in some of our scripts and programs. Sadly, unlike the normal Linux logger and its --prio-prefix option, you can't change the syslog facility this way, but if you're doing pure journald logging you probably don't care about that.

(It's possible that sd-daemon(3) actually supports the logger behavior of changing the syslog facility too, but if so it's not documented and you shouldn't count on it. Instead you should assume that you have to control the syslog facility through setting SyslogFacility=, which unfortunately means you can't log just authentication things to 'auth' and everything else to 'daemon' or some other appropriate facility.)

PS: Unfortunately, as far as I know journalctl has no way to augment its normal syslog-like output with some additional fields, such as the priority or the syslog facility. Instead you have to go all the way to a verbose dump of information in one of the supported formats for field selection.

Some notes on using systemd-run or systemd-cat for logging program output

By: cks

In response to yesterday's entry on using systemd (service) units for easy capturing of log output, a commentator drew my attention to systemd-run and systemd-cat. I spent a bit of time poking at both of them and so I've wound up with some things to remember and some opinions.

(The short summary is that you probably want to use systemd-run with a specific unit name that you pick.)

Systemd-cat is very roughly the systemd equivalent of logger. As you'd expect, things that it puts in the systemd journal flow through to anywhere that regular journal entries would, including things that directly get fed from the journal and syslog (including remote syslog destinations). The most convenient way to use systemd-cat is to just have it run a command, at which point it will capture all of the output from the command and put it in the journal. However, there is a little issue with using just 'systemd-cat /some/command', which is that the journal log identifiers that systemd-cat generates in this case will be the direct name of whatever program produced the output. If /some/command is a script that runs a variety of programs that produce output (perhaps it echos some status information itself then runs a program, which produces output on its own), you'll get a mixture of identifier names in the resulting log:

your-script[...]: >>> Frobulating the thing
some-prog[...]: Frobulation results: 23 processed, 0 errors

Journal logs written by systemd-cat also inherit whatever unit it was in (a session unit, cron.service, etc), and the combination can make it hard to clearly see all of the logs from running your script. To do better you need to give systemd-cat an explicit identifier, 'systemd-cat -t <something> /some/command', which point everything is logged with that name, but still in whatever systemd unit systemd-cat ran in.

Generally you want your script to report all its logs under a single unit name, so you can find them and sort them out from all of the other things your system is logging. To do this you need to use systemd-run with an explicit unit name:

systemd-run -u myscript --quiet --wait -G /some/script

I believe you can then hook this into any systemd service unit infrastructure you want, such as sending email if the unit fails (if you do, you probably want to add '--service-type=oneshot'). Using systemd-run this way gets you the best of both systemd-cat worlds; all of the output from /some/script will be directly labeled with what program produced it, but you can find it all using the unit name.

Systemd-run will refuse to activate a unit with a name that duplicates an existing unit, including existing systemd-run units. In many cases this is a feature for script use, since you basically get 'run only one copy' locking for free (although the error message is noisy, so you may want to do your own quiet locking). If you want to always run your program even if another instance is running, you'll have to generate non-constant unit names (or let systemd-run do it for you).

Systemd-cat has some features that systemd-run doesn't offer, such as setting the priority of messages (and setting a different priority for standard error output). If these features are important to you, I'd suggest nesting systemd-cat (with no '-t' argument) inside systemd-run, so you get both the searchable unit name and the systemd-cat features. If you're already in an environment with a useful unit name and you just need to divert log messages from wherever else the environment wants to send them into the system journal, bare systemd-cat will do the job.

(Arguably this is the case for things run from cron, if you're content to look for all of them under cron.service (or crond.service, depending on your Linux distribution). Running things under systemd-cat puts their output in the journal instead of having them send you email, which may be good enough and saves you having to invent and then remember a bunch of unit names.)

Turning to systemd units for easy capturing of log output

By: cks

Suppose, not hypothetically, that you have a third party tool that you need to run periodically. This tool prints things to standard output (or standard error) that are potentially useful to capture somehow. You want this captured output to be associated with the program (or your general system for running the program) and timestamped, and it would be handy if the log output wound up in all of the usual places in your systems for output. Unix has traditionally had some solutions for this, such as logger for sending things to syslog, but they all have a certain amount of annoyances associated with them.

(If you directly run your script or program from cron, you will automatically capture the output in a nice dated form, but you'll also get email all the time. Let's assume we want a quieter experience than email from cron, because you don't need to regularly see the output, you just want it to be available if you go looking.)

On modern Linux systems, the easy and lazy thing to do is to run your script or program from a systemd service unit, because systemd will automatically do this for you and send the result into the systemd journal (and anything that pulls data from that) and, if configured, into whatever overall systems you have for handling syslog logs. You want a unit like this:

[Unit]
Description=Local: Do whatever
ConditionFileIsExecutable=/root/do-whatever

[Service]
Type=oneshot
ExecStart=/root/do-whatever

Unlike the usual setup for running scripts as systemd services, we don't set 'RemainAfterExit=True' because we want to be able to repeatedly trigger our script with, for example, 'systemctl start local-whatever.service'. You can even arrange to get email if this unit (ie, your script) fails.

You can run this directly from cron through suitable /etc/cron.d files that use 'systemctl start', or set up a systemd timer unit (possibly with a randomized start time). The advantage of a systemd timer unit is that you definitely won't ever get email about this unless you specifically configure it. If you're setting up a relatively unimportant and throwaway thing, it being reliably silent is probably a feature.

(Setting up a systemd timer unit also keeps everything within the systemd ecosystem rather than worrying about various aspects of running 'systemctl start' from scripts or crontabs or etc.)

On the one hand, it feels awkward to go all the way to a systemd service unit simply to get easy to handle logs; it feels like there should be a better solution somewhere. On the other hand, it works and it only needs one extra file over what you'd already need (the .service).

Why I (still) love Linux

I usually publish articles about how much I love the BSDs or illumos distributions, but today I want to talk about Linux (or, better, GNU/Linux) and why, despite everything, it still holds a place in my heart.

In Linux, filesystems can and do have things with inode number zero

By: cks

A while back I wrote about how in POSIX you could theoretically use inode (number) zero. Not all Unixes consider inode zero to be valid; prominently, OpenBSD's getdents(2) doesn't return valid entries with an inode number of 0, and by extension, OpenBSD's filesystems won't have anything that uses inode zero. However, Linux is a different beast.

Recently, I saw a Go commit message with the interesting description of:

os: allow direntries to have zero inodes on Linux

Some Linux filesystems have been known to return valid entries with zero inodes. This new behavior also puts Go in agreement with recent glibc.

This fixes issue #76428, and the issue has a simple reproduction to create something with inode numbers of zero. According to the bug report:

[...] On a Linux system with libfuse 3.17.1 or later, you can do this easily with GVFS:

# Create many dir entries
(cd big && printf '%04x ' {0..1023} | xargs mkdir -p)
gio mount sftp://localhost/$PWD/big

The resulting filesystem mount is in /run/user/$UID/gvfs (see the issue for the exact long path) and can be experimentally verified to have entries with inode numbers of zero (well, as reported by reading the directory). On systems using glibc 2.37 and later, you can look at this directory with 'ls' and see the zero inode numbers.

(Interested parties can try their favorite non-C or non-glibc bindings to see if those environments correctly handle this case.)

That this requires glibc 2.37 is due to this glibc bug, first opened in 2010 (but rejected at the time for reasons you can read in the glibc bug) and then resurfaced in 2016 and eventually fixed in 2022 (and then again in 2024 for the thread safe version of readdir). The 2016 glibc issue has a bit of a discussion about the kernel side. As covered in the Go issue, libfuse returning a zero inode number may be a bug itself, but there are (many) versions of libfuse out in the wild that actually do this today.

Of course, libfuse (and gvfs) may not be the only Linux filesystems and filesystem environments that can create this effect. I believe there are alternate language bindings and APIs for the kernel FUSE (also, also) support, so they might have the same bug as libfuse does.

(Both Go and Rust have at least one native binding to the kernel FUSE driver. I haven't looked at either to see what they do about inode numbers.)

PS: My understanding of the Linux (kernel) situation is that if you have something inside the kernel that needs an inode number and you ask the kernel to give you one (through get_next_ino(), an internal function for this), the kernel will carefully avoid giving you inode number 0. A lot of things get inode numbers this way, so this makes life easier for everyone. However, a filesystem can decide on inode numbers itself, and when it does it can use inode number 0 (either explicitly or by zeroing out the d_ino field in the getdents(2) dirent structs that it returns, which I believe is what's happening in the libfuse situation).

Making Polkit authenticate people like su does (with group wheel)

By: cks

Polkit is how a lot of things on modern Linux systems decide whether or not to let people do privileged operations, including systemd's run0, which effectively functions as another su or sudo. Polkit normally has a significantly different authentication model than su or sudo, where an arbitrary login can authenticate for privileged operations by giving the password of any 'administrator' account (accounts in group wheel or group admin, depending on your Linux distribution).

Suppose, not hypothetically, that you want a su like model in Polkit, one where people in group 'wheel' can authenticate by providing the root password, while people not in group 'wheel' cannot authenticate for privileged operations at all. In my earlier entry on learning about Polkit and adjusting it I put forward an untested Polkit stanza to do this. Now I've tested it and I can provide an actual working version.

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // must exist but have a locked password
        return ["unix-user:nobody"];
    }
});

(This goes in /etc/polkit-1/rules.d/50-default.rules, and the filename is important because it has to replace the standard version in /usr/share/polkit-1/rules.d.)

This doesn't quite work the way 'su' does, where it will just refuse to work for people not in group wheel. Instead, if you're not in group wheel you'll be prompted for the password of 'nobody' (or whatever other login you're using), which you can never successfully supply because the password is locked.

As I've experimentally determined, it doesn't work to return an empty list ('[]'), or a Unix group that doesn't exist ('unix-group:nosuchgroup'), or a Unix group that exists but has no members. In all cases my Fedora 42 system falls back to asking for the root password, which I assume is a built-in default for privileged authentication. Instead you apparently have to return something that Polkit thinks it can plausibly use to authenticate the person, even if that authentication can't succeed. Hopefully Polkit will never get smart enough to work that out and stop accepting accounts with locked passwords.

(If you want to be friendly and you expect people on your servers to run into this a lot, you should probably create a login with a more useful name and GECOS field, perhaps 'not-allowed' and 'You cannot authenticate for this operation', that has a locked password. People may or may not realize what's going on, but at least they have a chance.)

PS: This is with the Fedora 42 version of Polkit, which is version 126. This appears to be the most recent version from the upstream project.

Sidebar: Disabling Polkit entirely

Initially I assumed that Polkit had explicit rules somewhere that authorized the 'root' user. However, as far as I can tell this isn't true; there's no normal rules that specifically authorize root or any other UID 0 login name, and despite that root can perform actions that are restricted to groups that root isn't in. I believe this means that you can explicitly disable all discretionary Polkit authorization with an '00-disable.rules' file that contains:

polkit.addRule(function(action, subject) {
    return polkit.Result.NO;
});

Based on experimentation, this disables absolutely everything, even actions that are considered generally harmless (like libvirt's 'virsh list', which I think normally anyone can do).

A slightly more friendly version can be had by creating a situation where there are no allowed administrative users. I think this would be done with a 50-default.rules file that contained:

polkit.addAdminRule(function(action, subject) {
    // must exist but have a locked password
    return ["unix-user:nobody"];
});

You'd also want to make sure that nobody is in any special groups that rules in /usr/share/polkit-1/rules.d use to allow automatic access. You can look for these by grep'ing for 'isInGroup'.

The (early) good and bad parts of Polkit for a system administrator

By: cks

At a high level, Polkit is how a lot of things on modern Linux systems decide whether or not to let you do privileged operations. After looking into it a bit, I've wound up feeling that Polkit has both good and bad aspects from the perspective of a system administrator (especially a system administrator with multi-user Linux systems, where most of the people using them aren't supposed to have any special privileges). While I've used (desktop) Linuxes with Polkit for a while and relied on it for a certain amount of what I was doing, I've done so blindly, effectively as a normal person. This is the first I've looked at the details of Polkit, which is why I'm calling this my early reactions.

On the good side, Polkit is a single source of authorization decisions, much like PAM. On a modern Linux system, there are a steadily increasing number of programs that do privileged things, even on servers (such as systemd's run0). These could all have their own bespoke custom authorization systems, much as how sudo has its own custom one, but instead most of them have centralized on Polkit. In theory Polkit gives you a single thing to look at and a single thing to learn, rather than learning systemd's authentication system, NetworkManager's authentication system, etc. It also means that programs have less of a temptation to hard-code (some of) their authentication rules, because Polkit is very flexible.

(In many cases programs couldn't feasibly use PAM instead, because they want certain actions to be automatically authorized. For example, in its standard configuration libvirt wants everyone in group 'libvirt' to be able to issue libvirt VM management commands without constantly having to authenticate. PAM could probably be extended to do this but it would start to get complicated, partly because PAM configuration files aren't a programming language and so implementing logic in PAM gets awkward in a hurry.)

On the bad side, Polkit is a non-declarative authorization system, and a complex one with its rules not in any single place (instead they're distributed through multiple files in two different formats). Authorization decisions are normally made in (JavaScript) code, which means that they can encode essentially arbitrary logic (although there are standard forms of things). This means that the only way to know who is authorized to do a particular thing is to read its XML 'action' file and then look through all of the JavaScript code to find and then understand things that apply to it.

(Even 'who is authorized' is imprecise by default. Polkit normally allows anyone to authenticate as any administrative account, provided that they know its password and possibly other authentication information. This makes the passwords of people in group wheel or group admin very dangerous things, since anyone who can get their hands on one can probably execute any Polkit-protected action.)

This creates a situation where there's no way in Polkit to get a global overview of who is authorized to do what, or what a particular person has authorization for, since this doesn't exist in a declarative form and instead has to be determined on the fly by evaluating code. Instead you have to know what's customary, like the group that's 'administrative' for your Linux distribution (wheel or admin, typically) and what special groups (like 'libvirt') do what, or you have to read and understand all of the JavaScript and XML involved.

In other words, there's no feasible way to audit what Polkit is allowing people to do on your system. You have to trust that programs have made sensible decisions in their Polkit configuration (ones that you agree with), or run the risk of system malfunctions by turning everything off (or allowing only root to be authorized to do things).

(Not even Polkit itself can give you visibility into why a decision was made or fully predict it in advance, because the JavaScript rules have no pre-filtering to narrow down what they apply to. The only way you find out what a rule really does is invoking it. Well, invoking the function that the addRule() or addAdminRule() added to the rule stack.)

This complexity (and the resulting opacity of authorization) is probably intrinsic in Polkit's goals. I even think they made the right decision by having you write logic in JavaScript rather than try to create their own language for it. However, I do wish Polkit had a declarative subset that could express all of the simple cases, reserving JavaScript rules only for complex ones. I think this would make the overall system much easier for system administrators to understand and analyze, so we had a much better idea (and much better control) over who was authorized for what.

Brief notes on learning and adjusting Polkit on modern Linuxes

By: cks

Polkit (also, also) is a multi-faceted user level thing used to control access to privileged operations. It's probably used by various D-Bus services on your system, which you can more or less get a list of with pkaction, and there's a pkexec program that's like su and sudo. There are two reasons that you might care about Polkit on your system. First, there might be tools you want to use that use Polkit, such as systemd's run0 (which is developing some interesting options). The other is that Polkit gives people an alternate way to get access to root or other privileges on your servers and you may have opinions about that and what authentication should be required.

Unfortunately, Polkit configuration is arcane and as far as I know, there aren't really any readily accessible options for it. For instance, if you want to force people to authenticate for root-level things using the root password instead of their password, as far as I know you're going to have to write some JavaScript yourself to define a suitable Administrator identity rule. The polkit manual page seems to document what you can put in the code reasonably well, but I'm not sure how you test your new rules and some areas seem underdocumented (for example, it's not clear how 'addAdminRule()' can be used to say that the current user cannot authenticate as an administrative user at all).

(If and when I wind up needing to test rules, I will probably try to do it in a scratch virtual machine that I can blow up. Fortunately Polkit is never likely to be my only way to authenticate things.)

Polkit also has some paper cuts in its current setup. For example, as far as I can see there's no easy way to tell Polkit-using programs that you want to immediately authenticate for administrative access as yourself, rather than be offered a menu of people in group wheel (yourself included) and having to pick yourself. It's also not clear to me (and I lack a test system) if the default setup blocks people who aren't in group wheel (or group admin, depending on your Linux distribution flavour) from administrative authentication or if instead they get to pick authenticating using one of your passwords. I suspect it's the latter.

(All of this makes Polkit seem like it's not really built for multi-user Linux systems, or at least multi-user systems where not everyone is an administrator.)

PS: Now that I've looked at it, I have some issues with Polkit from the perspective of a system administrator, but those are going to be for another entry.

Sidebar: Some options for Polkit (root) authentication

If you want everyone to authenticate as root for administrative actions, I think what you want is:

polkit.addAdminRule(function(action, subject) {
    return ["unix-user:0"];
});

If you want to restrict this to people in group wheel, I think you want something like:

polkit.addAdminRule(function(action, subject) {
    if (subject.isInGroup("wheel")) {
        return ["unix-user:0"];
    } else {
        // might not work to say 'no'?
        return [];
    }
});

If you want people in group wheel to authenticate as themselves, not root, I think you return 'unix-user:' + subject.user instead of 'unix-user:0'. I don't know if people still get prompted by Polkit to pick a user if there's only one possible user.

Discovering orphaned binaries in /usr/sbin on Fedora 42

By: cks

Over on the Fediverse, I shared a somewhat unwelcome discovery I made after upgrading to Fedora 42:

This is my face when I have quite a few binaries in /usr/sbin on my office Fedora desktop that aren't owned by any package. Presumably they were once owned by packages, but the packages got removed without the files being removed with them, which isn't supposed to happen.

(My office Fedora install has been around for almost 20 years now without being reinstalled, so things have had time to happen. But some of these binaries date from 2021.)

There seem to be two sorts of these lingering, unowned /usr/sbin programs. One sort, such as /usr/sbin/getcaps, seems to have been left behind when its package moved things to /usr/bin, possibly due to this RPM bug (via). The other sort is genuinely unowned programs dating to anywhere from 2007 (at the oldest) to 2021 (at the newest), which have nothing else left of them sitting around. The newest programs are what I believe are wireless management programs: iwconfig, iwevent, iwgetid, iwlist, iwpriv, and iwspy, and also "ifrename" (which I believe was also part of a 'wireless-tools' package). I had the wireless-tools package installed on my office desktop until recently, but I removed it some time during Fedora 40, probably sparked by the /sbin to /usr/sbin migration, and it's possible that binaries didn't get cleaned up properly due to that migration.

The most interesting orphan is /usr/sbin/sln, dating from 2018, when apparently various people discovered it as an orphan on their system. Unlike all the other orphan programs, the sln manual page is still shipped as part of the standard 'man-pages' package and so you can read sln(8) online. Based on the manual page, it sounds like it may have been part of glibc at one point.

(Another orphaned program from 2018 is pam_tally, although it's coupled to pam_tally2.so, which did get removed.)

I don't know if there's any good way to get mappings from files to RPM packages for old Fedora versions. If there is, I'd certainly pick through it to try to find where various of these files came from originally. Unfortunately I suspect that for sufficiently old Fedora versions, much of this information is either offline or can't be processed by modern versions of things like dnf.

(The basic information is used by eg 'dnf provides' and can be built by hand from the raw RPMs, but I have no desire to download all of the RPMs for decade-old Fedora versions even if they're still available somewhere. I'm curious but not that curious.)

PS: At the moment I'm inclined to leave everything as it is until at least Fedora 43, since RPM bugs are still being sorted out here. I'll have to clean up genuinely orphaned files at some point but I don't think there's any rush. And I'm not removing any more old packages that use '/sbin/<whatever>', since that seems like it has some bugs.

Removing Fedora's selinux-policy-targeted package is mostly harmless so far

By: cks

A while back I discussed why I might want to remove the selinux-policy-targeted RPM package for a Fedora 42 upgrade. Today, I upgraded my office workstation from Fedora 41 to Fedora 42, and as part of preparing for that upgrade I removed the selinux-policy-targeted policy (and all of the packages that depended on it). The result appears to work, although there were a few things that came up during the upgrade and I may reinstall at least selinux-policy-targeted itself to get rid of them (for now).

The root issue appears to be that when I removed the selinux-policy-targeted package, I probably should have edited /etc/selinux/config to set SELINUXTYPE to some bogus value, not left it set to "targeted". For entirely sensible reasons, various packages have postinstall scripts that assume that if your SELinux configuration says your SELinux type is 'targeted', they can do things that implicitly or explicitly require things from the package or from the selinux-policy package, which got removed when I removed selinux-policy-targeted.

I'm not sure if my change to SELINUXTYPE will completely fix things, because I suspect that there are other assumptions about SELinux policy programs and data files being present lurking in standard, still-installed package tools and so on. Some of these standard SELinux related packages definitely can't be removed without gutting Fedora of things that are important to me, so I'll either have to live with periodic failures of postinstall scripts or put selinux-policy-targeted and some other bits back. On the whole, reinstalling selinux-policy-targeted is probably the safest way and the issue that caused me to remove it only applies during Fedora version upgrades and might anyway be fixed in Fedora 42.

What this illustrates to me is that regardless of package dependencies, SELinux is not really optional on Fedora. The Fedora environment assumes that a functioning SELinux environment is there and if it isn't, things are likely to go wrong. I can't blame Fedora for this, or for not fully capturing this in package dependencies (and Fedora did protect the selinux-policy-targeted package from being removed; I overrode that by hand, so what happens afterward is on me).

(Although I haven't checked modern versions of Fedora, I suspect that there's no official way to install Fedora without getting a SELinux policy package installed, and possibly selinux-policy-targeted specifically.)

PS: I still plan to temporarily remove selinux-policy-targeted when I upgrade my home desktop to Fedora 42. A few package postinstall glitches is better than not being able to read DNF output due to the package's spam.

Modern Linux filesystem mounts are rather complex things

By: cks

Once upon a time, Unix filesystem mounts worked by putting one inode on top of another, and this was also how they worked in very early Linux. It wasn't wrong to say that mounts were really about inodes, with the names only being used to find the inodes. This is no longer how things work in Linux (and perhaps other Unixes, but Linux is what I'm most familiar with for this). Today, I believe that filesystem mounts in Linux are best understood as namespace operations.

Each separate (unmounted) filesystem is a a tree of names (a namespace). At a broad level, filesystem mounts in Linux take some name from that filesystem tree and project it on top of something in an existing namespace, generally with some properties attached to the projection. A regular conventional mount takes the root name of the new filesystem and puts the whole tree somewhere, but for a long time Linux's bind mounts took some other name in the filesystem as their starting point (what we could call the root inode of the mount). In modern Linux, there can also be multiple mount namespaces in existence at one time, with different contents and properties. A filesystem mount does not necessarily appear in all of them, and different things can be mounted at the same spot in the tree of names in different mount namespaces.

(Some mount properties are still global to the filesystem as a whole, while other mount properties are specific to a particular mount. See mount(2) for a discussion of general mount properties. I don't know if there's a mechanism to handle filesystem specific mount properties on a per mount basis.)

This can't really be implemented with an inode-based view of mounts. You can somewhat implement traditional Linux bind mounts with an inode based approach, but mount namespaces have to be separate from the underlying inodes. At a minimum a mount point must be a pair of 'this inode in this namespace has something on top of it', instead of just 'this inode has something on top of it'.

(A pure inode based approach has problems going up the directory tree even in old bind mounts, because the parent directory of a particular directory depends on how you got to the directory. If /usr/share is part of /usr and you bind mounted /usr/share to /a/b, the value of '..' depends on if you're looking at '/usr/share/..' or '/a/b/..', even though /usr/share and /a/b are the same inode in the /usr filesystem.)

If I'm reading manual pages correctly, Linux still normally requires the initial mount of any particular filesystem be of its root name (its true root inode). Only after that initial mount is made can you make bind mounts to pull out some subset of its tree of names and then unmount the original full filesystem mount. I believe that a particular filesystem can provide ways to sidestep this with a filesystem specific mount option, such as btrfs's subvol= mount option that's covered in the btrfs(5) manual page (or 'btrfs subvolume set-default').

We don't update kernels without immediately rebooting the machine

By: cks

I've mentioned this before in passing (cf, also) but today I feel like saying it explicitly: our habit with all of our machines is to never apply a kernel update without immediately rebooting the machine into the new kernel. On our Ubuntu machines this is done by holding the relevant kernel packages; on my Fedora desktops I normally run 'dnf update --exclude "kernel*"' unless I'm willing to reboot on the spot.

The obvious reason for this is that we want to switch to the new kernel under controlled, attended conditions when we'll be able to take immediate action if something is wrong, rather than possibly have the new kernel activate at some random time without us present and paying attention if there's a power failure, a kernel panic, or whatever. This is especially acute on my desktops, where I use ZFS by building my own OpenZFS packages and kernel modules. If something goes wrong and the kernel modules don't load or don't work right, an unattended reboot can leave my desktops completely unusable and off the network until I can get to them. I'd rather avoid that if possible (sometimes it isn't).

(In general I prefer to reboot my Fedora machines with me present because weird things happen from time to time and sometimes I make mistakes, also.)

The less obvious reason is that when you reboot a machine right after applying a kernel update, it's clear in your mind that the machine has switched to a new kernel. If there are system problems in the days immediately afterward the update, you're relatively likely to remember this and at least consider the possibility that the new kernel is involved. If you apply a kernel update, walk away without rebooting, and the machine reboots a week and a half later for some unrelated reason, you may not remember that one of the things the reboot did was switch to a new kernel.

(Kernels aren't the only thing that this can happen with, since not all system updates and changes take effect immediately when made or applied. Perhaps one should reboot after making them, too.)

I'm assuming here that your Linux distribution's package management system is sensible, so there's no risk of losing old kernels (especially the one you're currently running) merely because you installed some new ones but didn't reboot into them. This is how Debian and Ubuntu behave (if you don't 'apt autoremove' kernels), but not quite how Fedora's dnf does it (as far as I know). Fedora dnf keeps the N most recent kernels around and probably doesn't let you remove the currently running kernel even if it's more than N kernels old, but I don't believe it tracks whether or not you've rebooted into those N kernels and stretches the N out if you haven't (or removes more recent installed kernels that you've never rebooted into, instead of older kernels that you did use at one point).

PS: Of course if kernel updates were perfect this wouldn't matter. However this isn't something you can assume for the Linux kernel (especially as patched by your distribution), as we've sometimes seen. Although big issues like that are relatively uncommon.

Restarting or redoing something after a systemd service restarts

By: cks

Suppose, not hypothetically, that your system is running some systemd based service or daemon that resets or erase your carefully cultivated state when it restarts. One example is systemd-networkd, although you can turn that off (or parts of it off, at least), but there are likely others. To clean up after this happens, you'd like to automatically restart or redo something after a systemd unit is restarted. Systemd supports this, but I found it slightly unclear how you want to do this and today I poked at it, so it's time for notes.

(This is somewhat different from triggering one unit when another unit becomes active, which I think is still not possible in general.)

First, you need to put whatever you want to do into a script and a .service unit that will run the script. The traditional way to run a script through a .service unit is:

[Unit]
....

[Service]
Type=oneshot
RemainAfterExit=True
ExecStart=/your/script/here

[Install]
WantedBy=multi-user.target

(The 'RemainAfterExit' is load-bearing, also.)

To get this unit to run after another unit is started or restarted, what you need is PartOf=, which causes your unit to be stopped and started when the other unit is, along with 'After=' so that your unit starts after the other unit instead of racing it (which could be counterproductive when what you want to do is fix up something from the other unit). So you add:

[Unit]
...
PartOf=systemd-networkd.service
After=systemd-networkd.service

(This is what works for me in light testing. This assumes that the unit you want to re-run after is normally always running, as systemd-networkd is.)

In testing, you don't need to have your unit specifically enabled by itself, although you may want it to be for clarity and other reasons. Even if your unit isn't specifically enabled, systemd will start it after the other unit because of the PartOf=. If the other unit is started all of the time (as is usually the case for systemd-networkd), this effectively makes your unit enabled, although not in an obvious way (which is why I think you should specifically 'systemctl enable' it, to make it obvious). I think you can have your .service unit enabled and active without having the other unit enabled, or even present.

You can declare yourself PartOf a .target unit, and some stock package systemd units do for various services. And a .target unit can be PartOf a .service; on Fedora, 'sshd-keygen.target' is PartOf sshd.service in a surprisingly clever little arrangement to generate only the necessary keys through a templated 'sshd-keygen@.service' unit.

I admit that the whole collection of Wants=, Requires=, Requisite=, BindsTo=, PartOf=, Upholds=, and so on are somewhat confusing to me. In the past, I've used the wrong version and suffered the consequences, and I'm not sure I have them entirely right in this entry.

Note that as far as I know, PartOf= has those Requires= consequences, where if the other unit is stopped, yours will be too. In a simple 'run a script after the other unit starts' situation, stopping your unit does nothing and can be ignored.

(If this seems complicated, well, I think it is, and I think one part of the complication is that we're trying to use systemd as an event-based system when it isn't one.)

Systemd-resolved's new 'DNS Server Delegation' feature (as of systemd 258)

By: cks

A while ago I wrote an entry about things that resolved wasn't for as of systemd 251. One of those things was arbitrary mappings of (DNS) names to DNS servers, for example if you always wanted *.internal.example.org to query a special DNS server. Systemd-resolved didn't have a direct feature for this and attempting to attach your DNS names to DNS server mappings to a network interface could go wrong in various ways. Well, time marches on and as of systemd v258 this is no longer the state of affairs.

Systemd v258 introduces systemd.dns-delegate files, which allow you to map DNS names to DNS servers independently from network interfaces. The release notes describe this as:

A new DNS "delegate zone" concept has been introduced, which are additional lookup scopes (on top of the existing per-interface and the one global scope so far supported in resolved), which carry one or more DNS server addresses and a DNS search/routing domain. It allows routing requests to specific domains to specific servers. Delegate zones can be configured via drop-ins below /etc/systemd/dns-delegate.d/*.dns-delegate.

Since systemd v258 is very new I don't have any machines where I can actually try this out, but based on the systemd.dns-delegate documentation, you can use this both for domains that you merely want diverted to some DNS server and also domains that you also want on your search path. Per resolved.conf's Domains= documentation, the latter is 'Domains=example.org' (example.org will be one of the domains that resolved tries to find single-label hostnames in, a search domain), and the former is 'Domains=~example.org' (where we merely send queries for everything under 'example.org' off to whatever DNS= you set, a route-only domain).

(While resolved.conf's Domains= officially promises to check your search domains in the order you listed them, I believe this is strictly for a single 'Domains=' setting for a single interface. If you have multiple 'Domains=' settings, for example in a global resolved.conf, a network interface, and now in a delegation, I think systemd-resolved makes no promises.)

Right now, these DNS server delegations can only be set through static files, not manipulated through resolvectl. I believe fiddling with them through resolvectl is on the roadmap, but for now I guess we get to restart resolved if we need to change things. In fact resolvectl doesn't expose anything to do with them, although I believe read-only information is available via D-Bus and maybe varlink.

Given the timing of systemd v258's release relative to Fedora releases, I probably won't be able to use this feature until Fedora 44 in the spring (Fedora 42 is current and Fedora 43 is imminent, which won't have systemd v258 given that v258 was released only a couple of weeks ago). My current systemd-resolved setup is okay (if it wasn't I'd be doing something else), but I can probably find uses for these delegations to improve it.

These days, systemd can be a cause of restrictions on daemons

By: cks

One of the traditional rites of passage for Linux system administrators is having a daemon not work in the normal system configuration (eg, when you boot the system) but work when you manually run it as root. The classical cause of this on Unix was that $PATH wasn't fully set in the environment the daemon was running in but was in your root shell. On Linux, another traditional cause of this sort of thing has been SELinux and a more modern source (on Ubuntu) has sometimes been AppArmor. All of these create hard to see differences between your root shell (where the daemon works when run by hand) and the normal system environment (where the daemon doesn't work). These days, we can add another cause, an increasingly common one, and that is systemd service unit restrictions, many of which are covered in systemd.exec.

(One pernicious aspect of systemd as a cause of these restrictions is that they can appear in new releases of the same distribution. If a daemon has been running happily in an older release and now has surprise issues in a new Ubuntu LTS, I don't always remember to look at its .service file.)

Some of systemd's protective directives simply cause failures to do things, like access user home directories if ProtectHome= is set to something appropriate. Hopefully your daemon complains loudly here, reporting mysterious 'permission denied' or 'file not found' errors. Some systemd settings can have additional, confusing effects, like PrivateTmp=. A standard thing I do when troubleshooting a chain of programs executing programs executing programs is to shim in diagnostics that dump information to /tmp, but with PrivateTmp= on, my debugging dump files are mysteriously not there in the system-wide /tmp.

(On the other hand, a daemon may not complain about missing files if it's expected that the files aren't always there. A mailer usually can't really tell the difference between 'no one has .forward files' and 'I'm mysteriously not able to see people's home directories to find .forward files in them'.)

Sometimes you don't get explicit errors, just mysterious failures to do some things. For example, you might set IP address access restrictions with the intention of blocking inbound connections but wind up also blocking DNS queries (and this will also depend on whether or not you use systemd-resolved). The good news is that you're mostly not going to find standard systemd .service files for normal daemons shipped by your Linux distribution with IP address restrictions. The bad news is that at some point .service files may start showing up that impose IP address restrictions with the assumption that DNS resolution is being done via systemd-resolved as opposed to direct DNS queries.

(I expect some Linux distributions to resist this, for example Debian, but others may declare that using systemd-resolved is now mandatory in order to simplify things and let them harden service configurations.)

Right now, you can usually test if this is the problem by creating a version of the daemon's .service file with any systemd restrictions stripped out of it and then seeing if using that version makes life happy. In the future it's possible that some daemons will assume and require some systemd restrictions (for instance, assuming that they have a /tmp all of their own), making things harder to test.

Some stuff on how Linux consoles interact with the mouse

By: cks

On at least x86 PCs, Linux text consoles ('TTY' consoles or 'virtual consoles') support some surprising things. One of them is doing some useful stuff with your mouse, if you run an additional daemon such as gpm or the more modern consolation. This is supported on both framebuffer consoles and old 'VGA' text consoles. The experience is fairly straightforward; you install and activate one of the daemons, and afterward you can wave your mouse around, select and paste text, and so on. How it works and what you get is not as clear, and since I recently went diving into this area for reasons, I'm going to write down what I now know before I forget it (with a focus on how consolation works).

The quick summary is that the console TTY's mouse support is broadly like a terminal emulator. With a mouse daemon active, the TTY will do "copy and paste" selection stuff on its own. A mouse aware text mode program can put the console into a mode where mouse button presses are passed through to the program, just as happens in xterm or other terminal emulators.

The simplest TTY mode is when a non-mouse-aware program or shell is active, which is to say a program that wouldn't try to intercept mouse actions itself if it was run in a regular terminal window and would leave mouse stuff up to the terminal emulator. In this mode, your mouse daemon reads mouse input events and then uses sub-options of the TIOCLINUX ioctl to inject activities into the TTY, for example telling it to 'select' some text and then asking it to paste that selection to some file descriptor (normally the console itself, which delivers it to whatever foreground program is taking terminal input at the time).

(In theory you can use the mouse to scroll text back and forth, but in practice that was removed in 2020, both for the framebuffer console and for the VGA console. If I'm reading the code correctly, a VGA console might still have a little bit of scrollback support depending on how much spare VGA RAM you have for your VGA console size. But you're probably not using a VGA console any more.)

The other mode the console TTY can be in is one where some program has used standard xterm-derived escape sequences to ask for xterm-compatible "mouse tracking", which is the same thing it might ask for in a terminal emulator if it wanted to handle the mouse itself. What this does in the kernel TTY console driver is set a flag that your mouse daemon can query with TIOCL_GETMOUSEREPORTING; the kernel TTY driver still doesn't directly handle or look at mouse events. Instead, consolation (or gpm) reads the flag and, when the flag is set, uses the TIOCL_SELMOUSEREPORT sub-sub-option to TIOCLINUX's TIOCL_SETSEL sub-option to report the mouse position and button presses to the kernel (instead of handling mouse activity itself). The kernel then turns around and sends mouse reporting escape codes to the TTY, as the program asked for.

(As I discovered, we got a CVE this year related to this, where the kernel let too many people trigger sending programs 'mouse' events. See the stable kernel commit message for details.)

A mouse daemon like consolation doesn't have to pay attention to the kernel's TTY 'mouse reporting' flag. As far as I can tell from the current Linux kernel code, if the mouse daemon ignores the flag it can keep on doing all of its regular copy and paste selection and mouse button handling. However, sending mouse reports is only possible when a program has specifically asked for it; the kernel will report an error if you ask it to send a mouse report at the wrong time.

(As far as I can see there's no notification from the kernel to your mouse daemon that someone changed the 'mouse reporting' flag. Instead you have to poll it; it appears consolation does this every time through its event loop before it handles any mouse events.)

PS: Some documentation on console mouse reporting was written as a 2020 kernel documentation patch (alternate version) but it doesn't seem to have made it into the tree. According to various sources, eg, the mouse daemon side of things can only be used by actual mouse daemons, not by programs, although programs do sometimes use other bits of TIOCLINUX's mouse stuff.

PPS: It's useful to install a mouse daemon on your desktop or laptop even if you don't intend to ever use the text TTY. If you ever wind up in the text TTY for some reason, perhaps because your regular display environment has exploded, having mouse cut and paste is a lot nicer than not having it.

My Fedora machines need a cleanup of their /usr/sbin for Fedora 42

By: cks

One of the things that Fedora is trying to do in Fedora 42 is unifying /usr/bin and /usr/sbin. In an ideal (Fedora) world, your Fedora machines will have /usr/sbin be a symbolic link to /usr/bin after they're upgraded to Fedora 42. However, if your Fedora machines have been around for a while, or perhaps have some third party packages installed, what you'll actually wind up with is a /usr/sbin that is mostly symbolic links to /usr/bin but still has some actual programs left.

One source of these remaining /usr/sbin programs is old packages from past versions of Fedora that are no longer packaged in Fedora 41 and Fedora 42. Old packages are usually harmless, so it's easy for them to linger around if you're not disciplined; my home and office desktops (which have been around for a while) still have packages from as far back as Fedora 28.

(An added complication of tracking down file ownership is that some RPMs haven't been updated for the /sbin to /usr/sbin merge and so still believe that their files are /sbin/<whatever> instead of /usr/sbin/<whatever>. A 'rpm -qf /usr/sbin/<whatever>' won't find these.)

Obviously, you shouldn't remove old packages without being sure of whether or not they're important to you. I'm also not completely sure that all packages in the Fedora 41 (or 42) repositories are marked as '.fc41' or '.fc42' in their RPM versions, or if there are some RPMs that have been carried over from previous Fedora versions. Possibly this means I should wait until a few more Fedora versions have come to pass so that other people find and fix the exceptions.

(On what is probably my cleanest Fedora 42 test virtual machine, there are a number of packages that 'dnf list --extras' doesn't list that have '.fc41' in their RPM version. Some of them may have been retained un-rebuilt for binary compatibility reasons. There's also the 'shim' UEFI bootloaders, which date from 2024 and don't have Fedora releases in their RPM versions, but those I expect to basically never change once created. But some others are a bit mysterious, such as 'libblkio', and I suspect that they may have simply been missed by the Fedora 42 mass rebuild.)

PS: In theory anyone with access to the full Fedora 42 RPM repository could sweep the entire thing to find packages that still install /usr/sbin files or even /sbin files, which would turn up any relevant not yet rebuilt packages. I don't know if there's any easy way to do this through dnf commands, although I think dnf does have access to a full file list for all packages (which is used for certain dnf queries).

My machines versus the Fedora selinux-policy-targeted package

By: cks

I upgrade Fedora on my office and home workstations through an online upgrade with dnf, and as part of this I read (or at least scan) DNF's output to look for problems. Usually this goes okay, but DNF5 has a general problem with script output and when I did a test upgrade from Fedora 41 to Fedora 42 on a virtual machine, it generated a huge amount of repeated output from a script run by selinux-policy-targeted, repeatedly reporting "Old compiled fcontext format, skipping" for various .bin files in /etc/selinux/targeted/contexts/files. The volume of output made the rest of DNF's output essentially unreadable. I would like to avoid this when I actually upgrade my office and home workstations to Fedora 42 (which I still haven't done, partly because of this issue).

(You can't make this output easier to read because DNF5 is too smart for you. This particular error message reportedly comes from 'semodule -B', per this Fedora discussion.)

The 'targeted' policy is one of several SELinux policies that are supported or at least packaged by Fedora (although I suspect I might see similar issues with the other policies too). My main machines don't use SELinux and I have it completely disabled, so in theory I should be able to remove the selinux-policy-targeted package to stop it from repeatedly complaining during the Fedora 42 upgrade process. In practice, selinux-policy-targeted is a 'protected' package that DNF will normally refuse to remove. Such packages are listed in /etc/dnf/protected.d/ in various .conf files; selinux-policy-targeted installs (well, includes) a .conf file to protect itself from removal once installed.

(Interestingly, sudo protects itself but there's nothing specifically protecting su and the rest of util-linux. I suspect util-linux is so pervasively a dependency that other protected things hold it down, or alternately no one has ever worried about people removing it and shooting themselves in the foot.)

I can obviously remove this .conf file and then DNF will let me remove selinux-policy-targeted, which will force the removal of some other SELinux policy packages (both selinux-policy packages themselves and some '*-selinux' sub-packages of other packages). I tried this on another Fedora 41 test virtual machine and nothing obvious broke, but that doesn't mean that nothing broke at all. It seems very likely that almost no one tests Fedora without the selinux-policy collective installed and I suspect it's not a supported configuration.

I could reduce my risks by removing the packages only just before I do the upgrade to Fedora 42 and put them back later (well, unless I run into a dnf issue as a result, although that issue is from 2024). Also, now that I've investigated this, I could in theory delete the .bin files in /etc/selinux/targeted/contexts/files before the upgrade, hopefully making it so that selinux-policy-targeted has less or nothing to complain about. Since I'm not using SELinux, hopefully the lack of these files won't cause any problems, but of course this is less certain a fix than removing selinux-policy-targeted (for example, perhaps the .bin files would get automatically rebuilt early on in the upgrade process as packages are shuffled around, and bring the problem back with them).

Really, though, I wish DNF5 didn't have its problem with script output. All of this is hackery to deal with that underlying issue.

Some thoughts on Ubuntu automatic ('unattended') package upgrades

By: cks

The default behavior of a stock Ubuntu LTS server install is that it enables 'unattended upgrades', by installing the package unattended-upgrades (which creates /etc/apt/apt.conf.d/20auto-upgrades, which controls this). Historically, we haven't believed in unattended automatic package upgrades and eventually built a complex semi-automated upgrades system (which has various special features). In theory this has various potential advantages; in practice it mostly results in package upgrades being applied after some delay that depends on when they come out relative to working days.

I have a few machines that actually are stock Ubuntu servers, for reasons outside the scope of this entry. These machines naturally have automated upgrades turned on and one of them (in a cloud, using the cloud provider's standard Ubuntu LTS image) even appears to automatically reboot itself if kernel updates need that. These machines are all in undemanding roles (although one of them is my work IPv6 gateway), so they aren't necessarily indicative of what we'd see on more complex machines, but none of them have had any visible problems from these unattended upgrades.

(I also can't remember the last time that we ran into a problem with updates when we applied them. Ubuntu updates still sometimes have regressions and other problems, forcing them to be reverted or reissued, but so far we haven't seen problems ourselves; we find out about these problems only through the notices in the Ubuntu security lists.)

If we were starting from scratch today in a greenfield environment, I'm not sure we'd bother building our automation for manual package updates. Since we have the automation and it offers various extra features (even if they're rarely used), we're probably not going to switch over to automated upgrades (including in our local build of Ubuntu 26.04 LTS when that comes out next year).

(The advantage of switching over to standard unattended upgrades is that we'd get rid of a local tool that, like all local tools, is all our responsibility. The less local weird things we have, the better, especially since we have so many as it is.)

Getting the Cinnamon desktop environment to support "AppIndicator"

By: cks

The other day I wrote about what "AppIndicator" is (a protocol) and some things about how the Cinnamon desktop appeared to support it, except they weren't working for me. Now I actually understand what's going on, more or less, and how to solve my problem of a program complaining that it needed AppIndicator.

Cinnamon directly implements the AppIndicator notification protocol in xapp-sn-watcher, part of Cinnamon's xapp(s) package. Xapp-sn-watcher is started as part of your (Cinnamon) session. However, it has a little feature, namely that it will exit if no one is asking it to do anything:

XApp-Message: 22:03:57.352: (SnWatcher) watcher_startup: ../xapp-sn-watcher/xapp-sn-watcher.c:592: No active monitors, exiting in 30s

In a normally functioning Cinnamon environment, something will soon show up to be an active monitor and stop xapp-sn-watcher from exiting:

Cjs-Message: 22:03:57.957: JS LOG: [LookingGlass/info] Loaded applet xapp-status@cinnamon.org in 88 ms
[...]
XApp-Message: 22:03:58.129: (SnWatcher) name_owner_changed_signal: ../xapp-sn-watcher/xapp-sn-watcher.c:162: NameOwnerChanged signal received (n: org.x.StatusIconMonitor.cinnamon_0, old: , new: :1.60
XApp-Message: 22:03:58.129: (SnWatcher) handle_status_applet_name_owner_appeared: ../xapp-sn-watcher/xapp-sn-watcher.c:64: A monitor appeared on the bus, cancelling shutdown

This something is a standard Cinnamon desktop applet. In System Settings → Applets, it's way down at the bottom and is called "XApp Status Applet". If you've accidentally wound up with it not turned on, xapp-sn-watcher will (probably) not have a monitor active after 30 seconds, and then it will exit (and in the process of exiting, it will log alarming messages about failed GLib assertions). Not having this xapp-status applet turned on was my problem, and turning it on fixed things.

(I don't know how it got turned off. It's possible I wen through the standard applets at some point and turned some of them off in an excess of ignorant enthusiasm.)

As I found out from leigh scott in my Fedora bug report, the way to get this debugging output from xapp-sn-watcher is to run 'gsettings set org.x.apps.statusicon sn-watcher-debug true'. This will cause xapp-sn-watcher to log various helpful and verbose things to your ~/.xsession-errors (although apparently not the fact that it's actually exiting; you have to deduce that from the timestamps stopping 30 seconds later and that being the timestamps on the GLib assertion failures).

(I don't know why there's both a program and an applet involved in this and I've decided not to speculate.)

What an "AppIndicator" is in Linux desktops and some notes on it

By: cks

Suppose, not hypothetically, that you start up some program on your Fedora 42 Cinnamon desktop and it helpfully tells you "<X> requires AppIndicator to run. Please install the AppIndicator plugin for your desktop". You are likely confused, so here are some notes.

'AppIndicator' itself is the name of an application notification protocol, apparently originally from KDE, and some desktop environments may need a (third party) extension to support it, such as the Ubuntu one for GNOME Shell. Unfortunately for me, Cinnamon is not one of those desktops. It theoretically has native support for this, implemented in /usr/libexec/xapps/xapp-sn-watcher, part of Cinnamon's xapps package.

The actual 'AppIndicator' protocol is done over D-Bus, because that's the modern way. Since this started as a KDE thing, the D-Bus name is 'org.kde.StatusNotifierWatcher'. What provides certain D-Bus names is found in /usr/share/dbus-1/services, but not all names are mentioned there and 'org.kde.StatusNotifierWatcher' is one of the missing ones. In this case /etc/xdg/autostart/xapp-sn-watcher.desktop mentions the D-Bus name in its 'Comment=', but that's probably not something you can count on to find what your desktop is (theoretically) using to provide a given D-Bus name. I found xapp-sn-watcher somewhat through luck.

There are probably a number of ways to see what D-Bus names are currently registered and active. The one that I used when looking at this is 'dbus-send --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames'. As far as I know, there's no easy way to go from an error message about 'AppIndicator' to knowing that you want 'org.kde.StatusNotifierWatcher'; in my case I read the source of the thing complaining which was helpfully in Python.

(I used the error message to find the relevant section of code, which showed me what it wasn't finding.)

I have no idea how to actually fix the problem, or if there is a program that implements org.kde.StatusNotifierWatcher as a generic, more or less desktop independent program the way that stalonetray does for system tray stuff (or one generation of system tray stuff, I think there have been several iterations of it, cf).

(Yes, I filed a Fedora bug, but I believe Cinnamon isn't particularly supported by Fedora so I don't expect much. I also built the latest upstream xapps tree and it also appears to fail in the same way. Possibly this means something in the rest of the system isn't working right.)

Getting Linux nflog and tcpdump packet filters to sort of work together

By: cks

So, suppose that you have a brand new nflog version of OpenBSD's pflog, so you can use tcpdump to watch dropped packets (or in general, logged packets). And further suppose that you specifically want to see DNS requests to your port 53. So of course you do:

# tcpdump -n -i nflog:30 'port 53'
tcpdump: NFLOG link-layer type filtering not implemented

Perhaps we can get clever by reading from the interface in one tcpdump and sending it to another to be interpreted, forcing the pcap filter to be handled entirely in user space instead of the kernel:

# tcpdump --immediate-mode -w - -U -i nflog:30 | tcpdump -r - 'port 53'
tcpdump: listening on nflog:30, link-type NFLOG (Linux netfilter log messages), snapshot length 262144 bytes
reading from file -, link-type NFLOG (Linux netfilter log messages), snapshot length 262144
tcpdump: NFLOG link-layer type filtering not implemented

Alas we can't.

As far as I can determine, what's going on here is that the netfilter log system, 'NFLOG', uses a 'packet' format that isn't the same as any of the regular formats (Ethernet, PPP, etc) and adds some additional (meta)data about the packet to every packet you capture. I believe the various attributes this metadata can contain are listed in the kernel's nfnetlink_log.h.

(I believe it's not technically correct to say that this additional stuff is 'before' the packet; instead I believe the packet is contained in a NFULA_PAYLOAD attribute.)

Unfortunately for us, tcpdump (or more exactly libpcap) doesn't know how to create packet capture filters for this format, not even ones that are interpreted entirely in user space (as happens when tcpdump reads from a file).

I believe that you have two options. First, you can use tshark with a display filter, not a capture filter:

# tshark -i nflog:30 -Y 'udp.port == 53 or tcp.port == 53'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'nflog:30'
[...]

(Tshark capture filters are subject to the same libpcap inability to work on NFLOG formatted packets as tcpdump has.)

Alternately and probably more conveniently, you can tell tcpdump to use the 'IPV4' datalink type instead of the default, as mentioned in (opaque) passing in the tcpdump manual page:

# tcpdump -i nflog:30 -L
Data link types for nflog:30 (use option -y to set):
  NFLOG (Linux netfilter log messages)
  IPV4 (Raw IPv4)
# tcpdump -i nflog:30 -y ipv4 -n 'port 53'
tcpdump: data link type IPV4
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on nflog:30, link-type IPV4 (Raw IPv4), snapshot length 262144 bytes
[...]

Of course this is only applicable if you're only doing IPv4. If you have some IPv6 traffic that you want to care about, I think you have to use tshark display filters (which means learning how to write Wireshark display filters, something I've avoided so far).

I think there is some potentially useful information in the extra NFLOG data, but to get it or to filter on it I think you'll need to use tshark (or Wireshark) and consult the NFLOG display filter reference, although that doesn't seem to give you access to all of the NFLOG stuff that 'tshark -i nflog:30 -V' will print about packets.

(Or maybe the trick is that you need to match 'nflog.tlv_type == <whatever> and nflog.tlv_value == <whatever>'. I believe that some NFLOG attributes are available conveniently, such as 'nflog.prefix', which corresponds to NFULA_PREFIX. See packet-nflog.c.)

PS: There's some information on the NFLOG format in the NFLOG linktype documentation and tcpdump's supported data link types in the link-layer header types documentation.

Implementing a basic equivalent of OpenBSD's pflog in Linux nftables

By: cks

OpenBSD's and FreeBSD's PF system has a very convenient 'pflog' feature, where you put in a 'log' bit in a PF rule and this dumps a copy of any matching packets into a pflog pseudo-interface, where you can both see them with 'tcpdump -i pflog0' and have them automatically logged to disk by pflogd in pcap format. Typically we use this to log blocked packets, which gives us both immediate and after the fact visibility of what's getting blocked (and by what rule, also). It's possible to mostly duplicate this in Linux nftables, although with more work and there's less documentation on it.

The first thing you need is nftables rules with one or two log statements of the form 'log group <some number>'. If you want to be able to both log packets for later inspection and watch them live, you need two 'log group' statements with different numbers; otherwise you only need one. You can use different (group) numbers on different nftables rules if you want to be able to, say, look only at accepted but logged traffic or only dropped traffic. In the end this might wind up looking something like:

tcp port ssh counter log group 30 log group 31 drop;

As the nft manual page will tell you, this uses the kernel 'nfnetlink_log' to forward the 'logs' (packets) to a netlink socket, where exactly one process (at most) can subscribe to a particular group to receive those logs (ie, those packets). If we want to both log the packets and be able to tcpdump them, we need two groups so we can have ulogd getting one and tcpdump getting the other.

To see packets from any particular log group, we use the special 'nflog:<N>' pseudo-interface that's hopefully supported by your Linux version of tcpdump. This is used as 'tcpdump -i nflog:30 ...' and works more or less like you'd want it to. However, as far as I know there's no way to see meta-information about the nftables filtering, such as what rule was involved or what the decision was; you just get the packet.

To log the packets to disk for later use, the default program is ulogd, which in Ubuntu is called 'ulogd2'. Ulogd(2) isn't as automatic as OpenBSD's and FreeBSD's pf logging; instead you have to configure it in /etc/ulogd.conf, and on Ubuntu make sure you have the 'ulogd2-pcap' package installed (along with ulogd2 itself). Based merely on getting it to work, what you want in /etc/ulogd.conf is the following three bits:

# A 'stack' of source, handling, and destination
stack=log31:NFLOG,base1:BASE,pcap31:PCAP

# The source: NFLOG group 31, for IPv4 traffic
[log31]
group=31
# addressfamily=10 for IPv6

# the file path is correct for Ubuntu
[pcap31]
file="/var/log/ulog/ulogd.pcap"
sync=0

(On Ubuntu 24.04, any .pcap files in /var/log/ulog will be automatically rotated by logrotate, although I think by default it's only weekly, so you might want to make it daily.)

The ulogd documentation suggests that you will need to capture IPv4 and IPv6 traffic separately, but I've only used this on IPv4 traffic so I don't know. This may imply that you need separate nftables rules to log (and drop) IPv6 traffic so that you can give it a separate group number for ulogd (I'm not sure if it needs a separate one for tcpdump or if tcpdump can sort it out).

Ulogd can also log to many different things than PCAP format, including JSON and databases. It's possible that there are ways to enrich the ulogd pcap logs, or maybe just the JSON logs, with additional useful information such as the network interface involved and other things. I find the ulogd documentation somewhat opaque on this (and also it's incomplete), and I haven't experimented.

(According to this, the JSON logs can be enriched or maybe default to that.)

Given the assorted limitations and other issues with ulogd, I'm tempted to not bother with it and only have our nftables setups support live tcpdump of dropped traffic with a single 'log group <N>'. This would save us from the assorted annoyances of ulogd2.

PS: One reason to log to pcap format files is that then you can use all of the tcpdump filters that you're already familiar with in order to narrow in on (blocked) traffic of interest, rather than having to put together a JSON search or something.

The 'nft' command may not show complete information for iptables rules

By: cks

These days, nftables is the Linux network firewall system that you want to use, and especially it's the system that Ubuntu will use by default even if you use the 'iptables' command. The nft command is the official interface to nftables, and it has a 'nft list ruleset' sub-command that will list your NFT rules. Since iptables rules are implemented with nftables, you might innocently expect that 'nft list ruleset' will show you the proper NFT syntax to achieve your current iptables rules.

Well, about that:

# iptables -vL INPUT
[...] target prot opt in  out  source   destination         
[...] ACCEPT tcp  --  any any  anywhere anywhere    match-set nfsports dst match-set nfsclients src
# nft list ruleset
[...]
      ip protocol tcp xt match "set" xt match "set" counter packets 0 bytes 0 accept
[...]

As they say, "yeah no". As the documentation tells you (eventually), somewhat reformatted:

xt TYPE NAME

TYPE := match | target | watcher

This represents an xt statement from xtables compat interface. It is a fallback if translation is not available or not complete. Seeing this means the ruleset (or parts of it) were created by iptables-nft and one should use that to manage it.

Nftables has a native set type (and also maps), but, quite reasonably, the old iptables 'ipset' stuff isn't translated to nftables sets by the iptables compatibility layer. Instead the compatibility layer uses this 'xt match' magic that the nft command can only imperfectly tell you about. To nft's credit, it prints a warning comment (which I've left out) that the rules are being managed by iptables-nft and you shouldn't touch them. Here, all of the 'xt match "set"' bits in the nft output are basically saying "opaque stuff happens here".

This still makes me a little bit sad because it makes it that bit harder to bootstrap my nftables knowledge from what iptables rules convert into. If I wanted to switch to nftables rules and nftables sets (for example for my now-simpler desktop firewall rules), I'd have to do that from relative scratch instead of getting to clean up what the various translation tools would produce or report.

(As a side effect it makes it less likely that I'll convert various iptables things to being natively nft/nftables based, because I can't do a fully mechanical conversion. If they still work with iptables-nft, I'm better off leaving them as is. Probably this also means that iptables-nft support is likely to have a long, long life.)

NFS v4 delegations on a Linux NFS server can act as mandatory locks

By: cks

Over on the Fediverse, I shared an unhappy learning experience:

Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS client that's now not responding, good luck writing to the file from any other machine. Your writes will hang.

NFS v4 delegations are an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular file over to a client that is using that file. There are various sorts of delegations, but even a basic read delegation will force the NFS server to recall the delegation if anything else wants to write to the file or to remove it. Recalling a delegation requires notifying the NFS v4 client that it has lost the delegation and then having the client accept and respond to that. NFS v4 clients have to respond to the loss of a delegation because they may be holding local state that needs to be flushed back to the NFS server before the delegation can be released.

(After all the NFS v4 server promised the client 'this file is yours to fiddle around with, I will consult you before touching it'.)

Under some circumstances, when the NFS v4 server is unable to contact the NFS v4 client, it will simply sit there waiting and as part of that will not allow you to do things that require the delegation to be released. I don't know if there's a delegation recall timeout, although I suspect that there is, and I don't know how to find out what the timeout is, but whatever the value is, it's substantial (it may be the 90 second 'default lease time' from nfsd4_init_leases_net(), or perhaps the 'grace', also probably 90 seconds, or perhaps the two added together).

(90 seconds is not what I consider a tolerable amount of time for my editor to completely freeze when I tell it to write out a new version of the file. When NFS is involved, I will typically assume that something has gone badly wrong well before then.)

As mentioned, the NFS v4 RFC also explicitly notes that NFS v4 clients may have to flush file state in order to release their delegation, and this itself may take some time. So even without an unavailable client machine, recalling a delegation may stall for some possibly arbitrary amount of time (depending on how the NFS v4 server behaves; the RFC encourages NFS v4 servers to not be hasty if the client seems to be making a good faith effort to clear its state). Both the slow client recall and the hung client recall can happen even in the absence of any actual file locks; in my case, the now-unavailable client merely having read from the file was enough to block things.

This blocking recall is effectively a mandatory lock, and it affects both remote operations over NFS and local operations on the fileserver itself. Short of waiting out whatever timeout applies, you have two realistic choices to deal with this (the non-realistic choice is to reboot the fileserver). First, you can bring the NFS client back to life, or at least something that's at its IP address and responds to the server with NFS v4 errors. Second, I believe you can force everything from the client to expire through /proc/fs/nfsd/clients/<ID>, by writing 'expire' to the client's 'ctl' file. You can find the right client ID by grep'ing for something in all of the clients/*/info files.

Discovering this makes me somewhat more inclined than before to consider entirely disabling 'leases', the underlying kernel feature that is used to implement these NFS v4 delegations (I discovered how to do this when investigating NFS v4 client locks on the server). This will also affect local processes on the fileserver, but that now feels like a feature since hung NFS v4 delegation recalls will stall or stop even local operations.

Why Ubuntu 24.04's ls can show a puzzling error message on NFS filesystems

By: cks

Suppose that you're on Ubuntu 24.04, using NFS v4 filesystems mounted from a Linux NFS fileserver, and at some point you do a 'ls -l' or a 'ls -ld' of something you don't own. You may then be confused and angered:

; /bin/ls -ld ckstst
/bin/ls: ckstst: Permission denied
drwx------ 64 ckstst [...] 131 Jul 17 12:06 ckstst

(There are situations where this doesn't happen or doesn't repeat, which I don't understand but which I'm assuming are NFS caching in action.)

If you apply strace to the problem, you'll find that the failing system call is listxattr(2), which is trying to list 'extended attributes'. On Ubuntu 24.04, ls comes from Coreutils, and Coreutils apparently started using listxattr() in version 9.4.

The Linux NFS v4 code supports extended attributes (xattrs), which are from RFC 8276; they're supported in both the client and the server since mid-2020 if I'm reading git logs correctly. Both the normal Ubuntu 22.04 LTS and 24.04 LTS server kernels are recent enough to include this support on both the server and clients, and I don't believe there's any way to turn just them off in the kernel server (although if you disable NFS v4.2 they may disappear too).

However, the NFS v4 server doesn't treat listxattr() operations the way the kernel normally does. Normally, the kernel will let you do listxattr() on an object (a directory, a file, etc) that you don't have read permissions on, just as it will let you do stat() on it. However, the NFS v4 server code specifically requires that you have read access to the object. If you don't, you get EACCES (no second S).

(The sausage is made in nfsd_listxattr() in fs/nfsd/vfs.c, specifically in the fh_verify() call that uses NFSD_MAY_READ instead of NFSD_MAY_NOP, which is what eg GETATTR uses.)

In January of this year, Coreutils applied a workaround to this problem, which appeared in Coreutils 9.6 (and is mentioned in the release notes).

Normally we'd have found this last year, but we've been slow to roll out Ubuntu 24.04 LTS machines and apparently until now no one ever did a 'ls -l' of unreadable things on one of them (well, on a NFS mounted filesystem).

(This elaborates on a Fediverse post. Our patch is somewhat different than the official one.)

The development version of OpenZFS is sometimes dangerous, illustrated

By: cks

I've used OpenZFS on my office and home desktops (on Linux) for what is a long time now, and over that time I've consistently used the development version of OpenZFS, updating to the latest git tip on a regular basis (cf). There have been occasional issues but I've said, and continue to say, that the code that goes into the development version is generally well tested and I usually don't worry too much about it. But I do worry somewhat, and I do things like read every commit message for the development version and I sometimes hold off on updating my version if a particular significant change has recently landed.

But, well, sometimes things go wrong in a development version. As covered in Rob Norris's An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too) (via), there was a recently discovered bug in the development version of OpenZFS that could or would have corrupted RAIDZ vdevs. When I saw the fix commit go by in the development version, I felt extremely lucky that I use mirror vdevs, not raidz, and so avoided being affected by this.

(While I might have detected this at the first scrub after some data was corrupted, the data would have been gone and at a minimum I'd have had to restore it from backups. Which I don't currently have on my home desktop.)

In general this is a pointed reminder that the development version of OpenZFS isn't perfect, no matter how long I and other people have been lucky with it. You might want to think twice before running the development version in order to, for example, get support for the very latest kernels that are used by distributions like Fedora. Perhaps you're better off delaying your kernel upgrades a bit longer and sticking to released branches.

I don't know if this is going to change my practices around running the development version of OpenZFS on my desktops. It may make me more reluctant to update to the very latest version on my home desktop; it would be straightforward to have that run only time-delayed versions of what I've already run through at least one scrub cycle on my office desktop (where I have backups). And I probably won't switch to the next release version when it comes out, partly because of kernel support issues.

(Maybe) understanding how to use systemd-socket-proxyd

By: cks

I recently read systemd has been a complete, utter, unmitigated success (via among other places), where I found a mention of an interesting systemd piece that I'd previously been unaware of, systemd-socket-proxyd. As covered in the article, the major purpose of systemd-socket-proxyd is the bridge between systemd dynamic socket activation and a conventional programs that listens on some socket, so that you can dynamically activate the program when a connection comes in. Unfortunately the systemd-socket-proxyd manual page is a little bit opaque about how it works for this purpose (and what the limitations are). Even though I'm familiar with systemd stuff, I had to think about it for a bit before things clicked.

A systemd socket unit activates the corresponding service unit when a connection comes in on the socket. For simple services that are activated separately for each connection (with 'Accept=yes'), this is actually a templated unit, but if you're using it to activate a regular daemon like sshd (with 'Accept=no') it will be a single .service unit. When systemd activates this unit, it will pass the socket to it either through systemd's native mechanism or an inetd-compatible mechanism using standard input. If your listening program supports either mechanism, you don't need systemd-socket-proxyd and your life is simple. But plenty of interesting programs don't; they expect to start up and bind to their listening socket themselves. To work with these programs, systemd-socket-proxyd accepts a socket (or several) from systemd and then proxies connections on that socket to the socket your program is actually listening to (which will not be the official socket, such as port 80 or 443).

All of this is perfectly fine and straightforward, but the question is, how do we get our real program to be automatically started when a connection comes in and triggers systemd's socket activation? The answer, which isn't explicitly described in the manual page but which appears in the examples, is that we make the socket's .service unit (which will run systemd-socket-proxyd) also depend on the .service unit for our real service with a 'Requires=' and an 'After='. When a connection comes in on the main socket that systemd is doing socket activation for, call it 'fred.socket', systemd will try to activate the corresponding .service unit, 'fred.service'. As it does this, it sees that fred.service depends on 'realthing.service' and must be started after it, so it will start 'realthing.service' first. Your real program will then start, bind to its local socket, and then have systemd-socket-proxyd proxy the first connection to it.

To automatically stop everything when things are idle, you set systemd-socket-proxyd's --exit-idle-time option and also set StopWhenUnneeded=true on your program's real service unit ('realthing.service' here). Then when systemd-socket-proxyd is idle for long enough, it will exit, systemd will notice that the 'fred.service' unit is no longer active, see that there's nothing that needs your real service unit any more, and shut that unit down too, causing your real program to exit.

The obvious limitation of using systemd-socket-proxyd is that your real program no longer knows the actual source of the connection. If you use systemd-socket-proxyd to relay HTTP connections on port 80 to an nginx instance that's activated on demand (as shown in the examples in the systemd-socket-proxyd manual page), that nginx sees and will log all of the connections as local ones. There are usage patterns where this information will be added by something else (for example, a frontend server that is a reverse proxy to a bunch of activated on demand backend servers), but otherwise you're out of luck as far as I know.

Another potential issue is that systemd's idea of when the .service unit for your real program has 'started' and thus it can start running systemd-socket-proxyd may not match when your real program actually gets around to setting up its socket. I don't know if systemd-socket-proxyd will wait and try a bit to cope with the situation where it gets started a bit faster than your real program can get its socket ready.

(Systemd has ways that your real program can signal readiness, but if your program can use these ways it may well also support being passed sockets from systemd as a direct socket activated thing.)

Linux 'exportfs -r' stops on errors (well, problems)

By: cks

Linux's NFS export handling system has a very convenient option where you don't have to put all of your exports into one file, /etc/exports, but can instead write them into a bunch of separate files in /etc/exports.d. This is very convenient for allowing you to manage filesystem exports separately from each other and to add, remove, or modify only a single filesystem's exports. Also, one of the things that exportfs(8) can do is 'reexport' all current exports, synchronizing the system state to what is in /etc/exports and /etc/exports.d; this is 'exportfs -r', and is a handy thing to do after you've done various manipulations of files in /etc/exports.d.

Although it's not documented and not explicit in 'exportfs -v -r' (which will claim to be 'exporting ...' for various things), I have an important safety tip which I discovered today: exportfs does nothing on a re-export if you have any problems in your exports. In particular, if any single file in /etc/exports.d has a problem, no files from /etc/exports.d get processed and no exports are updated.

One potential problem with such files is syntax errors, which is fair enough as a 'problem'. But another problem is that they refer to directories that don't exist, for example because you have lingering exports for a ZFS pool that you've temporarily exported (which deletes the directories that the pool's filesystems may have previously been mounted on). A missing directory is an error even if the exportfs options include 'mountpoint', which only does the export if the directory is a mount point.

When I stubbed my toe on this I was surprised. What I'd vaguely expected was that the error would cause only the particular file in /etc/exports.d to not be processed, and that it wouldn't be a fatal error for the entire process. Exportfs itself prints no notices about this being a fatal problem, and it will happily continue to process other files in /etc/exports.d (as you can see with 'exportfs -v -r' with the right ordering of where the problem file is) and claim to be exporting them.

Oh well, now I know and hopefully it will stick.

Systemd user units, user sessions, and environment variables

By: cks

A variety of things in typical graphical desktop sessions communicate through the use of environment variables; for example, X's $DISPLAY environment variable. Somewhat famously, modern desktops run a lot of things as systemd user units, and it might be nice to do that yourself (cf). When you put these two facts together, you wind up with a question, namely how the environment works in systemd user units and what problems you're going to run into.

The simplest case is using systemd-run to run a user scope unit ('systemd-run --user --scope --'), for example to run a CPU heavy thing with low priority. In this situation, the new scope will inherit your entire current environment and nothing else. As far as I know, there's no way to do this with other sorts of things that systemd-run will start.

Non-scope user units by default inherit their environment from your user "systemd manager". I believe that there is always only a single user manager for all sessions of a particular user, regardless of how you've logged in. When starting things via 'systemd-run', you can selectively pass environment variables from your current environment with 'systemd-run --user -E <var> -E <var> -E ...'. If the variable is unset in your environment but set in the user systemd manager, this will unset it for the new systemd-run started unit. As you can tell, this will get very tedious if you want to pass a lot of variables from your current environment into the new unit.

You can manipulate your user "systemd manager environment block", as systemctl describes it in Environment Commands. In particular, you can export current environment settings to it with 'systemctl --user import-environment VAR VAR2 ...'. If you look at this with 'systemctl --user show-environment', you'll see that your desktop environment has pushed a lot of environment variables into the systemd manager environment block, including things like $DISPLAY (if you're on X). All of these environment variables for X, Wayland, DBus, and so on are probably part of how the assorted user units that are part of your desktop session talk to the display and so on.

You may now see a little problem. What happens if you're logged in with a desktop X session, and then you go elsewhere and SSH in to your machine (maybe with X forwarding) and try to start a graphical program as a systemd user unit? Since you only have a single systemd manager regardless of how many sessions you have, the systemd user unit you started from your SSH session will inherit all of the environment variables that your desktop session set and it will think it has graphics and open up a window on your desktop (which is hopefully locked, and in any case it's not useful to you over SSH). If you import the SSH session's $DISPLAY (or whatever) into the systemd manager's environment, you'll damage your desktop session.

For specific environment variables, you can override or remove them with 'systemd-run --user -E ...' (for example, to override or remove $DISPLAY). But hunting down all of the session environment variables that may trigger undesired effects is up to you, making systemd-run's user scope units by far the easiest way to deal with this.

(I don't know if there's something extra-special about scope units that enables them and only them to be passed your entire environment, or of this is simply a limitation in systemd-run that it doesn't try to implement this for anything else.)

The reason I find all of this regrettable is that it makes putting applications and other session processes into their own units much harder than it should be. Systemd-run's scope units inherit your session environment but can't be detached, so at a minimum you have extra systemd-run processes sticking around (and putting everything into scopes when some of them might be services is unaesthetic). Other units can be detached but don't inherit your environment, requiring assorted contortions to make things work.

PS: Possibly I'm missing something obvious about how to do this correctly, or perhaps there's an existing helper that can be used generically for this purpose.

Current cups-browsed seems to be bad for central CUPS print servers

By: cks

Suppose, not hypothetically, that you have a central CUPS print server, and that people also have Linux desktops or laptops that they point at your print server to print to your printers. As of at least Ubunut 24.04, if you're doing this you probably want to get people to turn off and disable cups-browsed on their machines. If you don't, your central print server may see a constant flood of connections from client machines running cups-browsed. You're probably running it, as I believe that cups-browsed is installed and activated by default these days in most desktop Linux environments.

(We didn't really notice this in prior Ubuntu versions, although it's possible cups-browsed was always doing something like this and what's changed in the Ubuntu 24.04 version is that it's doing it more and faster.)

I'm not entirely sure why this happens, and I'm also not sure what the CUPS requests typically involve, but one pattern that we see is that such clients will make a lot of requests to the CUPS server's /admin/ URL. I'm not sure what's in these requests, because CUPS immediately rejects them as unauthenticated. Another thing we've seen is frequent attempts to get printer attributes for printers that don't exist and that have name patterns that look like local printers. One of the reason that the clients are hitting the /admin/ endpoint may be to somehow add these printers to our CUPS server, which is definitely not going to work.

(We've also seen signs that some Ubuntu 24.04 applications can repeatedly spam the CUPS server, probably with status requests for printers or print jobs. This may be something enabled or encouraged by cups-browsed.)

My impression is that modern Linux desktop software, things like cups-browsed included, is not really spending much time thinking about larger scale, managed Unix environments where there are a bunch of printers (or at least print queues), the 'print server' is not on your local machine and not run by you, anything random you pick up through broadcast on the local network is suspect, and so on. I broadly sympathize with this, because such environments are a small minority now, but it would be nice if client side CUPS software didn't cause problems in them.

(I suspect that cups-browsed and its friends are okay in an environment where either the 'print server' is local or it's operated by you and doesn't require authentication, there's only a few printers, everyone on the local network is friendly and if you see a printer it's definitely okay to use it, and so on. This describes a lot of Linux desktop environments, including my home desktop.)

Compute GPUs can have odd failures under Linux (still)

By: cks

Back in the early days of GPU computation, the hardware, drivers, and software were so relatively untrustworthy that our early GPU machines had to be specifically reserved by people and that reservation gave them the ability to remotely power cycle the machine to recover it (this was in the days before our SLURM cluster). Things have gotten much better since then, with things like hardware and driver changes so that programs with bugs couldn't hard-lock the GPU hardware. But every so often we run into odd failures where something funny is going on that we don't understand.

We have one particular SLURM GPU node that has been flaky for a while, with the specific issue being that every so often the NVIDIA GPU would throw up its hands and drop off the PCIe bus until we rebooted the system. This didn't happen every time it was used, or with any consistent pattern, although some people's jobs seemed to regularly trigger this behavior. Recently I dug up a simple to use GPU stress test program, and when this machine's GPU did its disappearing act this Saturday, I grabbed the machine, rebooted it, ran the stress test program, and promptly had the GPU disappear again. Success, I thought, and since it was Saturday, I stopped there, planning to repeat this process today (Monday) at work, while doing various monitoring things.

Since I'm writing a Wandering Thoughts entry about it, you can probably guess the punchline. Nothing has changed on this machine since Saturday, but all today the GPU stress test program could not make the GPU disappear. Not with the same basic usage I'd used Saturday, and not with a different usage that took the GPU to full power draw and a reported temperature of 80C (which was a higher temperature and power draw than the GPU had been at when it disappeared, based on our Prometheus metrics). If I'd been unable to reproduce the failure at all with the GPU stress program, that would have been one thing, but reproducing it once and then not again is just irritating.

(The machine is an assembled from parts one, with an RTX 4090 and a Ryzen Threadripper 1950X in an X399 Taichi motherboard that is probably not even vaguely running the latest BIOS, seeing as the base hardware was built many years ago, although the GPU has been swapped around since then. Everything is in a pretty roomy 4U case, but if the failure was consistent we'd have assumed cooling issues.)

I don't really have any theories for what could be going on, but I suppose I should try to find a GPU stress test program that exercises every last corner of the GPU's capabilities at full power rather than using only one or two parts at a time. On CPUs, different loads light up different functional units, and I assume the same is true on GPUs, so perhaps the problem is in one specific functional unit or a combination of them.

(Although this doesn't explain why the GPU stress test program was able to cause the problem on Saturday but not today, unless a full reboot didn't completely clear out the GPU's state. Possibly we should physically power this machine off entirely for long enough to dissipate any lingering things.)

What I've observed about Linux kernel WireGuard on 10G Ethernet so far

By: cks

I wrote about a performance mystery with WireGuard on 10G Ethernet, and since then I've done additional measurements with results that both give some clarity and leave me scratching my head a bit more. So here is what I know about the general performance characteristics of Linux kernel WireGuard on a mixture of Ubuntu 22.04 and 24.04 servers with stock settings, and using TCP streams inside the WireGuard tunnels (because the high bandwidth thing we care about runs over TCP).

  • CPU performance is important even when WireGuard isn't saturating the CPU.

  • CPU performance seems to be more important on the receiving side than on the sending side. If you have two machines, one faster than the other, you get more bandwidth sending a TCP stream from the slower machine to the faster one. I don't know if this is an artifact of the Linux kernel implementation or if the WireGuard protocol requires the receiver to do more work than the sender.

  • There seems to be a single-peer bandwidth limit (related to CPU speeds). You can increase the total WireGuard bandwidth of a given server by talking to more than one peer.

  • When talking to a single peer, there's both a unidirectional bandwidth limit and a bidirectional bandwidth limit. If you send and receive to a single peer at once, you don't get the sum of the unidirectional send and unidirectional receive; you get less.

  • There's probably also a total WireGuard bandwidth that, in our environment, falls short of 10G bandwidth (ie, a server talking WireGuard to multiple peers can't saturate its 10G connection, although maybe it could if I had enough peers in my test setup).

The best performance between a pair of WireGuard peers I've gotten is from two servers with Xeon E-2226G CPUs; these can push their 10G Ethernet to about 850 MBytes/sec of WireGuard bandwidth in one direction and about 630 MBytes/sec in each direction if they're both sending and receiving. These servers (and other servers with slower CPUs) can basically saturate their 10G-T network links with plain (non-WireGuard) TCP.

If I was to build a high performance 'WireGuard gateway' today, I'd build it with a fast CPU and dual 10G networks, with WireGuard traffic coming in (and going out) one 10G interface and the resulting gatewayed traffic using the other. WireGuard on fast CPUs can run fast enough that a single 10G interface could limit total bandwidth under the right (or wrong) circumstances; segmenting WireGuard and clear traffic onto different interfaces avoids that.

(A WireGuard gateway that only served clients at 1G or less would likely be perfectly fine with a single 10G interface and reasonably fast CPUs. But I'd want to test how many 1G clients it took to reach the total WireGuard bandwidth limit on a 10G WireGuard server before I was completely confident about that.)

A performance mystery with Linux WireGuard on 10G Ethernet

By: cks

As a followup on discovering that WireGuard can saturate a 1G Ethernet (on Linux), I set up WireGuard on some slower servers here that have 10G networking. This isn't an ideal test but it's more representative of what we would see with our actual fileservers, since I used spare fileserver hardware. What I got out of it was a performance and CPU usage mystery.

What I expected to see was that WireGuard performance would top out at some level above 1G as the slower CPUs on both the sending and the receiving host ran into their limits, and I definitely wouldn't see them drive the network as fast as they could without WireGuard. What I actually saw was that WireGuard did hit a speed limit but the CPU usage didn't seem to saturate, either for kernel WireGuard processing or for the iperf3 process. These machines can manage to come relatively close to 10G bandwidth with bare TCP, while with WireGuard they were running around 400 MBytes/sec of on the wire bandwidth (which translates to somewhat less inside the WireGuard connection, due to overheads).

One possible explanation for this is increased packet handling latency, where the introduction of WireGuard adds delays that keep things from running at full speed. Another possible explanation is that I'm running into CPU limits that aren't obvious from simple tools like top and htop. One interesting thing is that if I do a test in both directions at once (either an iperf3 bidirectional test or two iperf3 sessions, one in each direction), the bandwidth in each direction is slightly over half the unidirectional bandwidth (while a bidirectional test without WireGuard runs at full speed in both directions at once). This certainly makes it look like there's a total WireGuard bandwidth limit in these servers somewhere; unidirectional traffic gets basically all of it, while bidirectional traffic splits it fairly between each direction.

I looked at 'perf top' on the receiving 10G machine and kernel spin lock stuff seems to come in surprisingly high. I tried having a 1G test machine also send WireGuard traffic to the receiving 10G test machine at the same time and the incoming bandwidth does go up by about 100 Mbytes/sec, so perhaps on these servers I'm running into a single-peer bandwidth limitation. I can probably arrange to test this tomorrow.

(I can't usefully try both of my 1G WireGuard test machines at once because they're both connected to the same 1G switch, with a 1G uplink into our 10G switch fabric.)

PS: The two 10G servers are running Ubuntu 24.04 and Ubuntu 22.04 respectively with standard kernels; the faster server with more CPUs was the 'receiving' server here, and is running 24.04. The two 1G test servers are running Ubuntu 24.04.

Linux kernel WireGuard can go 'fast' on decent hardware

By: cks

I'm used to thinking of encryption as a slow thing that can't deliver anywhere near to network saturation, even on basic gigabit Ethernet connections. This is broadly the experience we see with our current VPN servers, which struggle to turn in more than relatively anemic bandwidth with OpenVPN and L2TP, and so for a long time I assumed it would also be our experience with WireGuard if we tried to put anything serious behind it. I'd seen the 2023 Tailscale blog post about this but discounted it as something we were unlikely to see; as their kernel throughput on powerful sounding AWS nodes was anemic by 10G standards, so I assumed our likely less powerful servers wouldn't even get 1G rates.

Today, for reasons beyond the scope of this entry, I wound up wondering how fast we could make WireGuard go. So I grabbed a couple of spare servers we had with reasonably modern CPUs (by our limited standards), put our standard Ubuntu 24.04 on them, and took a quick look to see how fast I could make them go over 1G networking. To my surprise, the answer is that WireGuard can saturate that 1G network with no particularly special tuning, and the system CPU usage is relatively low (4.5% on the client iperf3 side, 8% on the server iperf3 side; each server has a single Xeon E-2226G). The low usage suggests that we could push well over 1G of WireGuard bandwidth through a 10G link, which means that I'm going to set one up for testing at some point.

While the Xeon E-2226G is not a particularly impressive CPU, it's better than the CPUs our NFS fileservers have (the current hardware has Xeon Silver 4410Ys). But I suspect that we could sustain over 1G of WireGuard bandwidth even on them, if we wanted to terminate WireGuard on the fileservers instead of on a 'gateway' machine with a fast CPU (and a 10G link).

More broadly, I probably need to reset my assumptions about the relative speed of encryption as compared to network speeds. These days I suspect a lot of encryption methods can saturate a 1G network link, at least in theory, since I don't think WireGuard is exceptionally good in this respect (as I understand it, encryption speed wasn't particularly a design goal; it was designed to be secure first). Actual implementations may vary for various reasons so perhaps our VPN servers need some tuneups.

(The actual bandwidth achieved inside WireGuard is less than the 1G data rate because simply being encrypted adds some overhead. This is also something I'm going to have to remember when doing future testing; if I want to see how fast WireGuard is driving the underlying networking, I should look at the underlying networking data rate, not necessarily WireGuard's rate.)

A silly systemd wish for moving new processes around systemd units

By: cks

Linux cgroups offer a bunch of robust features for limiting resource usage and handling resource contention between different groups of processes, which you can use to implement things like per-user memory and CPU resource limits. On a systemd based system, which is to say basically almost all Linuxes today, systemd more or less completely owns the cgroup hierarchy and using cgroups for resource limits requires that the processes involved be placed inside relevant systemd units, and for that matter that the systemd units exist.

Unfortunately, the mechanisms for doing this are a little bit under-developed. If you're dealing with something that goes through PAM and for which putting processes into user slices based on the UID running them is the right answer, you can use pam_systemd (which we do for various reasons). If you want a different hierarchy and things go through PAM, you can perhaps write a PAM session module that does this, copying code from pam_systemd, but I don't know if there's anything for that today. If you have processes that are started in ways that don't go through PAM, as far as I know you're currently out of luck. One case that's quite relevant for us is Apache CGI processes run through suexec.

It would be nice to be able to do better, since the odds that everything that starts processes will pick up the ability to talk to systemd to set up slices, sessions, and so on for them seem rather low. Some things have specific magic support for this, but I don't think the process is very documented and I believe it requires that things change how they start programs (so eg suexec would have to know how to do this). This means that what I'm wishing for is a daemon that would be given some sort of rules and use them to move processes between systemd slices and other units, possibly creating things like user sessions on the fly. Then you could write a rule that said 'if a process is in the Apache system cgroup and its UID isn't <X>, put it in a slice in a user hierarchy'.

An extra problem is that this daemon probably wouldn't be perfect, since it would have to react to processes after they'd appeared rather than intercept their creation; some processes could slip through the cracks or otherwise do weird things. This would make it sort of a hack, rather than something that I suspect anyone would want as a proper feature.

(I don't know if a kernel LSM could make this more reliable by intercepting and acting on certain things, like setuid() calls.)

PS: Possibly the correct answer is to persuade the Apache people to make suexec consult PAM, even if the standard suexec PAM stack does nothing. Then you could in theory add pam_systemd or whatever there. It appears that Debian may have had a custom patch for this at one but I believe they gave it up years and years ago.

Fedora's DNF 5 and the curse of mandatory too-smart output

By: cks

DNF is Fedora's high(er) level package management system, which pretty much any system administrator is going to have to use to install and upgrade packages. Fedora 41 and later have switched from DNF 4 to DNF 5 as their normal (and probably almost mandatory) version of DNF. I ran into some problems with this switch, and since then I've found other issues, all of which boil down to a simple issue: DNF 5 insists on doing too-smart output.

Regardless of what you set your $TERM to and what else you do, if DNF 5 is connected to a terminal (and perhaps if it isn't), it will pretty-print its output in an assortment of ways. As far as I can tell it simply assumes ANSI cursor addressability, among other things, and will always fit its output to the width of your terminal window, truncating output as necessary. This includes output from RPM package scripts that are running as part of the update. Did one of them print a line longer than your current terminal width? Tough, it was probably truncated. Are you using script so that you can capture and review all of the output from DNF and RPM package scripts? Again, tough, you can't turn off the progress bars and other things that will make a complete mess of the typescript.

(It's possible that you can find the information you want in /var/log/dnf5.log in un-truncated and readable form, but if so it's buried in debug output and I'm not sure I trust dnf5.log in general.)

DNF 5 is far from the only offender these days. An increasing number of command line programs simply assume that they should always produce 'smart' output (ideally only if they're connected to a terminal). They have no command line option to turn this off and since they always use 'ANSI' escape sequences, they ignore the tradition of '$TERM' and especially 'TERM=dumb' to turn that off. Some of them can specifically disable colour output (typically with one of a number of environment variables, which may or may not be documented, and sometimes with a command line option), but that's usually the limits of their willingness to stop doing things. The idea of printing one whole line at a time as you do things and not printing progress bars, interleaving output, and so on has increasingly become a non-starter for modern command line tools.

(Another semi-offender is Debian's 'apt' and also 'apt-get' to some extent, although apt-get's progress bars can be turned off and 'apt' is explicitly a more user friendly front end to apt-get and friends.)

PS: I can't run DNF with its output directed into a file because it wants you to interact with it to approve things, and I don't feel like letting it run freely without that.

Netplan can only have WireGuard peers in one file

By: cks

We have started using WireGuard to build a small mesh network so that machines outside of our network can securely get at some services inside it (for example, to send syslog entries to our central syslog server). Since this is all on Ubuntu, we set it up through Netplan, which works but which I said 'has warts' in my first entry about it. Today I discovered another wart due to what I'll call the WireGuard provisioning problem:

Current status: provisioning WireGuard endpoints is exhausting, at least in Ubuntu 22.04 and 24.04 with netplan. So many netplan files to update. I wonder if Netplan will accept files that just define a single peer for a WG network, but I suspect not.

The core WireGuard provisioning problem is that when you add a new WireGuard peer, you have to tell all of the other peers about it (or at least all of the other peers you want to be able to talk to the new peer). When you're using iNetplan, it would be convenient if you could put each peer in a separate file in /etc/netplan; then when you add a new peer, you just propagate the new Netplan file for the peer to everything (and do the special Netplan dance required to update peers).

(Apparently I should now call it 'Canonical Netplan', as that's what its front page calls it. At least that makes it clear exactly who is responsible for Netplan's state and how it's not going to be widely used.)

Unfortunately this doesn't work, and it doesn't work in a dangerous way, which is that Netplan only notices one set of WireGuard peers in one netplan file (at least on servers, using systemd-networkd as the backend). If you put each peer in its own file, only the first peer is picked up. If you define some peers in the file where you define your WireGuard private key, local address, and so on, and some peers in another file, only peers from whichever is first will be used (even if the first file only defines peers, which isn't enough to bring up a WireGuard device by itself). As far as I can see, Netplan doesn't report any errors or warnings to the system logs on boot about this situation; instead, you silently get incomplete WireGuard configurations.

This is visibly and clearly a Netplan issue, because on servers you can inspect the systemd-networkd files written by Netplan (in /run/systemd/network). When I do this, the WireGuard .netdev file has only the peers from one file defined in it (and the .netdev file matches the state of the WireGuard interface). This is especially striking when the netplan file with the private key and listening port (and some peers) is second; since the .netdev file contains the private key and so on, Netplan is clearly merging data from more than one netplan file, not completely ignoring everything except the first one. It's just ignoring any peers encountered after the first set of them.

My overall conclusion is that in Netplan, you need to put all configuration for a given WireGuard interface into a single file, however tempting it might be to try splitting it up (for example, to put core WireGuard configuration stuff in one file and then list all peers in another one).

I don't know if this is an already filed Netplan bug and I don't plan on bothering to file one for it, partly because I don't expect Canonical to fix Netplan issues any more than I expect them to fix anything else and partly for other reasons.

PS: I'm aware that we could build a system to generate the Netplan WireGuard file, or maybe find a YAML manipulating program that could insert and delete blocks that matched some criteria. I'm not interested in building yet another bespoke custom system to deal with what is (for us) a minor problem, since we don't expect to be constantly deploying or removing WireGuard peers.

These days, Linux audio seems to just work (at least for me)

By: cks

For a long time, the common perception was that 'Linux audio' was the punchline for a not particularly funny joke. I sort of shared that belief; although audio had basically worked for me for a long time, I had a simple configuration and dreaded having to make more complex audio work in my unusual desktop environment. But these days, audio seems to just work for me, even in systems that have somewhat complex audio options.

On my office desktop, I've wound up with three potential audio outputs and two audio inputs: the motherboard's standard sound system, a USB headset with a microphone that I use for online meetings, the microphone on my USB webcam, and (to my surprise) a HDMI audio output because my LCD displays do in fact have tiny little speakers built in. In PulseAudio (or whatever is emulating it today), I have the program I use for online meetings set to use the USB headset and everything else plays sound through the motherboard's sound system (which I have basic desktop speakers plugged into). All of this works sufficiently seamlessly that I don't think about it, although I do keep a script around to reset the default audio destination.

On my home desktop, for a long time I had a simple single-output audio system that played through the motherboard's sound system (plus a microphone on a USB webcam that was mostly not connected). Recently I got an outboard USB DAC and, contrary to my fears, it basically plugged in and just worked. It was easy to set the USB DAC as the default output in pavucontrol and all of the settings related to it stick around even when I put it to sleep overnight and it drops off the USB bus. I was quite pleased by how painless the USB DAC was to get working, since I'd been expecting much more hassles.

(Normally I wouldn't bother meticulously switching the USB DAC to standby mode when I'm not using it for an extended time, but I noticed that the case is clearly cooler when it rests in standby mode.)

This is still a relatively simple audio configuration because it's basically static. I can imagine more complex ones, where you have audio outputs that aren't always present and that you want some programs (or more generally audio sources) to use when they are present, perhaps even with priorities. I don't know if the Linux audio systems that Linux distributions are using these days could cope with that, or if they did would give you any easy way to configure it.

(I'm aware that PulseAudio and so on can be fearsomely complex under the hood. As far as the current actual audio system goes, I believe that what my Fedora 41 machines are using for audio is PipeWire (also) with WirePlumber, based on what processes seem to be running. I think this is the current Fedora 41 audio configuration in general, but I'm not sure.)

My Cinnamon desktop customizations (as of 2025)

By: cks

A long time ago I wrote up some basic customizations of Cinnamon, shortly after I started using Cinnamon (also) on my laptop of the time. Since then, the laptop got replaced with another one and various things changed in both the land of Cinnamon and my customizations (eg, also). Today I feel like writing down a general outline of my current customizations, which fall into a number of areas from the modest but visible to the large but invisible.

The large but invisible category is that just like on my main fvwm-based desktop environment, I use xcape (plus a custom Cinnamon key binding for a weird key combination) to invoke my custom dmenu setup (1, 2) when I tap the CapsLock key. I have dmenu set to come up horizontally on the top of the display, which Cinnamon conveniently leaves alone in the default setup (it has its bar at the bottom). And of course I make CapsLock into an additional Control key when held.

(On the laptop I'm using a very old method of doing this. On more modern Cinnamon setups in virtual machines, I do this with Settings → Keyboard → Layout → Options, and then in the CapsLock section set CapsLock to be an additional Ctrl key.)

To start xcape up and do some other things, like load X resources, I have a personal entry in Settings → Startup Applications that runs a script in my ~/bin/X11. I could probably do this in a more modern way with an assortment of .desktop files in ~/.config/autostart (which is where my 'Startup Applications' setting actually wind up) that run each thing individually or perhaps some systemd user units. But the current approach works and is easy to modify if I want to add or remove things (I can just edit the script).

I have a number of Cinnamon 'applets' installed on my laptop and my other Cinnamon VM setups. The ones I have everywhere are Spices Update and Shutdown Applet, the latter because if I tell the (virtual) machine to log me off, shut down, or restart, I generally don't want to be nagged about it. On my laptop I also have CPU Frequency Applet (set to only display a summary) and CPU Temperature Indicator, for no compelling reason. In all environments I also pin launchers for Firefox and (Gnome) Terminal to the Cinnamon bottom bar, because I start both of them often enough. I position the Shutdown Applet on the left side, next to the launchers, because I think of it as a peculiar 'launcher' instead of an applet (on the right).

(The default Cinnamon keybindings also start a terminal with Ctrl + Alt + T, which you can still find through the same process from several years ago provided that you don't cleverly put something in .local/share/glib-2.0/schemas and then run 'glib-compile-schemas .' in that directory. If I was a smarter bear, I'd understand what I should have done when I was experimenting with something.)

On my virtual machines with Cinnamon, I don't bother with the whole xcape and dmenu framework, but I do set up the applets and the launchers and fix CapsLock.

(This entry was sort of inspired by someone I know who just became a Linux desktop user (after being a long time terminal user).)

Sidebar: My Cinnamon 'window manager' custom keybindings

I have these (on my laptop) and perpetually forget about them, so I'm going to write them down now so perhaps that will change.

move-to-corner-ne=['<Alt><Super>Right']
move-to-corner-nw=['<Alt><Super>Left']
move-to-corner-se=['<Primary><Alt><Super>Right']
move-to-corner-sw=['<Primary><Alt><Super>Left']
move-to-side-e=['<Shift><Alt><Super>Right']
move-to-side-n=['<Shift><Alt><Super>Up']
move-to-side-s=['<Shift><Alt><Super>Down']
move-to-side-w=['<Shift><Alt><Super>Left']

I have some other keybindings on the laptop but they're even less important, especially once I added dmenu.

Looking at what NFSv4 clients have locked on a Linux NVS(v4) server

By: cks

A while ago I wrote an entry about (not) finding which NFSv4 client owns a lock on a Linux NFS(v4) server, where the best I could do was pick awkwardly through the raw NFS v4 client information in /proc/fs/nfsd/clients. Recently I discovered an alternative to doing this by hand, which is the nfsdclnts program, and as a result of digging into it and what I was seeing when I tried it out, I now believe I have a better understanding of the entire situation (which was previously somewhat confusing).

The basic thing that nfsdclnts will do is list 'locks' and some information about them with 'nfsdclnts -t lock', in addition to listing other state information such as 'open', for open files, and 'deleg', for NFS v4 delegations. The information it lists is somewhat limited, for example it will list the inode number but not the filesystem, but on the good side nfsdclnts is a Python program so you can easily modify it to report any extra information that exists in the clients/#/states files. However, this information about locks is not complete, because of how file level locks appear to normally manifest in NFS v4 client state.

(The information in the states files is limited, although it contains somewhat more than nfsdclnts shows.)

Here is how I understand NFS v4 locking and states. To start with, NFS v4 has a feature called delegations where the NFS v4 server can hand a lot of authority over a file to a NFS v4 client. When a NFS v4 client accesses a file, the NFS v4 server likes to give it a delegation if this is possible; it normally will be if no one else has the file open or active. Once a NFS v4 client holds a delegation, it can lock the file without involving the NFS v4 server. At this point, the client's 'states' file will report an opaque 'type: deleg' entry for the file (and this entry may or may not have a filename or instead be what nfsdclnts will report as 'disconnected dentry').

While a NFS v4 client has the file delegated, if any other NFS v4 client does anything with the file, including simply opening it, the NFS v4 server will recall the delegation from the original client. As a result, the original client now has to tell the NFS v4 server that it has the file locked. At this point a 'type: lock' entry for the file appears in the first NFS v4 client's states file. If the first NFS v4 client releases its lock while the second NFS v4 client is trying to acquire it, the second NFS v4 client will not have a delegation for the file, so its lock will show up as an explicit 'type: lock' entry in its states file.

An additional wrinkle, a NFS v4 client holding a delegation doesn't immediately release it once all processes have released their locks, closed the file, and so on. Instead the delegation may linger on for some time. If another NFS v4 client opens the file during this time, the first client will lose the delegation but the second NFS v4 client may not get a delegation from the NFS v4 server, so its lock will be visible as a 'type: lock' states file entry.

A third wrinkle is that multiple clients may hold read-only delegations for a file and have fcntl() read locks on it at once, with each of them having a 'type: deleg, access: r' entry for it in their states files. These will only become visible 'type: lock' states entries if the clients have to release their delegations.

So putting this all together:

  • If there is a 'type: lock' entry for the file in any states file (or it's listed in 'nfsdclnts -t lock'), the file is definitely locked by whoever has that entry.

  • If there are no 'type: deleg' or 'type: lock' entries for the file, it's definitely not locked; you can also see this by whether nfsdclnts lists it as having delegations or locks.

  • If there are 'type: deleg' entries for the file, it may or may not be locked by the NFS v4 client (or clients) with the delegation. If the delegation is an 'access: w' delegation, you can see if someone actually has the file locked by accessing the file on another NFS v4 client, which will force the NFS v4 server to recall the delegation and expose the lock if there is one.

If the delegation is 'access: r' and might have multiple read-only locks, you can't force the NFS v4 server to recall the delegation by merely opening the file read-only (for example with 'cat file' or 'less file'). Instead the server will only recall the delegation if you open the file read-write. A convenient way to do this is probably to use 'flock -x <file> -c /bin/true', although this does require you to have more permissions for the file than simply the ability to read it.

Sidebar: Disabling NFS v4 delegations on the server

Based on trawling various places, I believe this is done by writing a '0' to /proc/sys/fs/leases-enabled (or the equivalent 'fs.leases-enabled' sysctl) and then apparently restarting your NFS v4 server processes. This will disable all user level uses of fcntl()'s F_SETLEASE and F_GETLEASE as an additional effect, and I don't know if this will affect any important programs running on the NFS server itself. Based on a study of the kernel source code, I believe that you don't need to restart your NFS v4 server processes if it's sufficient for the NFS server to stop handing out new delegations but current delegations can stay until they're dropped.

(There have apparently been some NFS v4 server and client issues with delegations, cf, along with other NFS v4 issues. However, I don't know if the cure winds up being worse than the disease here, or if there's another way to deal with these stateid problems.)

Getting older, now-replaced Fedora package updates

By: cks

Over the history of a given Fedora version, Fedora will often release multiple updates to the same package (for example, kernels, but there are many others). When it does this, the older package wind up being removed from the updates repository and are no longer readily available through mechanisms like 'dnf list --showduplicates <package>'. For a long time I used dnf's 'local' plugin to maintain a local archive of all packages I'd updated, so I could easily revert, but it turns out that as of Fedora 41's change to dnf5 (dnf version 5), that plugin is not available (presumably it hasn't been ported to dnf5, and may never be). So I decided to look into my other options for retrieving and installing older versions of packages, in case the most recent version has a bug that affects me (which has happened).

Before I take everyone on a long yak-shaving expedition, the simplest and best answer is to install the 'fedora-repos-archive' package, which installs an additional Fedora repository that has those replaced updates. After installing it, I suggest that you edit /etc/yum.repos.d/fedora-updates-archive.repo to disable it by default, which will save you time, bandwidth, and possibly aggravation. Then when you really want to see all possible versions of, say, Rust, you can do:

dnf list --showduplicates --enablerepo=updates-archive rust

You can then use 'dnf downgrade ...' as appropriate.

(Like the other Fedora repositories, updates-archive automatically knows your release version and picks packages from it. I think you can change this a bit with '--releasever=<NN>', but I'm not sure how deep the archive is.)

The other approach is to use Fedora Bodhi (also) and Fedora Koji (also) to fetch the packages for older builds, in much the same way as you can use Bodhi (and Koji) to fetch new builds that aren't in the updates or updates-testing repository yet. To start with, we're going to need to find out what's available. I think this can be done through either Bodhi or Koji, although Koji is presumably more authoritative. Let's do this for Rust in Fedora 41:

bodhi updates query --packages rust --releases f41
koji list-builds --state COMPLETE --no-draft --package rust --pattern '*.fc41'

Note that both of these listings are going to include package versions that were never released as updates for various reasons, and also versions built for the pre-release Fedora 41. Although Koji has a 'f41-updates' tag, I haven't been able to find a way to restrict 'koji list-builds' output to packages with that tag, so we're getting more than we'd like even after we use a pattern to restrict this to just Fedora 41.

(I think you may need to use the source package name, not a binary package one; if so, you can get it with 'rpm -qi rust' or whatever and looking at the 'Source RPM' line and name.)

Once you've found the package version you want, the easiest and fastest way to get it is through the koji command line client, following the directions in Installing Kernel from Koji with appropriate changes:

mkdir /tmp/scr
cd /tmp/scr
koji download-build --arch=x86_64 --arch=noarch rust-1.83.0-1.fc41

This will get you a bunch of RPMs, and then you can do 'dnf downgrade /tmp/scr/*.rpm' to have dnf do the right thing (only downgrading things you actually have installed).

One reason you might want to use Koji is that this gets you a local copy of the old package in case you want to go back and forth between it and the latest version for testing. If you use the dnf updates-archive approach, you'll be re-downloading the old version at every cycle. Of course at that point you can also use Koji to get a local copy of the latest update too, or 'dnf download ...', although Koji has the advantage that it gets all the related packages regardless of their names (so for Rust you get the 'cargo', 'clippy', and 'rustfmt' packages too).

(In theory you can work through the Fedora Bodhi website, but in practice it seems to be extremely overloaded at the moment and very slow. I suspect that the bot scraper plague is one contributing factor.)

PS: If you're using updates-archive and you just want to download the old packages, I think what you want is 'dnf download --enablerepo=updates-archive ...'.

Fedora 41 seems to have dropped an old XFT font 'property'

By: cks

Today I upgraded my office desktop from Fedora 40 to Fedora 41, and as traditional there was a little issue:

Current status: it has been '0' days since a Fedora upgrade caused X font problems, this time because xft apparently no longer accepts 'encoding=...' as a font specification argument/option.

One of the small issues with XFT fonts is that they don't really have canonical names. As covered in the "Font Name" section of fonts.conf, a given XFT font is a composite of a family, a size, and a number of attributes that may be used to narrow down the selection of the XFT font until there's only one option left (or no option left). One way to write that in textual form is, for example, 'Sans:Condensed Bold:size=13'.

For a long time, one of the 'name=value' properties that XFT font matching accepted was 'encoding=<something>'. For example, you might say 'encoding=iso10646-1' to specify 'Unicode' (and back in the long ago days, this apparently could make a difference for font rendering). Although I can't find 'encoding=' documented in historical fonts.conf stuff, I appear to have used it for more than a decade, dating back to when I first converted my fvwm configuration from XLFD fonts to XFT fonts. It's still accepted today on Fedora 40 (although I suspect it does nothing):

: f40 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
DejaVuSans.ttf: "DejaVu Sans" "Regular"

However, it's no longer accepted on Fedora 41:

: f41 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
Unable to parse the pattern

Initially I thought this had to be a change in fontconfig, but that doesn't seem to be the case; both Fedora 40 and Fedora 41 use the same version, '2.15.0', just with different build numbers (partly because of a mass rebuild for Fedora 41). Freetype itself went from version 2.13.2 to 2.13.3, but the release notes don't seem to have anything relevant. So I'm at a loss. At least it was easy to fix once I knew what had happened; I just had to take the ':encoding=iso10646-1' bit out from the places I had it.

(The visual manifestation was that all of my fvwm menus and window title bars switched to a tiny font. For historical reasons all of my XFT font specifications in my fvwm configuration file used 'encoding=...', so in Fedora 41 none of them worked and fvwm reported 'can't load font <whatever>' and fell back to its default of an XLFD font, which was tiny on my HiDPI display.)

PS: I suspect that this change will be coming in other Linux distributions sooner or later. Unsurprisingly, Ubuntu 24.04's fc-match still accepts 'encoding=...'.

PPS: Based on ltrace output, FcNameParse() appears to be what fails on Fedora 41.

I should learn systemd's features for restricting things

By: cks

Today, for reasons beyond the scope of this entry, I took something I'd been running by hand from the command line for testing and tried to set it up under systemd. This is normally straightforward, and it should have been extra straightforward because the thing came with a .service file. But that .service file used a lot of systemd's features for restricting what programs can do, and for my sins I'd decided to set up the program with its binary, configuration file, and so on in different places than it expected (and I think without some things it expected, like a supplementary group for permission to read some files). This was, unfortunately, an abject failure, so I wound up yanking all of the restrictions except 'DynamicUser=true'.

I'm confident that with enough time, I can (or could) sort out all of the problems (although I didn't feel like spending that time today). What this experience really points out is that systemd has a lot of options for really restricting what programs you run can do, and I'm not particularly familiar with them. To get the service working with all of its original restrictions, I'd have to read way through things like systemd.exec and understanding what everything the .service file used did. Once I did that, I could have understood what I needed to change to deal with my setup of the program.

(An expert probably could have fixed things in short order.)

That systemd has a lot of potential restrictions it can impose and that those restrictions are complex is not a flaw of systemd (or its fault). We already know that fine grained permissions are hard to set up and manage in any environment, especially if you don't know what you're doing (as I don't with systemd's restrictions). At the same time, fine grained restrictions are quite useful for being able to apply some restrictions to programs not designed for them.

(The simplicity of OpenBSD's 'pledge' system is great, but it needs the program's active cooperation. For better or worse, Linux doesn't have a native, fully supported equivalent; instead we have to build it out of more fine grained, lower level facilities, and that's what systemd exposes.)

Learning how to do use the restrictions is probably worthwhile in general. We run plenty of things through locally written systemd .service units. Some amount of those things are potentially risky (although generally not too risky), and some of them could be more restricted than they are today if we wanted to do the work and knew what we were doing (and knew some of the gotchas involved).

(And sooner or later we're going to run into more things with restrictions already in their .service units, and we're going to want to change some aspects of how they work.)

I'm working to switch from wget to curl (due to Fedora)

By: cks

I've been using wget for a long time now, which means that I've developed a lot of habits, reflexes and even little scripts around it. Then wget2 happened, or more exactly Fedora switched from wget to wget2 (and Ubuntu is probably going to follow along). I'm very much not a fan of wget2 (also); I find it has both worse behavior and worse output than classical wget, in ways that routinely get in my way. Or got in my way before I started retraining myself to use curl instead of wget.

(It's actually possible that Ubuntu won't follow Fedora here. Ubuntu 24.04's 'wget' is classic wget, and Debian unstable currently has the wget package still as classic wget. The wget to wget2 transition involves the kind of changes that I can see Debian developers rejecting, so maybe Debian will keep 'wget' as classic wget. The upstream has a wget 1.25.0 release as recently as November 2024 (cf); on the other hand, the main project page says that 'currently GNU wget2 is being developed', so it certainly sounds like the upstream wants to move.)

One tool for my switch is wcurl (also, via), which is a cover script to provide a wget-like interface to curl. But I don't have wcurl everywhere (it's not packaged in Ubuntu 24.04, although I think it's coming in 26.04), so I've also been working to remember things like curl's -L and -O options (for downloading things, these are basically 'do what I want' options; I almost always want curl to follow HTTP redirects). There's a number of other options I want to remember, so since I've been looking at the curl manual page, here's some notes to myself.

(If I downloaded multiple URLs at once, I'll probably want to use '--remote-name-all' instead of repeating -O a lot. But I'm probably not going to remember that unless I write a script.)

My 'wcat' script is basically 'curl -L -sS <url>' (-s to not show the progress bar, -S to include at least the HTTP payload on an error, -L to follow redirects). My related 'wretr' script, which is intended to show headers too, is 'curl -L -sS -i <url>' (-i includes headers), or 'curl -sS -i <url>' if I want to explicitly see any HTTP redirect rather than automatically follow it.

(What I'd like is an option to show HTTP headers only if there was an HTTP error, but curl is currently all or nothing here.)

Some of the time I'll want to fetch files with the -J option, which is the curl equivalent of wget's --trust-server-names. This is necessary in cases where a project doesn't bother with good URLs for things. Possibly I also want to use '-R' to set the local downloaded file's timestamp based on the server provided timestamp, which is wget's traditional behavior (sometimes it's good, sometimes it's confusing).

PS: I care about wcurl being part of a standard Ubuntu package because then we can install it as part of one of our standard package sets. If it's a personal script, it's not pervasive, although that's still better than nothing.

PPS: I'm not going to blame Fedora for the switch from wget to wget2. Fedora has a consistent policy of marching forward in changes like this to stay in sync with what upstream is developing, even when they cause pain to people using Fedora. That's just what you sign up for when you choose Fedora (or drift into it, in my case; I've been using 'Fedora' since before it was Fedora).

How I discovered a hidden microphone on a Chinese NanoKVM

NanoKVM is a hardware KVM switch developed by the Chinese company Sipeed. Released last year, it enables remote control of a computer or server using a virtual keyboard, mouse, and monitor. Thanks to its compact size and low price, it quickly gained attention online, especially when the company promised to release its code as open-source. However, as we’ll see, the device has some serious security issues. But first, let’s start with the basics.

How Does the Device Work?

As mentioned, NanoKVM is a KVM switch designed for remotely controlling and managing computers or servers. It features an HDMI port, three USB-C ports, an Ethernet port for network connectivity, and a special serial interface. The package also includes a small accessory for managing the power of an external computer.

Using it is quite simple. First, you connect the device to the internet via an Ethernet cable. Once online, you can access it through a standard web browser (though JavaScript JIT must be enabled). The device supports Tailscale VPN, but with some effort (read: hacking), it can also be configured to work with your own VPN, such as WireGuard or OpenVPN server. Once set up, you can control it from anywhere in the world via your browser.

NanoKVM

NanoKVM

The device could be connected to the target computer using an HDMI cable, capturing the video output that would normally be displayed on a monitor. This allows you to view the computer’s screen directly in your browser, essentially acting as a virtual monitor.

Through the USB connection, NanoKVM can also emulate a keyboard, mouse, CD-ROM, USB drive, and even a USB network adapter. This means you can remotely control the computer as if you were physically sitting in front of it - but all through a web interface.

While it functions similarly to remote management tools like RDP or VNC, it has one key difference: there’s no need to install any software on the target computer. Simply plug in the device, and you’re ready to manage it remotely. NanoKVM even allows you to enter the BIOS, and with the additional accessory for power management, you can remotely turn the computer on, off, or reset it.

This makes it incredibly useful - you can power on a machine, access the BIOS, change settings, mount a virtual bootable CD, and install an operating system from scratch, just as if you were physically there. Even if the computer is on the other side of the world.

NanoKVM is also quite affordable. The fully-featured version, which includes all ports, a built-in mini screen, and a case, costs just over €60, while the stripped-down version is around €30. By comparison, a similar RaspberryPi-based device, PiKVM, costs around €400. However, PiKVM is significantly more powerful and reliable and, with a KVM splitter, can manage multiple devices simultaneously.

As mentioned earlier, the announcement of the device caused quite a stir online - not just because of its low price, but also due to its compact size and minimal power consumption. In fact, it can be powered directly from the target computer via a USB cable, which it also uses to simulate a keyboard, mouse, and other USB devices. So you have only one USB cable - in one direction it powers NanoKVM, on the other it helps it to simulate keyboard mouse and other devices on a computer you want to manage.

The device is built on the open-source RISC-V processor architecture, and the manufacturer eventually did release the device’s software under an open-source license at the end of last year. (To be fair, one part of the code remains closed, but the community has already found a suitable open-source replacement, and the manufacturer has promised to open this portion soon.)

However, the real issue is security.

Understandably, the company was eager to release the device as soon as possible. In fact, an early version had a minor hardware design flaw - due to an incorrect circuit cable, the device sometimes failed to detect incoming HDMI signals. As a result, the company recalled and replaced all affected units free of charge. Software development also progressed rapidly, but in such cases, the primary focus is typically on getting basic functionality working, with security taking a backseat.

So, it’s not surprising that the developers made some serious missteps - rushed development often leads to stupid mistakes. But some of the security flaws I discovered in my quick (and by no means exhaustive) review are genuinely concerning.

One of the first security analysis revealed numerous vulnerabilities - and some rather bizarre discoveries. For instance, a security researcher even found an image of a cat embedded in the firmware. While the Sipeed developers acknowledged these issues and relatively quickly fixed at least some of them, many remain unresolved.

NanoKVM

NanoKVM

After purchasing the device myself, I ran a quick security audit and found several alarming flaws. The device initially came with a default password, and SSH access was enabled using this preset password. I reported this to the manufacturer, and to their credit, they fixed it relatively quickly. However, many other issues persist.

The user interface is riddled with security flaws - there’s no CSRF protection, no way to invalidate sessions, and more. Worse yet, the encryption key used for password protection (when logging in via a browser) is hardcoded and identical across all devices. This is a major security oversight, as it allows an attacker to easily decrypt passwords. More problematic, this needed to be explained to the developers. Multiple times.

Another concern is the device’s reliance on Chinese DNS servers. And configuring your own (custom) DNS settings is quite complicated. Additionally, the device communicates with Sipeed’s servers in China - downloading not only updates but also the closed-source component mentioned earlier. For this closed source component it needs to verify an identification key, which is stored on the device in plain text. Alarmingly, the device does not verify the integrity of software updates, includes a strange version of the WireGuard VPN application (which does not work on some networks), and runs a heavily stripped-down version of Linux that lacks systemd and apt. And these are just a few of the issues.

Were these problems simply oversights? Possibly. But what additionally raised red flags was the presence of tcpdump and aircrack - tools commonly used for network packet analysis and wireless security testing. While these are useful for debugging and development, they are also hacking tools that can be dangerously exploited. I can understand why developers might use them during testing, but they have absolutely no place on a production version of the device.

A Hidden Microphone

And then I discovered something even more alarming - a tiny built-in microphone that isn’t clearly mentioned in the official documentation. It’s a miniature SMD component, measuring just 2 x 1 mm, yet capable of recording surprisingly high-quality audio.

What’s even more concerning is that all the necessary recording tools are already installed on the device! By simply connecting via SSH (remember, the device initially used default passwords!), I was able to start recording audio using the amixer and arecord tools. Once recorded, the audio file could be easily copied to another computer. With a little extra effort, it would even be possible to stream the audio over a network, allowing an attacker to eavesdrop in real time.

Hidden Microphone in NanoKVM

Hidden Microphone in NanoKVM

Physically removing the microphone is possible, but it’s not exactly straightforward. As seen in the image, disassembling the device is tricky, and due to the microphone’s tiny size, you’d need a microscope or magnifying glass to properly desolder it.

To summarize: the device is riddled with security flaws, originally shipped with default passwords, communicates with servers in China, comes preinstalled with hacking tools, and even includes a built-in microphone - fully equipped for recording audio - without clear mention of it in the documentation. Could it get any worse?

I am pretty sure these issues stem from extreme negligence and rushed development rather than malicious intent. However, that doesn’t make them any less concerning.

That said, these findings don’t mean the device is entirely unusable.

Since the device is open-source, it’s entirely possible to install custom software on it. In fact, one user has already begun porting his own Linux distribution - starting with Debian and later switching to Ubuntu. With a bit of luck, this work could soon lead to official Ubuntu Linux support for the device.

This custom Linux version already runs the manufacturer’s modified KVM code, and within a few months, we’ll likely have a fully independent and significantly more secure software alternative. The only minor inconvenience is that installing it requires physically opening the device, removing the built-in SD card, and flashing the new software onto it. However, in reality, this process isn’t too complicated.

And while you’re at it, you might also want to remove the microphone… or, if you prefer, connect a speaker. In my test, I used an 8-ohm, 0.5W speaker, which produced surprisingly good sound - essentially turning the NanoKVM into a tiny music player. Actually, the idea is not so bad, because PiKVM also included 2-way audio support for their devices end of last year.

Basic board with speaker

Basic board with speaker

Final Thoughts

All this of course raises an interesting question: How many similar devices with hidden functionalities might be lurking in your home, just waiting to be discovered? And not just those of Chinese origin. Are you absolutely sure none of them have built-in miniature microphones or cameras?

You can start with your iPhone - last year Apple has agreed to pay $95 million to settle a lawsuit alleging that its voice assistant Siri recorded private conversations. They shared the data with third parties and used them for targeted ads. “Unintentionally”, of course! Yes, that Apple, that cares about your privacy so much.

And Google is doing the same. They are facing a similar lawsuit over their voice assistant, but the litigation likely won’t be settled until this fall. So no, small Chinese startup companies are not the only problem. And if you are worried about Chinese companies obligations towards Chinese government, let’s not forget that U.S. companies also have obligations to cooperate with U.S. government. While Apple is publicly claiming they do not cooperate with FBI and other U. S. agencies (because thy care about your privacy so much), some media revealed that Apple was holding a series secretive Global Police Summit at its Cupertino headquarters where they taught police how to use their products for surveillance and policing work. And as one of the police officers pointed out - he has “never been part of an engagement that was so collaborative.”. Yep.

P.S. How to Record Audio on NanoKVM

If you want to test the built-in microphone yourself, simply connect to the device via SSH and run the following two commands:

  • amixer -Dhw:0 cset name='ADC Capture Volume 20' (this sets microphone sensitivity to high)
  • arecord -Dhw:0,0 -d 3 -r 48000 -f S16_LE -t wav test.wav & > /dev/null & (this will capture the sound to a file named test.wav)

Now, speak or sing (perhaps the Chinese national anthem?) near the device, then press Ctrl + C, copy the test.wav file to your computer, and listen to the recording.

Kako sem na mini kitajski napravi odkril skriti mikrofon

Lansko leto je kitajsko podjetje Sipeed izdalo zanimivo napravico za oddaljeno upravljanje računalnikov in strežnikov, ki sliši na ime NanoKVM. Gre za tim. KVM stikalo (angl. KVM switch), torej fizično napravo, ki omogoča oddaljeno upravljanje računalnika oz. strežnika preko virtualne tipkovnice, miške in monitorja.

Kako deluje?

Napravica ima en HDMI, tri USB-C priključke, Ethernet priključek za omrežni kabel in posebno “letvico”, kamor priključimo dodaten priložen vmesnik za upravljanje napajanja zunanjega računalnika. Kako zadeva deluje? Zelo preprosto. Napravico preko omrežnega Ethernet kabla povežemo na internet in se potem lahko nanjo s pomočjo navadnega spletnega brskalnika povežemo od koderkoli (je pa v brskalniku potrebno omogočiti JavaScript JIT). Vgrajena je sicer že tudi podpora za Tailscale VPN, a z malo truda oz. hekanja jo lahko povežemo tudi na svoj VPN (Wireguard ali OpenVPN). Torej lahko do nje preprosto dostopamo preko interneta od kjerkoli na svetu.

NanoKVM

NanoKVM

Napravico nato na računalnik, ki ga želimo upravljati povežemo preko HDMI kabla, naprava pa nato zajema sliko (ki bi se sicer prikazovala na monitorju) in to sliko lahko potem vidimo v brskalniku. Povezava preko USB na ciljnem računalniku simulira tipkovnico, miško, CD-ROM/USB ključek ter celo USB omrežno kartico. S tem naprava omogoča oddaljeno upravljanje računalnika kot bi sedeli za njim, v resnici pa računalnik upravljamo kar preko brskalnika preko interneta. Za razliko od aplikacij za oddaljeno upravljanje računalnika tukaj na ciljni računalnik ni potrebno nameščati ničesar, dovolj je, da nanj priključimo to napravico. Seveda pa s pomočjo te naprave lahko vstopimo tudi v BIOS ciljnega računalnika, z dodatnim vmesnikom, ki ga priključimo na prej omenjeno “letvico” pa oddaljeni računalnik lahko tudi ugasnemo, prižgemo ali resetiramo.

Uporabno, saj na ta način lahko računalnik prižgemo, gremo v BIOS in tam spreminjamo nastavitve, nato pa vanj virtualno vstavimo zagonski CD in celo namestimo operacijski sistem. Pa čeprav se računalnik nahaja na drugem koncu sveta.

Napravica je precej poceni - razširjena različica, ki ima vse priključke, vgrajen mini zaslonček in prikupno ohišje stane nekaj čez 60 EUR, oskubljena različica pa okrog 30 EUR. Za primerjavo, podobna naprava ki temelji na RaspberryPi in se imenuje PiKVM, stane okrog 400 EUR, je pa res, da je tista naprava precej bolj zmogljiva in zanesljiva, preko KVM razdelillca pa omogoča tudi upravljanje več naprav hkrati.

Kaj pa varnost?

Najava naprave je na spletu povzročila precej navdušenja, ne samo zaradi nizke cene, pač pa tudi zato, ker je res majhna in porabi minimalno energije (napaja se lahko kar iz ciljnega računalnika preko USB kabla s katerim v drugo smer simulira tipkovnico, miško in ostale USB naprave). Zgrajena je na odprtokodni RISC-V procesorski arhitekturi, proizvajalec pa je obljubil, da bo programsko kodo naprave odprl oziroma jo izdal pod odprtokodno licenco, kar se je konec lanskega leta tudi res zgodilo. No, en del sicer še ni povsem odprt, a je skupnost že našla ustrezno odprtokodno nadomestilo, pa tudi proizvajalec je obljubil, da bodo odprli tudi ta del kode.

Težava pa je varnost.

Proizvajalec je seveda imel interes napravico čim prej dati na trg in ena izmed prvih različic je celo imela manjšo napako v strojni zasnovi (zaradi uporabe napačnega kabla na vezju naprava včasih ni zaznala vhodnega HDMI signala) zato so vse napravice odpoklicali in jih brezplačno zamenjali. Tudi razvoj programske opreme je bil precej intenziven in jasno je, da je podjetju v takem primeru v fokusu predvsem razvoj osnovne funkcionalnosti, varnost pa je na drugem mestu.

Zato ne preseneča, da so bili razvijalci pri razvoju precej malomarni, kar je seveda posledica hitenja. A nekatere ugotovitve mojega hitrega (in vsekakor ne celovitega) varnostnega pregleda so resnično zaskrbljujoče.

Že eden prvih hitrih varnostnih pregledov je odkril številne pomanjkljivosti in celo prav bizarne zadeve - med drugim je varnostni raziskovalec na strojni programski opremi naprave našel celo sliko mačke. Razvijalci podjetja Sipeed so te napake priznali in jih - vsaj nekatere - tudi relativno hitro odpravili. A še zdaleč ne vseh.

Odprt NanoKVM

Odprt NanoKVM

Napravico sem pred kratkim kupil tudi sam in tudi moj hitri pregled je odkril številne pomanjkljivosti. Naprava je na začetku imela nastavljeno privzeto geslo, z enakim geslom so bile omogočene tudi ssh povezave na napravo. Proizvajalca sem o tem obvestil in so zadevo relativno hitro popravili. A številne napake so ostale.

Tako ima uporabniški vmesnik še vedno cel kup pomanjkljivosti - ni CSFR zaščite, ni mogoče invalidirati seje, in tako dalje. Šifrirni ključ za zaščito gesel (ko se preko brskalnika prijavimo na napravo) je kar vgrajen (angl. hardcoded) in za vse naprave enak. Kar absolutno nima smisla, saj napadalec s pomočjo tega ključa geslo lahko povsem preprosto dešifrira. Težava je, da je bilo to potrebno razvijalcem posebej razložiti. In to večkrat.

Osebno me je zmotilo, da naprava uporablja neke kitajske DNS strežnike - nastavitev lastnih DNS strežnikov pa je precej zapletena. Prav tako naprava prenaša podatke iz kitajskih strežnikov podjetja (v bistvu iz teh strežnikov prenaša zaenkrat še edino zaprtokodno komponento, pri čemer pa preverja identifikacijski ključ naprave, ki je sicer na napravi shranjen v nešifrirani obliki). Naprava ne preverja integritete posodobitev, ima nameščeno neko čudno različico Wireguard VPN aplikacije, na njej teče precej oskubljena različica Linuxa brez systemd in apt komponente, najde pa se še precej podobnih cvetk. Porodne težave?

Morda. A na napravi sta nameščeni orodji tcpdump in aircrack, ki se sicer uporabljata za razhroščevanje in pomoč pri razvoju, vseeno pa gre za hekerski orodji, ki ju je mogoče nevarno zlorabiti. Sicer povsem razumem zakaj razvijalci ti dve orodji uporabljajo, a v produkcijski različici naprave resnično nimata kaj iskati.

Skriti mikrofon

Potem pa sem na napravici odkril še mini mikrofon, ki ga dokumentacija ne omenja jasno. Gre za miniaturno SMD komponento, velikosti 2 x 1 mm, ki pa dejansko omogoča snemanje precej kakovostnega zvoka. In kar je dodatno zaskrbljujoče je to, da so na napravi že nameščena vsa orodja za snemanje! To omogoča, da se na napravico povežemo preko ssh (saj se spomnite, da sem na začetku omenil, da je naprava uporabljala privzeta gesla!), nato pa s pomočjo orodij amixer in arecord preprosto zaženemo snemanje zvoka. Datoteko s posnetkom nato preprosto skopiramo na svoj računalnik. Z malo truda pa bi bilo seveda mogoče implementirati tudi oddajanje zvoka preko omrežja, kar bi napadalcu seveda omogočalo prisluškovanje v realnem času.

Skriti mikrofon v NanoKVM

Skriti mikrofon v NanoKVM

Mikrofon bi bilo sicer mogoče odstraniti, a je za to napravico potrebno fizično razdreti in mikrofon nato odlotati iz nje. Kot je razvidno iz slike to ni povsem enostavno, poleg tega si je treba pri lotanju pomagati z mikroskopom oz. povečevalnim steklom.

Skratka, če povzamemo. Naprava ima kup varnostnih pomanjkljivosti, vsaj na začetku je uporabljala privzeta gesla, komunicira s strežniki na Kitajskem, ima nameščena hekerska orodja in vgrajen mikrofon z vso programsko podporo za snemanje zvoka, ki ga pa dokumentacija ne omenja jasno! Je lahko še slabše?

Sicer sem prepričan, da je to posledica predvsem skrajne malomarnosti in hitenja pri razvoju in ne zlonamernosti, a vseeno vse skupaj pušča precej slab priokus.

Po drugi strani pa te ugotovitve nikakor ne pomenijo, da naprava ni uporabna.

Ker je zasnova naprave odprta je seveda nanjo mogoče namestiti svojo programsko opremo. Eden izmed uporabnikov je tako začel na napravo prenašati svojo različico Linuxa (najprej Debian, zdaj je preklopil na Ubuntu), in z malo sreče bo ta koda kmalu postala osnova za to, da bo Ubuntu Linux tudi uradno podprt na teh napravah. Na tej različici Linuxa že teče modificirana KVM koda proizvajalca in verjetno bomo v nekaj mesecih že dobili popolnoma neodvisno programsko opremo, ki bo tudi bistveno bolj varna. Manjša težava je, da bo za namestitev te programske opreme napravo treba fizično odpreti, ven vzeti vgrajeno SD kartico in nanjo zapisati to alternativno programsko kodo. A v resnici to ni preveč zapleteno. Lahko pa ob tem še odlotamo mikrofon… ali pa gor priključimo zvočnik. Sam sem za test uporabil 8 Ohmski, 0.5 W zvočnik, ki zmore predvajati kar kvaliteten zvok in tako dobil mini predvajalnik glasbe. :)

Osnovna plošča z zvočnikom

Osnovna plošča z zvočnikom

Za konec pa se je dobro vprašati koliko podobnih napravic s skritimi funkcionalnostmi bi se s podobnim pregledom še našlo v vaših domovih? In to ne nujno samo kitajskega izvora. Ste prepričani, da nobena od njih nima vgrajenih miniaturnih mikrofonov ali kamer?

P. S. Za snemanje se je treba na napravico povezati preko ssh in zagnati naslednja dva ukaza:

  • amixer -Dhw:0 cset name='ADC Capture Volume 20' (s tem nastavimo visoko občutljivost mikrofona)
  • arecord -Dhw:0,0 -d 3 -r 48000 -f S16_LE -t wav test.wav & > /dev/null &

Zdaj lahko poleg napravice govorite ali prepevate (na primer kitajsko himno), nato pa pritisnete ctrl-c in datoteko test.wav skopirate na svoj računalnik kjer jo lahko poslušate.

Signal kontejner

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno šifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo številni aktivisti, novinarji, žvižgači, pa tudi državni uradniki in poslovneži. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo povežemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (žal) vezan na telefonsko številko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjši “hack”. Kakšen, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene številne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s številnimi izboljšavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniški profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameščate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

Rešitev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal povežemo z novo številko.

Ko je telefonska številka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameščeni dve različici Signala, vezani na dve različni telefonski številki, in iz obeh različic lahko pošiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko številko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno rešiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameščena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniške imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpišemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameščena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo približno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameščen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala povežemo v operacijski sistem in vpišemo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpišemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpišemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaženemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa še enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se povežemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameščena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naš zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvršljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne še čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapišemo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakšno drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvršljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

Osvežimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop

Ko bomo v iskalnik aplikacij vpisali “Signal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo še povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Žal pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej še vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo rešiti, pa mi seveda lahko sporočite. :)

How we handle debconf questions during our Ubuntu installs

By: cks

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

How we automate installing extra packages during Ubuntu installs

By: cks

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

Linux kernel NFSv4 server and client RPC operation statistics

By: cks

NFS servers and clients communicate using RPC, sending various NFS v3, v4, and possibly v2 (but we hope not) RPC operations to the server and getting replies. On Linux, the kernel exports statistics about these NFS RPC operations in various places, with a global summary in /proc/net/rpc/nfsd (for the NFS server side) and /proc/net/rpc/nfs (for the client side). Various tools will extract this information and convert it into things like metrics, or present it on the fly (for example, nfsstat(8)). However, as far as I know what is in those files and especially how RPC operations are reported is not well documented, and also confusing, which is a problem if you discover that something has an incomplete knowledge of NFSv4 RPC stats.

For a general discussion of /proc/net/rpc/nfsd, see Svenn D'Hert's nfsd stats explained article. I'm focusing on NFSv4, which is to say the 'proc4ops' line. This line is produced in nfsd_show in fs/nfsd/stats.c. The line starts with a count of how many operations there are, such as 'proc4ops 76', and then has one number for each operation. What are the operations and how many of them are there? That's more or less found in the nfs_opnum4 enum in include/linux/nfs4.h. You'll notice that there are some gaps in the operation numbers; for example, there's no 0, 1, or 2. Despite there being no such actual NFS v4 operations, 'proc4ops' starts with three 0s for them, because it works with an array numbered by nfs_opnum4 and like all C arrays, it starts at 0.

(The counts of other, real NFS v4 operations may be 0 because they're never done in your environment.)

For NFS v4 client operations, we look at the 'proc4' line in /proc/net/rpc/nfs. Like the server's 'proc4ops' line, it starts with a count of how many operations are being reported on, such as 'proc4 69', and then a count for each operation. Unfortunately for us and everyone else, these operations are not numbered the same as the NFS server operations. Instead the numbering is given in an anonymous and unnumbered enum in include/linux/nfs4.h that starts with 'NFSPROC4_CLNT_NULL = 0,' (as a spoiler, the 'null' operation is not unused, contrary to the include file's comment). The actual generation and output of /proc/net/rpc/nfs is done in rpc_proc_show in net/sunrpc/stats.c. The whole structure this code uses is set up in fs/nfs/nfs4xdr.c, and while there is a confusing level of indirection, I believe the structure corresponds directly with the NFSPROC4_CLNT_* enum values.

What I think is going on is that Linux has decided to optimize its NFSv4 client statistics to only include the NFS v4 operations that it actually uses, rather than take up a bit of extra memory to include all of the NFS v4 operations, including ones that will always have a '0' count. Because the Linux NFS v4 client started using different NFSv4 operations at different times, some of these operations (such as 'lookupp') are out of order; when the NFS v4 client started using them, they had to be added at the end of the 'proc4' line to preserve backward compatibility with existing programs that read /proc/net/rpc/nfs.

PS: As far as I can tell from a quick look at fs/nfs/nfs3xdr.c, include/uapi/linux/nfs3.h, and net/sunrpc/stats.c, the NFS v3 server and client stats cover all of the NFS v3 operations and are in the same order, the order of the NFS v3 operation numbers.

How Ubuntu 24.04's bad bpftrace package appears to have happened

By: cks

When I wrote about Ubuntu 24.04's completely broken bpftrace '0.20.2-1ubuntu4.2' package (which is now no longer available as an Ubuntu update), I said it was a disturbing mystery how a theoretical 24.04 bpftrace binary was built in such a way that it depended on a shared library that didn't exist in 24.04. Thanks to the discussion in bpftrace bug #2097317, we have somewhat of an answer, which in part shows some of the challenges of building software at scale.

The short version is that the broken bpftrace package wasn't built in a standard Ubuntu 24.04 environment that only had released packages. Instead, it was built in a '24.04' environment that included (some?) proposed updates, and one of the included proposed updates was an updated version of libllvm18 that had the new shared library. Apparently there are mechanisms that should have acted to make the new bpftrace depend on the new libllvm18 if everything went right, but some things didn't go right and the new bpftrace package didn't pick up that dependency.

On the one hand, if you're planning interconnected package updates, it's a good idea to make sure that they work with each other, which means you may want to mingle in some proposed updates into some of your build environments. On the other hand, if you allow your build environments to be contaminated with non-public packages this way, you really, really need to make sure that the dependencies work out. If you don't and packages become public in the wrong order, you get Ubuntu 24.04's result.

(While the RPM build process and package format would have avoided this specific problem, I'm pretty sure that there are similar ways to make it go wrong.)

Contaminating your build environment this way also makes testing your newly built packages harder. The built bpftrace binary would have run inside the build environment, because the build environment had the right shared library from the proposed libllvm18. To see the failure, you would have to run tests (including running the built binary) in a 'pure' 24.04 environment that had only publicly released package updates. This would require an extra package test step; I'm not clear if Ubuntu has this as part of their automated testing of proposed updates (there's some hints in the discussion that they do but that these tests were limited and didn't try to run the binary).

An alarmingly bad official Ubuntu 24.04 bpftrace binary package

By: cks

Bpftrace is a more or less official part of Ubuntu; it's even in the Ubuntu 24.04 'main' repository, as opposed to one of the less supported ones. So I'll present things in the traditional illustrated form (slightly edited for line length reasons):

$ bpftrace
bpftrace: error while loading shared libraries: libLLVM-18.so.18.1: cannot open shared object file: No such file or directory
$ readelf -d /usr/bin/bpftrace | grep libLLVM
 0x0...01 (NEEDED)  Shared library: [libLLVM-18.so.18.1]
$ dpkg -L libllvm18 | grep libLLVM
/usr/lib/llvm-18/lib/libLLVM.so.1
/usr/lib/llvm-18/lib/libLLVM.so.18.1
/usr/lib/x86_64-linux-gnu/libLLVM-18.so
/usr/lib/x86_64-linux-gnu/libLLVM.so.18.1
$ dpkg -l bpftrace libllvm18
[...]
ii  bpftrace       0.20.2-1ubuntu4.2 amd64 [...]
ii  libllvm18:amd64 1:18.1.3-1ubuntu1 amd64 [...]

I originally mis-diagnosed this as a libllvm18 packaging failure, but this is in fact worse. Based on trawling through packages.ubuntu.com, only Ubuntu 24.10 and later have a 'libLLVM-18.so.18.1' in any package; in Ubuntu 24.04, the correct name for this is 'libLLVM.so.18.1'. If you rebuild the bpftrace source .deb on a genuine 24.04 machine, you get a bpftrace build (and binary .deb) that does correctly use 'libLLVM.so.18.1' instead of 'libLLVM-18.so.18.1'.

As far as I can see, there are two things that could have happened here. The first is that Canonical simply built a 24.10 (or later) bpftrace binary .deb and put it in 24.04 without bothering to check if the result actually worked. I would like to say that this shows shocking disregard for the functioning of an increasingly important observability tool from Canonical, but actually it's not shocking at all, it's Canonical being Canonical (and they would like us to pay for this for some reason). The second and worse option is that Canonical is building 'Ubuntu 24.04' packages in an environment that is contaminated with 24.10 or later packages, shared libraries, and so on. This isn't supposed to happen in a properly operating package building environment that intends to create reliable and reproducible results and casts doubt on the provenance and reliability of all Ubuntu 24.04 packages.

(I don't know if there's a way to inspect binary .debs to determine anything about the environment they were built in, the way you can get some information about RPMs. Also, I now have a new appreciation for Fedora putting the Fedora release version into the actual RPM's 'release' name. Ubuntu 24.10 and 24.04 don't have the same version of bpftrace, so this isn't quite as simple as Canonical copying the 24.10 package to 24.04; 24.10 has 0.21.2, while 24.04 is theoretically 0.20.2.)

Incidentally, this isn't an issue of the shared library having its name changed, because if you manually create a 'libLLVM-18.so.18.1' symbolic link to the 24.04 libllvm18's 'libLLVM.so.18.1' and run bpftrace, what you get is:

$ bpftrace
: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
abort

This appears to say that the Ubuntu 24.04 bpftrace binary is incompatible with the Ubuntu 24.04 libllvm18 shared libraries. I suspect that it was built against different LLVM 18 headers as well as different LLVM 18 shared libraries.

The (potential) complexity of good runqueue latency measurement in Linux

By: cks

Run queue latency is the time between when a Linux task becomes ready to run and when it actually runs. If you want good responsiveness, you want a low runqueue latency, so for a while I've been tracking a histogram of it with eBPF, and I put some graphs of it up on some Grafana dashboards I look at. Then recently I improved the responsiveness of my desktop with the cgroup V2 'cpu.idle' setting, and questions came up about how this different from process niceness. When I was looking at those questions, I realized that my run queue latency measurements were incomplete.

When I first set up my run queue latency tracking, I wasn't using either cgroup V2 cpu.idle or process niceness, and so I set up a single global runqueue latency histogram for all tasks regardless of their priority and scheduling class. Once I started using 'idle' CPU scheduling (and testing the effectiveness of niceness), this resulted in hopelessly muddled data that was effectively meaningless during the time that multiple scheduling types of scheduling or multiple nicenesses were running. Running CPU-consuming processes only when the system is otherwise idle is (hopefully) good for the runqueue latency of my regular desktop processes, but more terrible than usual for those 'run only when idle' processes, and generally there's going to be a lot more of them than my desktop processes.

The moment you introduce more than one 'class' of processes for scheduling, you need to split run queue latency measurements up between these classes if you want to really make sense of the results. What these classes are will depend on your environment. I could probably get away with a class for 'cpu.idle' tasks, a class for heavily nice'd tasks, a class for regular tasks, and perhaps a class for (system) processes running with very high priority. If you're doing fair share scheduling between logins, you might need a class per login (or you could ignore run queue latency as too noisy a measure).

I'm not sure I'd actually track all of my classes as Prometheus metrics. For my personal purposes, I don't care very much about the run queue latency of 'idle' or heavily nice'd processes, so perhaps I should update my personal metrics gathering to just ignore those. Alternately, I could write a bpftrace script that gathered the detailed class by class data, run it by hand when I was curious, and ignore the issue otherwise (continuing with my 'global' run queue latency histogram, which is at least honest in general).

The issue with DNF 5 and script output in Fedora 41

By: cks

These days Fedora uses DNF as its high(er) level package management software, replacing yum. However, there are multiple versions of DNF, which behave somewhat differently. Through Fedora 40, the default version of DNF was DNF 4; in Fedora 41, DNF is now DNF 5. DNF 5 brings a number of improvements but it has at least one issue that makes me unhappy with it in my specific situation. Over on the Fediverse I said:

Oh nice, DNF 5 in Fedora 41 has nicely improved the handling of output from RPM scriptlets, so that you can more easily see that it's scriptlet output instead of DNF messages.

[later]

I must retract my praise for DNF 5 in Fedora 41, because it has actually made the handling of output from RPM scriptlets *much* worse than in dnf 4. DNF 5 will repeatedly re-print the current output to date of scriptlets every time it updates a progress indicator of, for example, removing packages. This results in a flood of output for DKMS module builds during kernel updates. Dnf 5's cure is far worse than the disease, and there's no way to disable it.

<bugzilla 2331691>

(Fedora 41 specifically has dnf5-5.2.8.1, at least at the moment.)

This can be mostly worked around for kernel package upgrades and DKMS modules by manually removing and upgrading packages before the main kernel upgrade. You want to do this so that dnf is removing as few packages as possible while your DKMS modules are rebuilding. This is done with:

  1. Upgrade all of your non-kernel packages first:

    dnf upgrade --exclude 'kernel*'
    

  2. Remove the following packages for the old kernel:

    kernel kernel-core kernel-devel kernel-modules kernel-modules-core kernel-modules-extra

    (It's probably easier to do 'dnf remove kernel*<version>*' and let DNF sort it out.)

  3. Upgrade two kernel packages that you can do in advance:

    dnf upgrade kernel-tools kernel-tools-libs
    

Unfortunately in Fedora 41 this still leaves you with one RPM package that you can't upgrade in advance and that will be removed while your DKMS module is rebuilding, namely 'kernel-devel-matched'. To add extra annoyance, this is a virtual package that contains no files, and you can't remove it because a lot of things depend on it.

As far as I can tell, DNF 5 has absolutely no way to shut off its progress bars. It completely ignores $TERM and I can't see anything else that leaves DNF usable. It would have been nice to have some command line switches to control this, but it seems pretty clear that this wasn't high on the DNF 5 road map.

(Although I don't expect this to be fixed in Fedora 41 over its lifetime, I am still deferring the Fedora 41 upgrades of my work and home desktops for as long as possible to minimize the amount of DNF 5 irritation I have to deal with.)

WireGuard's AllowedIPs aren't always the (WireGuard) routes you want

By: cks

A while back I wrote about understanding WireGuard's AllowedIPs, and also recently I wrote about how different sorts of WireGuard setups have different difficulties, where one of the challenges for some setups is setting up what you want routed through WireGuard connections. As Ian Z aka nobrowser recently noted in a comment on the first entry, these days many WireGuard related programs (such as wg-quick and NetworkManager) will automatically set routes for you based on AllowedIPs. Much of the time this will work fine, but there are situations where adding routes for all AllowedIPs ranges isn't what you want.

WireGuard's AllowedIPs setting for a particular peer controls two things at once: what (inside-WireGuard) source IP addresses you will accept from the peer, and what destination addresses WireGuard will send to that peer if the packet is sent to that WireGuard interface. However, it's the routing table that controls what destination addresses are sent to a particular WireGuard interface (or more likely a combination of IP policy routing rules and some routing table).

If your WireGuard IP address is only reachable from other WireGuard peers, you can sensibly bound your AllowedIPs so that the collection of all of them matches the routing table. This is also more or less doable if some of them are gateways for additional networks; hopefully your network design puts all of those networks under some subnet and the subnet isn't too big. However, if your WireGuard IP can wind up being reached by a broader range of source IPs, or even 'all of the Internet' (as is my case), then your AllowedIPs range is potentially much larger than what you want to always be routed to WireGuard.

A related case is if you have a 'work VPN' WireGuard configuration where you could route all of your traffic through your WireGuard connection but some of the time you only want to route traffic to specific (work) subnets. Unless you like changing AllowedIPs all of the time or constructing two different WireGuard interfaces and only activating the correct one, you'll want an AllowedIPs that accepts everything but some of the time you'll only route specific networks to the WireGuard interface.

(On the other hand, with the state of things in Linux, having two separate WireGuard interfaces might be the easiest way to manage this in NetworkManager or other tools.)

I think that most people's use of WireGuard will probably involve AllowedIPs settings that also work for routing, provided that the tools involve handle the recursive routing problem. These days, NetworkManager handles that for you, although I don't know about wg-quick.

(This is one of the entries that I write partly to work it out in my own head. My own configuration requires a different AllowedIPs than the routes I send through the WireGuard tunnel. I make this work with policy based routing.)

Cgroup V2 memory limits and their potential for thrashing

By: cks

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

A gotcha with importing ZFS pools and NFS exports on Linux (as of ZFS 2.3.0)

By: cks

Ever since its Solaris origins, ZFS has supported automatic NFS and CIFS sharing of ZFS filesystems through their 'sharenfs' and 'sharesmb' properties. Part of the idea of this is that you could automatically have NFS (and SMB) shares created and removed as you did things like import and export pools, rather than have to maintain a separate set of export information and keep it in sync with what ZFS filesystems were available. On Linux, OpenZFS still supports this, working through standard Linux NFS export permissions (which don't quite match the Solaris/Illumos model that's used for sharenfs) and standard tools like exportfs. A lot of this works more or less as you'd expect, but it turns out that there's a potentially unpleasant surprise lurking in how 'zpool import' and 'zpool export' work.

In the current code, if you import or export a ZFS pool that has no filesystems with a sharenfs set, ZFS will still run 'exportfs -ra' at the end of the operation even though nothing could have changed in the NFS exports situation. An important effect that this has is that it will wipe out any manually added or changed NFS exports, reverting your NFS exports to what is currently in /etc/exports and /etc/exports.d. In many situations (including ours) this is a harmless operation, because /etc/exports and /etc/exports.d are how things are supposed to be. But in some environments you may have programs that maintain their own exports list and permissions through running 'exportfs' in various ways, and in these environments a ZFS pool import or export will destroy those exports.

(Apparently one such environment is high availability systems, some of which manually manage NFS exports outside of /etc/exports (I maintain that this is a perfectly sensible design decision). These are also the kind of environment that might routinely import or export pools, as HA pools move between hosts.)

The current OpenZFS code runs 'exportfs -ra' entirely blindly. It doesn't matter if you don't NFS export any ZFS filesystems, much less any from the pool that you're importing or exporting. As long as an 'exportfs' binary is on the system and can be executed, ZFS will run it. Possibly this could be changed if someone was to submit an OpenZFS bug report, but for a number of reasons (including that we're not directly affected by this and aren't in a position to do any testing), that someone will not be me.

(As far as I can tell this is the state of the code in all Linux OpenZFS versions up through the current development version and 2.3.0-rc4, the latest 2.3.0 release candidate.)

Appendix: Where this is in the current OpenZFS source code

The exportfs execution is done in nfs_commit_shares() in lib/libshare/os/linux/nfs.c. This is called (indirectly) by sa_commit_shares() in lib/libshare/libshare.c, which is called by zfs_commit_shares() in lib/libzfs/libzfs_mount.c. In turn this is called by zpool_enable_datasets() and zpool_disable_datasets(), also in libzfs_mount.c, which are called as part of 'zpool import' and 'zpool export' respectively.

(As a piece of trivia, zpool_disable_datasets() will also be called during 'zpool destroy'.)

Signal kontejner

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno šifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo številni aktivisti, novinarji, žvižgači, pa tudi državni uradniki in poslovneži. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo povežemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (žal) vezan na telefonsko številko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjši “hack”. Kakšen, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene številne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s številnimi izboljšavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniški profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameščate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

Rešitev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal povežemo z novo številko.

Ko je telefonska številka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameščeni dve različici Signala, vezani na dve različni telefonski številki, in iz obeh različic lahko pošiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko številko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno rešiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameščena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniške imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpišemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameščena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo približno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameščen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala povežemo v operacijski sistem in vpišemo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpišemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpišemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaženemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa še enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se povežemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameščena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naš zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvršljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne še čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapišemo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakšno drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvršljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

Osvežimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop"

Ko bomo v iskalnik aplikacij vpisali “Signal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo še povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Žal pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej še vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo rešiti, pa mi seveda lahko sporočite. :)

Using systemd-run to limit something's memory usage in cgroups v2

By: cks

Once upon a time I wrote an entry about using systemd-run to limit something's RAM consumption. This was back in the days of cgroups v1 (also known as 'non-unified cgroups'), and we're now in the era of cgroups v2 ('unified cgroups') and also ZRAM based swap. This means we want to make some adjustments, especially if you're dealing with programs with obnoxiously large RAM usage.

As before, the basic thing you want to do is run your program or thing in a new systemd user scope, which is done with 'systemd-run --user --scope ...'. You may wish to give it a unit name as well, '--unit <name>', especially if you expect it to persist a while and you want to track it specifically. Systemd will normally automatically clean up this scope when everything in it exits, and the scope is normally connected to your current terminal and otherwise more or less acts normally as an interactive process.

To actually do anything with this, we need to set some systemd resource limits. To limit memory usage, the minimum is a MemoryMax= value. It may also work better to set MemoryHigh= to a value somewhat below the absolute limit of MemoryMax. If you're worried about whatever you're doing running your system out of memory and your system uses ZRAM based swap, you may also want to set a MemoryZSwapMax= value so that the program doesn't chew up all of your RAM by 'swapping' it to ZRAM and filling that up. Without a ZRAM swap limit, you might find that the program actually uses MemoryMax RAM plus your entire ZRAM swap RAM, which might be enough to trigger a more general OOM. So this might be:

systemd-run --user --scope -p MemoryHigh=7G -p MemoryMax=8G -p MemoryZSwapMax=1G ./mach build

(Good luck with building Firefox in merely 8 GBytes of RAM, though. And obviously if you do this regularly, you're going to want to script it.)

If you normally use ZRAM based swap and you're worried about the program running you out of memory that way, you may want to create some actual swap space that the program can be turned loose on. These days, this is as simple as creating a 'swap.img' file somewhere and then swapping onto it:

cd /
dd if=/dev/zero of=swap.img bs=1MiB count=$((4*1024))
mkswap swap.img
swapon /swap.img

(You can use swapoff to stop swapping to this image file after you're done running your big program.)

Then you may want to also limit how much of this swap space the program can use, which is done with a MemorySwapMax= value. I've read both systemd's documentation and the kernel's cgroup v2 memory controller documentation, and I can't tell whether the ZRAM swap maximum is included in the swap maximum or is separate. I suspect that it's included in the swap maximum, but if it really matters you should experiment.

If you also want to limit the program's CPU usage, there are two options. The easiest one to set is CPUQuota=. The drawback of CPU quota limits is that programs may not realize that they're being restricted by such a limit and wind up running a lot more threads (or processes) than they should, increasing the chances of overloading things. The more complex but more legible to programs way is to restrict what CPUs they can run on using taskset(1).

(While systemd has AllowedCPUs=, this is a cgroup setting and doesn't show up in the interface used by taskset and sched_getaffinity(2).)

Systemd also has CPUWeight=, but I have limited experience with it; see fair share scheduling in cgroup v2 for what I know. You might want the special value 'idle' for very low priority programs.

What NFS server threads do in the Linux kernel

By: cks

If we ignore the network stack and take an abstract view, the Linux kernel NFS server needs to do things at various different levels in order to handle NFS client requests. There is NFS specific processing (to deal with things like the NFS protocol and NFS filehandles), general VFS processing (including maintaining general kernel information like dentries), then processing in whatever specific filesystem you're serving, and finally some actual IO if necessary. In the abstract, there are all sorts of ways to split up the responsibility for these various layers of processing. For example, if the Linux kernel supported fully asynchronous VFS operations (which it doesn't), the kernel NFS server could put all of the VFS operations in a queue and let the kernel's asynchronous 'IO' facilities handle them and notify it when a request's VFS operations were done. Even with synchronous VFS operations, you could split the responsibility between some front end threads that handled the NFS specific side of things and a backend pool of worker threads that handled the (synchronous) VFS operations.

(This would allow you to size the two pools differently, since ideally they have different constraints. The NFS processing is more or less CPU bound, and so sized based on how much of the server's CPU capacity you wanted to use for NFS; the VFS layer would ideally be IO bound, and could be sized based on how much simultaneous disk IO it was sensible to have. There is some hand-waving involved here.)

The actual, existing Linux kernel NFS server takes the much simpler approach. The kernel NFS server threads do everything. Each thread takes an incoming NFS client request (or a group of them), does NFS level things like decoding NFS filehandles, and then calls into the VFS to actually do operations. The VFS will call into the filesystem, still in the context of the NFS server thread, and if the filesystem winds up doing IO, the NFS server thread will wait for that IO to complete. When the thread of execution comes back out of the VFS, the NFS thread then does the NFS processing to generate replies and dispatch them to the network.

This unfortunately makes it challenging to answer the question of how many NFS server threads you want to use. The NFS server threads may be CPU bound (if they're handling NFS requests from RAM and the VFS's caches and data structures), or they may be IO bound (as they wait for filesystem IO to be performed, usually for reading and writing files). When you're IO bound, you probably want enough NFS server threads so that you can wait on all of the IO and still have some threads left over to handle the collection of routine NFS requests that can be satisfied from RAM. When you're CPU bound, you don't want any more NFS server threads than you have CPUs, and maybe you want a bit less.

If you're lucky, your workload is consistently and predictably one or the other. If you're not lucky (and we're not), your workload can be either of these at different times or (if we're really out of luck) both at once. Energetic people with NFS servers that have no other real activity can probably write something that automatically tunes the number of NFS threads up and down in response to a combination of the load average, the CPU utilization, and pressure stall information.

(We're probably just going to set it to the number of system CPUs.)

(After yesterday's question I decided I wanted to know for sure what the kernel's NFS server threads were used for, just in case. So I read the kernel code, which did have some useful side effects such as causing me to learn that the various nfsd4_<operation> functions we sometimes use bpftrace on are doing less than I assumed they were.)

The question of how many NFS server threads you should use (on Linux)

By: cks

Today, not for the first time, I noticed that one of our NFS servers was sitting at a load average of 8 with roughly half of its overall CPU capacity used. People with experience in Linux NFS servers are now confidently predicting that this is a 16-CPU server, which is correct (it has 8 cores and 2 HT threads per core). They're making this prediction because the normal Linux default number of kernel NFS server threads to run is eight.

(Your distribution may have changed this, and if so it's most likely by changing what's in /etc/nfs.conf, which is the normal place to set this. It can be changed on the fly by writing a new value to /proc/fs/nfsd/threads.)

Our NFS server wasn't saturating its NFS server threads because someone on a NFS client was doing a ton of IO. That might actually have slowed the requests down. Instead, there were some number of programs that were constantly making some number of NFS requests that could be satisfied entirely from (server) RAM, which explains why all of the NFS kernel threads were busy using system CPU (mostly on a spinlock, apparently, according to 'perf top'). It's possible that some of these constant requests came from code that was trying to handle hot reloading, since this is one of the sources of constant NFS 'GetAttr' requests, but I believe there's other things going on.

(Since this is the research side of a university department, we have very little visibility into what the graduate students are running on places like our little SLURM cluster.)

If you search around the Internet, you can find all sorts of advice about what to to set the number of NFS server threads to on your Linux NFS server. Many of them involve relatively large numbers (such as this 2024 SuSE advice of 128 threads). Having gone through this recent experience, my current belief is that it depends on what your problem is. In our case, with the NFS server threads all using kernel CPU time and not doing much else, running more threads than we have CPUs seems pointless; all it would do is create unproductive contention for CPU time. If NFS clients are going to totally saturate the fileserver with (CPU-eating) requests even at 16 threads, possibly we should run fewer threads than CPUs, so that user level management operations have some CPU available without contending against the voracious appetite of the kernel NFS server.

(Some advice suggests some number of server NFS kernel threads per NFS client. I suspect this advice is not used in places with tens or hundreds of NFS clients, which is our situation.)

To figure out what your NFS server's problem is, I think you're going to need to look at things like pressure stall information and information on the IO rate and the number of IO requests you're seeing. You can't rely on overall iowait numbers, because Linux iowait is a conservative lower bound. IO pressure stall information is much better for telling you if some NFS threads are blocked on IO even while others are active.

(Unfortunately the kernel NFS threads are not in a cgroup of their own, so you can't get per-cgroup pressure stall information for them. I don't know if you can manually move them into a cgroup, or if systemd would cooperate with this if you tried it.)

PS: In theory it looks like a potentially reasonable idea to run roughly at least as many NFS kernel threads as you have CPUs (maybe a few less so you have some user level CPU left over). However, if you have a lot of CPUs, as you might on modern servers, this might be too many if your NFS server gets flooded with an IO-heavy workload. Our next generation NFS fileserver hardware is dual socket, 12 cores per socket, and 2 threads per core, for a total of 48 CPUs, and I'm not sure we want to run anywhere near than many NFS kernel threads. Although we probably do want to run more than eight.

Ubuntu LTS (server) releases have become fairly similar to each other

By: cks

Ubuntu 24.04 LTS was released this past April, so one of the things we've been doing since then is building out our install system for 24.04 and then building a number of servers using 24.04, both new servers and servers that used to be build on 20.04 or 22.04. What has been quietly striking about this process is how few changes there have been for us between 20.04, 22.04, and 24.04. Our customization scripts needed only very small changes, and many of the instructions for specific machines could be revised by just searching and replacing either '20.04' or '22.04' with '24.04'.

Some of this lack of changes is illusory, because when I actually look at the differences between our 22.04 and 24.04 postinstall scripting, there are a number of changes, adjustments, and new fixes (and a big change in having to install Python 2 ourselves). Even when we didn't do anything there were decisions to be made, like whether or not we would stick with the Ubuntu 24.04 default of socket activated SSH (our decision so far is to stick with 24.04's default for less divergence from upstream). And there were also some changes to remove obsolete things and restructure how we change things like the system-wide SSH configuration; these aren't forced by the 22.04 to 24.04 change, but building the install setup for a new release is the right time to rethink existing pieces.

However, plenty of this lack of changes is real, and I credit a lot of that to systemd. Systemd has essentially standardized a lot of the init process and in the process, substantially reduced churn in it. For a relevant example, our locally developed systemd units almost never need updating between Ubuntu versions; if it worked in 20.04, it'll still work just as well in 24.04 (including its relationships to various other units). Another chunk of this lack of changes is that the current 20.04+ Ubuntu server installer has maintained a stable configuration file and relatively stable feature set (at least of features that we want to use), resulting in very little needing to be modified in our spin of it as we moved from 20.04 to 22.04 to 24.04. And the experience of going through the server installer has barely changed; if you showed me an installer screen from any of the three releases, I'm not sure I could tell you which it's from.

I generally feel that this is a good thing, at least on servers. A normal Linux server setup and the software that you run on it has broadly reached a place of stability, where there's no particular need to make really visible changes or to break backward compatibility. It's good for us that moving from 20.04 to 22.04 to 24.04 is mostly about getting more recent kernels and more up to date releases of various software packages, and sometimes having bugs fixed so that things like bpftrace work better.

(Whether this is 'welcome maturity' or 'unwelcome statis' is probably somewhat in the eye of the observer. And there are quiet changes afoot behind the scenes, like the change from iptables to nftables.)

A rough equivalent to "return to last power state" for libvirt virtual machines

By: cks

Physical machines can generally be set in their BIOS so that if power is lost and then comes back, the machine returns to its previous state (either powered on or powered off). The actual mechanics of this are complicated (also), but the idealized version is easily understood and convenient. These days I have a revolving collection of libvirt based virtual machines running on a virtualization host that I periodically reboot due to things like kernel updates, and for a while I have quietly wished for some sort of similar libvirt setting for its virtual machines.

It turns out that this setting exists, sort of, in the form of the libvirt-guests systemd service. If enabled, it can be set to restart all guests that were running when the system was shut down, regardless of whether or not they're set to auto-start on boot (none of my VMs are). This is a global setting that applies to all virtual machines that were running at the time the system went down, not one that can be applied to only some VMs, but for my purposes this is sufficient; it makes it less of a hassle to reboot the virtual machine host.

Linux being Linux, life is not quite this simple in practice, as is illustrated by comparing my Ubuntu VM host machine with my Fedora desktops. On Ubuntu, libvirt-guests.service defaults to enabled, it is configured through /etc/default/libvirt-guests (the Debian standard), and it defaults to not not automatically restarting virtual machines. On my Fedora desktops, libvirt-guests.service is not enabled by default, it is configured through /etc/sysconfig/libvirt-guests (as in the official documentation), and it defaults to automatically restarting virtual machines. Another difference is that Ubuntu has a /etc/default/libvirt-guests that has commented out default values, while Fedora has no /etc/sysconfig/libvirt-guests so you have to read the script to see what the defaults are (on Fedora, this is /usr/libexec/libvirt-guests.sh, on Ubuntu /usr/lib/libvirt/libvirt-guests.sh).

I've changed my Ubuntu VM host machine so that it will automatically restart previously running virtual machines on reboot, because generally I leave things running intentionally there. I haven't touched my Fedora machines so far because by and large I don't have any regularly running VMs, so if a VM is still running when I go to reboot the machine, it's most likely because I forgot I had it up and hadn't gotten around to shutting it off.

(My pre-libvirt virtualization software was much too heavy-weight for me to leave a VM running without noticing, but libvirt VMs have a sufficiently low impact on my desktop experience that I can and have left them running without realizing it.)

Pam_unix and your system's supported password algorithms

By: cks

The Linux login passwords that wind up in /etc/shadow can be encrypted (well, hashed) with a variety of algorithms, which you can find listed (and sort of documented) in places like Debian's crypt(5) manual page. Generally the choice of which algorithm is used to hash (new) passwords (for example, when people change them) is determined by an option to the pam_unix PAM module.

You might innocently think, as I did, that all of the algorithms your system supports will all be supported by pam_unix, or more exactly will all be available for new passwords (ie, what you or your distribution control with an option to pam_unix). It turns out that this is not the case some of the time (or if it is actually the case, the pam_unix manual page can be inaccurate). This is surprising because pam_unix is the thing that handles hashed passwords (both validating them and changing them), and you'd think its handling of them would be symmetric.

As I found out today, this isn't necessarily so. As documented in the Ubuntu 20.04 crypt(5) manual page, 20.04 supports yescrypt in crypt(3) (sadly Ubuntu's manual page URL doesn't seem to work). This means that the Ubuntu 20.04 pam_unix can (or should) be able to accept yescrypt hashed passwords. However, the Ubuntu 20.04 pam_unix(8) manual page doesn't list yescrypt as one of the available options for hashing new passwords. If you look only at the 20.04 pam_unix manual page, you might (incorrectly) assume that a 20.04 system can't deal with yescrypt based passwords at all.

At one level, this makes sense once you know that pam_unix and crypt(3) come from different packages and handle different parts of the work of checking existing Unix password and hashing new ones. Roughly speaking, pam_unix can delegate checking passwords to crypt(3) without having to care how they're hashed, but to hash a new password with a specific algorithm it has to know about the algorithm, have a specific PAM option added for it, and call some functions in the right way. It's quite possible for crypt(3) to get ahead of pam_unix for a new password hashing algorithm, like yescrypt.

(Since they're separate packages, pam_unix may not want to implement this for a new algorithm until a crypt(3) that supports it is at least released, and then pam_unix itself will need a new release. And I don't know if linux-pam can detect whether or not yescrypt is supported by crypt(3) at build time (or at runtime).)

PS: If you have an environment with a shared set of accounts and passwords (whether via LDAP or your own custom mechanism) and a mixture of Ubuntu versions (maybe also with other Linux distribution versions), you may want to be careful about using new password hashing schemes, even once it's supported by pam_unix on your main systems. The older some of your Linuxes are, the more you'll want to check their crypt(3) and crypt(5) manual pages carefully.

Linux's /dev/disk/by-id unfortunately often puts the transport in the name

By: cks

Filippo Valsorda ran into an issue that involved, in part, the naming of USB disk drives. To quote the relevant bit:

I can't quite get my head around the zfs import/export concept.

When I replace a drive I like to first resilver the new one as a USB drive, then swap it in. This changes the device name (even using by-id).

[...]

My first reaction was that something funny must be going on. My second reaction was to look at an actual /dev/disk/by-id with a USB disk, at which point I got a sinking feeling that I should have already recognized from a long time ago. If you look at your /dev/disk/by-id, you will mostly see names that start with things like 'ata-', 'scsi-OATA-', 'scsi-1ATA', and maybe 'usb-' (and perhaps 'nvme-', but that's a somewhat different kettle of fish). All of these names have the problem that they burn the transport (how you talk to the disk) into the /dev/disk/by-id, which is supposed to be a stable identifier for the disk as a standalone thing.

As Filippo Valsorda's case demonstrates, the problem is that some disks can move between transports. When this happens, the theoretically stable name of the disk changes; what was 'usb-' is now likely 'ata-' or vice versa, and in some cases other transformations may happen. Your attempt to use a stable name has failed and you will likely have problems.

Experimentally, there seem to be some /dev/disk/by-id names that are more stable. Some but not all of our disks have 'wwn-' names (one USB attached disk I can look at doesn't). Our Ubuntu based systems have 'scsi-<hex digits>' and 'scsi-SATA-<disk id>' names, but one of my Fedora systems with SATA drives has only the 'scsi-<hex>' names and the other one has neither. One system we have a USB disk on has no names for the disk other than 'usb-' ones. It seems clear that it's challenging at best to give general advice about how a random Linux user should pick truly stable /dev/disk/by-id names, especially if you have USB drives in the picture.

(See also Persistent block device naming in the Arch Wiki.)

This whole current situation seems less than ideal, to put it one way. It would be nice if disks (and partitions on them) had names that were as transport independent and usable as possible, especially since most disks have theoretically unique serial numbers and model names available (and if you're worried about cross-transport duplicates, you should already be at least as worried as duplicates within the same type of transport).

PS: You can find out what information udev knows about your disks with 'udevadm info --query=all --name=/dev/...' (from, via, by coincidence). The information for a SATA disk differs between my two Fedora machines (one of them has various SCSI_* and ID_SCSI* stuff and the other doesn't), but I can't see any obvious reason for this.

Using pam_access to sometimes not use another PAM module

By: cks

Suppose that you want to authenticate SSH logins to your Linux systems using some form of multi-factor authentication (MFA). The normal way to do this is to use 'password' authentication and then in the PAM stack for sshd, use both the regular PAM authentication module(s) of your system and an additional PAM module that requires your MFA (in another entry about this I used the module name pam_mfa). However, in your particular MFA environment it's been decided that you don't have to require MFA for logins from some of your other networks or systems, and you'd like to implement this.

Because your MFA happens through PAM and the details of this are opaque to OpenSSH's sshd, you can't directly implement skipping MFA through sshd configuration settings. If sshd winds up doing password based authentication at all, it will run your full PAM stack and that will challenge people for MFA. So you must implement sometimes skipping your MFA module in PAM itself. Fortunately there is a PAM module we can use for this, pam_access.

The usual way to use pam_access is to restrict or allow logins (possibly only some logins) based on things like the source address people are trying to log in from (in this, it's sort of a superset of the old tcpwrappers). How this works is configured through an access control file. We can (ab)use this basic matching in combination with the more advanced form of PAM controls to skip our PAM MFA module if pam_access matches something.

What we want looks like this:

auth  [success=1 default=ignore]  pam_access.so noaudit accessfile=/etc/security/access-nomfa.conf
auth  requisite  pam_mfa

Pam_access itself will 'succeed' as a PAM module if the result of processing our access-nomfa.conf file is positive. When this happens, we skip the next PAM module, which is our MFA module. If it 'fails', we ignore the result, and as part of ignoring the result we tell pam_access to not report failures.

Our access-nomfa.conf file will have things like:

# Everyone skips MFA for internal networks
+:ALL:192.168.0.0/16 127.0.0.1

# Insure we fail otherwise.
-:ALL:ALL

We list the networks we want to allow password logins without MFA from, and then we have to force everything else to fail. (If you leave this off, everything passes, either explicitly or implicitly.)

As covered in the access.conf manual page, you can get quite sophisticated here. For example, you could have people who always had to use MFA, even from internal machines. If they were all in a group called 'mustmfa', you might start with:

-:(mustmfa):ALL

If you get at all creative with your access-nomfa.conf, I strongly suggest writing a lot of comments to explain everything. Your future self will thank you.

Unfortunately but entirely reasonably, the information about the remote source of a login session doesn't pass through to later PAM authentication done by sudo and su commands that you do in the session. This means that you can't use pam_access to not give MFA challenges on su or sudo to people who are logged in from 'trusted' areas.

(As far as I can tell, the only information ``pam_access' gets about the 'origin' of a su is the TTY, which is generally not going to be useful. You can probably use this to not require MFA on su or sudo that are directly done from logins on the machine's physical console or serial console.)

Having an emergency backup DNS resolver with systemd-resolved

By: cks

At work we have a number of internal DNS resolvers, which you very much want to use to resolve DNS names if you're inside our networks for various reasons (including our split-horizon DNS setup). Purely internal DNS names aren't resolvable by the outside world at all, and some DNS names resolve differently. However, at the same time a lot of the host names that are very important to me are in our public DNS because they have public IPs (sort of for historical reasons), and so they can be properly resolved if you're using external DNS servers. This leaves me with a little bit of a paradox; on the one hand, my machines must resolve our DNS zones using our internal DNS servers, but on the other hand if our internal DNS servers aren't working for some reason (or my home machine can't reach them) it's very useful to still be able to resolve the DNS names of our servers, so I don't have to memorize their IP addresses.

A while back I switched to using systemd-resolved on my machines. Systemd-resolved has a number of interesting virtues, including that it has fast (and centralized) failover from one upstream DNS resolver to another. My systemd-resolved configuration is probably a bit unusual, in that I have a local resolver on my machines, so resolved's global DNS resolution goes to it and then I add a layer of (nominally) interface-specific DNS domain overrides that point to our internal DNS resolvers.

(This doesn't give me perfect DNS resolution, but it's more resilient and under my control than routing everything to our internal DNS resolvers, especially for my home machine.)

Somewhat recently, it occurred to me that I could deal with the problem of our internal DNS resolvers all being unavailable by adding '127.0.0.1' as an additional potential DNS server for my interface specific list of our domains. Obviously I put it at the end, where resolved won't normally use it. But with it there, if all of the other DNS servers are unavailable I can still try to resolve our public DNS names with my local DNS resolver, which will go out to the Internet to talk to various authoritative DNS servers for our zones.

The drawback with this emergency backup approach is that systemd-resolved will stick with whatever DNS server it's currently using unless that DNS server stops responding. So if resolved switches to 127.0.0.1 for our zones, it's going to keep using it even after the other DNS resolvers become available again. I'll have to notice that and manually fiddle with the interface specific DNS server list to remove 127.0.0.1, which would force resolved to switch to some other server.

(As far as I can tell, the current systemd-resolved correctly handles the situation where an interface says that '127.0.0.1' is the DNS resolver for it, and doesn't try to force queries to 127.0.0.1:53 to go out that interface. My early 2013 notes say that this sometimes didn't work, but I failed to write down the specific circumstances.)

A surprise with /etc/cron.daily, run-parts, and files with '.' in their name

By: cks

Linux distributions have a long standing general cron feature where there is are /etc/cron.hourly, /etc/cron.daily, and /etc/cron.weekly directories and if you put scripts in there, they will get run hourly, daily, or weekly (at some time set by the distribution). The actual running is generally implemented by a program called 'run-parts'. Since this is a standard Linux distribution feature, of course there is a single implementation of run-parts and its behavior is standardized, right?

Since I'm asking the question, you already know the answer: there are at least two different implementations of run-parts, and their behavior differs in at least one significant way (as well as several other probably less important ones).

In Debian, Ubuntu, and other Debian-derived distributions (and also I think Arch Linux), run-parts is a C program that is part of debianutils. In Fedora, Red Hat Enterprise Linux, and derived RPM-based distributions, run-parts is a shell script that's part of the crontabs package, which is part of cronie-cron. One somewhat unimportant way that these two versions differ is that the RPM version ignores some extensions that come from RPM packaging fun (you can see the current full list in the shell script code), while the Debian version only skips the Debian equivalents with a non-default option (and actually documents the behavior in the manual page).

A much more important difference is that the Debian version ignores files with a '.' in their name (this can be changed with a command line switch, but /etc/cron.daily and so on are not processed with this switch). As a non-hypothetical example, if you have a /etc/cron.daily/backup.sh script, a Debian based system will ignore this while a RHEL or Fedora based system will happily run it. If you are migrating a server from RHEL to Ubuntu, this may come as an unpleasant surprise, partly since the Debian version doesn't complain about skipping files.

(Whether or not the restriction could be said to be clearly documented in the Debian manual page is a matter of taste. Debian does clearly state the allowed characters, but it does not point out that '.', a not uncommon character, is explicitly not accepted by default.)

Linux software RAID and changing your system's hostname

By: cks

Today, I changed the hostname of an old Linux system (for reasons) and rebooted it. To my surprise, the system did not come up afterward, but instead got stuck in systemd's emergency mode for a chain of reasons that boiled down to there being no '/dev/md0'. Changing the hostname back to its old value and rebooting the system again caused it to come up fine. After some diagnostic work, I believe I understand what happened and how to work around it if it affects us in the future.

One of the issues that Linux RAID auto-assembly faces is the question of what it should call the assembled array. People want their RAID array names to stay fixed (so /dev/md0 is always /dev/md0), and so the name is part of the RAID array's metadata, but at the same time you have the problem of what happens if you connect up two sets of disks that both want to be 'md0'. Part of the answer is mdadm.conf, which can give arrays names based on their UUID. If your mdadm.conf says 'ARRAY /dev/md10 ... UUID=<x>' and mdadm finds a matching array, then in theory it can be confident you want that one to be /dev/md10 and it should rename anything else that claims to be /dev/md10.

However, suppose that your array is not specified in mdadm.conf. In that case, another software RAID array feature kicks in, which is that arrays can have a 'home host'. If the array is on its home host, it will get the name it claims it has, such as '/dev/md0'. Otherwise, well, let me quote from the 'Auto-Assembly' section of the mdadm manual page:

[...] Arrays which do not obviously belong to this host are given names that are expected not to conflict with anything local, and are started "read-auto" so that nothing is written to any device until the array is written to. i.e. automatic resync etc is delayed.

As is covered in the documentation for the '--homehost' option in the mdadm manual page, on modern 1.x superblock formats the home host is embedded into the name of the RAID array. You can see this with 'mdadm --detail', which can report things like:

Name : ubuntu-server:0
Name : <host>:25  (local to host <host>)

Both of these have a 'home host'; in the first case the home host is 'ubuntu-server', and in the second case the home host is the current machine's hostname. Well, its 'hostname' as far as mdadm is concerned, which can be set in part through mdadm.conf's 'HOMEHOST' directive. Let me repeat that, mdadm by default identifies home hosts by their hostname, not by any more stable identifier.

So if you change a machine's hostname and you have arrays not in your mdadm.conf with home hosts, their /dev/mdN device names will get changed when you reboot. This is what happened to me, as we hadn't added the array to the machine's mdadm.conf.

(Contrary to some ways to read the mdadm manual page, arrays are not renamed if they're in mdadm.conf. Otherwise we'd have noticed this a long time ago on our Ubuntu servers, where all of the arrays created in the installer have the home host of 'ubuntu-server', which is obviously not any machine's actual hostname.)

Setting the home host value to the machine's current hostname when an array is created is the mdadm default behavior, although you can turn this off with the right mdadm.conf HOMEHOST setting. You can also tell mdadm to consider all arrays to be on their home host, regardless of the home host embedded into their names.

(The latter is 'HOMEHOST <ignore>', the former by itself is 'HOMEHOST <none>', and it's currently valid to combine them both as 'HOMEHOST <ignore> <none>', although this isn't quite documented in the manual page.)

PS: Some uses of software RAID arrays won't care about their names. For example, if they're used for filesystems, and your /etc/fstab specifies the device of the filesystem using 'UUID=' or with '/dev/disk/by-id/md-uuid-...' (which seems to be common on Ubuntu).

PPS: For 1.x superblocks, the array name as a whole can only be 32 characters long, which obviously limits how long of a home host name you can have, especially since you need a ':' in there as well and an array number or the like. If you create a RAID array on a system with a too long hostname, the name of the resulting array will not be in the '<host>:<name>' format that creates an array with a home host; instead, mdadm will set the name of the RAID to the base name (either whatever name you specified, or the N of the 'mdN' device you told it to use).

(It turns out that I managed to do this by accident on my home desktop, which has a long fully qualified name, by creating an array with the name 'ssd root'. The combination turns out to be 33 characters long, so the RAID array just got the name 'ssd root' instead of '<host>:ssd root'.)

❌