Reading view

There are new articles available, click to refresh the page.

How we handle debconf questions during our Ubuntu installs

By: cks

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

How we automate installing extra packages during Ubuntu installs

By: cks

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

Linux kernel NFSv4 server and client RPC operation statistics

By: cks

NFS servers and clients communicate using RPC, sending various NFS v3, v4, and possibly v2 (but we hope not) RPC operations to the server and getting replies. On Linux, the kernel exports statistics about these NFS RPC operations in various places, with a global summary in /proc/net/rpc/nfsd (for the NFS server side) and /proc/net/rpc/nfs (for the client side). Various tools will extract this information and convert it into things like metrics, or present it on the fly (for example, nfsstat(8)). However, as far as I know what is in those files and especially how RPC operations are reported is not well documented, and also confusing, which is a problem if you discover that something has an incomplete knowledge of NFSv4 RPC stats.

For a general discussion of /proc/net/rpc/nfsd, see Svenn D'Hert's nfsd stats explained article. I'm focusing on NFSv4, which is to say the 'proc4ops' line. This line is produced in nfsd_show in fs/nfsd/stats.c. The line starts with a count of how many operations there are, such as 'proc4ops 76', and then has one number for each operation. What are the operations and how many of them are there? That's more or less found in the nfs_opnum4 enum in include/linux/nfs4.h. You'll notice that there are some gaps in the operation numbers; for example, there's no 0, 1, or 2. Despite there being no such actual NFS v4 operations, 'proc4ops' starts with three 0s for them, because it works with an array numbered by nfs_opnum4 and like all C arrays, it starts at 0.

(The counts of other, real NFS v4 operations may be 0 because they're never done in your environment.)

For NFS v4 client operations, we look at the 'proc4' line in /proc/net/rpc/nfs. Like the server's 'proc4ops' line, it starts with a count of how many operations are being reported on, such as 'proc4 69', and then a count for each operation. Unfortunately for us and everyone else, these operations are not numbered the same as the NFS server operations. Instead the numbering is given in an anonymous and unnumbered enum in include/linux/nfs4.h that starts with 'NFSPROC4_CLNT_NULL = 0,' (as a spoiler, the 'null' operation is not unused, contrary to the include file's comment). The actual generation and output of /proc/net/rpc/nfs is done in rpc_proc_show in net/sunrpc/stats.c. The whole structure this code uses is set up in fs/nfs/nfs4xdr.c, and while there is a confusing level of indirection, I believe the structure corresponds directly with the NFSPROC4_CLNT_* enum values.

What I think is going on is that Linux has decided to optimize its NFSv4 client statistics to only include the NFS v4 operations that it actually uses, rather than take up a bit of extra memory to include all of the NFS v4 operations, including ones that will always have a '0' count. Because the Linux NFS v4 client started using different NFSv4 operations at different times, some of these operations (such as 'lookupp') are out of order; when the NFS v4 client started using them, they had to be added at the end of the 'proc4' line to preserve backward compatibility with existing programs that read /proc/net/rpc/nfs.

PS: As far as I can tell from a quick look at fs/nfs/nfs3xdr.c, include/uapi/linux/nfs3.h, and net/sunrpc/stats.c, the NFS v3 server and client stats cover all of the NFS v3 operations and are in the same order, the order of the NFS v3 operation numbers.

How Ubuntu 24.04's bad bpftrace package appears to have happened

By: cks

When I wrote about Ubuntu 24.04's completely broken bpftrace '0.20.2-1ubuntu4.2' package (which is now no longer available as an Ubuntu update), I said it was a disturbing mystery how a theoretical 24.04 bpftrace binary was built in such a way that it depended on a shared library that didn't exist in 24.04. Thanks to the discussion in bpftrace bug #2097317, we have somewhat of an answer, which in part shows some of the challenges of building software at scale.

The short version is that the broken bpftrace package wasn't built in a standard Ubuntu 24.04 environment that only had released packages. Instead, it was built in a '24.04' environment that included (some?) proposed updates, and one of the included proposed updates was an updated version of libllvm18 that had the new shared library. Apparently there are mechanisms that should have acted to make the new bpftrace depend on the new libllvm18 if everything went right, but some things didn't go right and the new bpftrace package didn't pick up that dependency.

On the one hand, if you're planning interconnected package updates, it's a good idea to make sure that they work with each other, which means you may want to mingle in some proposed updates into some of your build environments. On the other hand, if you allow your build environments to be contaminated with non-public packages this way, you really, really need to make sure that the dependencies work out. If you don't and packages become public in the wrong order, you get Ubuntu 24.04's result.

(While the RPM build process and package format would have avoided this specific problem, I'm pretty sure that there are similar ways to make it go wrong.)

Contaminating your build environment this way also makes testing your newly built packages harder. The built bpftrace binary would have run inside the build environment, because the build environment had the right shared library from the proposed libllvm18. To see the failure, you would have to run tests (including running the built binary) in a 'pure' 24.04 environment that had only publicly released package updates. This would require an extra package test step; I'm not clear if Ubuntu has this as part of their automated testing of proposed updates (there's some hints in the discussion that they do but that these tests were limited and didn't try to run the binary).

An alarmingly bad official Ubuntu 24.04 bpftrace binary package

By: cks

Bpftrace is a more or less official part of Ubuntu; it's even in the Ubuntu 24.04 'main' repository, as opposed to one of the less supported ones. So I'll present things in the traditional illustrated form (slightly edited for line length reasons):

$ bpftrace
bpftrace: error while loading shared libraries: libLLVM-18.so.18.1: cannot open shared object file: No such file or directory
$ readelf -d /usr/bin/bpftrace | grep libLLVM
 0x0...01 (NEEDED)  Shared library: [libLLVM-18.so.18.1]
$ dpkg -L libllvm18 | grep libLLVM
/usr/lib/llvm-18/lib/libLLVM.so.1
/usr/lib/llvm-18/lib/libLLVM.so.18.1
/usr/lib/x86_64-linux-gnu/libLLVM-18.so
/usr/lib/x86_64-linux-gnu/libLLVM.so.18.1
$ dpkg -l bpftrace libllvm18
[...]
ii  bpftrace       0.20.2-1ubuntu4.2 amd64 [...]
ii  libllvm18:amd64 1:18.1.3-1ubuntu1 amd64 [...]

I originally mis-diagnosed this as a libllvm18 packaging failure, but this is in fact worse. Based on trawling through packages.ubuntu.com, only Ubuntu 24.10 and later have a 'libLLVM-18.so.18.1' in any package; in Ubuntu 24.04, the correct name for this is 'libLLVM.so.18.1'. If you rebuild the bpftrace source .deb on a genuine 24.04 machine, you get a bpftrace build (and binary .deb) that does correctly use 'libLLVM.so.18.1' instead of 'libLLVM-18.so.18.1'.

As far as I can see, there are two things that could have happened here. The first is that Canonical simply built a 24.10 (or later) bpftrace binary .deb and put it in 24.04 without bothering to check if the result actually worked. I would like to say that this shows shocking disregard for the functioning of an increasingly important observability tool from Canonical, but actually it's not shocking at all, it's Canonical being Canonical (and they would like us to pay for this for some reason). The second and worse option is that Canonical is building 'Ubuntu 24.04' packages in an environment that is contaminated with 24.10 or later packages, shared libraries, and so on. This isn't supposed to happen in a properly operating package building environment that intends to create reliable and reproducible results and casts doubt on the provenance and reliability of all Ubuntu 24.04 packages.

(I don't know if there's a way to inspect binary .debs to determine anything about the environment they were built in, the way you can get some information about RPMs. Also, I now have a new appreciation for Fedora putting the Fedora release version into the actual RPM's 'release' name. Ubuntu 24.10 and 24.04 don't have the same version of bpftrace, so this isn't quite as simple as Canonical copying the 24.10 package to 24.04; 24.10 has 0.21.2, while 24.04 is theoretically 0.20.2.)

Incidentally, this isn't an issue of the shared library having its name changed, because if you manually create a 'libLLVM-18.so.18.1' symbolic link to the 24.04 libllvm18's 'libLLVM.so.18.1' and run bpftrace, what you get is:

$ bpftrace
: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
abort

This appears to say that the Ubuntu 24.04 bpftrace binary is incompatible with the Ubuntu 24.04 libllvm18 shared libraries. I suspect that it was built against different LLVM 18 headers as well as different LLVM 18 shared libraries.

The (potential) complexity of good runqueue latency measurement in Linux

By: cks

Run queue latency is the time between when a Linux task becomes ready to run and when it actually runs. If you want good responsiveness, you want a low runqueue latency, so for a while I've been tracking a histogram of it with eBPF, and I put some graphs of it up on some Grafana dashboards I look at. Then recently I improved the responsiveness of my desktop with the cgroup V2 'cpu.idle' setting, and questions came up about how this different from process niceness. When I was looking at those questions, I realized that my run queue latency measurements were incomplete.

When I first set up my run queue latency tracking, I wasn't using either cgroup V2 cpu.idle or process niceness, and so I set up a single global runqueue latency histogram for all tasks regardless of their priority and scheduling class. Once I started using 'idle' CPU scheduling (and testing the effectiveness of niceness), this resulted in hopelessly muddled data that was effectively meaningless during the time that multiple scheduling types of scheduling or multiple nicenesses were running. Running CPU-consuming processes only when the system is otherwise idle is (hopefully) good for the runqueue latency of my regular desktop processes, but more terrible than usual for those 'run only when idle' processes, and generally there's going to be a lot more of them than my desktop processes.

The moment you introduce more than one 'class' of processes for scheduling, you need to split run queue latency measurements up between these classes if you want to really make sense of the results. What these classes are will depend on your environment. I could probably get away with a class for 'cpu.idle' tasks, a class for heavily nice'd tasks, a class for regular tasks, and perhaps a class for (system) processes running with very high priority. If you're doing fair share scheduling between logins, you might need a class per login (or you could ignore run queue latency as too noisy a measure).

I'm not sure I'd actually track all of my classes as Prometheus metrics. For my personal purposes, I don't care very much about the run queue latency of 'idle' or heavily nice'd processes, so perhaps I should update my personal metrics gathering to just ignore those. Alternately, I could write a bpftrace script that gathered the detailed class by class data, run it by hand when I was curious, and ignore the issue otherwise (continuing with my 'global' run queue latency histogram, which is at least honest in general).

The issue with DNF 5 and script output in Fedora 41

By: cks

These days Fedora uses DNF as its high(er) level package management software, replacing yum. However, there are multiple versions of DNF, which behave somewhat differently. Through Fedora 40, the default version of DNF was DNF 4; in Fedora 41, DNF is now DNF 5. DNF 5 brings a number of improvements but it has at least one issue that makes me unhappy with it in my specific situation. Over on the Fediverse I said:

Oh nice, DNF 5 in Fedora 41 has nicely improved the handling of output from RPM scriptlets, so that you can more easily see that it's scriptlet output instead of DNF messages.

[later]

I must retract my praise for DNF 5 in Fedora 41, because it has actually made the handling of output from RPM scriptlets *much* worse than in dnf 4. DNF 5 will repeatedly re-print the current output to date of scriptlets every time it updates a progress indicator of, for example, removing packages. This results in a flood of output for DKMS module builds during kernel updates. Dnf 5's cure is far worse than the disease, and there's no way to disable it.

<bugzilla 2331691>

(Fedora 41 specifically has dnf5-5.2.8.1, at least at the moment.)

This can be mostly worked around for kernel package upgrades and DKMS modules by manually removing and upgrading packages before the main kernel upgrade. You want to do this so that dnf is removing as few packages as possible while your DKMS modules are rebuilding. This is done with:

  1. Upgrade all of your non-kernel packages first:

    dnf upgrade --exclude 'kernel*'
    

  2. Remove the following packages for the old kernel:

    kernel kernel-core kernel-devel kernel-modules kernel-modules-core kernel-modules-extra

    (It's probably easier to do 'dnf remove kernel*<version>*' and let DNF sort it out.)

  3. Upgrade two kernel packages that you can do in advance:

    dnf upgrade kernel-tools kernel-tools-libs
    

Unfortunately in Fedora 41 this still leaves you with one RPM package that you can't upgrade in advance and that will be removed while your DKMS module is rebuilding, namely 'kernel-devel-matched'. To add extra annoyance, this is a virtual package that contains no files, and you can't remove it because a lot of things depend on it.

As far as I can tell, DNF 5 has absolutely no way to shut off its progress bars. It completely ignores $TERM and I can't see anything else that leaves DNF usable. It would have been nice to have some command line switches to control this, but it seems pretty clear that this wasn't high on the DNF 5 road map.

(Although I don't expect this to be fixed in Fedora 41 over its lifetime, I am still deferring the Fedora 41 upgrades of my work and home desktops for as long as possible to minimize the amount of DNF 5 irritation I have to deal with.)

WireGuard's AllowedIPs aren't always the (WireGuard) routes you want

By: cks

A while back I wrote about understanding WireGuard's AllowedIPs, and also recently I wrote about how different sorts of WireGuard setups have different difficulties, where one of the challenges for some setups is setting up what you want routed through WireGuard connections. As Ian Z aka nobrowser recently noted in a comment on the first entry, these days many WireGuard related programs (such as wg-quick and NetworkManager) will automatically set routes for you based on AllowedIPs. Much of the time this will work fine, but there are situations where adding routes for all AllowedIPs ranges isn't what you want.

WireGuard's AllowedIPs setting for a particular peer controls two things at once: what (inside-WireGuard) source IP addresses you will accept from the peer, and what destination addresses WireGuard will send to that peer if the packet is sent to that WireGuard interface. However, it's the routing table that controls what destination addresses are sent to a particular WireGuard interface (or more likely a combination of IP policy routing rules and some routing table).

If your WireGuard IP address is only reachable from other WireGuard peers, you can sensibly bound your AllowedIPs so that the collection of all of them matches the routing table. This is also more or less doable if some of them are gateways for additional networks; hopefully your network design puts all of those networks under some subnet and the subnet isn't too big. However, if your WireGuard IP can wind up being reached by a broader range of source IPs, or even 'all of the Internet' (as is my case), then your AllowedIPs range is potentially much larger than what you want to always be routed to WireGuard.

A related case is if you have a 'work VPN' WireGuard configuration where you could route all of your traffic through your WireGuard connection but some of the time you only want to route traffic to specific (work) subnets. Unless you like changing AllowedIPs all of the time or constructing two different WireGuard interfaces and only activating the correct one, you'll want an AllowedIPs that accepts everything but some of the time you'll only route specific networks to the WireGuard interface.

(On the other hand, with the state of things in Linux, having two separate WireGuard interfaces might be the easiest way to manage this in NetworkManager or other tools.)

I think that most people's use of WireGuard will probably involve AllowedIPs settings that also work for routing, provided that the tools involve handle the recursive routing problem. These days, NetworkManager handles that for you, although I don't know about wg-quick.

(This is one of the entries that I write partly to work it out in my own head. My own configuration requires a different AllowedIPs than the routes I send through the WireGuard tunnel. I make this work with policy based routing.)

Cgroup V2 memory limits and their potential for thrashing

By: cks

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

A gotcha with importing ZFS pools and NFS exports on Linux (as of ZFS 2.3.0)

By: cks

Ever since its Solaris origins, ZFS has supported automatic NFS and CIFS sharing of ZFS filesystems through their 'sharenfs' and 'sharesmb' properties. Part of the idea of this is that you could automatically have NFS (and SMB) shares created and removed as you did things like import and export pools, rather than have to maintain a separate set of export information and keep it in sync with what ZFS filesystems were available. On Linux, OpenZFS still supports this, working through standard Linux NFS export permissions (which don't quite match the Solaris/Illumos model that's used for sharenfs) and standard tools like exportfs. A lot of this works more or less as you'd expect, but it turns out that there's a potentially unpleasant surprise lurking in how 'zpool import' and 'zpool export' work.

In the current code, if you import or export a ZFS pool that has no filesystems with a sharenfs set, ZFS will still run 'exportfs -ra' at the end of the operation even though nothing could have changed in the NFS exports situation. An important effect that this has is that it will wipe out any manually added or changed NFS exports, reverting your NFS exports to what is currently in /etc/exports and /etc/exports.d. In many situations (including ours) this is a harmless operation, because /etc/exports and /etc/exports.d are how things are supposed to be. But in some environments you may have programs that maintain their own exports list and permissions through running 'exportfs' in various ways, and in these environments a ZFS pool import or export will destroy those exports.

(Apparently one such environment is high availability systems, some of which manually manage NFS exports outside of /etc/exports (I maintain that this is a perfectly sensible design decision). These are also the kind of environment that might routinely import or export pools, as HA pools move between hosts.)

The current OpenZFS code runs 'exportfs -ra' entirely blindly. It doesn't matter if you don't NFS export any ZFS filesystems, much less any from the pool that you're importing or exporting. As long as an 'exportfs' binary is on the system and can be executed, ZFS will run it. Possibly this could be changed if someone was to submit an OpenZFS bug report, but for a number of reasons (including that we're not directly affected by this and aren't in a position to do any testing), that someone will not be me.

(As far as I can tell this is the state of the code in all Linux OpenZFS versions up through the current development version and 2.3.0-rc4, the latest 2.3.0 release candidate.)

Appendix: Where this is in the current OpenZFS source code

The exportfs execution is done in nfs_commit_shares() in lib/libshare/os/linux/nfs.c. This is called (indirectly) by sa_commit_shares() in lib/libshare/libshare.c, which is called by zfs_commit_shares() in lib/libzfs/libzfs_mount.c. In turn this is called by zpool_enable_datasets() and zpool_disable_datasets(), also in libzfs_mount.c, which are called as part of 'zpool import' and 'zpool export' respectively.

(As a piece of trivia, zpool_disable_datasets() will also be called during 'zpool destroy'.)

Signal kontejner

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno šifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo številni aktivisti, novinarji, žvižgači, pa tudi državni uradniki in poslovneži. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo povežemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (žal) vezan na telefonsko številko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjši “hack”. Kakšen, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene številne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s številnimi izboljšavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniški profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameščate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

Rešitev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal povežemo z novo številko.

Ko je telefonska številka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameščeni dve različici Signala, vezani na dve različni telefonski številki, in iz obeh različic lahko pošiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko številko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno rešiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameščena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniške imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpišemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameščena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo približno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameščen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala povežemo v operacijski sistem in vpišemo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpišemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpišemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaženemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa še enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se povežemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameščena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naš zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvršljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne še čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapišemo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakšno drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvršljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

Osvežimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop"

Ko bomo v iskalnik aplikacij vpisali “Signal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo še povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Žal pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej še vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo rešiti, pa mi seveda lahko sporočite. :)

Using systemd-run to limit something's memory usage in cgroups v2

By: cks

Once upon a time I wrote an entry about using systemd-run to limit something's RAM consumption. This was back in the days of cgroups v1 (also known as 'non-unified cgroups'), and we're now in the era of cgroups v2 ('unified cgroups') and also ZRAM based swap. This means we want to make some adjustments, especially if you're dealing with programs with obnoxiously large RAM usage.

As before, the basic thing you want to do is run your program or thing in a new systemd user scope, which is done with 'systemd-run --user --scope ...'. You may wish to give it a unit name as well, '--unit <name>', especially if you expect it to persist a while and you want to track it specifically. Systemd will normally automatically clean up this scope when everything in it exits, and the scope is normally connected to your current terminal and otherwise more or less acts normally as an interactive process.

To actually do anything with this, we need to set some systemd resource limits. To limit memory usage, the minimum is a MemoryMax= value. It may also work better to set MemoryHigh= to a value somewhat below the absolute limit of MemoryMax. If you're worried about whatever you're doing running your system out of memory and your system uses ZRAM based swap, you may also want to set a MemoryZSwapMax= value so that the program doesn't chew up all of your RAM by 'swapping' it to ZRAM and filling that up. Without a ZRAM swap limit, you might find that the program actually uses MemoryMax RAM plus your entire ZRAM swap RAM, which might be enough to trigger a more general OOM. So this might be:

systemd-run --user --scope -p MemoryHigh=7G -p MemoryMax=8G -p MemoryZSwapMax=1G ./mach build

(Good luck with building Firefox in merely 8 GBytes of RAM, though. And obviously if you do this regularly, you're going to want to script it.)

If you normally use ZRAM based swap and you're worried about the program running you out of memory that way, you may want to create some actual swap space that the program can be turned loose on. These days, this is as simple as creating a 'swap.img' file somewhere and then swapping onto it:

cd /
dd if=/dev/zero of=swap.img bs=1MiB count=$((4*1024))
mkswap swap.img
swapon /swap.img

(You can use swapoff to stop swapping to this image file after you're done running your big program.)

Then you may want to also limit how much of this swap space the program can use, which is done with a MemorySwapMax= value. I've read both systemd's documentation and the kernel's cgroup v2 memory controller documentation, and I can't tell whether the ZRAM swap maximum is included in the swap maximum or is separate. I suspect that it's included in the swap maximum, but if it really matters you should experiment.

If you also want to limit the program's CPU usage, there are two options. The easiest one to set is CPUQuota=. The drawback of CPU quota limits is that programs may not realize that they're being restricted by such a limit and wind up running a lot more threads (or processes) than they should, increasing the chances of overloading things. The more complex but more legible to programs way is to restrict what CPUs they can run on using taskset(1).

(While systemd has AllowedCPUs=, this is a cgroup setting and doesn't show up in the interface used by taskset and sched_getaffinity(2).)

Systemd also has CPUWeight=, but I have limited experience with it; see fair share scheduling in cgroup v2 for what I know. You might want the special value 'idle' for very low priority programs.

What NFS server threads do in the Linux kernel

By: cks

If we ignore the network stack and take an abstract view, the Linux kernel NFS server needs to do things at various different levels in order to handle NFS client requests. There is NFS specific processing (to deal with things like the NFS protocol and NFS filehandles), general VFS processing (including maintaining general kernel information like dentries), then processing in whatever specific filesystem you're serving, and finally some actual IO if necessary. In the abstract, there are all sorts of ways to split up the responsibility for these various layers of processing. For example, if the Linux kernel supported fully asynchronous VFS operations (which it doesn't), the kernel NFS server could put all of the VFS operations in a queue and let the kernel's asynchronous 'IO' facilities handle them and notify it when a request's VFS operations were done. Even with synchronous VFS operations, you could split the responsibility between some front end threads that handled the NFS specific side of things and a backend pool of worker threads that handled the (synchronous) VFS operations.

(This would allow you to size the two pools differently, since ideally they have different constraints. The NFS processing is more or less CPU bound, and so sized based on how much of the server's CPU capacity you wanted to use for NFS; the VFS layer would ideally be IO bound, and could be sized based on how much simultaneous disk IO it was sensible to have. There is some hand-waving involved here.)

The actual, existing Linux kernel NFS server takes the much simpler approach. The kernel NFS server threads do everything. Each thread takes an incoming NFS client request (or a group of them), does NFS level things like decoding NFS filehandles, and then calls into the VFS to actually do operations. The VFS will call into the filesystem, still in the context of the NFS server thread, and if the filesystem winds up doing IO, the NFS server thread will wait for that IO to complete. When the thread of execution comes back out of the VFS, the NFS thread then does the NFS processing to generate replies and dispatch them to the network.

This unfortunately makes it challenging to answer the question of how many NFS server threads you want to use. The NFS server threads may be CPU bound (if they're handling NFS requests from RAM and the VFS's caches and data structures), or they may be IO bound (as they wait for filesystem IO to be performed, usually for reading and writing files). When you're IO bound, you probably want enough NFS server threads so that you can wait on all of the IO and still have some threads left over to handle the collection of routine NFS requests that can be satisfied from RAM. When you're CPU bound, you don't want any more NFS server threads than you have CPUs, and maybe you want a bit less.

If you're lucky, your workload is consistently and predictably one or the other. If you're not lucky (and we're not), your workload can be either of these at different times or (if we're really out of luck) both at once. Energetic people with NFS servers that have no other real activity can probably write something that automatically tunes the number of NFS threads up and down in response to a combination of the load average, the CPU utilization, and pressure stall information.

(We're probably just going to set it to the number of system CPUs.)

(After yesterday's question I decided I wanted to know for sure what the kernel's NFS server threads were used for, just in case. So I read the kernel code, which did have some useful side effects such as causing me to learn that the various nfsd4_<operation> functions we sometimes use bpftrace on are doing less than I assumed they were.)

The question of how many NFS server threads you should use (on Linux)

By: cks

Today, not for the first time, I noticed that one of our NFS servers was sitting at a load average of 8 with roughly half of its overall CPU capacity used. People with experience in Linux NFS servers are now confidently predicting that this is a 16-CPU server, which is correct (it has 8 cores and 2 HT threads per core). They're making this prediction because the normal Linux default number of kernel NFS server threads to run is eight.

(Your distribution may have changed this, and if so it's most likely by changing what's in /etc/nfs.conf, which is the normal place to set this. It can be changed on the fly by writing a new value to /proc/fs/nfsd/threads.)

Our NFS server wasn't saturating its NFS server threads because someone on a NFS client was doing a ton of IO. That might actually have slowed the requests down. Instead, there were some number of programs that were constantly making some number of NFS requests that could be satisfied entirely from (server) RAM, which explains why all of the NFS kernel threads were busy using system CPU (mostly on a spinlock, apparently, according to 'perf top'). It's possible that some of these constant requests came from code that was trying to handle hot reloading, since this is one of the sources of constant NFS 'GetAttr' requests, but I believe there's other things going on.

(Since this is the research side of a university department, we have very little visibility into what the graduate students are running on places like our little SLURM cluster.)

If you search around the Internet, you can find all sorts of advice about what to to set the number of NFS server threads to on your Linux NFS server. Many of them involve relatively large numbers (such as this 2024 SuSE advice of 128 threads). Having gone through this recent experience, my current belief is that it depends on what your problem is. In our case, with the NFS server threads all using kernel CPU time and not doing much else, running more threads than we have CPUs seems pointless; all it would do is create unproductive contention for CPU time. If NFS clients are going to totally saturate the fileserver with (CPU-eating) requests even at 16 threads, possibly we should run fewer threads than CPUs, so that user level management operations have some CPU available without contending against the voracious appetite of the kernel NFS server.

(Some advice suggests some number of server NFS kernel threads per NFS client. I suspect this advice is not used in places with tens or hundreds of NFS clients, which is our situation.)

To figure out what your NFS server's problem is, I think you're going to need to look at things like pressure stall information and information on the IO rate and the number of IO requests you're seeing. You can't rely on overall iowait numbers, because Linux iowait is a conservative lower bound. IO pressure stall information is much better for telling you if some NFS threads are blocked on IO even while others are active.

(Unfortunately the kernel NFS threads are not in a cgroup of their own, so you can't get per-cgroup pressure stall information for them. I don't know if you can manually move them into a cgroup, or if systemd would cooperate with this if you tried it.)

PS: In theory it looks like a potentially reasonable idea to run roughly at least as many NFS kernel threads as you have CPUs (maybe a few less so you have some user level CPU left over). However, if you have a lot of CPUs, as you might on modern servers, this might be too many if your NFS server gets flooded with an IO-heavy workload. Our next generation NFS fileserver hardware is dual socket, 12 cores per socket, and 2 threads per core, for a total of 48 CPUs, and I'm not sure we want to run anywhere near than many NFS kernel threads. Although we probably do want to run more than eight.

Ubuntu LTS (server) releases have become fairly similar to each other

By: cks

Ubuntu 24.04 LTS was released this past April, so one of the things we've been doing since then is building out our install system for 24.04 and then building a number of servers using 24.04, both new servers and servers that used to be build on 20.04 or 22.04. What has been quietly striking about this process is how few changes there have been for us between 20.04, 22.04, and 24.04. Our customization scripts needed only very small changes, and many of the instructions for specific machines could be revised by just searching and replacing either '20.04' or '22.04' with '24.04'.

Some of this lack of changes is illusory, because when I actually look at the differences between our 22.04 and 24.04 postinstall scripting, there are a number of changes, adjustments, and new fixes (and a big change in having to install Python 2 ourselves). Even when we didn't do anything there were decisions to be made, like whether or not we would stick with the Ubuntu 24.04 default of socket activated SSH (our decision so far is to stick with 24.04's default for less divergence from upstream). And there were also some changes to remove obsolete things and restructure how we change things like the system-wide SSH configuration; these aren't forced by the 22.04 to 24.04 change, but building the install setup for a new release is the right time to rethink existing pieces.

However, plenty of this lack of changes is real, and I credit a lot of that to systemd. Systemd has essentially standardized a lot of the init process and in the process, substantially reduced churn in it. For a relevant example, our locally developed systemd units almost never need updating between Ubuntu versions; if it worked in 20.04, it'll still work just as well in 24.04 (including its relationships to various other units). Another chunk of this lack of changes is that the current 20.04+ Ubuntu server installer has maintained a stable configuration file and relatively stable feature set (at least of features that we want to use), resulting in very little needing to be modified in our spin of it as we moved from 20.04 to 22.04 to 24.04. And the experience of going through the server installer has barely changed; if you showed me an installer screen from any of the three releases, I'm not sure I could tell you which it's from.

I generally feel that this is a good thing, at least on servers. A normal Linux server setup and the software that you run on it has broadly reached a place of stability, where there's no particular need to make really visible changes or to break backward compatibility. It's good for us that moving from 20.04 to 22.04 to 24.04 is mostly about getting more recent kernels and more up to date releases of various software packages, and sometimes having bugs fixed so that things like bpftrace work better.

(Whether this is 'welcome maturity' or 'unwelcome statis' is probably somewhat in the eye of the observer. And there are quiet changes afoot behind the scenes, like the change from iptables to nftables.)

A rough equivalent to "return to last power state" for libvirt virtual machines

By: cks

Physical machines can generally be set in their BIOS so that if power is lost and then comes back, the machine returns to its previous state (either powered on or powered off). The actual mechanics of this are complicated (also), but the idealized version is easily understood and convenient. These days I have a revolving collection of libvirt based virtual machines running on a virtualization host that I periodically reboot due to things like kernel updates, and for a while I have quietly wished for some sort of similar libvirt setting for its virtual machines.

It turns out that this setting exists, sort of, in the form of the libvirt-guests systemd service. If enabled, it can be set to restart all guests that were running when the system was shut down, regardless of whether or not they're set to auto-start on boot (none of my VMs are). This is a global setting that applies to all virtual machines that were running at the time the system went down, not one that can be applied to only some VMs, but for my purposes this is sufficient; it makes it less of a hassle to reboot the virtual machine host.

Linux being Linux, life is not quite this simple in practice, as is illustrated by comparing my Ubuntu VM host machine with my Fedora desktops. On Ubuntu, libvirt-guests.service defaults to enabled, it is configured through /etc/default/libvirt-guests (the Debian standard), and it defaults to not not automatically restarting virtual machines. On my Fedora desktops, libvirt-guests.service is not enabled by default, it is configured through /etc/sysconfig/libvirt-guests (as in the official documentation), and it defaults to automatically restarting virtual machines. Another difference is that Ubuntu has a /etc/default/libvirt-guests that has commented out default values, while Fedora has no /etc/sysconfig/libvirt-guests so you have to read the script to see what the defaults are (on Fedora, this is /usr/libexec/libvirt-guests.sh, on Ubuntu /usr/lib/libvirt/libvirt-guests.sh).

I've changed my Ubuntu VM host machine so that it will automatically restart previously running virtual machines on reboot, because generally I leave things running intentionally there. I haven't touched my Fedora machines so far because by and large I don't have any regularly running VMs, so if a VM is still running when I go to reboot the machine, it's most likely because I forgot I had it up and hadn't gotten around to shutting it off.

(My pre-libvirt virtualization software was much too heavy-weight for me to leave a VM running without noticing, but libvirt VMs have a sufficiently low impact on my desktop experience that I can and have left them running without realizing it.)

Pam_unix and your system's supported password algorithms

By: cks

The Linux login passwords that wind up in /etc/shadow can be encrypted (well, hashed) with a variety of algorithms, which you can find listed (and sort of documented) in places like Debian's crypt(5) manual page. Generally the choice of which algorithm is used to hash (new) passwords (for example, when people change them) is determined by an option to the pam_unix PAM module.

You might innocently think, as I did, that all of the algorithms your system supports will all be supported by pam_unix, or more exactly will all be available for new passwords (ie, what you or your distribution control with an option to pam_unix). It turns out that this is not the case some of the time (or if it is actually the case, the pam_unix manual page can be inaccurate). This is surprising because pam_unix is the thing that handles hashed passwords (both validating them and changing them), and you'd think its handling of them would be symmetric.

As I found out today, this isn't necessarily so. As documented in the Ubuntu 20.04 crypt(5) manual page, 20.04 supports yescrypt in crypt(3) (sadly Ubuntu's manual page URL doesn't seem to work). This means that the Ubuntu 20.04 pam_unix can (or should) be able to accept yescrypt hashed passwords. However, the Ubuntu 20.04 pam_unix(8) manual page doesn't list yescrypt as one of the available options for hashing new passwords. If you look only at the 20.04 pam_unix manual page, you might (incorrectly) assume that a 20.04 system can't deal with yescrypt based passwords at all.

At one level, this makes sense once you know that pam_unix and crypt(3) come from different packages and handle different parts of the work of checking existing Unix password and hashing new ones. Roughly speaking, pam_unix can delegate checking passwords to crypt(3) without having to care how they're hashed, but to hash a new password with a specific algorithm it has to know about the algorithm, have a specific PAM option added for it, and call some functions in the right way. It's quite possible for crypt(3) to get ahead of pam_unix for a new password hashing algorithm, like yescrypt.

(Since they're separate packages, pam_unix may not want to implement this for a new algorithm until a crypt(3) that supports it is at least released, and then pam_unix itself will need a new release. And I don't know if linux-pam can detect whether or not yescrypt is supported by crypt(3) at build time (or at runtime).)

PS: If you have an environment with a shared set of accounts and passwords (whether via LDAP or your own custom mechanism) and a mixture of Ubuntu versions (maybe also with other Linux distribution versions), you may want to be careful about using new password hashing schemes, even once it's supported by pam_unix on your main systems. The older some of your Linuxes are, the more you'll want to check their crypt(3) and crypt(5) manual pages carefully.

Linux's /dev/disk/by-id unfortunately often puts the transport in the name

By: cks

Filippo Valsorda ran into an issue that involved, in part, the naming of USB disk drives. To quote the relevant bit:

I can't quite get my head around the zfs import/export concept.

When I replace a drive I like to first resilver the new one as a USB drive, then swap it in. This changes the device name (even using by-id).

[...]

My first reaction was that something funny must be going on. My second reaction was to look at an actual /dev/disk/by-id with a USB disk, at which point I got a sinking feeling that I should have already recognized from a long time ago. If you look at your /dev/disk/by-id, you will mostly see names that start with things like 'ata-', 'scsi-OATA-', 'scsi-1ATA', and maybe 'usb-' (and perhaps 'nvme-', but that's a somewhat different kettle of fish). All of these names have the problem that they burn the transport (how you talk to the disk) into the /dev/disk/by-id, which is supposed to be a stable identifier for the disk as a standalone thing.

As Filippo Valsorda's case demonstrates, the problem is that some disks can move between transports. When this happens, the theoretically stable name of the disk changes; what was 'usb-' is now likely 'ata-' or vice versa, and in some cases other transformations may happen. Your attempt to use a stable name has failed and you will likely have problems.

Experimentally, there seem to be some /dev/disk/by-id names that are more stable. Some but not all of our disks have 'wwn-' names (one USB attached disk I can look at doesn't). Our Ubuntu based systems have 'scsi-<hex digits>' and 'scsi-SATA-<disk id>' names, but one of my Fedora systems with SATA drives has only the 'scsi-<hex>' names and the other one has neither. One system we have a USB disk on has no names for the disk other than 'usb-' ones. It seems clear that it's challenging at best to give general advice about how a random Linux user should pick truly stable /dev/disk/by-id names, especially if you have USB drives in the picture.

(See also Persistent block device naming in the Arch Wiki.)

This whole current situation seems less than ideal, to put it one way. It would be nice if disks (and partitions on them) had names that were as transport independent and usable as possible, especially since most disks have theoretically unique serial numbers and model names available (and if you're worried about cross-transport duplicates, you should already be at least as worried as duplicates within the same type of transport).

PS: You can find out what information udev knows about your disks with 'udevadm info --query=all --name=/dev/...' (from, via, by coincidence). The information for a SATA disk differs between my two Fedora machines (one of them has various SCSI_* and ID_SCSI* stuff and the other doesn't), but I can't see any obvious reason for this.

Using pam_access to sometimes not use another PAM module

By: cks

Suppose that you want to authenticate SSH logins to your Linux systems using some form of multi-factor authentication (MFA). The normal way to do this is to use 'password' authentication and then in the PAM stack for sshd, use both the regular PAM authentication module(s) of your system and an additional PAM module that requires your MFA (in another entry about this I used the module name pam_mfa). However, in your particular MFA environment it's been decided that you don't have to require MFA for logins from some of your other networks or systems, and you'd like to implement this.

Because your MFA happens through PAM and the details of this are opaque to OpenSSH's sshd, you can't directly implement skipping MFA through sshd configuration settings. If sshd winds up doing password based authentication at all, it will run your full PAM stack and that will challenge people for MFA. So you must implement sometimes skipping your MFA module in PAM itself. Fortunately there is a PAM module we can use for this, pam_access.

The usual way to use pam_access is to restrict or allow logins (possibly only some logins) based on things like the source address people are trying to log in from (in this, it's sort of a superset of the old tcpwrappers). How this works is configured through an access control file. We can (ab)use this basic matching in combination with the more advanced form of PAM controls to skip our PAM MFA module if pam_access matches something.

What we want looks like this:

auth  [success=1 default=ignore]  pam_access.so noaudit accessfile=/etc/security/access-nomfa.conf
auth  requisite  pam_mfa

Pam_access itself will 'succeed' as a PAM module if the result of processing our access-nomfa.conf file is positive. When this happens, we skip the next PAM module, which is our MFA module. If it 'fails', we ignore the result, and as part of ignoring the result we tell pam_access to not report failures.

Our access-nomfa.conf file will have things like:

# Everyone skips MFA for internal networks
+:ALL:192.168.0.0/16 127.0.0.1

# Insure we fail otherwise.
-:ALL:ALL

We list the networks we want to allow password logins without MFA from, and then we have to force everything else to fail. (If you leave this off, everything passes, either explicitly or implicitly.)

As covered in the access.conf manual page, you can get quite sophisticated here. For example, you could have people who always had to use MFA, even from internal machines. If they were all in a group called 'mustmfa', you might start with:

-:(mustmfa):ALL

If you get at all creative with your access-nomfa.conf, I strongly suggest writing a lot of comments to explain everything. Your future self will thank you.

Unfortunately but entirely reasonably, the information about the remote source of a login session doesn't pass through to later PAM authentication done by sudo and su commands that you do in the session. This means that you can't use pam_access to not give MFA challenges on su or sudo to people who are logged in from 'trusted' areas.

(As far as I can tell, the only information ``pam_access' gets about the 'origin' of a su is the TTY, which is generally not going to be useful. You can probably use this to not require MFA on su or sudo that are directly done from logins on the machine's physical console or serial console.)

Having an emergency backup DNS resolver with systemd-resolved

By: cks

At work we have a number of internal DNS resolvers, which you very much want to use to resolve DNS names if you're inside our networks for various reasons (including our split-horizon DNS setup). Purely internal DNS names aren't resolvable by the outside world at all, and some DNS names resolve differently. However, at the same time a lot of the host names that are very important to me are in our public DNS because they have public IPs (sort of for historical reasons), and so they can be properly resolved if you're using external DNS servers. This leaves me with a little bit of a paradox; on the one hand, my machines must resolve our DNS zones using our internal DNS servers, but on the other hand if our internal DNS servers aren't working for some reason (or my home machine can't reach them) it's very useful to still be able to resolve the DNS names of our servers, so I don't have to memorize their IP addresses.

A while back I switched to using systemd-resolved on my machines. Systemd-resolved has a number of interesting virtues, including that it has fast (and centralized) failover from one upstream DNS resolver to another. My systemd-resolved configuration is probably a bit unusual, in that I have a local resolver on my machines, so resolved's global DNS resolution goes to it and then I add a layer of (nominally) interface-specific DNS domain overrides that point to our internal DNS resolvers.

(This doesn't give me perfect DNS resolution, but it's more resilient and under my control than routing everything to our internal DNS resolvers, especially for my home machine.)

Somewhat recently, it occurred to me that I could deal with the problem of our internal DNS resolvers all being unavailable by adding '127.0.0.1' as an additional potential DNS server for my interface specific list of our domains. Obviously I put it at the end, where resolved won't normally use it. But with it there, if all of the other DNS servers are unavailable I can still try to resolve our public DNS names with my local DNS resolver, which will go out to the Internet to talk to various authoritative DNS servers for our zones.

The drawback with this emergency backup approach is that systemd-resolved will stick with whatever DNS server it's currently using unless that DNS server stops responding. So if resolved switches to 127.0.0.1 for our zones, it's going to keep using it even after the other DNS resolvers become available again. I'll have to notice that and manually fiddle with the interface specific DNS server list to remove 127.0.0.1, which would force resolved to switch to some other server.

(As far as I can tell, the current systemd-resolved correctly handles the situation where an interface says that '127.0.0.1' is the DNS resolver for it, and doesn't try to force queries to 127.0.0.1:53 to go out that interface. My early 2013 notes say that this sometimes didn't work, but I failed to write down the specific circumstances.)

A surprise with /etc/cron.daily, run-parts, and files with '.' in their name

By: cks

Linux distributions have a long standing general cron feature where there is are /etc/cron.hourly, /etc/cron.daily, and /etc/cron.weekly directories and if you put scripts in there, they will get run hourly, daily, or weekly (at some time set by the distribution). The actual running is generally implemented by a program called 'run-parts'. Since this is a standard Linux distribution feature, of course there is a single implementation of run-parts and its behavior is standardized, right?

Since I'm asking the question, you already know the answer: there are at least two different implementations of run-parts, and their behavior differs in at least one significant way (as well as several other probably less important ones).

In Debian, Ubuntu, and other Debian-derived distributions (and also I think Arch Linux), run-parts is a C program that is part of debianutils. In Fedora, Red Hat Enterprise Linux, and derived RPM-based distributions, run-parts is a shell script that's part of the crontabs package, which is part of cronie-cron. One somewhat unimportant way that these two versions differ is that the RPM version ignores some extensions that come from RPM packaging fun (you can see the current full list in the shell script code), while the Debian version only skips the Debian equivalents with a non-default option (and actually documents the behavior in the manual page).

A much more important difference is that the Debian version ignores files with a '.' in their name (this can be changed with a command line switch, but /etc/cron.daily and so on are not processed with this switch). As a non-hypothetical example, if you have a /etc/cron.daily/backup.sh script, a Debian based system will ignore this while a RHEL or Fedora based system will happily run it. If you are migrating a server from RHEL to Ubuntu, this may come as an unpleasant surprise, partly since the Debian version doesn't complain about skipping files.

(Whether or not the restriction could be said to be clearly documented in the Debian manual page is a matter of taste. Debian does clearly state the allowed characters, but it does not point out that '.', a not uncommon character, is explicitly not accepted by default.)

Linux software RAID and changing your system's hostname

By: cks

Today, I changed the hostname of an old Linux system (for reasons) and rebooted it. To my surprise, the system did not come up afterward, but instead got stuck in systemd's emergency mode for a chain of reasons that boiled down to there being no '/dev/md0'. Changing the hostname back to its old value and rebooting the system again caused it to come up fine. After some diagnostic work, I believe I understand what happened and how to work around it if it affects us in the future.

One of the issues that Linux RAID auto-assembly faces is the question of what it should call the assembled array. People want their RAID array names to stay fixed (so /dev/md0 is always /dev/md0), and so the name is part of the RAID array's metadata, but at the same time you have the problem of what happens if you connect up two sets of disks that both want to be 'md0'. Part of the answer is mdadm.conf, which can give arrays names based on their UUID. If your mdadm.conf says 'ARRAY /dev/md10 ... UUID=<x>' and mdadm finds a matching array, then in theory it can be confident you want that one to be /dev/md10 and it should rename anything else that claims to be /dev/md10.

However, suppose that your array is not specified in mdadm.conf. In that case, another software RAID array feature kicks in, which is that arrays can have a 'home host'. If the array is on its home host, it will get the name it claims it has, such as '/dev/md0'. Otherwise, well, let me quote from the 'Auto-Assembly' section of the mdadm manual page:

[...] Arrays which do not obviously belong to this host are given names that are expected not to conflict with anything local, and are started "read-auto" so that nothing is written to any device until the array is written to. i.e. automatic resync etc is delayed.

As is covered in the documentation for the '--homehost' option in the mdadm manual page, on modern 1.x superblock formats the home host is embedded into the name of the RAID array. You can see this with 'mdadm --detail', which can report things like:

Name : ubuntu-server:0
Name : <host>:25  (local to host <host>)

Both of these have a 'home host'; in the first case the home host is 'ubuntu-server', and in the second case the home host is the current machine's hostname. Well, its 'hostname' as far as mdadm is concerned, which can be set in part through mdadm.conf's 'HOMEHOST' directive. Let me repeat that, mdadm by default identifies home hosts by their hostname, not by any more stable identifier.

So if you change a machine's hostname and you have arrays not in your mdadm.conf with home hosts, their /dev/mdN device names will get changed when you reboot. This is what happened to me, as we hadn't added the array to the machine's mdadm.conf.

(Contrary to some ways to read the mdadm manual page, arrays are not renamed if they're in mdadm.conf. Otherwise we'd have noticed this a long time ago on our Ubuntu servers, where all of the arrays created in the installer have the home host of 'ubuntu-server', which is obviously not any machine's actual hostname.)

Setting the home host value to the machine's current hostname when an array is created is the mdadm default behavior, although you can turn this off with the right mdadm.conf HOMEHOST setting. You can also tell mdadm to consider all arrays to be on their home host, regardless of the home host embedded into their names.

(The latter is 'HOMEHOST <ignore>', the former by itself is 'HOMEHOST <none>', and it's currently valid to combine them both as 'HOMEHOST <ignore> <none>', although this isn't quite documented in the manual page.)

PS: Some uses of software RAID arrays won't care about their names. For example, if they're used for filesystems, and your /etc/fstab specifies the device of the filesystem using 'UUID=' or with '/dev/disk/by-id/md-uuid-...' (which seems to be common on Ubuntu).

PPS: For 1.x superblocks, the array name as a whole can only be 32 characters long, which obviously limits how long of a home host name you can have, especially since you need a ':' in there as well and an array number or the like. If you create a RAID array on a system with a too long hostname, the name of the resulting array will not be in the '<host>:<name>' format that creates an array with a home host; instead, mdadm will set the name of the RAID to the base name (either whatever name you specified, or the N of the 'mdN' device you told it to use).

(It turns out that I managed to do this by accident on my home desktop, which has a long fully qualified name, by creating an array with the name 'ssd root'. The combination turns out to be 33 characters long, so the RAID array just got the name 'ssd root' instead of '<host>:ssd root'.)

Resetting the backoff restart delay for a systemd service

By: cks

Suppose, not hypothetically, that your Linux machine is your DSL PPPoE gateway, and you run the PPPoE software through a simple script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link fails for some reason, but generally you want to automatically try to establish it again. One way to do this (the simple way) is to set the systemd unit to 'Restart=always', with a restart delay.

Things like pppd generally benefit from a certain amount of backoff in their restart attempts, rather than restarting either slowly or rapidly all of the time. If your PPP(oE) link just dropped out briefly because of a hiccup, you want it back right away, not in five or ten minutes, but if there's a significant problem with the link, retrying every second doesn't help (and it may trigger things in your service provider's systems). Systemd supports this sort of backoff if you set 'RestartSteps' and 'RestartMaxDelaySec' to appropriate values. So you could wind up with, for example:

Restart=always
RestartSec=1s
RestartSteps=10
RestartMaxDelaySec=10m

This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.

To see the current state of your unit's backoff, you want to look at its properties, specifically 'NRestarts' and especially 'RestartUSecNext', which is the delay systemd will put on for the next restart. You see these with 'systemctl show <unit>', or perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>'. To reset your unit's dynamic backoff time, you run 'systemctl reset-failed <unit>'; this is the same thing you may need to do if you restart a unit too fast and the start stalls.

(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)

At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').

If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.

(If systemd has a unit setting that will already do this, I was unable to spot it.)

Options for adding IPv6 networking to your libvirt based virtual machines

By: cks

Recently, my home ISP switched me from an IPv6 /64 allocation to a /56 allocation, which means that now I can have a bunch of proper /64s for different purposes. I promptly celebrated this by, in part, extending IPv6 to my libvirt based virtual machine, which is on a bridged internal virtual network (cf). Libvirt provides three different ways to provide (public) IPv6 to such virtual machines, all of which will require you to edit your network XML (either inside the virt-manager GUI or directly with command line tools). The three ways aren't exclusive; you can use two of them or even all three at the same time, in which case your VMs will have two or three public IPv6 addresses (at least).

(None of this applies if you're directly bridging your virtual machines onto some physical network. In that case, whatever the physical network has set up for IPv6 is what your VMs will get.)

First, in all cases you're probably going to want an IPv6 '<ip>' block that sets the IPv6 address for your host machine and implicitly specifies your /64. This is an active requirement for two of the options, and typically looks like this:

<ip family='ipv6' address='2001:19XX:0:1102::1' prefix='64'>
[...]
</ip>

Here my desktop will have 2001:19XX:0:1102::1/64 as its address on the internal libvirt network.

The option that is probably the least hassle is to give static IPv6 addresses to your VMs. This is done with <host> elements inside a <dhcp> element (inside your IPv6 <ip>, which I'm not going to repeat):

<dhcp>
  <host name='hl-fedora-36' ip='2001:XXXX:0:1102::189'/>
</dhcp>

Unlike with IPv4, you can't identify VMs by their MAC address because, to quote the network XML documentation:

[...] The IPv6 host element differs slightly from that for IPv4: there is no mac attribute since a MAC address has no defined meaning in IPv6. [...]

Instead you probably need to identify your virtual machines by their (DHCP) hostname. Libvirt has another option for this but it's not really well documented and your virtual machine may not be set up with the necessary bits to use it.

The second least hassle option is to provide a DHCP dynamic range of IPv6 addresses. In the current Fedora 40 libvirt, this has the undocumented limitation that the range can't include more than 65,535 IPv6 addresses, so you can't cover the entire /64. Instead you wind up with something like this:

<dhcp>
  <range start='2001:XXXX:0:1102::1000' end='2001:XXXX:0:1102::ffff'/>
</dhcp>

Famously, not everything in the world does DHCP6; some things only do SLAAC, and in general SLAAC will allocate random IPv6 IPs across your entire /64. Libvirt uses dnsmasq (also) to provide IP addresses to virtual machines, and dnsmasq can do SLAAC (see the dnsmasq manual page). However, libvirt currently provides no directly exposed controls to turn this on; instead, you need to use a special libvirt network XML namespace to directly set up the option in the dnsmasq configuration file that libvirt will generate.

What you need looks like:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
[...]
  <dnsmasq:options>
    <dnsmasq:option value='dhcp-range=2001:XXXX:0:1102::,slaac,64'/>
  </dnsmasq:options>
</network>

(The 'xmlns:dnsmasq=' bit is what you have to add to the normal <network> element.)

I believe that this may not require you to declare an IPv6 <ip> section at all, although I haven't tested that. In my environment I want both SLAAC and a static IPv6 address, and I'm happy to not have DHCP6 as such, since SLAAC will allocate a much wider and more varied range of IPv6 addresses.

(You can combine a dnsmasq SLAAC dhcp-range with a regular DHCP6 range, in which case SLAAC-capable IPv6 virtual machines will get an IP address from both, possibly along with a third static IPv6 address.)

PS: Remember to set firewall rules to restrict access to those public IPv6 addresses, unless you want your virtual machines fully exposed on IPv6 (when they're probably protected on IPv4 by virtue of being NAT'd).

Mostly getting redundant UEFI boot disks on modern Ubuntu (especially 24.04)

By: cks

When I wrote about how our primary goal for mirrored (system) disks is increased redundancy, including being able to reboot the system after the primary disk failed, vowhite asked in a comment if there was any trick to getting this working with UEFI. The answer is sort of, and it's mostly the same as you want to do with BIOS MBR booting.

In the Ubuntu installer, when you set up redundant system disks it's long been the case that you wanted to explicitly tell the installer to use the second disk as an additional boot device (in addition to setting up a software RAID mirror of the root filesystem across both disks). In the BIOS MBR world, this installed GRUB bootblocks on the disk; in the UEFI world, this causes the installer to set up an extra EFI System Partition (ESP) on the second drive and populate it with the same sort of things as the ESP on the first drive.

(The 'first' and the 'second' drive are not necessarily what you think they are, since the Ubuntu installer doesn't always present drives to you in their enumeration order.)

I believe that this dates from Ubuntu 22.04, when Ubuntu seems to have added support for multi-disk UEFI. Ubuntu will mount one of these ESPs (the one it considers the 'first') on /boot/efi, and as part of multi-disk UEFI support it will also arrange to update the other ESP. You can see what other disk Ubuntu expects to find this ESP on by looking at the debconf selection 'grub-efi/install_devices'. For perfectly sensible reasons this will identify disks by their disk IDs (as found in /dev/disk/by-id), and it normally lists both ESPs.

All of this is great but it leaves you with two problems if the disk with your primary ESP fails. The first is the question of whether your system's BIOS will automatically boot off the second ESP. I believe that UEFI firmware will often do this, and you can specifically set this up with EFI boot entries through things like efibootmgr (also); possibly current Ubuntu installers do this for you automatically if it seems necessary.

The bigger problem is the /boot/efi mount. If the primary disk fails, a mounted /boot/efi will start having disk IO errors and then if the system reboots, Ubuntu will probably be unable to find and mount /boot/efi from the now gone or error-prone primary disk. If this is a significant concern, I think you need to make the /boot/efi mount 'nofail' in /etc/fstab (per fstab(5)). Energetic people might want to go further and make it either 'noauto' so that it's not even mounted normally, or perhaps mark it as a systemd automounted filesystem with 'x-systemd.automount' (per systemd.mount).

(The disclaimer is that I don't know how Ubuntu will react if /boot/efi isn't mounted at all or is a systemd automount mountpoint. I think that GRUB updates will cope with having it not mounted at all.)

If any disk with an ESP on it fails and has to be replaced, you have to recreate a new ESP on that disk and then, I believe, run 'dpkg-reconfigure grub-efi-amd64', which will ask you to select the ESPs you want to be automatically updated. You may then need to manually run '/usr/lib/grub/grub-multi-install --target=x86_64-efi', which will populate the new ESP (or it may be automatically run through the reconfigure). I'm not sure about this because we haven't had any UEFI system disks fail yet.

(The ESP is a vfat formatted filesystem, which can be set up with mkfs.vfat, and has specific requirements for its GUIDs and so on, which you'll have to set up by hand in the partitioning tool of your choice or perhaps automatically by copying the partitioning of the surviving system disk to your new disk.)

If it was the primary disk that failed, you will probably want to update /etc/fstab to get /boot/efi from a place that still exists (probably with 'nofail' and perhaps with 'noauto'). This might be somewhat easy to overlook if the primary disk fails without the system rebooting, at which point you'd get an unpleasant surprise on the next system reboot.

The general difference between UEFI and BIOS MBR booting for this is that in BIOS MBR booting, there's no /boot/efi to cause problems and running 'grub-install' against your replacement disk is a lot easier than creating and setting up the ESP. As I found out, a properly set up BIOS MBR system also 'knows' in debconf what devices you have GRUB installed on, and you'll need to update this (probably with 'dpkg-reconfigure grub-pc') when you replace a system disk.

(We've been able to avoid this so far because in Ubuntu 20.04 and 22.04, 'grub-install' isn't run during GRUB package updates for BIOS MBR systems so no errors actually show up. If we install any 24.04 systems with BIOS MBR booting and they have system disk failures, we'll have to remember to deal with it.)

(See also my entry on multi-disk UEFI in Ubuntu 22.04, which goes deeper into some details. That entry was written before I knew that a 'grub-*/install_devices' setting of a software RAID array was actually an error on Ubuntu's part, although I'd still like GRUB's UEFI and BIOS MBR scripts to support it.)

Why my Fedora 40 systems stalled logins for ten seconds or so

By: cks

One of my peculiarities is that I reboot my Fedora 40 desktops by logging in as root on a text terminal and then running 'reboot' (sometimes or often also telling loginctl to terminate any remainders of my login session so that the reboot doesn't stall for irritating lengths of time). Recently, the simple process of logging in as root has been stalling for an alarmingly long time, enough time to make me think something was wrong with the system (it turns out that the stall was probably ten seconds or so, but even a couple of seconds is alarming for your root login not working). Today I hit this again and this time I dug into what was happening, partly because I was able to reproduce it with something other than a root login to reboot the machine.

My first step was to use the excellent extrace to find out what was taking so long, since this can trace all programs run from one top level process and report how long they took (along with the command line arguments). This revealed that the time consuming command was '/usr/libexec/pk-command-not-found compinit -c', and it was being run as part of quite a lot of commands being executed during shell startup. Specifically, Bash, because on Fedora root's login shell is Bash. This was happening because Bash's normal setup will source everything from /etc/profile.d/ in order to set up your new (interactive) Bash setup, and it turns out that there's a lot there. Using 'bash -xl' I was able to determine that pk-command-not-found was probably being run somehow in /usr/share/lmod/lmod/init/bash. If you're as puzzled as I was about that, lmod (also) is apparently a system for setting up paths for accessing Lua 'modules', so it wants to hook into shell startup to set up its environment variables.

It took me a bit of time to understand how the bits fit together, partly because there's no documentation for pk-command-not-found. The first step is that Bash has a feature that allows you to hook into what happens when a command isn't found (cf, see the discussion of the (potential) command_not_found_handle function), and PackageKit is doing this (in the PackageKit-command-not-found Fedora RPM package, which Fedora installs as a standard feature). It turns out that Bash will invoke this handler function not just for commands you run interactively, but also commands that aren't found while Bash is sourcing all of your shell startup. This handler is being triggered in Lmod's init/bash code because said code attempts to run 'compinit -c' to set up completion in zsh so that it can modify zsh's function search path. Compinit is a zsh thing (it's not technically a builtin), so there is no exposed 'compinit' command on the system. Running compinit outside of zsh is a bug; in this case, an expensive bug.

My solution was to remove both PackageKit-command-not-found, because I don't want this slow 'command not found' handling in general, and also the Lmod package, because I don't use Lmod. Because I'm a certain sort of person, I filed Lmod issue #725 to report the issue.

In some testing in a virtual machine, it appears that pk-command-not-found may be so slow only the first time it's invoked. This means that most people with these packages installed may not see or at least realize what's happening, because under normal circumstances they probably log in to Fedora machines graphically, at which point the login stall is hidden in the general graphical environment startup delay that everyone expects to be slow. I'm in the unusual circumstance that my login doesn't use any normal shell, so logging in as root is the first time my desktops will run Bash interactively and trigger pk-command-not-found.

(This elaborates on and cleans up a Fediverse thread I wrote as I poked around.)

I wish (Linux) WireGuard had a simple way to restrict peer public IPs

By: cks

WireGuard is an obvious tool to build encrypted, authenticated connections out of, over which you can run more or less any network service. For example, you might expose the rsync daemon only over a specific WireGuard interface, instead of running rsync over SSH. Unfortunately, if you want to use WireGuard as a SSH replacement in this fashion, it has one limitation; unlike SSH, there's no simple way to restrict the public IP address of a particular peer.

The rough equivalent of a WireGuard peer is a SSH keypair. In SSH, you can restrict where a keypair will be accepted from with the 'from="..."' restriction in your .ssh/authorized_keys. This provides an extra layer of protection against the key being compromised; not only does an attacker have to acquire the key, they have to be able to use it from exactly the same IP (or the expected IPs). However, more or less by design WireGuard doesn't have a particular restriction on where a WireGuard peer key can be used from. You can set an expected public IP for the peer, but if the peer contacts you from another IP, your (Linux kernel) WireGuard will update its idea of where the peer is. This is handy for WireGuard's usual usage cases but not what we necessarily want for a wired down connection where the IPs should never change.

(I don't think this is a technical restriction in the WireGuard protocol, just something not done in most or all implementations.)

The normal answer is firewall rules that restrict access to the WireGuard port, but this has two limitations. The first and lesser limitation is that it's external to WireGuard, so it's possible to have WireGuard active but your firewall rules not properly applied, theoretically allowing more access than you intend. The bigger limitation is that if you have more than one such wired down WireGuard peer, firewall rules can't tell which WireGuard peer key is being used by which external peer. So in a straightforward implementation of firewall rules, any peer public IP can impersonate any other (if it has the required WireGuard peer key), which is different from the SSH 'from="..."' situation, where each key is restricted separately.

(On the other hand, the firewall situation is better in one way in that you can't accidentally add a WireGuard peer that will be accepted from anywhere the way you can with a SSH key by forgetting to put in a 'from="..."' restriction.)

To get firewall rules that can tell peers apart, you need to use different listening ports for each peer on your end. Today, this requires different WireGuard interfaces (and probably different server keys) for each peer. I think you can probably give all of the interfaces the same internal IP to simplify your life, although I haven't tested this.

(Having written this entry, I now wonder if it would be possible to write an nftables or iptables extension that hooked into the kernel side of WireGuard enough to know peer identities and let you match on them. Existing extensions are already able to be aware of various things like cgroup membership, and there's an existing extension for IPsec. Possibly you could do this with eBPF programs, since there's a BPF/eBPF iptables extension.)

The problems (Open)ZFS can have on new Linux kernel versions

By: cks

Every so often, someone out there is using a normal released version of OpenZFS on Linux (currently ZFS 2.2.6, which was just released) on a distribution that uses very new kernels (such as Fedora). They may then read that their version of ZFS (such as 2.2.5) doesn't list the latest kernel (such as 6.10) as a 'supported platform'. They may then wonder why this is so.

Part of the answer is that OpenZFS developers are cautious people who don't want to list new kernels as officially supported until people have carefully inspected and tested the situation. Even if everything looks good, it's possible that there is some subtle problem in the interface between (Open)ZFS and the new kernel version. But another part of the answer comes down to how the Linux kernel has no stable internal API, which is also part of how you can get subtle problems in new kernels.

The Linux kernel is constantly changing how things work internally. Functions appear or go away (or simply mutate); fields are added or removed from C structs, or sometimes change their meaning; function arguments change; how you're supposed to do things shifts. It's up to any out of tree code, such as OpenZFS, to keep up with these changes (and that's why you want kernel modules to be in the main Linux kernel if possible, because then other people do some of this work). So to merely compile on a new kernel version, OpenZFS may need to change its own code to match the kernel changes. Sometimes this will be simple, requiring almost no changes; other times it may lead to a bunch of modifications.

(Two examples are the master pull request for 6.10, which had only a few changes, and the larger master pull request for 6.11, which may not even be quite complete yet since 6.11 is not yet released.)

Having things compiling is merely the first step. The OpenZFS developers need to make sure that they're making the right changes, and also they generally want to try to see if things have changed in a way that doesn't break compiling code. To quote a message from Rob Norris on the ZFS on Linux mailing list:

"Support" here means that the people involved with the OpenZFS are reasonably certain that the traditional OpenZFS goals of stability, durability, etc will hold when used with that kernel version. That usually means the test suites have passed, there's no significant new issues reported, and at least three people have looked at the kernel changes, the matching OpenZFS changes, and thought very hard about it.

As a practical matter (as Rob Norris notes), this often means that development versions of OpenZFS will often build and work on new kernel versions well before they're officially supported. Speaking from personal experience, it's possible to be using kernel versions that are not yet 'supported' without noticing until you hit an RPM version dependency surprise.

How not to upgrade (some) held packages on Ubuntu (and Debian)

By: cks

We hold a number of packages across our Ubuntu fleet (for good reasons), so that they're only upgraded under controlled circumstances. Which packages are held varies, but they always include the kernel packages (among other issues, we don't want machines to reboot into new kernels by surprise, for example after a crash or a power issue). Some of our hosts are used for testing, and I generally update their kernels (far) more often than our regular machines for various reasons. Until recently I did this with the obvious 'apt-get' command line:

apt-get -u upgrade --with-new-pkgs --ignore-hold

The problem with this is that it upgrades all held packages, not just the kernel. I have historically gotten away with this on the machines I do this on, but recently I got burned (well, more burned my co-workers); as part of a kernel upgrade I also upgraded another package that caused some problems.

Instead what you (I) need to do is to use 'apt-mark unhold <packages>' and then just 'apt-get -u upgrade --with-new-pkgs'. This is less convenient (but at least these days we have apt-mark). I continue to be sad that 'apt-get upgrade' doesn't take package(s) to upgrade and will upgrade everything, so you can't do 'apt-get upgrade linux-image-*' to directly express what you (I) want here.

(Fedora's DNF will do this, along with the inverse option of 'dnf upgrade --exclude=...', and both of these are quite nice.)

You can do this with 'apt-get install', but if you're going to use wildcards in the package name for convenience, you need to be careful and add an extra option, --only-upgrade:

apt-get -u install --only-upgrade 'linux-*'

Otherwise, 'apt-get install ...' will faithfully do exactly what you told it to, which is install or upgrade all of the packages that match the wildcard. If you're using 'apt-get install' to upgrade held packages, you probably don't want that. Despite its name, the --only-upgrade option will install new packages that are required by the packages that you're upgrading, such as new kernel packages that are required by a new version of 'linux-image-generic'.

The one semi-virtue of explicitly unholding packages to upgrade them is that this makes it very obvious that the packages are in fact unheld. An 'apt-get install <packages>' or an 'apt-get upgrade --ignore-hold' will unhold the packages as a side effect. Fortunately we long ago modified our update system to automatically apply our standard package holds before it did anything else (after one too many accidents where we should have re-held a package but forgot).

(I'm sure you could write a cover script to handle all of this, if you wanted to. Currently I don't do this often enough to go that far.)

I used libvirt's 'virt-install' briefly and it worked nicely

By: cks

My normal way of using libvirt based virtual machines has been to initially create them in virt-manager using its convenient GUI, if necessary use virt-viewer to access their consoles, and use virsh for basic operations like starting and stopping VMs and rolling VMs back to snapshots, which I make heavy use of. Then recently I wrote about why and how I keep around spare virtual machines, and wound up discovering virt-install, which is supposed to let you easily create (and install) virtual machines from the command line. My first experience with it went well, so now I'm going to write myself some notes.

(I spun up a new virtual machine from scratch in order to poke at FreeBSD a bit.)

Due to having set up a number of VMs through virt-manager, I had already defined the network I wanted as well as a libvirt storage pool where the disks for the new virt-install VM could go. With those already existing, using virt-install was mostly a long list of arguments:

virt-install -n vmguest7 \
   --memory 8192 -vcpus 2 --cpu host \
   -c /virt/images/freebsd/FreeBSD-14.1-RELEASE-amd64-dvd1.iso \
   --osinfo freebsd14.0 \
   --disk size=20 --disk size=20 \
   -w network=netN-macvtap \
   --graphics spice --noautoconsole

(I think I should have used '--cpu host-passthrough' instead, because I think '--cpu host' caused virt-install to copy the host CPU features into the new VM instead of telling the new VM to just use whatever the host had.)

This created a VM with 8 GB of RAM (FreeBSD's minimum recommended amount for root on ZFS), two CPUs that are just like the host, two 20 GByte disks, the right sort of networking (using the already defined libvirt network), and not trying to start any sort of console since I was ssh'd in to the VM host. Once started, I used virt-viewer on my local machine to connect to the console and went through the standard FreeBSD installer in order to gain experience with it and see how it would go when I later did this on physical hardware.

This didn't create quite the same thing that I would normally get in virt-manager; for instance, this VM was created with an 'i440FX' (virtual) chipset instead of the Q35 chipset that I normally use and that may be better (this might be fixed with '--machine q35' or perhaps '--machine pc-q35-6.2'). The 'CDROM' it wound up with is an IDE one instead of a SATA one, although FreeBSD had no objections to it. All of the various differences don't seem to be particularly important, since the result worked and I'm only doing this for testing. The VM's new disks did get sensible file names, ie ones based on the VM's name.

(When the install finished and rebooted, the VM powered off, but this might have been a peculiarity in how I did things.)

Virt-install can create transient VMs with --transient, but as its documentation notes, the disks for these VMs aren't deleted after the VM itself is cleaned up. There are probably ways to use virt-install and some additional tooling to get truly transient VMs, where even their disks are deleted afterward, but I haven't looked at that since right now it's not really a usage case I'm interested in. If I'm spinning up a VM today, I want it to stick around for at least a bit.

(I'm also not interested in virt-builder or the automatic install side of virt-install; to put it one way, I want virtual versions of our physical servers, and they're not installed through cloud-init or other completely automated ways. I do have a limited use for using guestfish to automatically modify VM filesystems.)

Why and how I keep around spare libvirt based virtual machines

By: cks

Recently I mentioned in passing that I keep around spare virtual machines, and in comments Todd quite reasonably asked how one has such a thing (and sort of why one would bother). There are two parts to the answer, a general one and a libvirt one.

The general part is that one sort of my virtual machines are directly on the network, not NAT'd, using specifically assigned static IPs. In order to avoid ever having two VMs accidentally use the same IP, I pre-create a VM for each reserved IP with the (libvirt) name of the VM being its hostname. This still requires configuring each VM's OS with the right IP, but at least accidents are a lot less likely (and in my dominant use for the VMs, I do an initial install of an Ubuntu version with the right IP and then snapshot it).

The libvirt specific part is that I find it a pain in the rear to create a virtual machine, complete with creating and tracking a disk or disks for it, setting various bits and pieces up, and so on. Clever people who do this a lot could probably script it or build generic XML files or similar things, but instead I do it as little as possible, which means that I almost never delete virtual machines even if I'm not using them (although I shut them down). Right now my office desktop has ten VMs configured, none of which are normally running.

(I call this libvirt specific because it's fundamentally a user interface issue, since I could fix it with some sort of provisioning and de-provisioning script that automated all of the fiddly bits for me.)

The most important part of how I keep such VMs as 'spares' is that every time I set up a new VM, I snapshot its initial configuration, complete with a blank initial disk (under the imaginative snapshot name of 'empty-initial'). Then if I want to test something from complete scratch I don't have to go through the effort of making a new VM or erasing the disk of a currently unused one; I just find a currently unused VM, do 'virsh snapshot-revert cksvm5 empty-initial', connect the virtual DVD to an appropriate image (such as the latest FreeBSD or OpenBSD), and then run 'virsh start cksvm5'.

(My earlier entry on how I've set up my libvirt based virtual machines covers the somewhat different way I handle having spare customized Ubuntu VMs that I can use to test things in our standard Ubuntu server environment.)

Using snapshots instead of creating and deleting VMs is probably a bit less efficient at the system level, but not enough for me to notice and care. Having written this, it occurs to me that I could get much the same effect by attaching and detaching virtual disks to the VMs, but with current tooling that would take more work. Libvirt's virsh command line tools make snapshots the easiest approach.

The Broadcom 'bnxt' Ethernet driver and RDMA (in Ubuntu 24.04)

By: cks

We have a number of Supermicro machines with dual 10G-T Broadcom based networking; specifically what they have is the 'BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller'. Under Ubuntu 22.04, everything is fine with these cards (or at least seems to be in non-production use), using the normal bnxt_en kernel driver module. Unfortunately this is not our experience in Ubuntu 24.04.

In Ubuntu 24.04, these machines also load an additional Broadcom bnxt driver, bnxt_re, which is the 'Broadcom NetXtreme-C/E RoCE' driver. RoCE is short for RDMA over Converged Ethernet, and to confuse you, this driver is found in the 'Infiniband' area of the Linux kernel drivers tree. Unfortunately, on our hardware the 24.04 bnxt_re doesn't work (or maybe the hardware doesn't work and bnxt_re is failing to detect that, although with 'RDMA' in the name of the hardware one sort of suspects it's supposed to work). The driver stalls during boot and spits out kernel messages like:

bnxt_en 0000:ab:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xf]=0x3 waited (102721 > 100000) msec active 1
bnxt_en 0000:ab:00.0 bnxt_re0: Failed to modify HW QP
infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
infiniband bnxt_re0: Couldn't start port
bnxt_en 0000:ab:00.0 bnxt_re0: Failed to destroy HW QP
[... more fun ensues ...]

This causes systemd-udev-settle.service to fail:

udevadm[1212]: Timed out for waiting the udev queue being empty.
systemd[1]: systemd-udev-settle.service: Main process exited, code=exited, status=1/FAILURE

This then causes Ubuntu 24.04's ZFS services to fail to completely start, which is a bad thing on hardware that we want to use for our ZFS fileservers.

We aren't the only people with this problem, so I was able to find various threads on the Internet, for example. These gave me the solution, which is to blacklist the bnxt_re kernel module, but at the time left me with the mystery of how and why the bnxt_re module was even being loaded in the first place.

The answer is that bnxt_re is being loaded through the second sort of kernel driver module loading. It is an 'auxiliary' module for handling RDMA on top of the normal bnxt_en network driver, and the bnxt_en module basically asks for it to be loaded (which also suggests that at least the module thinks the hardware should be able to do RDMA properly). More specifically, bnxt_en basically asks for bnxt_en.rdma to be loaded, and that that is an alias for bnxt_re. Fortunately you don't have to know all of this in order to block bnxt_re from loading.

We don't have any 22.04 installs on this specific hardware any more, so I can't be completely sure what happened under 22.04, but it appears that 22.04 didn't load the bnxt_re module on these servers. Running 'modinfo' on the 22.04 module shows that it doesn't have the bnxt_en.rdma module alias it does in 24.04, so maybe you had to manually load it if your hardware had RDMA and you wanted to use it.

(Looking at kernel source history, it appears that bnxt_re support for using this 'auxiliary driver interface' only appeared in kernel 6.3, which is much too late for Ubuntu 22.04's normal server kernel, which is based on 5.15.0.)

One of my lessons learned from this is that in today's Linux kernel environment, drivers may enable additional functionality that you neither asked for or wanted, just because it's there. We don't use RDMA and never asked for anything related to RoCE, but because the hardware is (theoretically) capable of it, we got it anyway.

Reševanje ZFS

Prejšnji teden sem se odločil posodobiti operacijski sistem Debian na enem izmed svojih strežnikov. Posodobitev je načeloma preprosta - v datoteko sources.list je treba vpisati nova skladišča programskih paketov, nato pa se požene apt-get -y update, apt-get -y upgrade ter apt-get -y full-upgrade (pa še kakšno malenkost). Vse to sem lepo naredil in na koncu je preostal le še ukaz reboot, ki ponovno zažene sistem. Minuta ali dve čakanja - in strežnik bi se moral zbuditi s posodobljenim operacijskim sistemom. Le da se to ni zgodilo. Niti po petih, niti po desetih minutah. Kar je… znak za alarm. Še posebej, če se strežnik nahaja na drugem koncu… Slovenije (ali pa Evrope, saj je vseeno).

PiKVM

No, na srečo je bil na strežnik priključen PiKVM. Gre za napravico, ki omogoča oddaljen dostop in oddaljeno upravljanje računalnikov. PiKVM je v osnovi dodatek (tim. “klobuk” oz. angl. hat), ki ga priklopimo na RaspberryPi. Nato pa PiKVM priključimo na računalnik namesto monitorja in tipkovnice/miške - v tem primeru nam PiKVM predstavlja virtualni monitor, virtualno tipkovnico, miško, CD, USB ključek, itd. Preko tega nato lahko računalnik ali strežnik oddaljeno upravljamo (vstopimo lahko tudi v BIOS, virtualno pritisnemo gumb za izklop ali gumb za reset) - in to kar preko spletnega brskalnika. Programska oprema je popolnoma odprtokodna, zadeva pa podpira tudi priklop na KVM razdelilec, kar nam omogoča oddaljeno upravljanje več računalnikov - to je recimo idealno za montažo v podatkovni center.

PiKVM ob nakupu

PiKVM ob nakupu.

Skratka, ko se strežnik nekaj časa ni več odzival, sem se povezal na PiKVM in šel pogledat kaj se je dejansko zgodilo. In zgodila se je… katastrofa.

Težava

Strežnik je namreč po ponovnem zagonu obstal v initramfs. Aaaaaa! Na dnu zaslona pa se je svetilo še zadnje opozorilo preden je sistem dokončno izdihnil - ALERT! ZFS=rpool/ROOT/debian does not exists. Dropping to a shell!. V obupu sem spregledal tisti “s” in prebral “hell”…

V tistem trenutku sem se spomnil, da je bil na korenskem razdelku strežnika seveda nameščen ZFS datotečni sistem - in to šifriran - ob nadgradnji pa sem seveda pozabil ročno omogočiti tim. jedrne module (angl. kernel modules), ki bi omogočili, da operacijski sistem ob zagonu prepozna ZFS. In da bi bila stvar še hujša - na strežniku je teklo (no, zdaj pač ne več) več virtualnih strežnikov. Ki so bili sedaj seveda vsi nedosegljivi.

Opomba. ZFS (Zettabyte File System) je napreden datotečni sistem, ki je znan po svoji zanesljivosti, razširljivosti, uporabi naprednih tehnik za preverjanje in popravljanje napak (kar zagotavlja, da so podatki vedno dosledni in brez poškodb), uporabi kompresije in deduplikacije, itd. Skratka, idelaen za strežniška okolja.

Dobro, zdaj vemo kaj je problem, ampak kako ga rešiti?

Načrt za njeno rešitev

Da si vsaj malo opomorem od pretresa, sem si najprej pripravil močno kavo. Odločitev se je izkazala za strateško, saj se je reševanje sistema zavleklo pozno v noč (in še v naslednje dopoldne).

Po krajšem razmisleku se mi je v glavi zarisal naslednji načrt. Najprej sistem zaženem iz “Live Debian CD-ja”, na ta začasni sistem namestim podporo za ZFS, priklopim ZFS diskovje, se “chroot-am” v stari sistem, tam popravim nastalo škodo in vse skupaj ponovno zaženem. In to je to!

Na tej točki bi se v kakšnem starem filmu samo še vsedel na konja in odjahal v sončni zahod, ampak kot se je izkazalo, je bila pot do konja (in njegovega sedla)… še precej trnova. Pa pojdimo po vrsti.

PiKVM v akciji

PiKVM v akciji.

Najprej sem na PiKVM prenesel datoteko debian-live-12.6.0-amd64-standard.iso, jo priklopil kot navidezni CD, ter zagnal strežnik. To je bilo resnično enostavno in PiKVM se je ponovno izkazal za vreden svojega denarja.

Se je pa že kar na začetku izkazalo, da strežnik prepoznava samo ameriško tipkovnico. In ker imam jaz slovensko, je bilo treba najprej ugotoviti katero tipko moram pritisniti, da dobim točno tisti poseben znak, ki ga potrebujem. No, tule je nekaj v mojem primeru najpogosteje uporabljenih znakov na slovenski tipkovnici in njihovi “prevodi” na ameriško tipkovnico:

- /
? - 
Ž |
+ =
/ &

Luč na koncu tunela

Naslednji korak je bil, da v /etc/apt/sources.list tim. “živega sistema” dodam še skladišče contrib. Nato pa sem že lahko namestil podporo za ZFS: sudo apt update && sudo apt install linux-headers-amd64 zfsutils-linux zfs-dkms zfs-zed.

Po minuti ali dveh, pa sem že lahko naložil ZFS jedrne module: sudo modprobe zfs. Ukaz zfs version je pokazal, da podpora za ZFS zdaj deluje:

zfs-2.1.11-1
zfs-kmod-2.1.11-1

No, prvi korak je uspel, sedaj pa je bilo v sistem potrebno “samo še” priključiti obstoječe diskovje. Najprej sem naredil ustrezno mapo, kamor bom priklopil diskovje: sudo mkdir /sysroot.

Nato pa sem skušal nanjo priključil svoj “rpool” ZFS. Spodnji ukazi so zgolj približni (verjetno je treba narediti še kaj, recimo nastaviti tim. mountpoint), so pa lahko vodilo komu, ki bo imel podobne težave. Naj seveda dodam, da ni šlo povsem enostavno in je bilo potrebno kar nekaj telovadbe, da sem uspel priti do končnega cilja.

sudo zpool import -N -R /sysroot rpool -f

sudo zpool status
sudo zpool list
sudo zfs get mountpoint

Na tej točki sem vnesel šifrirno geslo: sudo zfs load-key rpool… in preveril, da je ZFS odklenjen: sudo zfs get encryption,keystatus.

Sedaj pa priklop: sudo zfs mount rpool/ROOT/debian. In evo, podatki so bili vidni in kot je kazalo ni bilo nič izgubljenega!

Oživljanje “pacienta”…

Končno je sledil chroot v stari sistem:

sudo mkdir /sysroot/mnt
sudo mkdir /sysroot/mnt/dev
sudo mkdir /sysroot/mnt/proc
sudo mkdir /sysroot/mnt/sys
sudo mkdir /sysroot/mnt/run
sudo mount -t tmpfs tmpfs /sysroot/mnt/run
sudo mkdir /sysroot/mnt/run/lock

sudo mount --make-private --rbind /dev /sysroot/mnt/dev
sudo mount --make-private --rbind /proc /sysroot/mnt/proc
sudo mount --make-private --rbind /sys /sysroot/mnt/sys

sudo chroot /sysroot/mnt /usr/bin/env DISK=$DISK bash --login

Zdaj sem bil torej uspešno povezan v stari (“okvarjeni”) sistem. Najprej je bilo vanj potrebno namestiti ZFS podporo:

apt install --yes dpkg-dev linux-headers-generic linux-image-generic
apt install --yes zfs-initramfs
echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf

…z manjšimi težavami

Seveda se je vmes pojavila še ena napaka, in sicer nameščanje programske opreme ni bilo možno zaradi okvarjenega systemd paketa. To sem rešil z:

sudo rm /var/lib/dpkg/info/systemd*
sudo dpkg --configure -D 777 systemd
sudo apt -f install

Potem so se seveda pojavile še nerešene odvisnosti… kako točno sem to uspel rešiti se niti ne spomnim več, pomagali pa so naslednji ukazi (ne nujno v tem vrstnem redu):

dpkg --force-all --configure -a
apt --fix-broken install
apt-get -f install

Zdaj je bilo potrebno priklopiti še efi razdelek (za katerega je bilo potrebno najprej ugotoviti kje točno se sploh nahaja):

cp -r /boot /tmp
zpool import -a
lsblk
mount /dev/nvme0n1p2 /boot/efi
cd /tmp
cp * /boot/

Zdaj pa zares!

Končno sem lahko pognal ukaze s katerimi sem dodal ZFS jedrne module v jedro operacijskega sistema:

update-initramfs -c -k all
dkms autoinstall
dkms-status
update-grub
grub-install

No, in končno je sledil ponovni zagon sistema, po njem pa je bilo treba popraviti še mesto priklopa ZFS sistema (zfs set mountpoint=/ rpool/ROOT/debian)… še en ponovni zagon - in stari sistem je vstal od mrtvih.

Postfestum sanacija nastale škode

Zaradi silnega čaranja in ne povsem dokončane nadgradnje, je bilo potrebno namestiti manjkajoče programske pakete, ponovno namestiti nekaj systemd paketov in odstraniti stara jedra operacijskega sistema. Vse seveda ročno.

Aja, pa iz nekega razloga je ob posodobitvi izginil SSH strežnik. Ampak to rešiti je bila sedaj mala malica.

Sledil je reboot in nato še enkrat reboot, da vidim, če res vse deluje.

Konec dober, vse dobro

In zdaj deluje. O, kako lepo deluje! ZFS je kriptiran, sistem se po vnosu gesla za odklep lepo zažene, prav tako se samodejno zaženejo virtualni strežniki. PiKVM pa je dobil prav posebno mesto v mojem srcu.

Pa do naslednjič, ali kako že rečejo! :)

P. S. Hvala tudi Juretu za pomoč. Brez njegovih nasvetov bi vse skupaj trajalo precej dlje.

How Linux kernel driver modules for hardware get loaded (I think)

By: cks

Once upon a time, a long time ago, the kernel modules for your hardware got loaded during boot because they were listed explicitly as 'load these modules' in configuration files somewhere. You can still explicitly list modules this way (and you may need to for things like IPMI drivers), but most hardware driver modules aren't loaded like this any more. Instead they get loaded through udev, through what I believe is two mechanisms.

The first mechanism is that as the kernel inventories things like PCIe devices, it generates udev events with 'MODALIAS' set in them in a way that incorporates the PCIe vendor and device/model numbers. At the same time, kernel modules declare all of the PCIe vendor and model values that they support, which are turned into (somewhat wild carded) module aliases that you can inspect with 'modinfo', for example:

$ modinfo bnxt_en
description: Broadcom BCM573xx network driver
license:     GPL
alias:       pci:v000014E4d0000D800sv*sd*bc*sc*i*
alias:       pci:v000014E4d00001809sv*sd*bc*sc*i*
[...]
alias:       pci:v000014E4d000016D8sv*sd*bc*sc*i*
[...]

(The other parts of the pci MODALIAS value are apparently, in order, the subsystem vendor, the subsystem device/model, the base class, the sub class, and the 'programming interface'. See the Arch Wiki entry on modalias.)

As I understand things (and udev rules), when udev processes a kernel udev event with a MODALIAS set, it will attempt to load a kernel module that matches the name. Usually this will be done through wild card matching against aliases, as in the case of Broadcom BCM573xx cards; a supported card will have its PCIe vendor and device listed as an alias, so udev will wind up loading bnxt_en for it.

The second mechanism is through something called the Auxiliary 'bus'. To put my own spin on it, this is a way for core hardware drivers to declare (possibly only under some situations) that loading an additional driver can enable extra functionality. When the main driver loads and registers itself, it will register a pseudo-device on the 'auxiliary bus'. This bus registration generates a udev event with a MODALIAS that starts with 'auxiliary:' and apparently is generally formatted as 'auxiliary:<core driver>.<some-feature>', for example 'auxiliary:bnxt_en.rdma'. When this pseudo-device is registered, the udev event goes out from the kernel, is picked up by udev, and triggers an attempt to load whatever kernel module has declared that name as an alias. For example:

$ modinfo bnxt_re
[...]
description: Broadcom NetXtreme-C/E RoCE Driver
[...]
alias:       auxiliary:bnxt_en.rdma
depends:     ib_uverbs,ib_core,bnxt_en
[...]

(Inside the kernel, the two kernel modules use this pseudo-device on the auxiliary bus to connect with each other.)

As far as I know, the main kernel driver modules don't explicitly publish information on what auxiliary bus things they may trigger; the information exists only in their code. You can attempt to go the other way by looking for modules that declare themselves as auxiliaries for something else. This is most conveniently done by looking for 'auxiliary:' in /lib/modules/<version>/modules.alias.

(Your results may depend on the specific kernel versions and build options involved, and perhaps what additional packages have been added. On my Fedora 40 machine with 6.9.12, there are 37 auxiliary: aliases; on an Ubuntu 24.04 machine with '6.8.0-39', there are 49, with the extras coming from the peci_cputemp and peci_dimmtemp kernel modules.)

PS: PCI(e) devices aren't the only thing that this kernel module alias facility is used for. There are a whole collection of USB modaliases, a bunch of 'i2c' and 'of' ones, a number of 'hid' ones, and so on.

The challenges of working out how many CPUs your program can use on Linux

By: cks

In yesterday's entry, I talked about our giant (Linux) login server and how we limit each person to only a small portion of that server's CPUs and RAM. These limits sometimes expose issues in how programs attempt to work out how many CPUs they have available so that they can automatically parallelize themselves, or parallelize their build process. This crops up even in areas where you might not expect it; for example, both the Go and Rust compilers attempt to parallelize various parts of compilation using multiple threads within a single compiler process.

In Linux, there are at least three different ways to count the number of 'CPUs' that you might be able to use. First, your program can read /proc/cpuinfo and count up how many online CPUs there are; if code does this in our giant login server, it will get 112 CPUs. Second, your program can call sched_getaffinity() and count how many bits are set in the result; this will detect if you've been limited to a subset of the CPUs by a tool such as taskset(1). Finally, you can read /proc/self/cgroup and then try to find your cgroup to see if you've been given cgroup-based resource limits. These limits won't be phrased in terms of the number of CPUs, but you can work backward from any CPU quota you've been assigned.

In a shell script, you can do the second with nproc, which will also give you the full CPU count if there are no particular limits. As far as I know, there's no straightforward API or program that will give you information on your cgroup CPU quota if there is one. The closest it looks you can come is to use cgget (if it's even installed), but you have to walk all the way back up the cgroup hierarchy to check for CPU limits; it's not necessarily visible in the cgroup (or cgroups) listed in /proc/self/cgroup.

Given the existence of nproc and sched_getaffinity() (and how using them is easier than reading /proc/cpuinfo), I think a lot of scripts and programs will notice CPU affinity restrictions and restrict their parallelism accordingly. My experience suggests that almost nothing is looking for cgroup-based restrictions. This occasionally creates amusing load average situations on our giant login server when a program will see 112 CPUs 'available' and promptly try to use all of them, resulting in their CPU quota being massively over-subscribed and the load average going quite high without actually really affecting anyone else.

(I once did this myself on the login server by absently firing up a heavily parallel build process without realizing I was on the wrong machine for it.)

PS: The corollary of this is that if you want to limit the multi-CPU load impact of something, such as building Firefox from source, it's probably better to use taskset(1) than to do it with systemd features, because it's much more likely that things will notice the taskset limits and not flood your process table and spike the load average. This will work best on single-user machines, such as your desktop, where you don't have to worry about coordinating taskset CPU ranges with anyone or anything else.

The Linux Out-Of-Memory killer process list can be misleading

By: cks

Recently, we had a puzzling incident where the OOM killer was triggered for a cgroup, listed some processes, and then reported that it couldn't kill anything:

acctg_prof invoked oom-killer: gfp_mask=0x1100cca (GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
[...]
memory: usage 16777224kB, limit 16777216kB, failcnt 414319
swap: usage 1040kB, limit 16777216kB, failcnt 0
Memory cgroup stats for /system.slice/slurmstepd.scope/job_31944
[...]
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 252732]     0 252732     1443        0    53248        0             0 sleep
[ 252728]     0 252728    37095     1915    90112       54         -1000 slurmstepd
[ 252740]  NNNN 252740  7108532    17219 39829504        5             0 python3
[ 252735]     0 252735    53827     1886    94208      151         -1000 slurmstepd
Out of memory and no killable processes...

We scratched our heads a lot, especially as something seemed to be killing systemd-journald at the same time and the messages being logged suggested that it had been OOM-killed instead (although I'm no longer so sure). Why was the kernel saying that there were no killable processes when there was a giant Python process right there?

What was actually going on is that the OOM task state list leaves out a critical piece of information, namely whether or not the process in question had already been killed. A surprising number of minutes before this set of OOM messages, the kernel had done another round of a cgroup OOM kill for this cgroup and:

oom_reaper: reaped process 252740 (python3), now anon-rss:0kB, file-rss:68876kB, shmem-rss:0kB

So the real problem was that this Python process was doing something that had it stuck sitting there, using memory, even after it was OOM killed. The Python process was indeed not killable, for the reason that it had already been killed.

The whole series of events is probably sufficiently rare that it's not worth cluttering the tasks state listing with some form of 'task status' that would show if a particular process was already theoretically dead, just not cleaned up. Perhaps it could be done with some clever handling of the adjusted OOM score, for example marking such processes with a blank value or a '-'. This would make the field not parse as a number, but then kernel log messages aren't an API and can change as the kernel developers like.

(This happened on one of the GPU nodes of our SLURM cluster, so our suspicion is that some CUDA operation (or a GPU operation in general) was in progress and until it finished, the process could not be cleaned and collected. But there were other anomalies at the time so something even odder could be going on.)

Fedora 40 probably doesn't work with software RAID 0.90 format superblocks

By: cks

On my home machine, I have an old pair of HDDs that have (had) four old software RAID mirrors. Because these were old arrays, they were set up with the old 0.90 superblock metadata format. For years the arrays worked fine, although I haven't actively used them since I moved my home machine to all solid state storage. However, when I upgraded from Fedora 39 to Fedora 40, things went wrong. When Fedora 40 booted, rather than finding four software RAID arrays on sdc1+sdd1, sdc2+sdd2, sdc3+sdd3, and sdc4+sdd4 respectively, Fedora 40 decided that the fourth RAID array was all I had, and it was on sdc plus sdd (the entire disks). Since the fourth array have a LVM logical volume that I was still mounting filesystems from, things went wrong from there.

One of the observed symptoms during the issue was that my /dev had no entries for the sdc and sdd partitions, although the kernel messages said they had been recognized. This led me to stopping the 'md53' array and running 'partprobe' on both sdc and sdd, which triggered an automatic assembly of the four RAID arrays. Of course this wasn't a long term solution, since I'd have to redo it (probably by hand) every time I rebooted my home machine. In the end I wound up pulling the old HDDs entirely, something I probably should have done a while back.

(This is filed as Fedora bug 2280699.)

Many of the ingredients of this issue seem straightforward. The old 0.90 superblock format is at the end of the object it's in, so a whole-disk superblock is at the same place as a superblock in the last partition on the disk, if the partition goes all the way to the end. If the entire disk has been assembled into a RAID array, it's reasonable to not register 'partitions' on it, since those are probably actually partitions inside the RAID array. But this doesn't explain why the bug started happening in Fedora 40; something seems to have changed so that Fedora 40's boot process 'sees' a whole disk RAID array based on the 0.90 format superblock at the end, where Fedora 39 did not.

I don't know if other Linux distributions have also picked up whatever change in whatever software is triggering this in Fedora 40, or if they will; it's possible that this is a Fedora specific issue. But the general moral I think people should take from this is that if you still have software RAID arrays using superblock format 0.90, you need a plan to change that. The Linux Raid Wiki has a somewhat dangerous looking in-place conversion process, but I wouldn't want to try that without backups. And if you have software RAID arrays that old, they probably contain old filesystems that you may want to recreate so they pick up new features (which isn't always possible with an in-place conversion).

Sidebar: how to tell what superblock format you have

The simple way is to look at /proc/mdstat. If the status line for a software RAID array mentions a superblock version, you have that version, for example:

.pn prewap on

md26 : active raid1 sda4[0] sdb4[1]
      94305280 blocks super 1.2 [2/2] [UU]

This is a superblock 1.2 RAID array.

If the status line doesn't mention a 'super' version, then you have an old 0.90 superblock. For example:

md53 : active raid1 sdd4[1] sdc4[0]
      2878268800 blocks [2/2] [UU]
      bitmap: 0/22 pages [0KB], 65536KB chunk

Unless you made your software RAID arrays a very long time ago and faithfully kept upgrading their system ever since, you probably don't have superblock 0.90 format arrays.

(Although you could have deliberately asked mdadm to make new arrays with 0.90 format superblocks.)

Gtk 4 has decided to blow up some people's world on HiDPI displays

By: cks

Pavucontrol is my go-to GUI application for full volume control on my Fedora desktops. Recently (since updating to Fedora 40), pavucontrol started to choose giant font rendering, which made it more than a bit inconvenient to use. Today I attempted to diagnose this, without particular success, although I did find a fix, although it still leaves pavucontrol with weird rendering issues. But investigating things more deeply led me to discover just what was going on.

Pavucontrol is one of the applications that is now based on Gtk version 4 ('Gtk-4') instead of Gtk version 3 ('Gtk-3'). On Fedora 40 systems, you can investigate which RPM packages want each of these with:

$ rpm -q --whatrequires 'libgtk-4.so.1()(64bit)' | sort
[... a relatively short list ...]
$ rpm -q --whatrequires 'libgtk-3.so.0()(64bit)' | sort
[... a much longer list including eg Thunderbird ...]

I don't use a standard desktop environment like Gnome or KDE, so HiDPI presented me with some additional hassles that required me to, among other things, run an XSettings daemon and set some X resources to communicate my high DPI. Back in the days of Gtk-3, Gtk-based applications did not notice these settings, and required their own modifications; first I had to scale them up in order to get icons right, and then I had to scale Gtk-3 text back down again because the core text rendering that was used by Gtk-3 did recognize my high DPI. So I needed 'GDK_SCALE=2' and 'GDK_DPI_SCALE=0.5'.

In Gtk-4, it turns out that they removed support for GDK_DPI_SCALE but not GDK_SCALE (via this KDE bug report). This makes life decidedly awkward; I can choose between having decent sized icons and UI elements along with giant text, or reasonable text and tiny icons. Gtk-4 has a settings file (the personal one is normally ~/.config/gtk-4.0/settings.ini), but as explicitly documented it's mostly ignored if you have XSettings active, which I do because I need it for other things. The current Arch Linux wiki page section on HiDPI in X suggests that there is a way to override XSettings values for Gtk(-4?), but this doesn't work for test Gtk-4 applications for me.

At the moment I'm unsetting both environment variables in a cover script for pavucontrol, which is acceptable for it because it has relatively few graphical elements that are scaled down to tiny sizes for this. If and when applications with more graphical elements move to Gtk-4, this is going to be a real problem for me and I don't know how I'll solve it.

(When I started writing this entry I thought I had a mystery, but then more research turned up the direct answers, although not how I'm supposed to deal with this.)

Sidebar: The pavucontrol rendering problem I still have

No matter how wide I make the pavucontrol window, some text on the left side gets cut off. In the picture below, notice that the 'S' of 'Silence' isn't there.

The pavucontrol GUI showing output volume control, with a 'ilence' on the left side instead of 'Silence'

This doesn't happen in a test Cinnamon session in a Fedora 40 virtual machine I have handy, and I have no idea how one would go about figuring out what is wrong.

(This picture is with the sensible text sizes and thus small icons, and probably looks large in your browser because it comes from a HiDPI display.)

Fedora 40 and a natural but less than ideal outcome with 'alternatives'

By: cks

Fedora, like various Linux distributions, has a system of 'alternatives', where several programs from several different packages can provide alternative versions of the same thing, and which one is used is chosen through symbolic links in /etc/alternatives (Fedora's version appears to be this implementation). These alternatives can have priorities and by default the highest priority version of something gets to be that thing; however, if you manually choose another option, that option is supposed to stick. On a Fedora system, you can see the surprisingly large list of things handled this way with 'alternatives --list' and see about a particular thing with 'alternatives --display <thing>'.

As part of its attempt to move to Wayland, Fedora 40 has opted to package two different builds of Emacs, a "pure GTK" build that does not work well on X11 (or on plain terminals), and a "gtk+x11' build that does work well on X11 and on plain terminals. Which of the two versions gets to be 'emacs' is handled through the alternatives system, and the default is the "pure GTK" version (because of Fedora's love of Wayland). I don't use Wayland, so more or less as soon as I upgraded to Fedora 40, I ran 'alternatives --config emacs' and switched my 'emacs' to the gtk+x11 version, which also sets this alternative to be manually configured and thus it's supposed to be left alone.

Fedora 40 shipped with Emacs 29.3. Recently, Emacs 29.4 was released to deal with a security issue, so of course I updated to it when Fedora had an updated package available. To my unhappy surprise, after the update my 'emacs' was suddenly back to being the pure GTK, Wayland only version. Unfortunately, this outcome is actually a natural one given how everything works, because I left out a critical element of how Emacs works with the alternative system in Fedora. You see, when I manually set my alternatives preferences, I did not set it to 'emacs-gtk+11', because there is no such alternative. Instead, I set it to 'emacs-29.3-gtk+x11', and after the upgrade I had to reset it to 'emacs-29.4-gtk+x11', because that's what now existed.

Quite sensibly, if you have an alternative pointed to something that gets removed, Fedora de-selects that alternative rather than leave you with a definitely non-working configuration. So Fedora removed my (manually set) 'emacs-29.3-gtk+x11' alternative, along with what had been the default 'emacs-29.3' option, and after the package update it had 'emacs-29.4-gtk+x11' (at priority 75) and 'emacs-29.4' (at priority 80, and thus the default). With no manual alternative settings left, it picked the default to be 'emacs', and suddenly I was trying to use the non-working pure GTK version.

This is all perfectly natural and straightforward, and in this situation it more or less has to be implemented this way, but it results in a less than ideal outcome. To solve it, I think you'd need an additional level of alternatives indirection, where 'emacs' pointed to an 'emacs-gtk+x11' or 'emacs-pure' alternative, and then each of these pointed to the appropriate version. Then there would be a chance for the alternatives system to not forget your manual setting over a package upgrade.

(There might be a simpler scheme with some more cleverness in the package update actions, but I think it would still need the extra level of indirection through a 'emacs-gtk+x11' symbolic link.)

All things considered I'm not surprised that Fedora either overlooked this or opted not to go through the extra effort. But still, the current situation is yet another little example of "robot logic".

The systemd journal doesn't force you to not have plain text logs

By: cks

People are periodically grumpy that systemd's journal(d) doesn't store logs using 'plain text'. Sometimes this is used to imply that you can't have plain text logs with systemd's journal (or, more rarely, to state it). This is false. The systemd journal doesn't force you to not have plain text logs. In fact the systemd journal is often a practical improvement in the state of plain text logs, because your plain text logs will capture more information (if you keep them).

Of course the systemd journal won't write plain text logs directly. But modern syslog daemons on Linux will definitely read from the systemd journal and handle the result as more or less native syslog messages, including forwarding them to a central syslog server and writing them to whatever local files you want in the traditional syslog plain text format. Because the systemd journal generally captures things like the output printed by programs run in units, this stream of syslog'd messages will include more log data than a pure journal-free syslog environment would, which is normally a good thing.

It's perfectly possible to use the systemd journal basically in a pass-through mode; for example, see the discussion of journald.conf's Storage= setting, and also the runtime storage settings. Historically, Linux distributions varied in whether they made the systemd journal persistent on disk, and in the early days some certainly did not (sometimes to our irritation). And of course if you don't like the journal settings that your Linux distribution defaults to, you can change them in your system installation procedures.

(If you set 'Storage=none', I'm not sure if your syslog daemon can still read and log journal data; you should probably test that. But my personal view is that you should retain some amount of journal logs in RAM to enable various convenient things.)

Today, whether or not you get plain text logs as well as the systemd journal by default is a choice that Linux distributions make, not something that systemd itself decides. If the Linux distribution installs and enables a syslog daemon, you'll get plain text logs, possibly as well as a persistent systemd journal. You can look at the plain text logs most of the time, and turn to the journal for the kind of extra metadata that plain text logs aren't good at. And even if your Linux distribution doesn't install a syslog daemon, I believe most of them still package one, so you can install it yourself as a standard piece of additional software.

I wish systemd didn't require two units for each socket service

By: cks

Triggered by our recent (and repeated) issue with xinetd restarts, we're considering partially or completely replacing our use of xinetd with systemd socket units on our future Ubuntu 24.04 machines. Xinetd generally works okay today, but our perception is that it's fallen out of style and may not last forever as a maintained and packaged thing in Ubuntu (it's already a 'universe' package). By contrast, systemd socket units and socket activation is definitely sticking around. However, I have a little petty gripe about systemd socket units (which also applies to systemd timer units), which is that they require you to provide two unit files, not one.

As the systemd socket unit documentation spells out, socket units work by causing systemd to start another unit when the socket is connected to. Depending on the setting of Accept= this is either a regular unit or a template unit (which will be started with a unique name for every connection). However, in each case you need a second unit file for the .service unit. This is in contrast to the xinetd approach, where all you need is a single file placed into /etc/xinetd.d. As a system administrator, my bias is that the fewer files involved the better, because there's less chance for things to become de-synchronized with each other.

Systemd has a coherent internal model that more or less requires there be two units involved here, because it needs to keep track of the activation status of the socket as well as the program or programs involved in handling it. After all, one of the selling points of socket units is that the socket can be active without the associated program having been started. And in the systemd world, the way you get two units is to have two files, so systemd socket activation needs you to provide both a .socket file and a .service file.

(Systemd could probably work out some way to embed the service information in the .socket unit file if it wanted to, but it's probably better not to complicate the model. I did say my gripe was a little petty one.)

PS: The good news is that although you have to install both unit files, you only have to directly activate one (the socket unit).

The xinetd restart problem with binding ports that we run into

By: cks

Recently, I said something on the Fediverse:

It has been '0' days since I wished xinetd had an option to exit with a failed status if it couldn't bind some configured ports on startup. (Yes, yes, I know, replace it with systemd socket listeners or something.)

The job of xinetd is to listen on some number of TCP or UDP ports for you, and run things when people connect to those ports. This has traditionally been used to avoid having N different inactive daemons each listening to its own ports, and also so that people don't have to write those daemons at all; they can write something that gets started with a network connection handed to it and talks over that connection, which is generally simpler (you can even use shell scripts). At work, our primary use for xinetd is invoking Amanda daemons on all of the backup clients.

Every so often, Ubuntu releases a package update that wants to restart xinetd (for whatever reason, most obviously updating xinetd itself). When xinetd restarts, the old xinetd process stops listening on its ports and the new xinetd attempts to start listening on what you've configured. Unfortunately this is where a quirk of the BSD sockets API comes in. If there is an existing connection on some port, the new xinetd is unable to start listening on that port again. So, for example, if the Amanda client is running at the time that xinetd restarts, xinetd will not be able to start listening on the Amanda TCP port.

(See the description of SO_REUSEADDR in socket(7). The error you'll get is 'address already in use', EADDRINUSE. At one point I thought this was an issue only for UDP listening, but today it is clearly also a problem for TCP services.)

When this happens, the current version of xinetd will log messages but happily start up, which means that to general status monitoring it looks like a healthy service. This can make it hard to detect until, for example, some Amanda backups fail because xinetd on the machines to be backed up isn't actually listening for new Amanda connections. This happens to us every so often, which is why I wish xinetd had an option to fail in this situation (and then we'd get an alert, or we could have the systemd service unit set to auto-restart xinetd after a delay).

(Systemd socket units don't so much solve this as work around it by never closing and re-opening the listening socket as part of service changes or restarts.)

Some notes on ZFS's zstd compression kstats (on Linux)

By: cks

Like various other filesystems, (Open)ZFS can compress file data when it gets written. As covered in the documentation for the 'compression=' filesystem property, ZFS can use a number of compression algorithms for this, including Zstd. The zstd compression system in ZFS exposes some information about its activity in the form of ZFS kstats; on Linux, these are visible in /proc/spl/kstat/zfs/zstd, but are unfortunately a little underdocumented. For reasons beyond the scope of this blog entry I recently looked into this, so here is what I know about them from reading the code.

compress_level_invalid
The zstd compression level was invalid on compression.
compress_alloc_fail
We failed to allocate a zstd compression context.
compress_failed
We failed to compress a block (after allocating a compression context).

decompress_level_invalid
A compressed block had an invalid zstd compression level.
decompress_header_invalid
A compressed block had an invalid zstd header.
decompress_alloc_fail
We failed to allocate a zstd decompression context. This should not normally happen.
decompress_failed
A compressed block with a valid header failed to decode (after we allocated a decompression context).

The zstd code does some memory allocation for data buffers and contexts and so on. These have more granular information in the kstats:

alloc_fail
How many times allocation failed for either compression or decompression.
alloc_fallback
How many times decompression allocation had to fall back to a special emergency reserve in order to allow blocks to be decompressed. A fallback is also considered an allocation failure.

buffers
How many buffers zstd has in its internal memory pools. I don't understand what 'buffers' are in this context, but I think they're objects (such as contexts or data buffers), not pool arenas.
size
The total size of all buffers currently in zstd's internal memory pools.

If I'm reading the code correctly, compression and decompression have separate pools, and each of them can have up to 16 buffers in them. Buffers are only freed if they're old enough, and if the zstd code needs more buffers than this for various possible reasons, it allocates them directly outside of the memory pools. No current kstat tracks how many allocations happened outside of the memory pools (or how effective the pools are), although this information could be extracted with eBPF.

In the current version of OpenZFS, compressing things with zstd has a complex system of trying to figure out if things are compressible (a 'tiered early abort'). If the data is small enough (128 Kb or less), ZFS just does the zstd compression (and then a higher level will discard the result if it didn't save enough). Otherwise, if you're using a high enough zstd level, ZFS tries a quick check with LZ4 and then zstd-1 to see if it can bail out quickly rather than try to compress the entire thing with zstd only to throw it away. How this goes is shown in some kstats:

passignored
How many times zstd didn't try the complex approach.
passignored_size
The amount of small data processed without extensive checks.
lz4pass_allowed
How many times the LZ4 pre-check passed things.
lz4pass_rejected
How many times the LZ4 pre-check rejected things.
zstdpass_allowed
How many times the quick zstd pre-check passed things.
zstdpass_rejected
How many times the quick zstd pre-check rejected things.

A number of these things will routinely be zero. Obviously you would like to see no compression and decompression failures, no allocation failures, and so on. And if you're compressing with only zstd-1 or zstd-2, you'll never trigger the complicated pre-checks; however, I believe that zstd-3 will trigger this, and that's the normal default if you set 'compression=zstd'.

This particular sausage is all made in module/zstd/zfs_zstd.c, which has some comments about things.

Where Thunderbird seems to get your default browser from on Linux

By: cks

Today, I had a little unhappy experience with Thunderbird and the modern Linux desktop experience:

Dear Linux Thunderbird, why are you suddenly starting to open links I click on in Chrome? I do not want that in the least, and your documentation on how you obtain this information is nonexistent (I am not on a Gnome or KDE desktop environment). This worked until relatively recently and opened in the right browser, but now it is opaquely broken.

This is an example of why people want to set computers on fire.

(All of this is for Thunderbird 115.12 on Fedora 40. Other Linux distributions may differ, and things may go differently if you're using KDE.)

After some investigation, I now know where Thunderbird was getting this information from and why it wound up on Chrome, although I don't know what changed so that this started happening recently. A critical source for my journey was Kevin Locke's Changing the Default Browser in Thunderbird on Linux, originally written in 2012, revised in 2018, and still applicable today, almost six years later.

If you're an innocent person, you might think that Thunderbird would of course use xdg-open to open URLs, since it is theoretically the canonical desktop-independent way to open URLs in your browser of choice. A slightly less innocent person could expect Thunderbird to use the xdg-mime tool and database to find what .desktop file handles the 'x-scheme-handler/https' MIME type and then use it (although this would require Thunderbird to find and parse the .desktop file).

Thunderbird does neither of these. Instead, it uses the GTK/Gnome 'gconf' system (which is the old system, in contrast to the new GSettings), which gives Thunderbird (and anyone else who asks) the default command to run to open a URL. We can access the same information with 'gconftool-2' or 'gconf-editor' (don't confuse the latter with dconf-editor, which works on GSettings/dconf). So:

$ gconftool-2 --get /desktop/gnome/url-handlers/http/command
gio open %s

The 'gio' command provides command line access to the GTK GIO system, and is actually what xdg-open would probably use too if I was using a Gnome desktop instead of my own weird environment. We can check what .desktop file 'gio' will use and compare it to xdg-mime with:

$ xdg-mime query default x-scheme-handler/https
org.mozilla.firefox.desktop
$ gio mime x-scheme-handler/https
Default application for “x-scheme-handler/https”: google-chrome.desktop
Registered applications:
        google-chrome.desktop
        kfmclient_html.desktop
        org.midori_browser.Midori.desktop
        org.mozilla.firefox.desktop
Recommended applications:
        google-chrome.desktop
        kfmclient_html.desktop
        org.midori_browser.Midori.desktop
        org.mozilla.firefox.desktop

So GIO and xdg-mime disagree, and GIO is picking Chrome.

(In case you're wondering, Thunderbird really does run 'gio open ...'.)

What happened to me is foreshadowed by my 2016 entry on how xdg-mime searches for things. I had a lingering old set of files in /usr/local/share/applications and the 'defaults.list' there contained (among a few other things):

[Default Applications]
x-scheme-handler/http=mozilla-firefox.desktop;google-chrome.desktop
x-scheme-handler/https=mozilla-firefox.desktop;google-chrome.desktop

The problem with these entries is that there's no 'mozilla-firefox.desktop' (or 'firefox.desktop') any more; it was long since renamed to 'org.mozilla.firefox.desktop'. Since there is no 'mozilla-firefox.desktop', it is ignored and this line really picks 'google-chrome.desktop' (instead of ignoring Chrome). For a long time this seems to have been harmless, but then apparently GIO started deciding to pay attention to /usr/local/share/applications, although xdg-mime was ignoring it. Getting rid of those 2013-era files made 'gio mime ...' agree that org.mozilla.firefox.desktop was what it should be using.

(The Arch wiki page on Default Applications has some additional information and pointers. Note that all of this ignores /etc/mailcap, although some things will use it.)

This is still not what I want (or what it used to be), but fixing that is an internal Thunderbird settings change, not Thunderbird getting weird system settings.

Sidebar: Fixing Thunderbird to use a browser of your choice

This comes from this superuser.com answer. Get into the Config Editor (the Thunderbird version of Firefox's 'about:config') and set network.protocol-handler.warn-external.http and network.protocol-handler.warn-external.https to 'true'. To be safe, quit and restart Thunderbird. Now click on a HTTPS or HTTP link in a message, and you should get the usual 'what to do with this' dialog, which will let you pick the program of your choice and select 'always use this'. Under some circumstances, post-restart you'll be able to find a 'https' or 'http' entry in the 'Files & Attachments' part of 'General' Settings, which you can change on the spot.

The Linux kernel NFS server and reconnecting client NFS filehandles

By: cks

Unlike some other Unix NFS servers, the Linux kernel NFS server attempts to solve the NFS server 'subtree' export problem, along with a related permissions problem that is covered in the exportfs(5) manual page section on no_subtree_check. To quote the manual page on this additional check:

subtree checking is also used to make sure that files inside directories to which only root has access can only be accessed if the filesystem is exported with no_root_squash (see below), even if the file itself allows more general access.

In general, both of these checks require finding a path that leads to the file obtained from a NFS filehandle. NFS filehandles don't contain paths; they normally only contain roughly the inode number, which is a flat, filesystem-wide reference to the file. The NFS server calls this 'reconnection', and it is somewhat complex and counterintuitive. It also differs for NFS filehandles of directories and files.

(All of this is as of kernel 6.10-rc3, although this area doesn't seem to change often.)

For directories, the kernel first gets the directory's dentry from the dentry cache (dcache); this dentry can be 'disconnected' (which mostly means it was newly created due to this lookup) or already connected (in general, already set up in the dcache). If the dentry is disconnected, the kernel immediately reconnects it. Reconnecting a specific directory dentry works like this:

  1. obtain the dentry's parent directory through a filesystem specific method (which may more or less look up what '..' is in the directory).
  2. search the parent directory to find the name of the directory entry that matches the inode number of the dentry you're trying to reconnect. (A few filesystems have special code to do this more efficiently.)
  3. using the dcache, look up that name in the parent directory to get the name's dentry.
  4. verify that this new dentry and your original dentry are the same (which guards against certain sorts of rename races).

It's possible to have multiple disconnected dentries on the way to the filesystem's mount point; if so, each level follows this process. The obvious happy path is that the dcache already has a fully connected dentry for the directory the NFS client is working on, in which case all of this can be skipped. This is frequently going to be the case if clients are repeatedly working on the same directories.

Once the directory's dentry is fully connected (ie, all of its parents are connected), the kernel NFS server code will check if it is 'acceptable'. If the export uses no_subtree_check (which is now the default), this acceptability check always answers 'yes'.

For files, things are more complicated. First, the kernel checks to see if the initial dentry for the file (and any aliases it may have) is 'acceptable'; if the export uses no_subtree_check the answer is always 'yes', and things stop. Otherwise, the kernel uses a filesystem specific method to obtain the (or a) directory the file is in, reconnects the directory using the same code as above, then does steps 2 through 4 of the 'directory reconnection' process for the file and its parent directory in order to check against renames (which will involve at least one scan of the parent directory to discover the file's name). Finally with all of this done and a verified, fully connected dentry for the file, the kernel does the acceptability check again and returns the result.

Because the kernel immediately reconnects the dentries of directory NFS file handles before looking at the status of subtree checks, you really want those directories to have dentries that are already in the dcache (and fully connected). Every directory NFS filehandle with a dentry that has to be freshly created in disconnected state means at least one scan of a possibly large parent directory, and more scans of more directories if the parent directory itself isn't in the dcache too.

I'm not sure of how the dcache shrinks, and especially if filesystems can trigger removing dcache entries because the filesystem itself wants to remove the inode entry. The general kernel code that shrinks a filesystem's associated dcache and inodes triggers dcache shrinking first and inode shrinking second, with the comment that the inode cache is pinned by the dcache.

Sidebar: Monitoring NFS filehandle reconnections

If you want to see how much reconnection is happening, you'll need to use bpftrace (or some equivalent). The total number of NFS filehandles being looked at is found by counting calls to exportfs_decode_fh_raw(). If you want to know how many reconnections are needed, you want to count calls to reconnect_path(); if you want to count how many path components had to be reconnected, you want to (also) count calls to reconnect_one(). All of these are in fs/exportfs/expfs.c. The exportfs_get_name() call searches for the name for a given inode in a directory, and then the lookup_one_unlocked() call does the name to dentry lookup needed for revalidation, and I think it will probably fall through to a filesystem directory lookup.

(You can also look at general dcache stats, as covered in my entry on getting some dcache information, but I don't think this dcache lookup information covers all of the things you want to know here. I don't know how to track dentries being dropped and freed up, although prune_dcache_sb() is part of the puzzle and apparently returns a count of how many dentries were freed up for a particular filesystem superblock.)

Viewing and resetting the BIOS passwords on the RedmiBook 16

I recently lost the BIOS password for my Xiaomi RedmiBook 16. Luckily, viewing and even resetting the password from inside a Linux session turned out to be incredibly easy.

As it turns out, both the user and the system ("supervisor") passwords are not hashed in any way and stored as plaintext inside EFI variables. Viewing these EFI variables is incredibly easy on a Linux system where efivarfs is enabled, even under a regular user account and if secure boot is enabled:

$ uname -a
Linux book 5.10.7.a-1-hardened #1 SMP PREEMPT Tue, 12 Jan 2021 20:46:33 +0000 x86_64 GNU/Linux
$ whoami
xx
$ sudo dmesg | grep "Secure boot"
[    0.010717] Secure boot enabled

Reading the variables:

$ hexdump -C /sys/firmware/efi/efivars/SystemSupervisorPw*
00000000  07 00 00 00 0a 70 61 73 73 77 6f 72 64 31 32 20  |.....password12 |

$ hexdump -C /sys/firmware/efi/efivars/SystemUserPw*
00000000  07 00 00 00 0a 70 61 73 73 77 6f 72 64 31 31 21  |.....password11!|

If you have a root shell, removing the passwords entirely is also possible:

# chattr -i /sys/firmware/efi/efivars/SystemUserPw* /sys/firmware/efi/efivars/SystemSupervisorPw*

# rm /sys/firmware/efi/efivars/SystemUserPw* /sys/firmware/efi/efivars/SystemSupervisorPw*

Reboot, and the BIOS no longer asks for a password to enter setup, change secure boot settings, etc.

Patching ACPI tables to enable deep sleep on the RedmiBook 16

I recently purchased Xiaomi's RedmiBook 16. For the price, it's an excellent MacBook clone. Being a Ryzen-based laptop, Linux support works great out of the box, with one big caveat: deep sleep does not work. I decided to try and fix this.

Deep sleep?

To clear some confusion about what I mean by deep sleep, I need to explain a bit of how hibernation/suspending works.

There are a number of sleep states on modern machines.

The most basic of these is referred to as S0. It's implemented purely in software (i.e. the kernel), and doesn't do a very good job at preserving battery. While userland processes are suspended, the machine (and the CPU) is still running and using power. As S0 doesn't rely on hardware compatibility, it's enabled on all devices. Using this mode, my RedmiBook's battery drained to 0% overnight.

S1, also known as "shallow" sleep, is similar to S0 but enables some additional power saving features such as suspending power to nonboot CPUs. This mode still doesn't provide significant power saving, however.

S3 ("suspend-to-RAM") saves the system's state to memory and powers off everything but the memory itself. On boot, this state is restored and the system can resume from suspension. This mode is the one known as "deep sleep" and can provide acceptable levels of power saving. Overnight, this drains only about 3-5% battery on my laptop, which is perfectly fine for my needs.

S4 is known as "suspend-to-disk" and works a lot like S3, but instead, as you can probably tell by the name, saves the state to disk. This means you can remove power from the device completely and resuming from suspension would still work as the state is not stored in volatile memory.

ACPI

Modes S1 - S4 require hardware compatibility. This compatibility is usually advertised to the operating system's kernel using ACPI definitions. The kernel uses this information to know what suspension methods to provide to the user.

On some systems (such as the RedmiBook), the ACPI definitions declare no or only conditional support for some (or all) modes.

You can see what sleep states your machine supports by looking into /sys/power/mem_sleep. On my machine, only S0 ("s2idle") was supported:

$ cat /sys/power/mem_sleep
[s2idle]

Annoying. I knew deep sleep works on Windows, so it's not a case of missing hardware support. I suspected misconfigured ACPI tables to be at fault here.

Patching ACPI

Luckily, Linux supports loading "patched" ACPI tables during the boot process. It is possible to grab the currently used tables, decompile them, patch out the parts which block S3 from being supported, recompile, and embed the patched table into a cpio archive.

The specific ACPI component we're interested in is the DSDT table. We can dump this somewhere safe:

# cat /sys/firmware/acpi/tables/DSDT > dsdt.aml

We'll use iasl from the ACPICA software set to decompile the dumped table:

$ iasl -d dsdt.aml

If you get warnings about unresolved references to external control methods, it might be worth decompiling again, but this time including the SSDT tables. See this post at encryp.ch for more info.

You'll end up with a human-readable dsdt.dsl file. You'll want to peek into this and search for "S3 System State" to find what you're looking for. In my case, it was nested into two flag checks, which I simply deleted, so as to advertise S3 support even if the flag checks failed:

@@ -18,7 +18,7 @@
  *     Compiler ID      "    "
  *     Compiler Version 0x01000013 (16777235)
  */
-DefinitionBlock ("", "DSDT", 1, "XMCC  ", "XMCC1953", 0x00000002)
+DefinitionBlock ("", "DSDT", 1, "XMCC  ", "XMCC1953", 0x00000003)
 {
     /*
      * iASL Warning: There were 9 external control methods found during
@@ -769,19 +769,13 @@ DefinitionBlock ("", "DSDT", 1, "XMCC  ", "XMCC1953", 0x00000002)
         Zero,
         Zero
     })
-    If ((CNSB == Zero))
-    {
-        If ((DAS3 == One))
-        {
-            Name (_S3, Package (0x04)  // _S3_: S3 System State
-            {
-                0x03,
-                0x03,
-                Zero,
-                Zero
-            })
-        }
-    }
+    Name (_S3, Package (0x04)  // _S3_: S3 System State
+    {
+        0x03,
+        0x03,
+        Zero,
+        Zero
+    })

     Name (_S4, Package (0x04)  // _S4_: S4 System State
     {

You'll also want to increment the version number by one (as shown above) as the patched table wouldn't be loaded otherwise.

Once this is done, we can recompile it, again using iasl:

$ iasl dsdt.dsl

If this refuses to compile due to the compiler thinking Zero is not a valid type, check out the post at encryp.ch, where they shed some light on this.

Compiling using iasl overwrites the old .aml file. We'll need to create the proper directory tree in order to archive it in a manner which the kernel accepts:

$ mkdir -p kernel/firmware/acpi

Copy the patched table into place and create the archive using the cpio tool:

$ cp dsdt.aml kernel/firmware/acpi/.
$ find kernel | cpio -H newc --create > dsdt_patch

Copy the newly created archive into your boot directory:

# cp dsdt_patch /boot/.

You'll need to figure out how to get your bootloader to load this archive on boot. As I use systemd-boot, I modified my default entry and added the following initrd line before initramfs is loaded:

$ grep initrd /boot/loader/entries/arch.conf
initrd	/amd-ucode.img
initrd  /dsdt_patch
initrd	/initramfs-linux.img

For grub users, you'll need to edit the /boot/grub/grub.cfg file and add the same line.

I also recommend adding the following kernel parameter, as that makes sure that S3 is used by default instead of S0:

mem_sleep_default=deep

After rebooting, peek into /sys/power/mem_sleep once again to make sure deep is supported and enabled as the current mode:

$ cat /sys/power/mem_sleep
s2idle [deep]

It's also a good idea to check whether the system properly suspends and resumes. In my case, there have been no issues and I get excellent battery life during sleep.

Some readers have tested this method and reported that this method also works for the RedmiBook 14 and the Ryzen edition of the Xiaomi Notebook Pro 15, which have similar hardware.

chroot shenanigans 2: Running a full desktop environment on an Amazon Kindle

In my previous post, I described running Arch on an OpenWRT router. Today, I'll be taking it a step further and running Arch and a full LXDE installation natively on an Amazon Kindle, which can be interacted with directly using the touch screen. This is possible thanks to the Kindle's operating system being Linux!

You can see the end result in action here. Apologies for the shaky video - it was shot using my phone and no tripod.

If you're wanting to follow along, make sure you've rooted your Kindle beforehand. This is essential – without it, it's impossible to run custom scripts or binaries.

I'm testing this on an 8th generation Kindle (KT3) – it should, however, work for all recent Kindles given you've enough storage and are rooted. You also need to set up USBnetwork for SSH access and optionally KUAL if you want a simple way of launching the chroot.

First things first: We need to set up a filesystem and extract an Arch installation into it, which we can later chroot into. The filesystem will be a file which will be mounted as a loop device. The reason why we're not extracting the Arch installation directly into a directory on the Kindle is because the Kindle's storage filesystem is FAT32. FAT32 doesn't support required features such as symbolic links, which would break the Arch installation. Please note that this also means that your chroot filesystem can be 4 gigabytes large, at maximum. This can be worked around by mounting the real root inside the chroot filesystem, which it's still a hacky way to go about it. But I digress.

First, figure out how large your filesystem actually can be. SSH into your Kindle and see how much free space you have:

$ ssh root@192.168.15.244

kindle# df -k /mnt/base-us
Filesystem   1K-blocks  Used    Available  Use%  Mounted on
/dev/loop/0  3188640    361856  2826784    11%   /mnt/base-us

Seems like we have around 2800000K (around 2.8G) of space available. Let's make our filesystem 2.6G – it's enough to host our root filesystem and some extra applications, such as LXDE. Note that I'll be running the following commands on my PC and transferring the filesystem over later. You can also do all of this on the Kindle, but it's simply easier and faster this way.

Let's create a blank file of the wanted size. I'm using dd, but you can also use fallocate for this:

$ dd if=/dev/zero of=arch.img bs=1024 count=2600000
2600000+0 records in
2600000+0 records out
2662400000 bytes (2.7 GB, 2.5 GiB) copied, 6.92058 s, 385 MB/s

Let's create our filesystem on it. Since we're doing this on the PC, we need make it 32bit and disable the metadata_csum and huge_file options on the filesystem, as the Kindle's ext4 kernel doesn't support them.

$ mkfs.ext4 -O ^64bit,^metadata_csum,^huge_file arch.img
mke2fs 1.45.0 (6-Mar-2019)
Discarding device blocks: done                            
Creating filesystem with 650000 4k blocks and 162560 inodes
Filesystem UUID: a4e72620-368a-44b4-81bb-9e66b2903523
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

This is optional, but I'll also disable periodic filesystem checks on it:

$ tune2fs -c 0 -i 0 arch.img                               
tune2fs 1.45.0 (6-Mar-2019)         
Setting maximal mount count to -1
Setting interval between checks to 0 seconds

Next it's time to mount the filesystem:

$ mkdir rootfs
$ sudo mount -o loop arch.img rootfs/

The Kindle I'm using has a Cortex-A9-based processor, so let's download the ARMv7 version of Arch Linux ARM from here. You can download it and extract then, or simply download and extract at the same time:

$ curl -L http://os.archlinuxarm.org/os/ArchLinuxARM-armv7-latest.tar.gz | sudo tar xz -C rootfs/

sudo is required to extract as it sets up a lot of files with root permissions. You can ignore the errors about SCHILY.fflags. Verify that the files extracted successfully with ls -l rootfs/.

Let's prepare our Kindle for the filesystem. I opted for hosting the filesystem in extensions/karch as I want to use KUAL for easy launching:

$ ssh root@192.168.15.244

kindle# mkdir -p /mnt/base-us/extensions/karch

While we're here, it's also a good idea to stop the power daemon to prevent the Kindle from going into sleep mode while transferring the filesystem and interrupting our transfer:

kindle# stop powerd
powerd stop/waiting

Let's transfer our filesystem:

kindle# exit
Connection to 192.168.15.244 closed.

$ scp arch.img root@192.168.15.244:/mnt/base-us/extensions/karch/

This might take quite a bit of time, depending on your connection.

Once it's done, let's SSH in once again and set up our mountpoint:

$ ssh root@192.168.15.244

kindle# cd /mnt/base-us/extensions/karch/
kindle# mkdir system

I decided to set up my own loop device, so I can have it named, but you can ignore this and opt to use /dev/loop/12 or similar instead. Just make sure it's already not in use with mount.

Setting up a loop point and mounting the filesystem:

kindle# mknod -m0660 /dev/loop/karch b 7 250
kindle# mount -o loop=/dev/loop/karch -t ext4 arch.img system/

We should also mount some system directories into it:

kindle# mount -o bind /dev system/dev
kindle# mount -o bind /dev/pts system/dev/pts
kindle# mount -o bind /proc system/proc
kindle# mount -o bind /sys system/sys
kindle# mount -o bind /tmp system/tmp
kindle# cp /etc/hosts system/etc/

It's time to chroot into our new system and set it up for LXDE. You can also use this opportunity to set up whatever applications you need, such as an onscreen keyboard:

kindle# chroot system/ /bin/bash
chroot# echo 'en_US.UTF-8 UTF-8' > /etc/locale.gen 
chroot# locale-gen
chroot# rm /etc/resolv.conf 
chroot# echo 'nameserver 8.8.8.8' > /etc/resolv.conf
chroot# pacman-key --init # this will take a while
chroot# pacman-key --populate
chroot# pacman -Syu --noconfirm
chroot# pacman -S lxde xorg-server-xephyr --noconfirm

We use Xephyr because it's the easiest way to get our LXDE session up and running. Since the Kindle uses X11 natively, we can try using that. It's possible to stop the native window manager using stop lab126_gui outside the chroot, but then the Kindle will stop updating the screen with new data, leaving it blank – forcing you to use something like eips to refresh the screen. The X server still works, however, and you can confirm this by using something like x11vnc after running your own WM in it. Xephyr spawns a new X server inside the preexisting X server, which is not as efficient but a lot easier.

We can however stop everything else related to the native GUI, as we need the extra memory and we can't use it while LXDE is running anyways:

chroot# exit
kindle# SERVICES="framework pillow webreader kb contentpackd"
kindle# for service in ${SERVICES}; do stop ${service}; done

While we're here, we need to get the screen size for later:

kindle# eips -i | grep 'xres:' | awk '{print $2"x"$4}'
600x800

Let's chroot back into the system and see if we can get LXDE to run. Be sure to replace the screen size parameter if needed:

kindle# chroot system/ /bin/bash
chroot# export DISPLAY=:0
chroot# Xephyr :1 -title "L:A_N:application_ID:xephyr" -screen 600x800 -cc 4 -nocursor &
chroot# export DISPLAY=:1
chroot# lxsession &
chroot# xrandr -o right

If everything goes well, you should have LXDE visible on your Kindle's screen. Ta-da! Feel free to play around with it. I've found that the touch screen is suprisingly accurate, even though it is using an IR LED system to detect touches instead of a normal digitizer.

Once done in the chroot, Ctrl-C + Ctrl-D can be issued to exit the chroot. We can then restore the Kindle UI by doing:

kindle# for service in ${SERVICES}; do start ${service}; done

It might take a while for anything to display again.

I've mentioned setting up a KUAL extension to automate the entering and exiting of the chroot. You can find that here. If you're interested in using this, make sure you've set up your filesystem first and copied it over to the same directory as the extension, and that it's named arch.img. Everything else is not mandatory - the extension will do it for you.

chroot shenanigans: Running Arch Linux on OpenWRT (LEDE) routers

Here's some notes on how to get Arch Linux running on OpenWRT devices. I'm using an Inteno IOPSYS (OpenWRT-based) DG400 for this, which has a Broadcom BCM963138 SoC - reportedly ARMv7 but not really (I'll get to that later).

I figured it would be fun trying to run Arch on such an unconventional device. I ran into 3 issues which I will be discussing, and the workarounds for them.

I've already "hacked" my router and have direct root access to the system, so I won't be discussing that in this post. If you're interested, check out any of my older posts with a CVE label for more information, or if you're brave and want to compile and flash custom firmware on your Inteno router, check out this post.

I used the lovely Arch Linux ARM community project as the basis for this. The plan of action: Grab a tarball of a compiled system for my architecture (ARMv7), extract it on the router and use chroot to effectively "run" it as if it was the root filesystem. Seems simple enough.

Issue 1: Space

These sort of devices are usually built with very limited storage to keep production costs down. The firmware just about fits on the onboard flash with some extra space for temporary files. It's not meant to be used as your conventional system.

df -h reported my root filesystem to only have 304 Kb of available space, and my tmp filesystem to have 100 Mb. Considering that the Arch tarball itself is already over 500 Mb, the device doesn't have nearly enough space to fit another OS on it.

The solution for this is quite simple: Use a USB drive. Indeed, my DG400 router has a USB2.0 and 3.0 port presumably for sticking pen drives into them. Evidently, seeing as any drives inserted are automatically mounted in /mnt (I'm unsure whether this is done by OpenWRT by default or if it's an IOPSYS feature).

It's settled then. I used my PC to format a pen drive as ext4 (FAT won't work for this very well), downloaded the ARMv7 tarball and extracted it onto the pen drive:

# umount /dev/sdc1 # (replace with your USB drive)
# mkfs.ext4 /dev/sdc1
# mount /dev/sdc1 /mnt
# mkdir /mnt/archfs
# wget http://os.archlinuxarm.org/os/ArchLinuxARM-armv7-latest.tar.gz
# bsdtar -xpf ArchLinuxARM-armv7-latest.tar.gz -C /mnt/archfs

Done. After plugging the USB drive into the router, it got automatically mounted at /mnt/usb0 (might differ). However, it got mounted with the noexec flag, which will prevent executables being run. It's easy enough to remount it. On the router:

# mount /mnt/usb0 -o exec,remount

Great! It's time to test if we can now actually chroot into it:

# chroot /mnt/usb0/archfs /bin/bash
Illegal instruction (core dumped)

Uh oh. Looks like something is still wrong. Which brings us to…

Issue 2: Not all ARM is created equal

Looks like we're running into some instructions while running bash that our processor doesn't support. Let's see if we're still ARMv7 and I hadn't messed up:

# cat /proc/cpuinfo 
processor       : 0
model name      : ARMv7 Processor rev 1 (v7l)
BogoMIPS        : 1325.05
Features        : half thumb fastmult edsp tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x4
CPU part        : 0xc09
CPU revision    : 1

Strange. We're using the ARMv7 tarball, it should all be groovy. My custom firmware is compiled with GDB, which I could use to see exactly which instruction it's failing on. Since there's no way of running GDB + any of my Arch binaries natively without library mismatches, I opted to simply grab the core dump and use that instead. I looked into /proc/sys/kernel/core_pattern to identify the script responsible for handling coredumps and modified it to dump it to the root of my USB stick instead. I could then use GDB to look through the backtrace:

# gdb /mnt/usb0/archfs/bin/grep /mnt/usb0/coredump -q
Reading symbols from archfs/bin/grep...(no debugging symbols found)...done.
[New LWP 14713]

warning: Could not load shared library symbols for /lib/ld-linux-armhf.so.3.
Do you need "set solib-search-path" or "set sysroot"?
Core was generated by `/bin/grep'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0xb6fe5ba4 in ?? ()

I needed to set the proper sysroot as well, to fetch proper library symbols:

(gdb) set sysroot /mnt/usb0/archfs/
Reading symbols from /mnt/usb0/archfs/lib/ld-linux-armhf.so.3...(no debugging symbols found)...done.
(gdb) disas 0xb6fe5ba4
Dump of assembler code for function __sigsetjmp:
   0xb6fe5b70 <+0>:	movw	r12, #28028	; 0x6d7c
   0xb6fe5b74 <+4>:	movt	r12, #1
   0xb6fe5b78 <+8>:	ldr	r2, [pc, r12]
   0xb6fe5b7c <+12>:	mov	r12, r0
   0xb6fe5b80 <+16>:	mov	r3, sp
   0xb6fe5b84 <+20>:	eor	r3, r3, r2
   0xb6fe5b88 <+24>:	str	r3, [r12], #4
   0xb6fe5b8c <+28>:	eor	r3, lr, r2
   0xb6fe5b90 <+32>:	str	r3, [r12], #4
   0xb6fe5b94 <+36>:	stmia	r12!, {r4, r5, r6, r7, r8, r9, r10, r11}
   0xb6fe5b98 <+40>:	movw	r3, #28064	; 0x6da0
   0xb6fe5b9c <+44>:	movt	r3, #1
   0xb6fe5ba0 <+48>:	ldr	r2, [pc, r3]
=> 0xb6fe5ba4 <+52>:	vstmia	r12!, {d8-d15}
   0xb6fe5ba8 <+56>:	tst	r2, #512	; 0x200
   0xb6fe5bac <+60>:	beq	0xb6fe5bc8 <__sigsetjmp+88>
   0xb6fe5bb0 <+64>:	stfp	f2, [r12], #8
   0xb6fe5bb4 <+68>:	stfp	f3, [r12], #8
   0xb6fe5bb8 <+72>:	stfp	f4, [r12], #8
   0xb6fe5bbc <+76>:	stfp	f5, [r12], #8
   0xb6fe5bc0 <+80>:	stfp	f6, [r12], #8
   0xb6fe5bc4 <+84>:	stfp	f7, [r12], #8
   0xb6fe5bc8 <+88>:	b	0xb6fe39d8 <__sigjmp_save>
End of assembler dump.

Looks like our processor didn't like the vstmia instruction. Can't imagine why - it seems to be a valid ARMv7 instruction.

After reading through some reference manuals and consulting others online, it turned out that my SoC processor is crippled: A set of instructions simply wasn't supported by my processor. Luckily, thanks to those instructions not existing in ARMv5 and ARM being backwards-compatible, I could simply use the ARMv5-compiled system instead.

Repeating the steps to create the root filesystem, this time using the ArchLinuxARM-armv5-latest.tar.gz tarball instead, showed promising results. I could finally:

# chroot /mnt/usb0/archfs /bin/bash
[root@iopsys /]# cat /etc/os-release
NAME="Arch Linux ARM"
PRETTY_NAME="Arch Linux ARM"
ID=archarm

I exited the chroot after seeing it works. We still needed to mount some partitions so the chroot could see and interact with them and copy some files over. I wrote a helper script for all of that which you can find here.

Great, we can now initialise pacman and try upgrading the system.

# pacman-key --init
# pacman-key --populate archlinuxarm
# pacman -Syu

error: out of memory

Issue 3: Memory problems

Honestly, should've seen this one coming. free -m showed that I was working with around 100 Mb of usable memory, which is not much - no wonder pacman crapped out. Luckily, my device kernel was compiled with swap support. This essentially allows the system to "swap" memory contents out to the filesystem and load them later when necessary. It's very slow compared to real memory, but it gets the job done in a pinch. I created a 1G swapfile on my USB drive and activated it, whilst inside the chroot:

# truncate -s 0   /swapfile
# chattr +C       /swapfile
# fallocate -l 1G /swapfile
# chmod 600       /swapfile
# mkswap          /swapfile
# swapon          /swapfile

Running pacman again allowed me to continue upgrading the system, which it finished successfully.

At this point, I had a fully functional Arch Linux system which I could chroot into and utilise pretty much to the maximum. I've successfully set up Python bots, compiled software with gcc/g++, etc. what you'd expect to see from a normal system. I don't know why you would want to do this, but it's definitely possible.

I realise that it may not go this smoothly on other systems. For example, a large portion of routers utilise the MIPS architecture instead of ARM. If this is the case for you, it unfortunately means that Arch Linux is off the table, as it doesn't have any functioning MIPS builds. However, the Debian community maintains an active MIPS port of Debian which you might want to look into instead. Everything in this post should still pretty much apply to Debian/MIPS as well, with some minor differences.

This has also been done on other unconventional devices. Reddit user parkerlreed used a similar procedure to run Arch Linux on a Steamlink, which you can read here - it even has instructions on how to compile applications natively on it.

My wireguard cheatsheet

By: danman

I always search for this so I will write it down here.

Client

cd /etc/wireguard
wg genkey | tee privatekey | wg pubkey > publickey
vi wg0.conf
[Interface]
PrivateKey = 
Address = 10.9.0.X/24

[Peer]
PublicKey = dg1cKCId81d6h5cWUQ61BMHksBbi0FdFnitjxDuOuno=
Endpoint = vpn.danman.eu:51820
AllowedIPs = 10.9.0.0/24
PersistentKeepalive = 25
wg-quick up /etc/wireguard/wg0.conf
systemctl enable wg-quick@wg0.service
systemctl edit wg-quick@wg0.service
[Service]
Restart=on-failure
RestartSec=5s

Server

systemctl stop wg-quick@wg0.service
vi /etc/wireguard/wg0
systemctl start wg-quick@wg0.service

Reload

wg syncconf wg0 <(wg-quick strip wg0)

The Prometheus host agent's 'perf' collector can be kind of expensive

By: cks

Today I looked at some system statistics for the first time in quite a while and discovered that both my office and home desktops were handling about 12,000 to 15,000 interrupts a second. On my office desktop this used about 1.5% of the overall (multi-)CPU for IRQ handling; on my home desktop it was just over 4%. Eventually I traced this down to me having enabled the Prometheus host agent's 'perf' collector. This collector uses the Linux kernel's perf system (also) to collect information on CPU information like the number of instructions and cycles, hardware cache information on various sorts of hits and misses, and some kernel information like page faults, context switches, and CPU migrations (some of which is available from other sources).

The extra interrupts were specifically coming from what /proc/interrupts calls 'LOC' and labels as 'Local timer interrupts', and were distributed basically evenly across all CPUs. The underlying cause is a mystery to me; after a certain amount of delving into the relevant host agent code and doing some hackery, the best I can tell is that having enough specific perf CPU information to collect enabled at once appears to trigger this. It's possible that it happens in general for enough perf information sources being enabled at once.

(If you're trying to test this yourself today with the host agent, note that in its current state you can't disable any 'hardware' (CPU) or 'software' (kernel) profilers. This is probably a bug in the code. I had to build a modified version that let me do this. However, this appears to reproduce in 'perf stat' if you enable enough things at once.)

The host agent's documentation does warn generically that the collectors that are disabled by default may have 'significant resource demands', so probably this isn't entirely surprising. The perf collector also isn't critical for my usage, because all I was using it for was to get the information for some 'total (CPU) instructions' and 'cycles per instruction' graphs on a personal Grafana dashboard about my desktops (I was previously collecting this information in another way).

PS: This is probably specific to some ranges of kernel versions, since I couldn't reproduce it with 'perf stat' on one of our Ubuntu machines. My desktops are both running the Fedora 40 6.8.9 kernel. That this is kernel and perhaps hardware dependent is a little bit irritating, since it means we'll have to keep an eye out for this on an ongoing basis if we ever enable the perf collector on our Ubuntu machines.

(This elaborates on some Fediverse posts.)

Some ideas on what Linux distributions can do about the new kernel situation

By: cks

In a comment on my entry on how the current Linux kernel CVE policy is sort of predictable, Ian Z aka nobrowser asked what a distribution like Debian is supposed to do today, now that the kernel developers are not going to be providing security analysis of fixes, especially for unsupported kernels (this is a concise way of describing the new kernel CVE policy). I don't particularly have answers, but I have some thoughts.

The options I can see today are:

  • More or less carrying on with a distribution specific kernel and backporting fixes into it if they seem important enough or otherwise are security relevant. This involves accepting that there will be some number of security issues in your kernel that are not in the upstream kernel, but this is already the case in reality today (cf).

  • Synchronize distribution releases to when the upstream kernel developers put out a LTS kernel version that will be supported for long enough, and then keep updating (LTS) kernel patch levels as new ones are released. Unfortunately the lifetime of LTS kernels is a little bit uncertain.

    My guess is that this will still leave distributions with any number of kernel security issues, because only bugfixes recognized as important are applied to LTS kernels. The Linux kernel developers are historically not great at recognizing when a bugfix has a security impact (cf again). However, once a security issue is recognized in your (LTS) kernel, at least the upstream LTS team are the ones who'll be fixing it, not you.

  • Give up on the idea of sticking with a single kernel version (much less a single patch level within that version) for the lifetime of a distribution release. Instead, expect to more or less track the currently supported kernels, or at least the LTS kernels (which would let you do releases whenever you want).

    (This is what Fedora currently does with the mainline kernel, although a distribution like Debian might want to be less aggressive about tracking the latest kernel version and patchlevel.)

Broadly, distributions are going to have to decide what is important to them. Just as we say 'good, cheap, fast, pick at most two', distributions are not going to be able to release whenever they want, use a single kernel version for years and years, and get perfect security in that kernel. Or at least they are not going to get that without doing a lot of work themselves.

(Again, the reality is that distributions didn't have this before; any old distribution kernel probably had a number of unrecognized security issues that had already been fixed upstream. That's kind of what it means for the average kernel CVE fix time between 2006 and 2018 to be '-100 days' and for 41% of kernel CVEs to have already been fixed when the CVE was issued.)

A volunteer-based distribution that prioritizes security almost certainly has no option other than closely tracking mainline kernels, and accepting whatever stability churn ensues (probably it's wise to turn off most or all configurable new features, freezing on the feature set of your initially released kernel). Commercial distributions like Red Hat Enterprise and Canonical Ubuntu can do whatever their companies are willing to pay for, but in general I don't think we're going to keep getting long term support for free.

(A volunteer based distribution that prioritizes not changing anything will have to accept that there are going to be security issues in their kernels and they will periodically scramble to find fixes or create fixes for them, and maybe get their own CVEs issued (and possibly have people write sad articles about how this distribution is using ancient kernels with security issues). I don't think this is a wise or attractive thing myself; I would rather keep up with kernel updates, at least LTS ones.)

Distributions don't have to jump on new kernel patchlevels (LTS or otherwise) or kernel versions immediately when they're released; not even Fedora does that. It's perfectly reasonable to do as much build farm testing as you can before rolling out a new LTS patch release or whatever, assuming that there are no obvious security issues that force a fast release.

The Linux kernel giving CVEs to all bugfixes is sort of predictable

By: cks

One of the controversial recent developments in the (Linux kernel) security world is that the Linux kernel developers have somewhat recently switched to a policy of issuing CVEs for basically all bugfixes made to stable kernels. This causes the kernel people to issue a lot of CVEs and means that every new stable kernel patch release officially fixes a bunch of them, and both of these are making some people annoyed. This development doesn't really surprise me (although I wouldn't have predicted it in advance), because I feel it's a natural result of the overall situation.

(This change happened in February when the Linux kernel became a CVE Numbering Authority; see also the LWN note and its comments.)

As as I understand it, the story starts with all of the people who maintain their own version of the kernel, which in practice means that they're maintaining some old version of the kernel, one that's not supported any more by the main developers. For a long time, these third parties have wanted the main kernel to label all security fixes. They wanted this because they wanted to know what changes they should backport into their own kernels; they only wanted to do this for security fixes, not every bugfix or change the kernel makes at some point during development.

Reliably identifying that a kernel bug fix is also a security fix is a quite hard problem, possibly a more or less impossible one, and there are plenty of times when the security impact of a fix has been missed, such as CVE-2014-9940. In many cases, seriously trying to assess whether a bugfix is a security fix would take noticeable extra effort. Despite all of this, my impression is that third party people keep yelling at the main Linux kernel developers about not 'correctly' labeling security fixes, and have been yelling at them for years.

(Greg Kroah-Hartman's 2019 presentation about CVEs and the Linux kernel notes that between 2006 and 2018, 41% of the Linux kernel CVEs were fixed in official kernels before the CVE had been issued, and the average fix date was '-100 days' (that is, 100 days before the CVE was issued).)

These third party people are exactly that; third parties. They would like the kernel developers to do extra work (work that may be impossible in general), not to benefit the kernel developers, but to benefit themselves, the third parties. These third parties could take on the (substantial) effort of classifying every bug fix to the kernel and evaluating its security impact, either individually or collectively, but they don't want to do the work; they want the mainstream kernel developers to do it for them.

The Linux kernel is an open source project. The kernel developers work on what is interesting to them, or in some cases what they're paid by their employers to work on. They do not necessarily do free work for third parties, even (or especially) if the third parties yell at them. And if things become annoying enough (what will all of the yelling), then the kernel developers may take steps to make the whole issue go away. If every bug fix has a CVE, well, you can't say that the kernel isn't giving CVEs to security issues it fixes. Dealing with the result is your problem, not the kernel developers' problem. This is not a change in the status quo; it has always been your problem. It was just (more) possible to pretend otherwise until recently.

(This elaborates on part of something I said on the Fediverse.)

Sidebar: The other bits of the kernel's CVE's policy

There are two interesting other aspects of the current policy. First, the kernel developers will only issue CVEs for currently supported versions of the kernel. If you are using some other kernel and you find a security issue, the kernel people say you go to the provider of that kernel, but the resulting CVE won't be a 'kernel.org Linux kernel CVE', it will be an 'organization CVE'. Second, you can't automatically get CVEs assigned for unfixed issues; you have to ask the kernel's CVE team (after you've reported the security issue through the kernel's process for this). This means that you have to persuade the kernel developers that there actually is an issue, which I think is a reaction to junk kernel CVEs that people have gotten issued in the past.

It's very difficult to tell if a Linux kernel bug is a security issue

By: cks

One of the controversial recent developments in the (Linux kernel) security world is that the Linux kernel developers have somewhat recently switched to a policy of aggressively issuing CVEs for kernel changes. It's simplest to quote straight from the official kernel.org documentation:

Note, due to the layer at which the Linux kernel is in a system, almost any bug might be exploitable to compromise the security of the kernel, but the possibility of exploitation is often not evident when the bug is fixed. Because of this, the CVE assignment team is overly cautious and assign CVE numbers to any bugfix that they identify. [...]

Naturally this results in every new patch release of a stable kernel containing a bunch of CVE fixes, which has people upset. There are various things I have to say about this, but I'll start with the straightforward one: the kernel people are absolutely right about the difficulty of telling whether a kernel bug is also a security issue.

Modern exploit development technology has become terrifying capable, as has today's exploit developers. It's routine to chain multiple bugs together in complex ways to create an exploit, reliably achieving results that often seem like sorcery to an outsider like me (consider Google Project Zero's Analyzing a Modern In-the-wild Android Exploit or An analysis of an in-the-wild iOS Safari WebContent to GPU Process exploit). Modern exploit techniques have made entire classes of bugs previously considered relatively harmless into security bugs, like 'use after free' or 'double free' memory usage bugs; the assumption today is that any instance of one of those in the kernel can probably be weaponized as part of an exploit chain, even if no one has yet worked out how to do it for a specific instance.

Once upon a time, it was reasonably possible to immediately tell whether or not a bug had a security impact, and you could divide fixed kernel bugs into 'ordinary bug fixes' and 'fixes a security problem'. For many years now, that has not been the case; instead, the fact that a bugfix was also a security fix may only become clear years after it was made. Today, the line between the two is not so much narrow as invisible. Often the difference between 'a kernel bug' and 'a kernel bug that can be weaponized as part of an exploit chain' seems to be whether or not a highly skilled person (or team) has spent enough time and effort to come up with something sufficiently ingenious.

Are there bug fixes that are genuinely impossible to weaponize as part of a security exploit? Probably. But reliably identifying them has proven to be very challenging, or to put it another way the Linux kernel has tried to do it in the past and failed repeatedly, identifying bug fixes as non-security ones when they turned out to be bugs that could be weaponized.

(The Linux kernel still periodically has security bugs that are obvious once discovered and often straightforward to exploit, but these are relatively uncommon.)

(This expands a bit on something I said in a conversation on the Fediverse.)

Our likely long term future (not) with Ubuntu (as of early 2024)

By: cks

Over on the Fediverse I said something that's probably not particularly surprising:

In re Canonical and Ubuntu: at work we are still using Ubuntu LTS (and we're going to start using 24.04), but this is on servers where we don't have to deal with snaps (we turn them off, they don't work in our environment). But the Canonical monetization drive is obvious and the end point is inevitable, so I expect we'll wind up on Debian before too many more years (depending on what Canonical does to LTS releases). 2026? 2028? Who knows.

wrt: <a post by @feoh>

(Work is a university department where we use physical servers in our own machine rooms and don't have the funding to pay for commercial support for anywhere near all of those servers.)

The 2026 and 2028 dates come from the expected next Ubuntu LTS release dates (which since 2008 have been every two years toward the end of April). It's always possible that Canonical could do something that unexpectedly forces us off Ubuntu LTS 22.04 and 24.04 before 2026 comes around and we have to make a decision again, but it seems somewhat unlikely (the obvious change would be to lock a lot of security updates behind 'Ubuntu Pro', effectively making the non-paid versions of Ubuntu LTS unsupported for most security fixes).

One potential and seemingly likely change that would force us to move away from Ubuntu would be Canonical changing important non-GUI packages to be Snaps instead of .debs that can be installed through apt (they've already moved important GUI packages to Snaps, but we are so far living without them). Snaps simply don't work in our environment and if Canonical forced us, we would rather move to Debian than to try to hack up Ubuntu and our NFS based environment to make them work (for the moment, until Canonical changes something that breaks our hacks). Another potential change that I keep expecting is for Canonical to more or less break the server installer in non-cloud environments, or to require them to provide emulations of cloud facilities (such as something to supply system metadata).

But in the long term I don't think the specific breaking changes are worth trying to predict. The general situation is that Canonical is a commercial company that is out to make money (lots of money), and free Ubuntu LTS for servers (or for anything) is a loss leader. The arc of loss leaders bends towards death, whether it be through obvious discontinuation, deliberate crippling, or simply slow strangulation from lack of resources. Sooner or later we'll have to move off Ubuntu; the only big questions are how soon and how much notice we'll have.

Should we jump before we have to? That may be a question we'll be asking ourselves in 2026, or maybe 2025 when the next Debian release will probably come out.

(Our answer for Ubuntu 24.04 LTS is that there's nothing in 24.04 so far that forces us to think about it so we're going to roll on with the default of continuing with Ubuntu LTS releases.)

Libvirt's virt-viewer and (guest) screen blanking

By: cks

One of the things that I sometimes need to do with my libvirt-based virtual machines is connect to their 'graphical' consoles. There are a variety of ways to do this, but generally the most convenient way for me has been virt-viewer, followed by virt-manager. Virt-viewer is typically pretty great, but it has one little drawback that surfaces with some of my VMs, especially the Fedora ones that boot into graphics mode. The standard behavior for Fedora machines sitting idle in graphics mode, especially on the login screen, is that after a while they'll blank the screen, which winds up turning off video output.

On a physical machine, the way to un-blank the display is to send it some keyboard or mouse input. Unfortunately, once this happens, all virt-viewer will do is display a 'Connected to graphics server' message in an otherwise inactive display window. Typing, clicking the mouse buttons, or moving the mouse does nothing in my environment; virt-viewer seems to be unwilling to send keyboard input to a virtual machine with a powered-down display.

(Virt-viewer has a menu that will let you send various keystrokes to the virtual machine, but this menu is greyed out when the virt-viewer sees the screen as blanked.)

My traditional way to fix this was to briefly bring up the virtual machine's console in virt-manager, which would reliably unblank it. With the console display now active, I could switch back to virt-viewer. Recently I discovered a better way. One of the things that virsh can do is directly send keystrokes to a virtual machine guest with 'virsh send-key'. Sending any keys to the guest will cause it to un-blank the screen, which is the result I want.

(The process for sending keys is a little bit arcane, you'll want to consult either virtkeyname-linux or maybe virtkeycode-linux, both of which are part of the libvirt manual pages.)

What key or keys you want to send are up to you. Right now I'm sending an ESC, which feels harmless and is easy to remember. If I was clever I'd write an 'unblank' script that just took the virtual machine's name and did all of the necessary magic for it.

(And someday hopefully this will all be unnecessary because virt-viewer will learn how to do this itself. Possibly I'm missing something in virt-viewer that would fix this, or something in libvirt machine configuration that would disable screen blanking.)

Making virtual machine network interfaces inactive in Linux libvirt

By: cks

Today, for reasons beyond the scope of this entry, I was interested in arranging to boot a libvirt-based virtual machine with a network interface that had no link signal, or at least lacked the virtual equivalent of it. It was not entirely obvious how to do this, and some of the ways I tried didn't work. So let's start with the easier thing to do, which is to set up a network interface that exists but doesn't talk to anything.

The easiest way I know of to do this is to create an 'isolated' libvirt network. An isolated libvirt network is essentially a virtual switch (technically a bridge) that is not connect to the outside world in any way. If your virtual machine's network interface is the only thing connected to this isolated network, it will have link signal but nothing out there to talk to. You can create such a network either through explicitly writing and loading the network XML yourself or through a GUI such as virt-manager (I recommend the GUI).

However, what I wanted was a network interface (a link) that was down, not up but connected to a non-functioning network. This is possible in several different ways through libvirt's various interfaces.

If a virtual machine is running, there are 'virsh' commands that will let you see the virtual machine's interfaces and manipulate their state. 'virsh domiflist <domain>' will give you the interface names, then 'domif-getlink <domain> <interface>' will get its current state and 'domif-setlink <domain> <interface> <state>' will change it. If the virtual machine is not running, you'll need to get the interface's MAC from 'domiflist', then use 'domif-setlink <domain> <iface> <state> --config' to affect the link state when the virtual machine starts up. However, you'll need to remember to later reset things with 'domif-setlink ... up --config' to make the interface be active on future boots.

If you like virt-manager's GUI (which I do), the easier approach for a powered down virtual machine is to go into its hardware list, pick the network device, and untick the 'Link state: active' tickbox (then Apply this change). You can then start the VM, which will come up with the interface behaving as if it had no network cable connected. Later you can tick the box again (and apply it) to reconnect the interface. The same thing can be done by editing the domain XML for the virtual machine to modify virtual link state. I believe this is what 'domif-setlink ... --config' does behind the scenes, although I haven't dumped the XML after such a change to see.

(In general there's a fair amount of interesting things lurking in the virsh manual page. For instance, until today I didn't know about 'virsh console' to connect to the serial console of a virtual machine.)

Modern Linux mounts a lot of different types of virtual filesystems

By: cks

For reasons that don't fit in the margins of this entry, I was recently looking at the filesystems that we have mounted on our Ubuntu machines and their types. Some of these filesystems are expected and predictable, such as the root filesystem (ext4), any NFS mounts we have on a particular machine (which can be of two different filesystem types), maybe a tmpfs mount with a size limit that we've set up, and ZFS filesystems on our fileservers. But a modern Linux system doesn't stop there, and in fact has a dizzying variety of virtual filesystems on various mount points, with various different virtual filesystem types.

As of Ubuntu 22.04 on a system that boots using UEFI (and runs some eBPF stuff), here is what you get:

sysfs        /sys
proc         /proc
devtmpfs     /dev
devpts       /dev/pts
tmpfs        /run
securityfs   /sys/kernel/security
tmpfs        /dev/shm
tmpfs        /run/lock
cgroup2      /sys/fs/cgroup
pstore       /sys/fs/pstore
efivarfs     /sys/firmware/efi/efivars
bpf          /sys/fs/bpf
autofs       /proc/sys/fs/binfmt_misc
hugetlbfs    /dev/hugepages
mqueue       /dev/mqueue
debugfs      /sys/kernel/debug
tracefs      /sys/kernel/tracing
fusectl      /sys/fs/fuse/connections
configfs     /sys/kernel/config
ramfs        /run/credentials/systemd-sysusers.service
binfmt_misc  /proc/sys/fs/binfmt_misc
rpc_pipefs   /run/rpc_pipefs
tracefs      /sys/kernel/debug/tracing
tmpfs        /run/user/<UID>

That amounts to 20 different virtual filesystem types, or 19 if you don't count systemd's autofs /proc/sys/fs/binfmt_misc mount.

On the one hand, I'm sure all of these different virtual filesystem types exist for good reason, and it makes the life of kernel code simpler to have so many different ones. On the other hand, it makes it more difficult for people who want to exclude all virtual filesystems and only list 'real' ones. For many virtual filesystems and their mounts, the third field of /proc/self/mounts (the type) is the same as the first field (the theoretical 'source' of the mount), but there are exceptions:

udev devtmpfs
systemd-1 autofs
none ramfs
sunrpc rpc_pipefs

(On another system there is 'gvfsd-fuse' versus 'fuse.gvfsd-fuse' and 'portal' versus 'fuse.portal'.)

As a pragmatic matter, for things like our metrics system we're probably better off excluding by mount point; anything at or under /run, /sys, /proc, and /dev is most likely to be a virtual filesystem of some sort. Alternately, you can rely on things like metrics host agents to have a sensible default list, although if you have to modify that list yourself you're going to need to keep an eye on its defaults.

PS: technically this is a mix of true virtual filesystems, which materialize their files and directories on demand from other information, and 'virtual' filesystem types that are merely purely RAM-based but do store inodes, files, and directories that you create and manipulate normally. Since the latter are ephemeral, they usually get lumped together with the former, but there is a core difference. And a virtual filesystem isn't necessarily volatile; both 'efivarfs' and 'pstore' are non-volatile.

Limiting the maximum size of Amanda debug logs with a Linux tmpfs mount

By: cks

Recently we had a little incident with our Amanda backup system, where the Amanda daemon on both the backup server and one particular backup client got into a state where they kept running and, more importantly, kept writing out plaintive debugging log messages. We discovered this when first the Amanda server and then an important Amanda client entirely filled up their root filesystems with what was, by that time, a several hundred gigabyte debug log file, which they each wrote into their /var/log/amanda directory tree. Afterward, we wanted to limit the size of Amanda debugging logs so that they couldn't fill up the root filesystem any more, especially on Amanda clients (which our normal servers, especially our fileservers).

All of our root filesystems are ext4, which supports quotas for users, groups, and "projects", as sort of covered in the ext4 manual page. In theory we could have added a size limit on /var/log/amanda with project quotas. In practice this would have required updating the root filesystem's mount options in order to get it to take effect (and that means editing /etc/fstab too), plus we have no experience with ext4 quotas in general and especially with project quotas. Instead we realized that there was a simpler solution.

(We can't use user quotas on the user that Amanda runs as because Amanda also has to write and update various things outside of /var/log/amanda. We don't want those to be damaged if /var/log/amanda gets too big.)

The easiest way to get a small, size limited filesystem on Linux is with a tmpfs mount. Of course the contents of a tmpfs mount are ephemeral, but we almost never look at Amanda's debug logs and so we decided that it was okay to lose the past few days of them on a reboot or other event (Amanda defaults to only keeping four days of them). Better yet, with systemd you can add a tmpfs mount with a systemd unit and some systemd commands, without having to modify /etc/fstab in any way. Some quick checking showed that our /var/log/amanda directories were all normally quite small, with the largest ones being 25 Mbytes or so, so the extra memory needed for a tmpfs for them is fine.

Without comments, the resulting systemd var-log-amanda.mount file is:

[Unit]
Description=Temporary Amanda directory /var/log/amanda
# I am not 100% sure about this. It's copied from other
# tmpfs mount units.
DefaultDependencies=no
Conflicts=umount.target
Before=local-fs.target umount.target
After=swap.target

[Mount]
What=tmpfs
Where=/var/log/amanda
Type=tmpfs
Options=size=128m,nr_inodes=100000,\
    nosuid,nodev,strictatime,\
    mode=0770,uid=34,gid=34

[Install]
RequiredBy=local-fs.target

(The UID and GID are those of the standard fixed Ubuntu 'backup' user and group. Possibly we can specify these by name instead; I haven't experimented to see if that's supported by mount and tmpfs. The real Options= line isn't split across multiple lines this way; I did it here to not break web page layout.)

In theory it would be better to use zram for this, since Amanda's debug logs are all text and should compress nicely. In practice, setting up a zram device and a filesystem on it and getting it all mounted has more moving parts than a tmpfs mount, which can be done as a single .mount systemd unit.

If we wanted persistence, another option could be a loopback device that used an appropriately sized file on the root filesystem as its backing store. I suspect that the actual mounting can be set up in a single systemd mount unit with appropriate options (since mount has options for setting up the loop device for you given the backing file).

Getting the underlying disks of a Linux software RAID array

By: cks

Due to the pre-beta Ubuntu 24.04 issue I found with grub updates on systems with software RAID root filesystems and BIOS MBR booting, for a while I thought we'd need something that rewrote a debconf key to change it from naming the software RAID of the root filesystem to naming the devices it was on. So I spent a bit of time working out how best to do that, which I'm going to write down for any future use.

At one level this question seems silly, because the devices are right there in /proc/mdstat (once we get which software RAID the root filesystem is mounted from). However, you have to parse them out and be careful to get it right, so we'd ideally like an easier way, which is to use lsblk:

# lsblk -n -p --list --output TYPE,NAME -s /dev/md0
raid1 /dev/md0
part  /dev/sda2
disk  /dev/sda
part  /dev/sdb2
disk  /dev/sdb

We want the 'disk' type devices. Having the basic /dev names is good enough for some purposes (for example, directly invoking grub-install), but we may want to use /dev/disk/by-id names in things like debconf keys for greater stability if our system has additional data disks and their 'sdX' names may get renumbered at some point.

To get the by-id names, you have two options, depending on how old your lsblk is. Sufficiently recent versions of lsblk support an 'ID-LINK' field, so you can use it to directly get the name you want (just add it as an output field in the lsblk invocation above). Otherwise, the easiest way to do this is with udevadm:

udevadm info -q symlink /dev/sda | fmt -1 | sort

Since there are a bunch of /dev/disk/by-id names, you'll need to decide which one you pick and which ones you exclude. For our systems, it looks like we'd exclude 'wnn-' and 'nvme-eui.' names, probably exclude any 'scsi-' name that was all hex digits, and then take the alphabetically first option. Since lsblk's 'ID-LINK' field basically does this sort of thing for you, it's the better option if you can use it.

Going from a software RAID to the EFI System Partitions (ESPs) on its component disks is possible but harder (and you may need to do this if the relevant debconf settings have gotten scrambled). Given a disk, lsblk can report all of the components of it and what their partition type is:

# lsblk --list --output FSTYPE,PARTTYPE,NAME -n -p /dev/nvme0n1
ext4                                                   /dev/md0
                                                       /dev/nvme0n1
vfat              c12a7328-f81f-11d2-ba4b-00a0c93ec93b /dev/nvme0n1p1
linux_raid_member 0fc63daf-8483-4772-8e79-3d69d8477de4 /dev/nvme0n1p2

If a disk has an ESP, it will be a 'vfat' filesystem with the partition GUID shown here, which is the one assigned to indicate an ESP. In many Linux environments you can skip checking for the GUID and simply assume that any 'vfat' filesystem on your servers is there because it's the ESP. If you see this partition GUID but lsblk doesn't say that this is a vfat filesystem, what you have is a potential ESP that was set up during partitioning but then never formatted as a (vfat) filesystem. To do this completely properly you need to mount these filesystems to see if they have the right contents, but here we'd just assume that a vfat filesystem with the right partition GUID had been set up properly by the installer (or by whoever did the disk replacement).

(A partition GUID of '21686148-6449-6e6f-744e-656564454649' is a BIOS boot partition, which is often present on modern installs that use BIOS MBR booting.)

It's far from clear how grub package updates work on Ubuntu

By: cks

Recently I ran across (and eventually reported) an issue on pre-beta Ubuntu 24.04 where a grub package update would fail for systems with software RAID root disks and BIOS MBR booting. The specific error was that grub-install could not install the new version of GRUB's boot-time code on '/dev/md0', the (nominal) device of the root filesystem, reporting an error to the effect of:

grub-install: warning: File system `ext2' doesn't support embedding.
grub-install: warning: Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However, blocklists are UNRELIABLE and their use is discouraged..
grub-install: error: diskfilter writes are not supported.
  grub-install failure for /dev/md0

(You can work around this by reconfiguring the grub package to use the underlying disk devices, either by doing 'dpkg-reconfigure grub-pc' or by installing package updates in a manner where dpkg is allowed to ask you questions. Also, this is another case of grub-install having unclear error messages.)

One of the puzzling things about this entire situation is that the exact same configuration works on Ubuntu 22.04 and there are no obvious differences between 22.04 and 24.04 here. For instance, there are debconf keys for what the root filesystem device is and they are exactly the same between 22.04 and 24.04:

; debconf-show grub-pc
[...]
* grub-pc/install_devices: /dev/disk/by-id/md-name-ubuntu-server:0

At this point you might guess (as I did) that 'grub-install /dev/md0' works on Ubuntu 22.04. However, it does not; it fails with the same error as in 24.04. So presumably how grub-install is invoked during package updates is different between 22.04 and 24.04.

As far as I can tell, the 'grub-pc' package runs grub-install from its 'postinst' script, which you can find in /var/lib/dpkg/info/grub-pc.postinst. If you take a look at this script, you can see that it's a rather complex script that is quite embedded into the general Debian package update and debconf framework. If there are ways to run it as a standalone script so that you can understand what it's doing, those ways aren't at all obvious. It's also not obvious how the script is making or not making decisions, and the 22.04 and 24.04 versions seem pretty similar. Nor does scanning and searching either version of the script provide any smoking guns in the form of, for example, mentions of 'md-'.

(You have to know a reasonable amount about dpkg to even find /var/lib/dpkg/info and know that the 'grub-pc.postinst' file is what you're looking for. The dpkg manual page does mention that packages can have various scripts associated with them.)

All of this adds up to something that's almost impossible for ordinary people to troubleshoot or debug. All we can readily determine is that this worked in Ubuntu 20.04 LTS and 22.04 LTS, and doesn't work in the pre-beta 24.04 (and probably not in the beta 24.04, and most likely not in the released 24.04). The mechanisms of it working and not working are opaque, buried inside several layers of black boxes.

Part of this opacity is that it's not even clear what Ubuntu's grub package does or is supposed to do on package update. If you run a UEFI system with mirrored system disks, for example, you may be a little bit surprised to find out that Ubuntu's grub is probably quietly updating all your EFI system partitions when it does package updates.

PS: after much delving into things using various tools and the fact that I have various scratch virtual machines available, I now believe that the answer is that Ubuntu 20.04 and 22.04 don't run grub-install at all when the grub package (for MBR booting) is updated. This fact is casually semi-disguised in the 20.04 and 22.04 grub-pc postinst script. Presumably the 20.04 and 22.04 server installer should have set 'grub-pc/install_devices' to a different value, but that problem was being covered up by grub-install normally not running and using that value.

Some thoughts on switching daemons to be socket activated via systemd

By: cks

Socket activation is a systemd feature for network daemons where systemd is responsible for opening and monitoring the Internet or local socket for a daemon, and it only starts the actual daemon when a client connects. This behavior mimics the venerable inetd but with rather more sophistication and features. A number of Linux distributions are a little bit in love with switching various daemons over to being socket activated this way, from the traditional approach where the daemon handles listening for connections itself (along with the sockets involved). Sometimes this goes well, and sometimes it doesn't.

There are a number of advantages to having a service (a daemon) activated by systemd through socket activation instead of running all the time:

  • Services can simplify their startup ordering because their socket is ready (and other services can start trying to talk to it) before the daemon itself is ready. In fact, systemd can reliably know when a socket is 'ready' instead of having to guess when a service has gotten that far in its startup.

  • Heavy-weight daemons don't have to be started until they're actually needed. As a consequence, these daemons and their possibly slow startup don't delay the startup of the overall system.

  • The service (daemon) responsible for handling a particular socket can often be restarted or swapped around without clients having to care.

  • The daemon responsible for the service can shut down after a while if there's no activity, reducing resource usage on the host; since systemd still has the socket active, the service will just get restarted if there's a new client that wants to talk to it.

Socket activated daemons don't have to ever time out and exit on their own; they can hang around until restarted or explicitly stopped if they want to. But it's common to make them exit on their own after a timeout, since this is seen as a general benefit. Often this is actually convenient, especially on typical systems. For example, I believe many libvirt daemons exit if they're unused; on my Fedora workstations, this means they're often not running (I'm usually not running VMs on my desktops).

Apart from another systemd unit and the daemon having a deeper involvement with systemd, the downside of socket activation is that your daemon isn't immediately started and sometimes it may not be running. The advantage of daemons immediately starting on boot is that you know right away whether or not they could start, and if they're always running you don't have to worry about whether they'll restart under the system's current conditions (and perhaps some updated configuration settings). If the daemon has an expensive startup process, socket activation can mean that you have to wait for that on the first connection (or the first connection after things go idle), as systemd starts the daemon to handle your connection and the daemon goes through its startup.

Similarly, having the theoretical possibility for a daemon to exit if it's unused for long enough doesn't matter if it will never be unused for that long once it starts. For example, if a daemon has a deactivation timeout of two minutes of idleness and your system monitoring connects to it for a health check every 59 seconds, it's never going to time out (and it's going to be started very soon after the system boots, when the first post-boot health check happens).

PS: If you want to see all currently enabled systemd socket activations on your machine, you want 'systemctl list-sockets'. Most of them will be local (Unix) sockets.

The Linux kernel.task_delayacct sysctl and why you might care about it

By: cks

If you run a recent enough version of iotop on a typical Linux system, it may nag at you to the effect of:

CONFIG_TASK_DELAY_ACCT and kernel.task_delayacct sysctl not enabled in kernel, cannot determine SWAPIN and IO %

You might wonder whether you should turn on this sysctl, how much you care, and why it was defaulted to being disabled in the first place.

This sysctl enables (Task) Delay accounting, which tracks things like how long things wait for the CPU or wait for their IO to complete on a per-task basis (which in Linux means 'thread', more or less). General system information will provide you an overall measure of this in things like 'iowait%' and pressure stall information, but those are aggregates; you may be interested in known things like how much specific processes are being delayed or are waiting for IO.

(Also, overall system iowait% is a conservative measure and won't give you a completely accurate picture of how much processes are waiting for IO. You can get per-cgroup pressure stall information, which in some cases can come close to a per-process number.)

In the context of iotop specifically, the major thing you will miss is 'IO %', which is the percent of the time that a particular process is waiting for IO. Task delay accounting can give you information about per-process (or task) run queue latency but I don't know if there are any tools similar to iotop that will give you this information. There is a program in the kernel source, tools/accounting/getdelays.c, that will dump the raw information on a one-time basis (and in some versions, compute averages for you, which may be informative). The (current) task delay accounting information you can theoretically get is documented in comments in include/uapi/linux/taskstats.h, or this version in the documentation. You may also want to look at include/linux/delayacct.h, which I think is the kernel internal version that tracks this information.

(You may need the version of getdelays.c from your kernel's source tree, as the current version may not be backward compatible to your kernel. This typically comes up as compile errors, which are at least obvious.)

How you can access this information yourself is sort of covered in Per-task statistics interface, but in practice you'll want to read the source code of getdelays.c or the Python source code of iotop. If you specifically want to track how long a task spends delaying for IO, there is also a field for it in /proc/<pid>/stat; per proc(5), field 42 is delayacct_blkio_ticks. As far as I can tell from the kernel source, this is the same information that the netlink interface will provide, although it only has the total time waiting for 'block' (filesystem) IO and doesn't have the count of block IO operations.

Task delay accounting can theoretically be requested on a per-cgroup basis (as I saw in a previous entry on where the Linux load average comes from), but in practice this only works for cgroup v1. This (task) delay accounting has never been added to cgroup v2, which may be a sign that the whole feature is a bit neglected. I couldn't find much to say why delay accounting was changed (in 2021) to default to being off. The commit that made this change seems to imply it was defaulted to off on the assumption that it wasn't used much. Also see this kernel mailing list message and this reddit thread.

Now that I've discovered kernel.task_delayacct and played around with it a bit, I think it's useful enough for us for diagnosing issues that we're going to turn it on by default until and unless we see problems (performance or otherwise). Probably I'll stick to doing this with an /etc/sysctl.d/ drop in file, because I think that gets activated early enough in boot to cover most processes of interest.

(As covered somewhere, if you turn delay accounting on through the sysctl, it apparently only covers processes that were started after the sysctl was changed. Processes started before have no delay accounting information, or perhaps only 'CPU' delay accounting information. One such process is init, PID 1, which will always be started before the sysctl is set.)

PS: The per-task IO delays do include NFS IO, just as iowait does, which may make it more interesting if you have NFS clients. Sometimes it's obvious which programs are being affected by slow NFS servers, but sometimes not.

Reading the Linux cpufreq sysfs interface is (deliberately) slow

By: cks

The Linux kernel has a CPU frequency (management) system, called cpufreq. As part of this, Linux (on supported hardware) exposes various CPU frequency information under /sys/devices/system/cpu, as covered in Policy Interface in sysfs. Reading these files can provide you with some information about the state of your system's CPUs, especially their current frequency (more or less). This information is considered interesting enough that the Prometheus host agent collects (some) cpufreq information by default. However, there is a little caution, which is that apparently the kernel deliberately slows down reading this information from /sys (as I learned recently. A comment in the relevant Prometheus code says that this delay is 50 milliseconds, but this comment dates from 2019 and may be out of date now (I wasn't able to spot the slowdown in the kernel code itself).

On a machine with only a few CPUs, reading this information is probably not going to slow things down enough that you really notice. On a machine with a lot of CPUs, the story can be very different. We have one AMD 512-CPU machine, and on this machine reading every CPU's scaling_cur_freq one at a time takes over ten seconds:

; cd /sys/devices/system/cpu/cpufreq
; time cat policy*/scaling_cur_freq >/dev/null
10.25 real 0.07 user 0.00 kernel

On a 112-CPU Xeon Gold server, things are not so bad at 2.24 seconds; a 128-Core AMD takes 2.56 seconds. A 64-CPU server is down to 1.28 seconds, a 32-CPU one 0.64 seconds, and on my 16-CPU and 12-CPU desktops (running Fedora instead of Ubuntu) the time is reported as '0.00 real'.

This potentially matters on high-CPU machines where you're running any sort of routine monitoring that tries to read this information, including the Prometheus host agent in its default configuration. The Prometheus host agent reduces the impact of this slowdown somewhat, but it's still noticeably slower to collect all of the system information if we have the 'cpufreq' collector enabled on these machines. As a result of discovering this, I've now disabled the Prometheus host agent's 'cpufreq' collector on anything with 64 cores or more, and we may reduce that in the future. We don't have a burning need to see CPU frequency information and we would like to avoid slow data collection and occasional apparent impacts on the rest of the system.

(Typical Prometheus configurations magnify the effect of the slowdown because it's common to query ('scrape') the host agent quite often, for example every fifteen seconds. Every time you do this, the host agent re-reads these cpufreq sysfs files and hits this delay.)

PS: I currently have no views on how useful the system's CPU frequencies are as a metric, and how much they might be perturbed by querying them (although the Prometheus host agent deliberately pretends it's running on a single-CPU machine, partly to avoid problems in this area). If you do, you might either universally not collect CPU frequency information or take the time impact to do so even on high-CPU machines.

Sorting out PIDs, Tgids, and tasks on Linux

By: cks

In the beginning, Unix only had processes and processes had process IDs (PIDs), and life was simple. Then people added (kernel-supported) threads, so processes could be multi-threaded. When you add threads, you need to give them some user-visible identifier. There are many options for what this identifier is and how it works (and how threads themselves work inside the kernel). The choice Linux made was that threads were just processes (that shared more than usual with other processes), and so their identifier was a process ID, allocated from the same global space of process IDs as regular independent processes. This has created some ambiguity in what programs and other tools mean by 'process ID' (including for me).

The true name for what used to be a 'process ID', which is to say the PID of the overall entity that is 'a process with all its threads', is a TGID (Thread or Task Group ID). The TGID of a process is the PID of the main thread; a single-threaded program will have a TGID that is the same as its PID. You can see this in the 'Tgid:' and 'Pid:' fields of /proc/<PID>/status. Although some places will talk about 'pids' as separate from 'tids' (eg some parts of proc(5)), the two types are both allocated from the same range of numbers because they're both 'PIDs'. If I just give you a 'PID' with no further detail, there's no way to know if it's a process's PID or a task's PID.

In every /proc/<PID> directory, there is a 'tasks' subdirectory; this contains the PIDs of all tasks (threads) that are part of the thread group (ie, have the same TGID). All PIDs have a /proc/<PID> directory, but for convenience things like 'ls /proc' only lists the PIDs of processes (which you can think of as TGIDs). The /proc/<PID> directories for other tasks aren't returned by the kernel when you ask for the directory contents of /proc, although you can use them if you access them directly (and you can also access or discover them through /proc/<PID>/tasks). I'm not sure what information in the /proc/<PID> directories for tasks are specific to the task itself or are in total across all tasks in the TGID. The proc(5) manual page sometimes talks about processes and sometimes about tasks, but I'm not sure that's comprehensive.

(Much of the time when you're looking at what is actually a TGID, you want the total information across all threads in the TGID. If /proc/<PID> always gave you only task information even for the 'process' PID/TGID, multi-threaded programs could report confusingly low numbers for things like CPU usage unless you went out of your way to sum /proc/<PID>/tasks/* information yourself.)

Various tools will normally return the PID (TGID) of the overall process, not the PID of a random task in a multi-threaded process. For example 'pidof <thing>' behaves this way. Depending on how the specific process works, this may or may not be the 'main thread' of the program (some multi-threaded programs more or less park their initial thread and do their main work on another one created later), and the program may not even have such a thing (I believe Go programs mostly don't, as they multiplex goroutines on to actual threads as needed).

If a tool or system offers you the choice to work on or with a 'PID' or a 'TGID', you are being given the choice to work with a single thread (task) or the overall process. Which one you want depends on what you're doing, but if you're doing things like asking for task delay information, using the TGID may better correspond to what you expect (since it will be the overall information for the entire process, not information for a specific thread). If a program only talks about PIDs, it's probably going to operate on or give you information about the entire process by default, although if you give it the PID of a task within the process (instead of the PID that is the TGID), you may get things specific to that task.

In a kernel context such as eBPF programs, I think you'll almost always want to track things by PID, not TGID. It is PIDs that do things like experience run queue scheduling latency, make system calls, and incur block IO delays, not TGIDs. However, if you're selecting what to report on, monitor, and so on, you'll most likely want to match on the TGID, not the PID, so that you report on all of the tasks in a multi-threaded program, not just one of them (unless you're specifically looking at tasks/threads, not 'a process').

(I'm writing this down partly to get it clear in my head, since I had some confusion recently when working with eBPF programs.)

Some more notes on Linux's ionice and kernel IO priorities

By: cks

In the long ago past, Linux gained some support for block IO priorities, with some limitations that I noticed the first time I looked into this. These days the Linux kernel has support for more IO scheduling and limitations, for example in cgroups v2 and its IO controller. However ionice is still there and now I want to note some more things, since I just looked at ionice again (for reasons outside the scope of this entry).

First, ionice and the IO priorities it sets are specifically only for read IO and synchronous write IO, per ioprio_set(2) (this is the underlying system call that ionice uses to set priorities). This is reasonable, since IO priorities are attached to processes and asynchronous write IO is generally actually issued by completely different kernel tasks and in situations where the urgency of doing the write is unrelated to the IO priority of the process that originally did the write. This is a somewhat unfortunate limitation since often it's write IO that is the slowest thing and the source of the largest impacts on overall performance.

IO priorities are only effective with some Linux kernel IO schedulers, such as BFQ. For obvious reasons they aren't effective with the 'none' scheduler, which is also the default scheduler for NVMe drives. I'm (still) unable to tell if IO priorities work if you're using software RAID instead of sitting your (supported) filesystem directly on top of a SATA, SAS, or NVMe disk. I believe that IO priorities are unlikely to work with ZFS, partly because ZFS often issues read IOs through its own kernel threads instead of directly from your process and those kernel threads probably aren't trying to copy around IO priorities.

Even if they pass through software RAID, IO priorities apply at the level of disk devices (of course). This means that each side of a software RAID mirror will do IO priorities only 'locally', for IO issued to it, and I don't believe there will be any global priorities for read IO to the overall software RAID mirror. I don't know if this will matter in practice. Since IO priorities only apply to disks, they obviously don't apply (on the NFS client) to NFS read IO. Similarly, IO priorities don't apply to data read from the kernel's buffer/page caches, since this data is already in RAM and doesn't need to be read from disk. This can give you an ionice'd program that is still 'reading' lots of data (and that data will be less likely to be evicted from kernel caches).

Since we mostly use some combination of software RAID, ZFS, and NFS, I don't think ionice and IO priorities are likely to be of much use for us. If we want to limit the impact a program's IO has on the rest of the system, we need different measures.

Restarting systemd-networkd normally clears your 'ip rules' routing policies

By: cks

Here's something that I learned recently: if systemd-networkd restarts, for example because of a package update for it that includes an automatic daemon restart, it will clear your 'ip rules' routing policies (and also I think your routing table, although you may not notice that much). If you've set up policy based routing of your own (or some program has done that as part of its operation), this may produce unpleasant surprises.

Systemd-networkd does this fundamentally because you can set ip routing policies in .network files. When networkd is restarted, one of the things it does is re-set-up whatever routing policies you specified; if you didn't specify any, it clears them. This is a reasonably sensible decision, both to deal with changes from previously specified routing policies and to also give people a way to clean out their experiments and reset to a known good base state. Similar logic applies to routes.

This can be controlled through networkd.conf and its drop-in files, by setting ManageForeignRoutingPolicyRules=no and perhaps ManageForeignRoutes=no. Without testing it through a networkd restart, I believe that the settings I want are:

[Network]
ManageForeignRoutingPolicyRules=no
ManageForeignRoutes=no

The minor downside of this for me is that certain sorts of route updates will have to be done by hand, instead of by updating .network files and then restarting networkd.

While having an option to do this sort of clearing is sensible, I am dubious about the current default. In practice, coherently specifying routing policies through .network files is so much of a pain that I suspect that few people do it that way; instead I suspect that most people either script it to issue the 'ip rule' commands (as I do) or use software that does it for them (and I know that such software exists). It would be great if networkd could create and manage high level policies for you (such as isolated interfaces), but the current approach is both verbose and limited in what you can do with it.

(As far as I know, networkd can't express rules for networks that can be brought up and torn down, because it's not an event-based system where you can have it react to the appearance of an interface or a configured network. It's possible I'm wrong, but if so it doesn't feel well documented.)

All of this is especially unfortunate on Ubuntu servers, which normally configure their networking through netplan. Netplan will more or less silently use networkd as the backend to actually implement what you wrote in your Netplan configuration, leaving you exposed to this, and on top of that Netplan itself has limitations on what routing policies you can express (pushing you even more towards running 'ip rule' yourself).

Scheduling latency, IO latency, and their role in Linux responsiveness

By: cks

One of the things that I do on my desktops and our servers is collect metrics that I hope will let me assess how responsive our systems are when people are trying to do things on them. For a long time I've been collecting disk IO latency histograms, and recently I've been collecting runqueue latency histograms (using the eBPF exporter and a modified version of libbpf/tools/runqlat.bpf.c). This has caused me to think about the various sorts of latency that affects responsiveness and how I can measure it.

Run queue latency is the latency between when a task becomes able to run (or when it got preempted in the middle of running) and when it does run. This latency is effectively the minimum (lack of) response from the system and is primarily affected by CPU contention, since the major reason tasks have to wait to run is other tasks using the CPU. For obvious reasons, high(er) run queue latency is related to CPU pressure stalls, but a histogram can show you more information than an aggregate number. I expect run queue latency to be what matters most for a lot of programs that mostly talk to things over some network (including talking to other programs on the same machine), and perhaps some of their time burning CPU furiously. If your web browser can't get its rendering process running promptly after the HTML comes in, or if it gets preempted while running all of that Javascript, this will show up in run queue latency. The same is true for your window manager, which is probably not doing much IO.

Disk IO latency is the lowest level indicator of things having to wait on IO; it sets a lower bound on how little latency processes doing IO can have (assuming that they do actual disk IO). However, direct disk IO is only one level of the Linux IO system, and the Linux IO system sits underneath filesystems. What actually matters for responsiveness and latency is generally how long user-level filesystem operations take. In an environment with sophisticated, multi-level filesystems that have complex internal behavior (such as ZFS and its ZIL), the actual disk IO time may only be a small portion of the user-level timing, especially for things like fsync().

(Some user-level operations may also not do any disk IO at all before they return from the kernel (for example). A read() might be satisfied from the kernel's caches, and a write() might simply copy the data into the kernel and schedule disk IO later. This is where histograms and related measurements become much more useful than averages.)

Measuring user level filesystem latency can be done through eBPF, to at least some degree; libbpf-tools/vfsstat.bpf.c hooks a number of kernel vfs_* functions in order to just count them, and you could convert this into some sort of histogram. Doing this on a 'per filesystem mount' basis is probably going to be rather harder. On the positive side for us, hooking the vfs_* functions does cover the activity a NFS server does for NFS clients as well as truly local user level activity. Because there are a number of systems where we really do care about the latency that people experience and want to monitor it, I'll probably build some kind of vfs operation latency histogram eBPF exporter program, although most likely only for selected VFS operations (since there are a lot of them).

I think that the straightforward way of measuring user level IO latency (by tracking the time between entering and exiting a top level vfs_* function) will wind up including run queue latency as well. You will get, basically, the time it takes to prepare and submit the IO inside the kernel, the time spent waiting for it, and then after the IO completes the time the task spends waiting inside the kernel before it's able to run.

Because of how Linux defines iowait, the higher your iowait numbers are, the lower the run queue latency portion of the total time will be, because iowait only happens on idle CPUs and idle CPUs are immediately available to run tasks when their IO completes. You may want to look at io pressure stall information for a more accurate track of when things are blocked on IO.

A complication of measuring user level IO latency is that not all user visible IO happens through read() and write(). Some of it happens through accessing mmap()'d objects, and under memory pressure some of it will be in the kernel paging things back in from wherever they wound up. I don't know if there's any particularly easy way to hook into this activity.

Some notes about the Cloudflare eBPF Prometheus exporter for Linux

By: cks

I've been a fan of the Cloudflare eBPF Prometheus exporter for some time, ever since I saw their example of per-disk IO latency histograms. And the general idea is extremely appealing; you can gather a lot of information with eBPF (usually from the kernel), and the ability to turn it into metrics is potentially quite powerful. However, actually using it has always been a bit arcane, especially if you were stepping outside the bounds of Cloudflare's canned examples. So here's some notes on the current version (which is more or less v2.4.0 as I write this), written in part for me in the future when I want to fiddle with eBPF-created metrics again.

If you build the ebpf_exporter yourself, you want to use their provided Makefile rather than try to do it directly. This Makefile will give you the choice to build a 'static' binary or a dynamic one (with 'make build-dynamic'); the static is the default. I put 'static' into quotes because of the glibc NSS problem; if you're on a glibc-using Linux, your static binary will still depend on your version of glibc. However, it will contain a statically linked libbpf, which will make your life easier. Unfortunately, building a static version is impossible on some Linux distributions, such as Fedora, because Fedora just doesn't provide static versions of some required libraries (as far as I can tell, libelf.a). If you have to build a dynamic executable, a normal ebpf_exporter build will depend on the libbpf shared library you can find in libbpf/dest/usr/lib. You'll need to set a LD_LIBRARY_PATH to find this copy of libbpf.so at runtime.

(You can try building with the system libbpf, but it may not be recent enough for ebpf_exporter.)

To get metrics from eBPF with ebpf_exporter, you need an eBPF program that collects the metrics and then a YAML configuration that tells ebpf_exporter how to handle what the eBPF program provides. The original version of ebpf_exporter had you specify eBPF programs in text in your (YAML) configuration file and then compiled them when it started. This approach has fallen out of favour, so now eBPF programs must be pre-compiled to special .o files that are loaded at runtime. I believe these .o files are relatively portable across systems; I've used ones built on Fedora 39 on Ubuntu 22.04. The simplest way to build either a provided example or your own one is to put it in the examples directory and then do 'make <name>.bpf.o'. Running 'make' in the examples directory will build all of the standard examples.

To run an eBPF program or programs, you copy their <name>.bpf.o and <name>.yaml to a configuration directory of your choice, specify this directory in theebpf_exporter '--config.dir' argument, and then use '--config.names=<name>,<name2>,...' to say what programs to run. The suffix of the YAML configuration file and the eBPF object file are always fixed.

The repository has some documentation on the YAML (and eBPF) that you have to write to get metrics. However, it is probably not sufficient to explain how to modify the examples or especially to write new ones. If you're doing this (for example, to revive an old example that was removed when the exporter moved to the current pre-compiled approach), you really want to read over existing examples and then copy their general structure more or less exactly. This is especially important because the main ebpf_exporter contains some special handling for at least histograms that assumes things are being done as in their examples. When reading examples, it helps to know that Cloudflare has a bunch of helpers that are in various header files in the examples directory. You want to use these helpers, not the normal, standard bpf helpers.

(However, although not documented in bpf-helpers(7), '__sync_fetch_and_add()' is a standard eBPF thing. It is not so much documented as mentioned in some kernel BPF documentation on arrays and maps and in bpf(2).)

One source of (e)BPF code to copy from that is generally similar to what you'll write for ebpf_exporter is bcc/libbpf-tools (in the <name>.bpf.c files). An eBPF program like runqlat.bpf.c will need restructuring to be used as an ebpf_exporter program, but it will show you what you can hook into with eBPF and how. Often these examples will be more elaborate than you need for ebpf_exporter, with more options and the ability to narrowly select things; you can take all of that out.

(When setting up things like the number of histogram slots, be careful to copy exactly what the examples do in both your .bpf.c and in your YAML, mysterious '+ 1's and all.)

Where and how Ubuntu kernels get their ZFS modules

By: cks

One of the interesting and convenient things about Ubuntu for people like us is that they provide pre-built and integrated ZFS kernel modules in their mainline kernels. If you want ZFS on your (our) ZFS fileservers, you don't have to add any extra PPA repositories or install any extra kernel module packages; it's just there. However, this leaves us with a little mystery, which is how the ZFS modules actually get there. The reason this is a mystery is that the ZFS modules are not in the Ubuntu kernel source, or at least not in the package source.

(One reason this matters is that you may want to see what patches Ubuntu has applied to their version of ZFS, because Ubuntu periodically backports patches to specific issues from upstream OpenZFS. If you go try to find ZFS patches, ZFS code, or a ZFS changelog in the regular Ubuntu kernel source, you will likely fail, and this will not be what you want.)

Ubuntu kernels are normally signed in order to work with Secure Boot. If you use 'apt source ...' on a signed kernel, what you get is not the kernel source but a 'source' that fetches specific unsigned kernels and does magic to sign them and generate new signed binary packages. To actually get the kernel source, you need to follow the directions in Build Your Own Kernel to get the source of the unsigned kernel package. However, as mentioned this kernel source does not include ZFS.

(You may be tempted to fetch the Git repository following the directions in Obtaining the kernel sources using git, but in my experience this may well leave you hunting around in confusing to try to find the branch that actually corresponds to even the current kernel for an Ubuntu release. Even if you have the Git repository cloned, downloading the source package can be easier.)

How ZFS modules get into the built Ubuntu kernel is that during the package build process, the Ubuntu kernel build downloads or copies a specific zfs-dkms package version and includes it in the tree that kernel modules are built from, which winds up including the built ZFS kernel modules in the binary kernel packages. Exactly what version of zfs-dkms will be included is specified in debian/dkms-versions, although good luck finding an accurate version of that file in the Git repository on any predictable branch or in any predictable location.

(The zfs-dkms package itself is the DKMS version of kernel ZFS modules, which means that it packages the source code of the modules along with directions for how DKMS should (re)build the binary kernel modules from the source.)

This means that if you want to know what specific version of the ZFS code is included in any particular Ubuntu kernel and what changed in it, you need to look at the source package for zfs-dkms, which is called zfs-linux and has its Git repository here. Don't ask me how the branches and tags in the Git repository are managed and how they correspond to released package versions. My current view is that I will be downloading specific zfs-linux source packages as needed (using 'apt source zfs-linux').

The zfs-linux source package is also used to build the zfsutils-linux binary package, which has the user space ZFS tools and libraries. You might ask if there is anything that makes zfsutils-linux versions stay in sync with the zfs-dkms versions included in Ubuntu kernels. The answer, as far as I can see, is no. Ubuntu is free to release new versions of zfsutils-linux and thus zfs-linux without updating the kernel's dkms-versions file to use the matching zfs-dkms version. Sufficiently cautious people may want to specifically install a matching version of zfsutils-linux and then hold the package.

I was going to write something about how you get the ZFS source for a particular kernel version, but it turns out that there is no straightforward way. Contrary to what the Ubuntu documentation suggests, if you do 'apt source linux-image-unsigned-$(uname -r)', you don't get the source package for that kernel version, you get the source package for the current version of the 'linux' kernel package, at whatever is the latest released version. Similarly, while you can inspect that source to see what zfs-dkms version it was built with, 'apt get source zfs-dkms' will only give you (easy) access to the current version of the zfs-linux source package. If you ask for an older version, apt will probably tell you it can't find it.

(Presumably Ubuntu has old source packages somewhere, but I don't know where.)

Fixing my problem of a stuck 'dnf updateinfo info' on Fedora Linux

By: cks

I apply Fedora updates only by hand, and as part of this I like to look at what 'dnf updateinfo info' will tell me about why they're being done. For some time, there's been an issue on my work desktop where 'dnf updateinfo info' would report on updates that I'd already applied, often drowning out information about the updates that I hadn't. This was a bit frustrating, because my home Fedora machine didn't do this but I couldn't spot anything obviously wrong (and at various times I'd cleaned all of the DNF caches that I could find).

(Now that I look, it seems I've been having some variant of this problem for a while.)

Recently I took another shot at troubleshooting this. In the system programmer way, I started by locating the Python source code of the DNF updateinfo subcommand and reading it. This showed me a bunch of subcommand specific options that I could have discovered by reading 'dnf updateinfo --help' and led me to find 'dnf updateinfo list', which lists which RPM (or RPMs) a particular update will update. When I used 'dnf updateinfo list' and looked at the list of RPMs, something immediately jumped out at me, and it turned out to be the cause.

My 'dnf updateinfo info' problems were because I had old Fedora 37 'debugsource' RPMs still installed (on a machine now running Fedora 39).

The '-debugsource' and '-debuginfo' RPMs for a given RPM contain symbol information and then source code that is used to allow better debugging (see Debuginfo packages and this change to create debugsource as well). I tend to wind up installing them if I'm trying to debug a crash in some standard packaged program, or sometimes code that heavily uses system libraries. Possibly these packages get automatically cleaned up if you update Fedora releases in one of the officially supported ways, but I do a live upgrade using DNF (following this Fedora documentation). Clearly, when I do such an upgrade, these packages are not removed or updated.

(It's possible that these packages are also not removed or updated within a specific Fedora release when you update their base packages, but since they were installed a long time ago I can't tell at this point.)

With these old debugsource packages hanging around, DNF appears to have reasonably seen more recent versions of them available and duly reported the information on the 'upgrade' (in practice the current version of the package) in 'dnf updateinfo info' when I asked for it. That the packages would not be updated if I did a 'dnf update' was not updateinfo's problem. Removing the debugsource packages eliminated this and now 'dnf updateinfo info' is properly only reporting actual pending updates.

('dnf updateinfo' has various options for what packages to select, but as covered in the updateinfo command documentation apparently they're mostly the same in practice.)

In the future I'm going to have to remember to remove all debugsource and debuginfo packages before upgrading Fedora releases. Possibly I should remove them after I'm done with whatever I installed them for. If I needed them again (in that Fedora release) I'd have to re-fetch them, but that's rare.

PS: In reading the documentation, I've discovered that it's really 'dnf updateinfo --info'; updateinfo just accepts 'info' (and 'list') as equivalent to the switches.

(This elaborates on a Fediverse post I made at the time.)

What ZIL metrics are exposed by (Open)ZFS on Linux

By: cks

The ZFS Intent Log (ZIL) is effectively ZFS's version of a filesystem journal, writing out hopefully brief records of filesystem activity to make them durable on disk before their full version is committed to the ZFS pool. What the ZIL is doing and how it's performing can be important for the latency (and thus responsiveness) of various operations on a ZFS filesystem, since operations like fsync() on an important file must wait for the ZIL to write out (commit) their information before they can return from the kernel. On Linux, OpenZFS exposes global information about the ZIL in /proc/spl/kstat/zfs/zil, but this information can be hard to interpret without some knowledge of ZIL internals.

(In OpenZFS 2.2 and later, each dataset also has per-dataset ZIL information in its kstat file, /proc/spl/kstat/zfs/<pool>/objset-0xXXX, for some hexadecimal '0xXXX'. There's no overall per-pool ZIL information the way there is a global one, but for most purposes you can sum up the ZIL information from all of the pool's datasets.)

The basic background here is the flow of activity in the ZIL and also the comments in zil.h about the members of the zil_stats struct.

The (ZIL) data you can find in the "zil" file (and the per-dataset kstats in OpenZFS 2.2 and later) is as follows:

  • zil_commit_count counts how many times a ZIL commit has been requested through things like fsync().
  • zil_commit_writer_count counts how many times the ZIL has actually committed. More than one commit request can be merged into the same ZIL commit, if two people fsync() more or less at the same time.

  • zil_itx_count counts how many intent transactions (itxs) have been written as part of ZIL commits. Each separate operation (such as a write() or a file rename) gets its own separate transaction; these are aggregated together into log write blocks (lwbs) when a ZIL commit happens.

When ZFS needs to record file data into the ZIL, it has three options, which it calls 'indirect', 'copied', and 'needcopy' in ZIL metrics. Large enough amounts of file data are handled with an indirect write, which writes the data to its final location in the regular pool; the ZIL transaction only records its location, hence 'indirect'. In a copied write, the data is directly and immediately put in the ZIL transaction (itx), even before it's part of a ZIL commit; this is done if ZFS knows that the data is being written synchronously and it's not large enough to trigger an indirect write. In a needcopy write, the data just hangs around in RAM as part of ZFS's regular dirty data, and if a ZIL commit happens that needs that data, the process of adding its itx to the log write block will fetch the data from RAM and add it to the itx (or at least the lwb).

There are ZIL metrics about this:

  • zil_itx_indirect_count and zil_itx_indirect_bytes count how many indirect writes have been part of ZIL commits, and the total size of the indirect writes of file data (not of the 'itx' records themselves, per the comments in zil.h).

    Since these are indirect writes, the data written is not part of the ZIL (it's regular data blocks), although it is put on disk as part of a ZIL commit. However, unlike other ZIL data, the data written here would have been written even without a ZIL commit, as part of ZFS's regular transaction group commit process. A ZIL commit merely writes it out earlier than it otherwise would have been.

  • zil_itx_copied_count and zil_itx_copied_bytes count how many 'copied' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.

  • zil_itx_needcopy_count and zil_itx_needcopy_bytes count how many 'needcopy' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.

A regular system using ZFS may have little or no 'copied' activity. Our NFS servers all have significant amounts of it, presumably because some NFS data writes are done synchronously and so this trickles through to the ZFS stats.

In a given pool, the ZIL can potentially be written to either the main pool's disks or to a separate log device (a slog, which can also be mirrored). The ZIL metrics have a collection of zil_itx_metaslab_* metrics about data actually written to the ZIL in either the main pool ('normal' metrics) or to a slog (the 'slog' metrics).

  • zil_itx_metaslab_normal_count counts how many ZIL log write blocks (not ZIL records, itxs) have been committed to the ZIL in the main pool. There's a corresponding 'slog' version of this and all further zil_itx_metaslab metrics, with the same meaning.

  • zil_itx_metaslab_normal_bytes counts how many bytes have been 'used' in ZIL log write blocks (for ZIL commits in the main pool). This is a rough representation of how much space the ZIL log actually needed, but it doesn't necessarily represent either the actual IO performed or the space allocated for ZIL commits.

    As I understand things, this size includes the size of the intent transaction records themselves and also the size of the associated data for 'copied' and 'needcopy' data writes (because these are written into the ZIL as part of ZIL commits, and so use space in log write blocks). It doesn't include the data written directly to the pool as 'indirect' data writes.

If you don't use a slog in any of your pools, the 'slog' versions of these metrics will all be zero. I think that if you have only slogs, the 'normal' versions of these metrics will all be zero.

In ZFS 2.2 and later, there are two additional statistics for both normal and slog ZIL commits:

  • zil_itx_metaslab_normal_write counts how many bytes have actually been written in ZIL log write blocks. My understanding is that this includes padding and unused space at the end of a log write block that can't fit another record.

  • zil_itx_metaslab_normal_alloc counts how many bytes of space have been 'allocated' for ZIL log write blocks, including any rounding up to block sizes, alignments, and so on. I think this may also be the logical size before any compression done as part of IO, although I'm not sure if ZIL log write blocks are compressed.

You can see some additional commentary on these new stats (and the code) in the pull request and the commit itself.

PS: OpenZFS 2.2 and later has a currently undocumented 'zilstat' command, and its 'zilstat -v' output may provide some guidance on what ratios of these metrics the ZFS developers consider interesting. In its current state it will only work on 2.2 and later because it requires the two new stats listed above.

Sidebar: Some typical numbers

Here is the "zil" file from my office desktop, which has been up for long enough to make it interesting:

zil_commit_count                4    13840
zil_commit_writer_count         4    13836
zil_itx_count                   4    252953
zil_itx_indirect_count          4    27663
zil_itx_indirect_bytes          4    2788726148
zil_itx_copied_count            4    0
zil_itx_copied_bytes            4    0
zil_itx_needcopy_count          4    174881
zil_itx_needcopy_bytes          4    471605248
zil_itx_metaslab_normal_count   4    15247
zil_itx_metaslab_normal_bytes   4    517022712
zil_itx_metaslab_normal_write   4    555958272
zil_itx_metaslab_normal_alloc   4    798543872

With these numbers we can see interesting things, such as that the average number of ZIL transactions per commit is about 18 and that my machine has never done any synchronous data writes.

Here's an excerpt from one of our Ubuntu 22.04 ZFS fileservers:

zil_commit_count                4    155712298
zil_commit_writer_count         4    155500611
zil_itx_count                   4    200060221
zil_itx_indirect_count          4    60935526
zil_itx_indirect_bytes          4    7715170189188
zil_itx_copied_count            4    29870506
zil_itx_copied_bytes            4    74586588451
zil_itx_needcopy_count          4    1046737
zil_itx_needcopy_bytes          4    9042272696
zil_itx_metaslab_normal_count   4    126916250
zil_itx_metaslab_normal_bytes   4    136540509568

Here we can see the drastic impact of NFS synchronous writes (the significant 'copied' numbers), and also of large NFS writes in general (the high 'indirect' numbers). This machine has written many times more data in ZIL commits as 'indirect' writes as it has written to the actual ZIL.

NetworkManager won't share network interfaces, which is a problem

By: cks

Today I upgraded my home desktop to Fedora 39. It didn't entirely go well; specifically, my DSL connection broke because Fedora stopped packaging some scripts with rp-pppoe and Fedora's old ifup, which is used by my very old-fashioned setup still requires those scripts. After I got back on the Internet, I decided to try an idea I'd toyed with, namely using NetworkManager to handle (only) my DSL link. Unfortunately this did not go well:

audit: op="connection-activate" uuid="[...]" name="[...]" pid=458524 uid=0 result="fail" reason="Connection '[...]' is not available on device em0 because device is strictly unmanaged"

The reason that em0 is 'unmanaged' by NetworkManager is that it's managed by systemd-networkd, which I like much better. Well, also I specifically told NetworkManager not to touch it by setting it as 'unmanaged' instead of 'managed'.

Although I haven't tested, I suspect that NetworkManager applies this restriction to all VPNs and other layered forms of networking, such that you can only run a NetworkManager managed VPN over a network interface that NetworkManager is controlling. I find this quite unfortunate. There is nothing that NetworkManager needs to change on the underlying Ethernet link to run PPPoE or a VPN over it; the network is a transport (a low level transport in the case of PPPoE).

I don't know if it's theoretically possible to configure NetworkManager so that an interface is 'managed' but NetworkManager doesn't touch it at all, so that systemd-networkd and other things could continue to use em0 while NetworkManager was willing to run PPPoE on top of it. Even if it's possible in theory, I don't have much confidence that it will be problem free in practice, either now or in the future, because fundamentally I'd be lying to NetworkManager and networkd. If NetworkManager really had a 'I will use this interface but not change its configuration' category, it would have a third option besides 'managed or '(strictly) unmanaged'.

(My current solution is a hacked together script to start pppd and pppoe with magic options researched through extrace and a systemd service that runs that script. I have assorted questions about how this is going to interactive with various things, but someday I will get answers, or perhaps unpleasant surprises.)

PS: Where this may be a special problem someday is if I want to run a VPN over my DSL link. I can more or less handle running PPPoE by hand, but the last time I looked at a by hand OpenVPN setup I rapidly dropped the idea. NetworkManager is or would be quite handy for this sort of 'not always there and complex' networking, but it apparently needs to own the entire stack down to Ethernet.

(To run a NetworkManager VPN over 'ppp0', I would have to have NetworkManager manage it, which would presumably require I have NetworkManager handle the PPPoE DSL, which requires NetworkManager not considering em0 to be unmanaged. It's NetworkManager all the way down.)

Digitalno podpisovanje dokumentov v LibreOffice

V enem prejšnjih prispevkov smo si ogledali kako v okolju Linux (in Windows) lahko digitalno podpisujemo PDF dokumente, v tokratnem prispevku pa si bomo ogledali, kako s SIGEN-CA lahko digitalno podpišemo LibreOffice dokumente.

Predpriprava

Najprej seveda potrebujemo SIGEN-CA digitalno potrdilo, korensko potrdilo slovenske državne uprave in vmesno potrdilo izdajatelja SIGEN-CA, in ta potrdila moramo prenesti v NSS sistemsko shrambo v Linuxu. V okolju Windows pa lahko uporabimo Thunderbirdovo shrambo digitalnih potrdil.

Nastavitve LibreOffica

Sedaj zaženemo LibreOffice in v meniju Orodja izberemo Možnosti in nato v razdelku LibreOffice izberemo Varnost.

Nastavitve NSS baze v LibreOffice

Nastavitve NSS baze v LibreOffice.

Nato poleg Pot digitalnega potrdila kliknemo na gumb Potrdilo in nastavimo dostop do NSS baze.

LibreOffice bo sedaj potrebno ponovno zagnati.

Digitalno podpisovanje v LibreOffice

Zdaj v LibreOffice odpremo dokument, ki bi ga želeli digitalno podpisati in v meniju Datoteka izberemo Digitalni podpisi - Digitalni podpisi ter kliknemo gumb Podpiši dokument.

Digitalno podpisovanje dokumentov v LibreOffice

Digitalno podpisovanje dokumentov v LibreOffice.

V naslednjem koraku izberemo svoj digitalni podpis in dokument podpišemo.

LibreOffice omogoča tudi digitalno podpisovanje PDF dokumentov. To naredimo preko menija Datoteka, kjer izberemo Digitalni podpisi ter Podpiši obstoječi PDF. Zažene se aplikacija LibreOffice Draw, kjer lahko podpišemo obstoječi PDF.

Digitalne podpise lahko preverimo preko menija Datoteka izberemo Digitalni podpisi - Digitalni podpisi.

Digitalni podpisi dokumenta v LibreOffice

Digitalni podpisi dokumenta v LibreOffice.

LibreOffice prikaže, če so dokumenti veljavno podpisani.

Veljavno podpisan dokument v LibreOffice

Veljavno podpisan dokument v LibreOffice.

Če pa je bil dokument po podpisu spremenjen, bo LibreOffice to ravno tako takoj ustrezno prikazal.

Digitalni dokument, ki je bil spremenjen po podpisu

Digitalni dokument, ki je bil spremenjen po podpisu.

Digitalno podpisovanje PDF dokumentov v Linuxu (in Windows)

Ena glavnih prednosti digitalnega podpisovanja PDF-jev je zmanjšanje potrebe po tiskanju, pošiljanju in shranjevanju papirnatih dokumentov. V Sloveniji lahko PDF dokumente digitalno podpišemo s pomočjo SIGEN-CA digitalnih potrdil, ki se uporabljajo za e-storitve državne uprave, oddajo dohodnine, dostop do geodetskih in katastrskih podatkov, itd. Kvalificirana digitalna potrdila SIGEN-CA izdaja država in sicer fizičnim osebam (starejšim od 15 let, ki imajo slovensko davčno številko) ter poslovnim subjektom. V tokratnem prispevku si bomo pogledali kako PDF dokumente s pomočjo SIGEN-CA potrdil podpisujemo v Linuxu (v grafičnem okolju, ne v terminalu) in sicer z odprtokodno in brezplačno aplikacijo Okular. Na koncu pa še, kako aplikacijo z Okular PDF dokumente digitalno podpisujemo v okolju Windows

Digitalno podpisovanje PDF v okolju Linux

Prvi korak je, da si SIGEN-CA digitalno potrdilo, pa tudi korensko potrdilo slovenske državne uprave in vmesno potrdilo izdajatelja SIGEN-CA prenesemo v sistemsko shrambo v Linuxu. Gre za sistemsko shrambo NSS (Network Security Services). NSS je nabor kriptografskih knjižnic, ki je v Ubuntu Linuxu že privzeto nameščen, je pa za lažji uvoz potrebno namestiti še nekatera dodatna orodja.

Namestitev digitalnih potrdil v NSS sistemsko shrambo

Najprej torej namestimo pki-tools. Odpremo terminal in vpišemo ukaz: sudo apt install pki-tools

Nato si iz spletne strani si-trust.gov.si na svoj računalnik prenesemo Korensko potrdilo SI-TRUST Root ter Vmesno potrdilo SIGEN-CA G2.

Obe potrdili (si-trust-root.crt in sigen-ca-g2.xcert.crt) je nato potrebno pretvoriti iz DER formata v PEM zapis:

openssl x509 -inform der -in si-trust-root.crt -out si-trust-root.pem

openssl x509 -inform der -in sigen-ca-g2.xcert.crt -out sigen-ca-g2.xcert.pem

Zdaj ti dve potrdili (v PEM formatu) uvozimo v NSS sistemsko shrambo (spodnji ukazi predvidevajo, da ste potrdila shranili neposredno v svojo domačo mapo):

cd $HOME/.pki/nssdb
PKICertImport -d . -n "SI-TRUST Root - Republika Slovenija" -t "CT,C,C" -a -i $HOME/si-trust-root.pem -u L
PKICertImport -d . -n "SIGEN-CA G2 - Republika Slovenija" -t "CT,C,C" -a -i $HOME/sigen-ca-g2.xcert.pem -u L

Končno pa uvozimo še naše osebno SIGEN-CA potrdilo (tudi ta naj bo shranjen neposredno v vaši domači mapo): pk12util -d ~/.pki/nssdb -i $HOME/Matej_Kovacic_SIGEN-CA.p12

Zdaj lahko uvožena potrdila pregledamo: certutil -L -d sql:.

Dobimo takšen izpis:

Certificate Nickname                                         Trust Attributes
                                                             SSL,S/MIME,JAR/XPI

Matej Kovačič’s Republika Slovenija ID                   u,u,u
SIGEN-CA G2 - Republika Slovenija                            CT,C,C
SI-TRUST Root - Republika Slovenija                          CT,C,C

Podrobnosti posameznega potrdila sicer lahko pogledamo z ukazom, kjer kot parameter podamo tim. vzdevek oz. ime potrdila, ki ga želimo pregledati:

certutil -L -d sql:. -a -n "Matej Kovačič’s Republika Slovenija ID" | openssl x509 -text -noout

Pomembno: certutil ukaz moramo poganjati v mapi $HOME/.pki/nssdb.

Pa še dva uporabna ukaza. Če želimo digitalno potrdilo preimenovati (oziroma preimenovati njegov vzdevek (angl. nickname), to storimo z ukazom:

certutil --rename -d sql:. -n "Matej Kovačič’s Republika Slovenija ID" --new-n "Matej Kovacic (staro potrdilo)"

S stikalom -n moramo navesti trenuten vzdevek (ki ga vidimo na izpisu ukaza certutil -L -d sql:.), s stikalom --new-n, pa nastavimo nov vzdevek.

Pa še brisanje potrdil:

certutil -D -d sql:. -n "Matej Kovacic (staro potrdilo)"

Nameščanje aplikacije Okular v Linux

Zdaj lahko namestimo aplikacijo Okular. Aplikacijo lahko namestimo preko sistema Snap, vendar v mojem primeru aplikacija ni znala dostopati do NSS sistemske shrambe potrdil. Druga možnost je namestitev preko sistema APT, kjer digitalno podpisovanje deluje brez težav, a če bi radi najnovejšo različico aplikacije, je najbolje, da jo namestimo preko sistema Flatpak.

To je mogoče preko aplikacije Programi Ubuntu.

Programi Ubuntu

Programi Ubuntu.

Če te aplikacije in Flatpat dodatka nimamo nameščenega, to lahko storimo z ukazi:

sudo apt install flatpak
flatpak remote-add --if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo
sudo apt install gnome-software-plugin-flatpak gnome-software

Namestitev je nato preprosta. Zaženemo aplikacijo Programi Ubuntu in kot vir namestitve izberemo Flathub ter Okular namestimo.

Namestitev Okularja preko Programi Ubuntu

Namestitev Okularja preko Programi Ubuntu.

Digitalno podpisovanje v okolju Linux

Ko Okular zaženemo, v meniju Nastavitve - Nastavi zaledja v zavihku PDF preverimo ali je dostop do zalednega NSS sistema pravilen (v mojem primeru je /home/matej/.pki/nssdb) in ali vidimo svoje digitalno potrdilo.

Nastavitve zaledja v Okularju

Nastavitve zaledja v Okularju.

Zdaj v Okularju odpremo PDF dokument, ki ga želimo digitalno podpisati ter v meniju kliknemo Orodja - Digitalno podpiši.

Digitalno podpisovanje v Okularju

Digitalno podpisovanje v Okularju.

Nato bomo na dokumentu najprej narisali kvadratek kjer bo prikazan digitalni podpis, v naslednjem koraku pa še izberemo s katerim digitalnim podpisom želimo podpisati dokument (če jih imamo več), lahko pa dodamo tudi tim. vizualni podpis oz. svoj fizični podpis nastavimo za ozadje značke digitalnega podpisa. Na veljavnost digitalnega podpisa to sicer ne vpliva, lahko pa ta funkcija pride prav, ko je dokument natisnjen na papir.

Podpisan PDF dokument v Okularju

Podpisan PDF dokument v Okularju.

Opomba: na sliki se ne nahaja moj dejanski podpis, pač pa je le-ta umetno ustvarjen z generatorjem podpisov.

S tem je PDF dokument digitalno podpisan s SIGEN-CA.

Digitalno podpisovanje v okolju Windows

Aplikacijo Okular lahko namestimo tudi v okolju Windows, digitalno potrdilo pa lahko uporabimo če smo ga namestili v kriptografsko shrambo aplikacije Firefox.

Najprej je torej potrebno namestiti Firefox in vanj SIGEC-CA digitalna potrdila, nato pa Okular namestimo preko Microsoftove trgovine s programsko opremo. Podporo za digitalno podpisovanje PDF dokumentov ima namreč namestitev Okularja iz Trgovine.

Zaženemo Okular in v Settings - Configure Backends v zavihku PDF preverimo ali je dostop do zalednega NSS sistema pravilen oziroma ali pod Available Certificates vidimo svoje digitalno potrdilo.

Nastavitve zaledja v Okularju

Nastavitve zaledja v Okularju.

Če ga je, je potrebno nastaviti pot do baze potrdil v profilu Firefoxa.

Pot do mape z nastavitvami Firefoxa dobimo tako, da v Firefoxu kliknemo na ikonico s tremi črticami, izberemo Pomož za Firefox in nato Več podatkov za odpravljanje težav. Spodaj nato vidimo lokacijo Mape s profilom, in to lokacijo nato prekopiramo v Okular med nastavitve zaledja.

Lokacija profilne mape v Firefoxu

Lokacija profilne mape v Firefoxu.

Podpisovanje nato poteka na enak način kot v Linuxu.

Digitalno podpisovanje PDF v okolju Windows

Digitalno podpisovanje PDF v okolju Windows.

Veljavnost digitalnega podpisa

Po standardu eIDAS (electronic identification and trust services) obstajajo tri ravni podpisov, ‘enostavni’ elektronski podpis, napredni elektronski podpis ter kvalificiran elektronski podpis. Glede na standard je le tretji enak lastnoročnemu podpisu.

Z opisanim postopkom ustvarjen digitalni podpis velja za napredni elektronski podpis, podprt z kvalificiranim digitalnim potrdilom.

To lahko preverimo tudi preko portala EU za preverjanje veljavnosti elektronskega podpisa, ki ta elektronski podpis prepozna kot napredni elektronski podpis, podprt z kvalificiranim digitalnim potrdilom (angl. Advanced electronic Signature supported by a Qualified Certificate (AdES/QC)).

Preverjanje podpisa na spletni strani EC

Preverjanje podpisa na spletni strani EC.

Če bi želeli najvišjo stopnjo podpisa, bi morali uporabiti SIGEN-CA potrdilo zapisano na tim. pametni USB ključek. A vseeno opisan način podpisovanja v praksi zadostuje za veliko večino uradnih opravil.

Mimogrede, dokument podpisan v Linuxu lahko preverimo še v bralniku Adobe PDF v okolju Windows, kjer tudi vidimo, da je veljavno digitalno podpisan.

Preverjanje podpisa v Adobe PDF bralniku

Preverjanje podpisa v Adobe PDF bralniku.

Kot smo torej videli, je digitalno podpisovanje PDF datotek v Ubuntu Linux res enostavno, le uvoz digitalnih potrdil je nekoliko bolj okoren. Na srečo je ta postopek treba narediti samo enkrat (oz. občasno, ko zamenjamo svoje digitalno potrdilo), upam pa, da bo enkrat bo bližnji prihodnosti tudi ta del poenostavljen oziroma bo vse skupaj veliko tesneje integrirano v operacijski sistem.

V Ubuntu vgrajeni LibreOffice zna shranjevati dokumente, preglednice in predstavitve v PDF zapis z enim klikom, te dokumente v PDF zapisu pa sedaj lahko enostavno digitalno podpišemo z Okularjem.

❌