❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

The Myth and Reality of Mac OS X Snow Leopard

By: Nick Heer
28 March 2025 at 03:56

Jeff Johnson in November 2023:

When people wistfully proclaim that they wish for the next major macOS version to be a β€œSnow Leopard update”, they’re wishing for the wrong thing. No major update will solve Apple’s quality issues. Major updates are the cause of quality issues. The solution would be a long string of minor bug fix updates. What people should be wishing for are the two years of stability and bug fixes that occurred after the release of Snow Leopard. But I fear we’ll never see that again with Tim Cook in charge.

I read an article today from yet another person pining for a mythical Snow Leopard-style MacOS release. While I sympathize with the intent of their argument, it is largely fictional and, as Johnson writes, it took until about two years into Snow Leopard’s release cycle for it to be the release we want to remember:

It’s an iron law of software development that major updates always introduce more bugs than they fix. Mac OS X 10.6.0 was no exception, of course. The next major update, Mac OS X 10.7.0, was no exception either, and it was much buggier than 10.6.8 v1.1, even though both versions were released in the same week.

What I desperately miss is that period of stability after a few rounds of bug fixes. As I have previously complained about, my iMac cannot run any version of MacOS newer than Ventura, released in 2022. It is still getting bug and security fixes. In theory, this should mean I am running a solid operating system despite missing some features.

It is not. Apple’s engineering efforts quickly moved toward shipping MacOS Sonoma in 2023, and then Sequoia last year. It seems as though any bug fixes were folded into these new major versions and, even worse, new bugs were introduced late in the Ventura release cycle that have no hope of being fixed. My iMac seizes up when I try to view HDR media; because this Extended Dynamic Range is an undocumented enhancement, there is no preference to turn it off. Recent Safari releases have contained several bugs related to page rendering and scrolling. Weather sometimes does not display for my current location.

Ventura was by no means bug-free when it shipped, and I am disappointed even its final form remains a mess. My MacBook Pro is running the latest public release of MacOS Sequoia and it, too, has new problems late in its development cycle; I reported a Safari page crashing bug earlier this week. These are on top of existing problems, like how there is no way to change the size of search results’ thumbnails in Photos.

Alas, I am not expecting many bugs to be fixed. It is, after all, nearly April, which means there are just two months until WWDC and the first semi-public builds of another new MacOS version. I am hesitant every year to upgrade. But it does not appear much effort is being put into the maintenance of any previous version. We all get the choice of many familiar bugs, or a blend of hopefully fewer old bugs plus some new ones.

βŒ₯ Permalink

β€˜Adolescence’

By: Nick Heer
22 March 2025 at 21:54

Lucy Mangan, the Guardian:

There have been a few contenders for the crown [of β€œtelevisual perfection”] over the years, but none has come as close as Jack Thorne’s and Stephen Graham’s astonishing four-part series Adolescence, whose technical accomplishments – each episode is done in a single take – are matched by an array of award-worthy performances and a script that manages to be intensely naturalistic and hugely evocative at the same time. Adolescence is a deeply moving, deeply harrowing experience.

I did not intend on watching the whole four-part series today, maybe just the first and second episodes. But I could not turn away. The effectively unanimous praise for this is absolutely earned.

The oner format sounds like it could be a gimmick, the kind of thing that screams a bit too loud and overshadows what should be a tender and difficult narrative. Nothing could be further from the truth. The technical decisions force specific storytelling decisions, in the same way that a more maximalist production in the style of, say, David Fincher does. Fincher would shoot fifty versions of everything and then assemble the best performances into a tight machine β€” and I love that stuff. But I love this, too, little errors and all. It is better for these choices. The dialogue cannot get just a little bit tighter in the edit, or whatever. It is all just there.

I know nothing about reviewing television or movies but, so far as I can tell, everyone involved has pulled this off spectacularly. You can quibble with things like the rainbow party-like explanation of different emoji β€” something for which I cannot find any evidence β€” that has now become its own moral panic. I get that. Even so, this is one of the greatest storytelling achievements I have seen in years.

Update: Watch it on Netflix. See? The ability to edit means I can get away with not fully thinking this post through.

βŒ₯ Permalink

How we handle debconf questions during our Ubuntu installs

By: cks
26 March 2025 at 02:37

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

The pragmatics of doing fsync() after a re-open() of journals and logs

By: cks
25 March 2025 at 02:02

Recently I read Rob Norris' fsync() after open() is an elaborate no-op (via). This is a contrarian reaction to the CouchDB article that prompted my entry Always sync your log or journal files when you open them. At one level I can't disagree with Norris and the article; POSIX is indeed very limited about the guarantees it provides for a successful fsync() in a way that frustrates the 'fsync after open' case.

At another level, I disagree with the article. As Norris notes, there are systems that go beyond the minimum POSIX guarantees, and also the fsync() after open() approach is almost the best you can do and is much faster than your other (portable) option, which is to call sync() (on Linux you could call syncfs() instead). Under POSIX, sync() is allowed to return before the IO is complete, but at least sync() is supposed to definitely trigger flushing any unwritten data to disk, which is more than POSIX fsync() provides you (as Norris notes, POSIX permits fsync() to apply only to data written to that file descriptor, not all unwritten data for the underlying file). As far as fsync() goes, in practice I believe that almost all Unixes and Unix filesystems are going to be more generous than POSIX requires and fsync() all dirty data for a file, not just data written through your file descriptor.

Actually being as restrictive as POSIX allows would likely be a problem for Unix kernels. The kernel wants to index the filesystem cache by inode, including unwritten data. This makes it natural for fsync() to flush all unwritten data associated with the file regardless of who wrote it, because then the kernel needs no extra data to be attached to dirty buffers. If you wanted to be able to flush only dirty data associated with a file object or file descriptor, you'd need to either add metadata associated with dirty buffers or index the filesystem cache differently (which is clearly less natural and probably less efficient).

Adding metadata has an assortment of challenges and overheads. If you add it to dirty buffers themselves, you have to worry about clearing this metadata when a file descriptor is closed or a file object is deallocated (including when the process exits). If you instead attach metadata about dirty buffers to file descriptors or file objects, there's a variety of situations where other IO involving the buffer requires updating your metadata, including the kernel writing out dirty buffers on its own without a fsync() or a sync() and then perhaps deallocating the now clean buffer to free up memory.

Being as restrictive as POSIX allows probably also has low benefits in practice. To be a clear benefit, you would need to have multiple things writing significant amounts of data to the same file and fsync()'ing their data separately; this is when the file descriptor (or file object) specific fsync() saves you a bunch of data write traffic over the 'fsync() the entire file' approach. But as far as I know, this is a pretty unusual IO pattern. Much of the time, the thing fsync()'ing the file is the only writer, either because it's the only thing dealing with the file or because updates to the file are being coordinated through it so that processes don't step over each other.

PS: If you wanted to implement this, the simplest option would be to store the file descriptor and PID (as numbers) as additional metadata with each buffer. When the system fsync()'d a file, it could check the current file descriptor number and PID against the saved ones and only flush buffers where they matched, or where these values had been cleared to signal an uncertain owner. This would flush more than strictly necessary if the file descriptor number (or the process ID) had been reused or buffers had been touched in some way that caused the kernel to clear the metadata, but doing more work than POSIX strictly requires is relatively harmless.

Sidebar: fsync() and mmap() in POSIX

Under a strict reading of the POSIX fsync() specification, it's not entirely clear how you're properly supposed to fsync() data written through mmap() mappings. If 'all data for the open file descriptor' includes pages touched through mmap(), then you have to keep the file descriptor you used for mmap() open, despite POSIX mmap() otherwise implicitly allowing you to close it; my view is that this is at least surprising. If 'all data' only includes data directly written through the file descriptor with system calls, then there's no way to trigger a fsync() for mmap()'d data.

The obviousness of indexing the Unix filesystem buffer cache by inodes

By: cks
24 March 2025 at 02:34

Like most operating systems, Unix has an in-memory cache of filesystem data. Originally this was a fixed size buffer cache that was maintained separately from the memory used by processes, but later it became a unified cache that was used for both memory mappings established through mmap() and regular read() and write() IO (for good reasons). Whenever you have a cache, one of the things you need to decide is how the cache is indexed. The more or less required answer for Unix is that the filesystem cache is indexed by inode (and thus filesystem, as inodes are almost always attached to some filesystem).

Unix has three levels of indirection for straightforward IO. Processes open and deal with file descriptors, which refer to underlying file objects, which in turn refer to an inode. There are various situations, such as calling dup(), where you will wind up with two file descriptors that refer to the same underlying file object. Some state is specific to file descriptors, but other state is held at the level of file objects, and some state has to be held at the inode level, such as the last modification time of the inode. For mmap()'d files, we have a 'virtual memory area', which is a separate level of indirection that is on top of the inode.

The biggest reason to index the filesystem cache by inode instead of file descriptor or file object is coherence. If two processes separately open the same file, getting two separate file objects and two separate file descriptors, and then one process writes to the file while the other reads from it, we want the reading process to see the data that the writing process has written. The only thing the two processes naturally share is the inode of the file, so indexing the filesystem cache by inode is the easiest way to provide coherence. If the kernel indexed by file object or file descriptor, it would have to do extra work to propagate updates through all of the indirection. This includes the 'updates' of reading data off disk; if you index by inode, everyone reading from the file automatically sees fetched data with no extra work.

(Generally we also want this coherence for two processes that both mmap() the file, and for one process that mmap()s the file while another process read()s or write()s to it. Again this is easiest to achieve if everything is indexed by the inode.)

Another reason to index by inode is how easy it is to handle various situations in the filesystem cache when things are closed or removed, especially when the filesystem cache holds writes that are being buffered in memory before being flushed to disk. Processes frequently close file descriptors and drop file objects, including by exiting, but any buffered writes still need to be findable so they can be flushed to disk before, say, the filesystem itself is unmounted. Similarly, if an inode is deleted we don't want to flush its pending buffered writes to disk (and certainly we can't allocate blocks for them, since there's nothing to own those blocks any more), and we want to discard any clean buffers associated with it to free up memory. If you index the cache by inode, all you need is for filesystems to be able to find all their inodes; everything else more or less falls out naturally.

This doesn't absolutely require a Unix to index its filesystem buffer caches by inode. But I think it's clearly easiest to index the filesystem cache by inode, instead of the other available references. The inode is the common point for all IO involving a file (partly because it's what filesystems deal with), which makes it the easiest index; everyone has an inode reference and in a properly implemented Unix, everyone is using the same inode reference.

(In fact all sorts of fun tend to happen in Unixes if they have a filesystem that gives out different in-kernel inodes that all refer to the same on-disk filesystem object. Usually this happens by accident or filesystem bugs.)

How we automate installing extra packages during Ubuntu installs

By: cks
23 March 2025 at 02:52

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks
22 March 2025 at 03:24

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks
28 February 2025 at 04:07

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks
27 February 2025 at 04:15

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

Why I have a little C program to filter a $PATH (more or less)

By: cks
16 February 2025 at 02:07

I use a non-standard shell and have for a long time, which means that I have to write and maintain my own set of dotfiles (which sometimes has advantages). In the long ago days when I started doing this, I had a bunch of accounts on different Unixes around the university (as was the fashion at the time, especially if you were a sysadmin). So I decided that I was going to simplify my life by having one set of dotfiles for rc that I used on all of my accounts, across a wide variety of Unixes and Unix environments. That way, when I made an improvement in a shell function I used, I could get it everywhere by just pushing out a new version of my dotfiles.

(This was long enough ago that my dotfile propagation was mostly manual, although I believe I used rdist for some of it.)

In the old days, one of the problems you faced if you wanted a common set of dotfiles across a wide variety of Unixes was that there were a lot of things that potentially could be in your $PATH. Different Unixes had different sets of standard directories, and local groups put local programs (that I definitely wanted access to) in different places. I could have put everything in $PATH (giving me a gigantic one) or tried to carefully scope out what system environment I was on and set an appropriate $PATH for each one, but I decided to take a more brute force approach. I started with a giant potential $PATH that listed every last directory that could appear in $PATH in any system I had an account on, and then I had a C program that filtered that potential $PATH down to only things that existed on the local system. Because it was written in C and had to stat() things anyways, I made it also keep track of what concrete directories it had seen and filter out duplicates, so that if there were symlinks from one name to another, I wouldn't get it twice in my $PATH.

(Looking at historical copies of the source code for this program, the filtering of duplicates was added a bit later; the very first version only cared about whether a directory existed or not.)

The reason I wrote a C program for this (imaginatively called 'isdirs') instead of using shell builtins to do this filtering (which is entirely possible) is primarily because this was so long ago that running a C program was definitely faster than using shell builtins in my shell. I did have a fallback shell builtin version in case my C program might not be compiled for the current system and architecture, although it didn't do the filtering of duplicates.

(Rc uses a real list for its equivalent of $PATH instead of the awkward ':' separated pseudo-list that other Unix shells use, so both my C program and my shell builtin could simply take a conventional argument list of directories rather than having to try to crack a $PATH apart.)

(This entry was inspired by Ben Zanin's trick(s) to filter out duplicate $PATH entries (also), which prompted me to mention my program.)

PS: rc technically only has one dotfile, .rcrc, but I split my version up into several files that did different parts of the work. One reason for this split was so that I could source only some parts to set up my environment in a non-interactive context (also).

Sidebar: the rc builtin version

Rc has very few builtins and those builtins don't include test, so this is a bit convoluted:

path=`{tpath=() pe=() {
        for (pe in $path)
           builtin cd $pe >[1=] >[2=] && tpath=($tpath $pe)
        echo $tpath
       } >[2]/dev/null}

In a conventional shell with a test builtin, you would just use 'test -d' to see if directories were there. In rc, the only builtin that will tell you if a directory exists is to try to cd to it. That we change directories is harmless because everything is running inside the equivalent of a Bourne shell $(...).

Keen eyed people will have noticed that this version doesn't work if anything in $path has a space in it, because we pass the result back as a whitespace-separated string. This is a limitation shared with how I used the C program, but I never had to use a Unix where one of my $PATH entries needed a space in it.

The profusion of things that could be in your $PATH on old Unixes

By: cks
15 February 2025 at 03:43

In the beginning, which is to say the early days of Bell Labs Research Unix, life was simple and there was only /bin. Soon afterwards that disk ran out of space and we got /usr/bin (and all of /usr), and some people might even have put /etc on their $PATH. When UCB released BSD Unix, they added /usr/ucb as a place for (some of) their new programs and put some more useful programs in /etc (and at some point there was also /usr/etc); now you had three or four $PATH entries. When window systems showed up, people gave them their own directories too, such as /usr/bin/X11 or /usr/openwin/bin, and this pattern was followed by other third party collections of programs, with (for example) /usr/bin/mh holding all of the (N)MH programs (if you installed them there). A bit later, SunOS 4.0 added /sbin and /usr/sbin and other Unixes soon copied them, adding yet more potential $PATH entries.

(Sometimes X11 wound up in /usr/X11/bin, or /usr/X11<release>/bin. OpenBSD still has a /usr/X11R6 directory tree, to my surprise.)

When Unix went out into the field, early system administrators soon learned that they didn't want to put local programs into /usr/bin, /usr/sbin, and so on. Of course there was no particular agreement on where to put things, so people came up with all sorts of options for the local hierarchy, including /usr/local, /local, /slocal, /<group name> (such as /csri or /dgp), and more. Often these /local/bin things had additional subdirectories for things like the locally built version of X11, which might be plain 'bin/X11' or have a version suffix, like 'bin/X11R4', 'bin/X11R5', or 'bin/X11R6'. Some places got more elaborate; rather than putting everything in a single hierarchy, they put separate things into separate directory hierarchies. When people used /opt for this, you could get /opt/gnu/bin, /opt/tk/bin, and so on.

(There were lots of variations, especially for locally built versions of X11. And a lot of people built X11 from source in those days, at least in the university circles I was in.)

Unix vendors didn't sit still either. As they began adding more optional pieces they started splitting them up into various directory trees, both for their own software and for third party software they felt like shipping. Third party software was often planted into either /usr/local or /usr/contrib, although there were other options, and vendor stuff could go in many places. A typical example is Solaris 9's $PATH for sysadmins (and I think that's not even fully complete, since I believe Solaris 9 had some stuff hiding under /usr/xpg4). Energetic Unix vendors could and did put various things in /opt under various names. By this point, commercial software vendors that shipped things for Unixes also often put them in /opt.

This led to three broad things for people using Unixes back in those days. First, you invariably had a large $PATH, between all of the standard locations, the vendor additions, and the local additions on top of those (and possibly personal 'bin' directories in your $HOME). Second, there was a lot of variation in the $PATH you wanted, both from Unix to Unix (with every vendor having their own collection of non-standard $PATH additions) and from site to site (with sysadmins making all sorts of decisions about where to put local things). Third, setting yourself up on a new Unix often required a bunch of exploration and digging. Unix vendors often didn't add everything that you wanted to their standard $PATH, for example. If you were lucky and got an account at a well run site, their local custom new account dotfiles would set you up with a correct and reasonably complete local $PATH. If you were a sysadmin exploring a new to you Unix, you might wind up writing a grumpy blog entry.

(This got much more complicated for sites that had a multi-Unix environment, especially with shared home directories.)

Modern Unix life is usually at least somewhat better. On Linux, you're typically down to two main directories (/usr/bin and /usr/sbin) and possibly some things in /opt, depending on local tastes. The *BSDs are a little more expansive but typically nowhere near the heights of, for example, Solaris 9's $PATH (see the comments on that entry too).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks
9 February 2025 at 03:51

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

Linux kernel NFSv4 server and client RPC operation statistics

By: cks
7 February 2025 at 02:59

NFS servers and clients communicate using RPC, sending various NFS v3, v4, and possibly v2 (but we hope not) RPC operations to the server and getting replies. On Linux, the kernel exports statistics about these NFS RPC operations in various places, with a global summary in /proc/net/rpc/nfsd (for the NFS server side) and /proc/net/rpc/nfs (for the client side). Various tools will extract this information and convert it into things like metrics, or present it on the fly (for example, nfsstat(8)). However, as far as I know what is in those files and especially how RPC operations are reported is not well documented, and also confusing, which is a problem if you discover that something has an incomplete knowledge of NFSv4 RPC stats.

For a general discussion of /proc/net/rpc/nfsd, see Svenn D'Hert's nfsd stats explained article. I'm focusing on NFSv4, which is to say the 'proc4ops' line. This line is produced in nfsd_show in fs/nfsd/stats.c. The line starts with a count of how many operations there are, such as 'proc4ops 76', and then has one number for each operation. What are the operations and how many of them are there? That's more or less found in the nfs_opnum4 enum in include/linux/nfs4.h. You'll notice that there are some gaps in the operation numbers; for example, there's no 0, 1, or 2. Despite there being no such actual NFS v4 operations, 'proc4ops' starts with three 0s for them, because it works with an array numbered by nfs_opnum4 and like all C arrays, it starts at 0.

(The counts of other, real NFS v4 operations may be 0 because they're never done in your environment.)

For NFS v4 client operations, we look at the 'proc4' line in /proc/net/rpc/nfs. Like the server's 'proc4ops' line, it starts with a count of how many operations are being reported on, such as 'proc4 69', and then a count for each operation. Unfortunately for us and everyone else, these operations are not numbered the same as the NFS server operations. Instead the numbering is given in an anonymous and unnumbered enum in include/linux/nfs4.h that starts with 'NFSPROC4_CLNT_NULL = 0,' (as a spoiler, the 'null' operation is not unused, contrary to the include file's comment). The actual generation and output of /proc/net/rpc/nfs is done in rpc_proc_show in net/sunrpc/stats.c. The whole structure this code uses is set up in fs/nfs/nfs4xdr.c, and while there is a confusing level of indirection, I believe the structure corresponds directly with the NFSPROC4_CLNT_* enum values.

What I think is going on is that Linux has decided to optimize its NFSv4 client statistics to only include the NFS v4 operations that it actually uses, rather than take up a bit of extra memory to include all of the NFS v4 operations, including ones that will always have a '0' count. Because the Linux NFS v4 client started using different NFSv4 operations at different times, some of these operations (such as 'lookupp') are out of order; when the NFS v4 client started using them, they had to be added at the end of the 'proc4' line to preserve backward compatibility with existing programs that read /proc/net/rpc/nfs.

PS: As far as I can tell from a quick look at fs/nfs/nfs3xdr.c, include/uapi/linux/nfs3.h, and net/sunrpc/stats.c, the NFS v3 server and client stats cover all of the NFS v3 operations and are in the same order, the order of the NFS v3 operation numbers.

How Ubuntu 24.04's bad bpftrace package appears to have happened

By: cks
6 February 2025 at 02:39

When I wrote about Ubuntu 24.04's completely broken bpftrace '0.20.2-1ubuntu4.2' package (which is now no longer available as an Ubuntu update), I said it was a disturbing mystery how a theoretical 24.04 bpftrace binary was built in such a way that it depended on a shared library that didn't exist in 24.04. Thanks to the discussion in bpftrace bug #2097317, we have somewhat of an answer, which in part shows some of the challenges of building software at scale.

The short version is that the broken bpftrace package wasn't built in a standard Ubuntu 24.04 environment that only had released packages. Instead, it was built in a '24.04' environment that included (some?) proposed updates, and one of the included proposed updates was an updated version of libllvm18 that had the new shared library. Apparently there are mechanisms that should have acted to make the new bpftrace depend on the new libllvm18 if everything went right, but some things didn't go right and the new bpftrace package didn't pick up that dependency.

On the one hand, if you're planning interconnected package updates, it's a good idea to make sure that they work with each other, which means you may want to mingle in some proposed updates into some of your build environments. On the other hand, if you allow your build environments to be contaminated with non-public packages this way, you really, really need to make sure that the dependencies work out. If you don't and packages become public in the wrong order, you get Ubuntu 24.04's result.

(While the RPM build process and package format would have avoided this specific problem, I'm pretty sure that there are similar ways to make it go wrong.)

Contaminating your build environment this way also makes testing your newly built packages harder. The built bpftrace binary would have run inside the build environment, because the build environment had the right shared library from the proposed libllvm18. To see the failure, you would have to run tests (including running the built binary) in a 'pure' 24.04 environment that had only publicly released package updates. This would require an extra package test step; I'm not clear if Ubuntu has this as part of their automated testing of proposed updates (there's some hints in the discussion that they do but that these tests were limited and didn't try to run the binary).

An alarmingly bad official Ubuntu 24.04 bpftrace binary package

By: cks
2 February 2025 at 03:53

Bpftrace is a more or less official part of Ubuntu; it's even in the Ubuntu 24.04 'main' repository, as opposed to one of the less supported ones. So I'll present things in the traditional illustrated form (slightly edited for line length reasons):

$ bpftrace
bpftrace: error while loading shared libraries: libLLVM-18.so.18.1: cannot open shared object file: No such file or directory
$ readelf -d /usr/bin/bpftrace | grep libLLVM
 0x0...01 (NEEDED)  Shared library: [libLLVM-18.so.18.1]
$ dpkg -L libllvm18 | grep libLLVM
/usr/lib/llvm-18/lib/libLLVM.so.1
/usr/lib/llvm-18/lib/libLLVM.so.18.1
/usr/lib/x86_64-linux-gnu/libLLVM-18.so
/usr/lib/x86_64-linux-gnu/libLLVM.so.18.1
$ dpkg -l bpftrace libllvm18
[...]
ii  bpftrace       0.20.2-1ubuntu4.2 amd64 [...]
ii  libllvm18:amd64 1:18.1.3-1ubuntu1 amd64 [...]

I originally mis-diagnosed this as a libllvm18 packaging failure, but this is in fact worse. Based on trawling through packages.ubuntu.com, only Ubuntu 24.10 and later have a 'libLLVM-18.so.18.1' in any package; in Ubuntu 24.04, the correct name for this is 'libLLVM.so.18.1'. If you rebuild the bpftrace source .deb on a genuine 24.04 machine, you get a bpftrace build (and binary .deb) that does correctly use 'libLLVM.so.18.1' instead of 'libLLVM-18.so.18.1'.

As far as I can see, there are two things that could have happened here. The first is that Canonical simply built a 24.10 (or later) bpftrace binary .deb and put it in 24.04 without bothering to check if the result actually worked. I would like to say that this shows shocking disregard for the functioning of an increasingly important observability tool from Canonical, but actually it's not shocking at all, it's Canonical being Canonical (and they would like us to pay for this for some reason). The second and worse option is that Canonical is building 'Ubuntu 24.04' packages in an environment that is contaminated with 24.10 or later packages, shared libraries, and so on. This isn't supposed to happen in a properly operating package building environment that intends to create reliable and reproducible results and casts doubt on the provenance and reliability of all Ubuntu 24.04 packages.

(I don't know if there's a way to inspect binary .debs to determine anything about the environment they were built in, the way you can get some information about RPMs. Also, I now have a new appreciation for Fedora putting the Fedora release version into the actual RPM's 'release' name. Ubuntu 24.10 and 24.04 don't have the same version of bpftrace, so this isn't quite as simple as Canonical copying the 24.10 package to 24.04; 24.10 has 0.21.2, while 24.04 is theoretically 0.20.2.)

Incidentally, this isn't an issue of the shared library having its name changed, because if you manually create a 'libLLVM-18.so.18.1' symbolic link to the 24.04 libllvm18's 'libLLVM.so.18.1' and run bpftrace, what you get is:

$ bpftrace
: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
abort

This appears to say that the Ubuntu 24.04 bpftrace binary is incompatible with the Ubuntu 24.04 libllvm18 shared libraries. I suspect that it was built against different LLVM 18 headers as well as different LLVM 18 shared libraries.

How to accidentally get yourself with 'find ... -name something*'

By: cks
28 January 2025 at 03:43

Suppose that you're in some subdirectory /a/b/c, and you want to search all of /a for the presence of files for any version of some program:

u@h:/a/b/c$ find /a -name program* -print

This reports '/a/b/c/program-1.2.tar' and '/a/b/f/program-1.2.tar', but you happen to know that there are other versions of the program under /a. What happened to a command that normally works fine?

As you may have already spotted, what happened is the shell's wildcard expansion. Because you ran your find in a directory that contained exactly one match for 'program*', the shell expanded it before you ran find, and what you actually ran was:

find /a -name program-1.2.tar -print

This reported the two instances of program-1.2.tar in the /a tree, but not the program-1.4.1.tar that was also in the /a tree.

If you'd run your find command in a directory without a shell match for the -name wildcard, the shell would (normally) pass the unexpanded wildcard through to find, which would do what you want. And if there had been only one instance of 'program-1.2.tar' in the tree, in your current directory, it might have been more obvious what went wrong; instead, the find returning more than one result made it look like it was working normally apart from inexplicably not finding and reporting 'program-1.4.1.tar'.

(If there were multiple matches for the wildcard in the current directory, 'find' would probably have complained and you'd have realized what was going on.)

Some shells have options to cause failed wildcard expansions to be considered an error; Bash has the 'failglob' shopt, for example. People who turn these options on are probably not going to stumble into this because they've already been conditioned to quote wildcards for 'find -name' and other similar tools. Possibly this Bash option or its equivalent in other shells should be the default for new Unix accounts, just so everyone gets used to quoting wildcards that are supposed to be passed through to programs.

(Although I don't use a shell that makes failed wildcard expansions an error, I somehow long ago internalized the idea that I should quote all wildcards I want to pass to programs.)

The (potential) complexity of good runqueue latency measurement in Linux

By: cks
21 January 2025 at 04:16

Run queue latency is the time between when a Linux task becomes ready to run and when it actually runs. If you want good responsiveness, you want a low runqueue latency, so for a while I've been tracking a histogram of it with eBPF, and I put some graphs of it up on some Grafana dashboards I look at. Then recently I improved the responsiveness of my desktop with the cgroup V2 'cpu.idle' setting, and questions came up about how this different from process niceness. When I was looking at those questions, I realized that my run queue latency measurements were incomplete.

When I first set up my run queue latency tracking, I wasn't using either cgroup V2 cpu.idle or process niceness, and so I set up a single global runqueue latency histogram for all tasks regardless of their priority and scheduling class. Once I started using 'idle' CPU scheduling (and testing the effectiveness of niceness), this resulted in hopelessly muddled data that was effectively meaningless during the time that multiple scheduling types of scheduling or multiple nicenesses were running. Running CPU-consuming processes only when the system is otherwise idle is (hopefully) good for the runqueue latency of my regular desktop processes, but more terrible than usual for those 'run only when idle' processes, and generally there's going to be a lot more of them than my desktop processes.

The moment you introduce more than one 'class' of processes for scheduling, you need to split run queue latency measurements up between these classes if you want to really make sense of the results. What these classes are will depend on your environment. I could probably get away with a class for 'cpu.idle' tasks, a class for heavily nice'd tasks, a class for regular tasks, and perhaps a class for (system) processes running with very high priority. If you're doing fair share scheduling between logins, you might need a class per login (or you could ignore run queue latency as too noisy a measure).

I'm not sure I'd actually track all of my classes as Prometheus metrics. For my personal purposes, I don't care very much about the run queue latency of 'idle' or heavily nice'd processes, so perhaps I should update my personal metrics gathering to just ignore those. Alternately, I could write a bpftrace script that gathered the detailed class by class data, run it by hand when I was curious, and ignore the issue otherwise (continuing with my 'global' run queue latency histogram, which is at least honest in general).

The history and use of /etc/glob in early Unixes

By: cks
13 January 2025 at 04:41

One of the innovations that the V7 Bourne shell introduced was built in shell wildcard globbing, which is to say expanding things like *, ?, and so on. Of course Unix had shell wildcards well before V7, but in V6 and earlier, the shell didn't implement globbing itself; instead this was delegated to an external program, /etc/glob (this affects things like looking into the history of Unix shell wildcards, because you have to know to look at the glob source, not the shell).

As covered in places like the V6 glob(8) manual page, the glob program was passed a command and its arguments (already split up by the shell), and went through the arguments to expand any wildcards it found, then exec()'d the command with the now expanded arguments. The shell operated by scanning all of the arguments for (unescaped) wildcard characters. If any were found, the shell exec'd /etc/glob with the whole show; otherwise, it directly exec()'d the command with its arguments. Quoting wildcards used a hack that will be discussed later.

This basic /etc/glob behavior goes all the way back to Unix V1, where we have sh.s and in it we can see that invocation of /etc/glob. In V2, glob is one of the programs that have been rewritten in C (glob.c), and in V3 we have a sh.1 that mentions /etc/glob and has an interesting BUGS note about it:

If any argument contains a quoted "*", "?", or "[", then all instances of these characters must be quoted. This is because sh calls the glob routine whenever an unquoted "*", "?", or "[" is noticed; the fact that other instances of these characters occurred quoted is not noticed by glob.

This section has disappeared in the V4 sh.1 manual page, which suggests that the V4 shell and /etc/glob had acquired the hack they use in V5 and V6 to avoid this particular problem.

How escaping wildcards works in the V5 and V6 shell is that all characters in commands and arguments are restricted to being seven-bit ASCII. The shell and /etc/glob both use the 8th bit to mark quoted characters, which means that such quoted characters don't match their unquoted versions and won't be seen as wildcards by either the shell (when it's deciding whether or not it needs to run /etc/glob) or by /etc/glob itself (when it's deciding what to expand). However, obviously neither the shell nor /etc/glob can pass such 'marked as quoted' characters to actual commands, so each of them strips the high bit from all characters before exec()'ing actual commands.

(This is clearer in the V5 glob.c source; look for how cat() ands every character with octal 0177 (0x7f) to drop the high bit. You can also see it in the V5 sh.c source, where you want to look at trim(), and also the #define for 'quote' at the start of sh.c and how it's used later.)

PS: I don't know why expanding shell wildcards used a separate program in V6 and earlier, but part of it may have been to keep the shell smaller and more minimal so that it required less memory.

PPS: See also Stephen R. Bourne's 2015 presentation from BSDCan [PDF], which has a bunch of interesting things on the V7 shell and confirms that /etc/glob was there from V1.

What a FreeBSD kernel message about your bridge means

By: cks
8 January 2025 at 03:58

Suppose, not hypothetically, that you're operating a FreeBSD based bridging firewall (or some other bridge situation) and you see something like the following kernel message:

kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix0 to ix1
kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix1 to ix0

The bad news is that this message means what you think it means. Your FreeBSD bridge between ix0 and ix1 first saw this MAC address as the source address on a packet it received on the ix0 interface of the bridge, and then it saw the same MAC address as the source address of a packet received on ix1, and then it received another packet on ix0 with that MAC address as the source address. Either you have something echoing those packets back on one side, or there is a network path between the two sides that bypasses your bridge.

(If you're lucky this happens regularly. If you're not lucky it happens only some of the time.)

This particular message comes from bridge_rtupdate() in sys/net/if_bridge.c, which is called to update the bridge's 'routing entries', which here means MAC addresses, not IP addresses. This function is called from bridge_forward(), which forwards packets, which is itself called from bridge_input(), which handles received packets. All of this only happens if the underlying interfaces are in 'learning' mode, but this is the default.

As covered in the ifconfig manual page, you can inspect what MAC addresses have been learned on which device with 'ifconfig bridge0 addr' (covered in the 'Bridge Interface Parameters' section of the manual page). This may be useful to see if your bridge normally has a certain MAC address (perhaps the one that's moving) on the interface it should be on. If you want to go further, it's possible to set a static mapping for some MAC addresses, which will make them stick to one interface even if seen on another one.

Logging this message is controlled by the net.link.bridge.log_mac_flap sysctl, and it's rate limited to only being reported five times a second in general (using ppsratecheck()). That's five times total, even if each time is a different MAC address or even a different bridge. This 'five times a second' log count isn't controllable through a sysctl.

(I'm writing all of this down because I looked much of it up today. Sometimes I'm a system programmer who goes digging in the (FreeBSD) kernel source just to be sure.)

The issue with DNF 5 and script output in Fedora 41

By: cks
7 January 2025 at 04:45

These days Fedora uses DNF as its high(er) level package management software, replacing yum. However, there are multiple versions of DNF, which behave somewhat differently. Through Fedora 40, the default version of DNF was DNF 4; in Fedora 41, DNF is now DNF 5. DNF 5 brings a number of improvements but it has at least one issue that makes me unhappy with it in my specific situation. Over on the Fediverse I said:

Oh nice, DNF 5 in Fedora 41 has nicely improved the handling of output from RPM scriptlets, so that you can more easily see that it's scriptlet output instead of DNF messages.

[later]

I must retract my praise for DNF 5 in Fedora 41, because it has actually made the handling of output from RPM scriptlets *much* worse than in dnf 4. DNF 5 will repeatedly re-print the current output to date of scriptlets every time it updates a progress indicator of, for example, removing packages. This results in a flood of output for DKMS module builds during kernel updates. Dnf 5's cure is far worse than the disease, and there's no way to disable it.

<bugzilla 2331691>

(Fedora 41 specifically has dnf5-5.2.8.1, at least at the moment.)

This can be mostly worked around for kernel package upgrades and DKMS modules by manually removing and upgrading packages before the main kernel upgrade. You want to do this so that dnf is removing as few packages as possible while your DKMS modules are rebuilding. This is done with:

  1. Upgrade all of your non-kernel packages first:

    dnf upgrade --exclude 'kernel*'
    

  2. Remove the following packages for the old kernel:

    kernel kernel-core kernel-devel kernel-modules kernel-modules-core kernel-modules-extra

    (It's probably easier to do 'dnf remove kernel*<version>*' and let DNF sort it out.)

  3. Upgrade two kernel packages that you can do in advance:

    dnf upgrade kernel-tools kernel-tools-libs
    

Unfortunately in Fedora 41 this still leaves you with one RPM package that you can't upgrade in advance and that will be removed while your DKMS module is rebuilding, namely 'kernel-devel-matched'. To add extra annoyance, this is a virtual package that contains no files, and you can't remove it because a lot of things depend on it.

As far as I can tell, DNF 5 has absolutely no way to shut off its progress bars. It completely ignores $TERM and I can't see anything else that leaves DNF usable. It would have been nice to have some command line switches to control this, but it seems pretty clear that this wasn't high on the DNF 5 road map.

(Although I don't expect this to be fixed in Fedora 41 over its lifetime, I am still deferring the Fedora 41 upgrades of my work and home desktops for as long as possible to minimize the amount of DNF 5 irritation I have to deal with.)

WireGuard's AllowedIPs aren't always the (WireGuard) routes you want

By: cks
6 January 2025 at 04:35

A while back I wrote about understanding WireGuard's AllowedIPs, and also recently I wrote about how different sorts of WireGuard setups have different difficulties, where one of the challenges for some setups is setting up what you want routed through WireGuard connections. As Ian Z aka nobrowser recently noted in a comment on the first entry, these days many WireGuard related programs (such as wg-quick and NetworkManager) will automatically set routes for you based on AllowedIPs. Much of the time this will work fine, but there are situations where adding routes for all AllowedIPs ranges isn't what you want.

WireGuard's AllowedIPs setting for a particular peer controls two things at once: what (inside-WireGuard) source IP addresses you will accept from the peer, and what destination addresses WireGuard will send to that peer if the packet is sent to that WireGuard interface. However, it's the routing table that controls what destination addresses are sent to a particular WireGuard interface (or more likely a combination of IP policy routing rules and some routing table).

If your WireGuard IP address is only reachable from other WireGuard peers, you can sensibly bound your AllowedIPs so that the collection of all of them matches the routing table. This is also more or less doable if some of them are gateways for additional networks; hopefully your network design puts all of those networks under some subnet and the subnet isn't too big. However, if your WireGuard IP can wind up being reached by a broader range of source IPs, or even 'all of the Internet' (as is my case), then your AllowedIPs range is potentially much larger than what you want to always be routed to WireGuard.

A related case is if you have a 'work VPN' WireGuard configuration where you could route all of your traffic through your WireGuard connection but some of the time you only want to route traffic to specific (work) subnets. Unless you like changing AllowedIPs all of the time or constructing two different WireGuard interfaces and only activating the correct one, you'll want an AllowedIPs that accepts everything but some of the time you'll only route specific networks to the WireGuard interface.

(On the other hand, with the state of things in Linux, having two separate WireGuard interfaces might be the easiest way to manage this in NetworkManager or other tools.)

I think that most people's use of WireGuard will probably involve AllowedIPs settings that also work for routing, provided that the tools involve handle the recursive routing problem. These days, NetworkManager handles that for you, although I don't know about wg-quick.

(This is one of the entries that I write partly to work it out in my own head. My own configuration requires a different AllowedIPs than the routes I send through the WireGuard tunnel. I make this work with policy based routing.)

My unusual X desktop wasn't made 'from scratch' in a conventional sense

By: cks
1 January 2025 at 04:10

There are people out there who set up unusual (Unix) environments for themselves from scratch; for example, Mike Hoye recently wrote Idiosyncra. While I have an unusual desktop, I haven't built it from scratch in quite the same way that Mike Hoye and other people have; instead I've wound up with my desktop through a rather easier process.

It would be technically accurate to say that my current desktop environment has been built up gradually over time (including over the time I've been writing Wandering Thoughts, such as my addition of dmenu). But this isn't really how it happened, in that I didn't start from a normal desktop and slowly change it into my current one. The real story is that the core of my desktop dates from the days when everyone's X desktops looked like mine does. Technically there were what we would call full desktops back in those days, if you had licensed the necessary software from your Unix vendor and chose to run it, but hardware was sufficiently slow back then that people at universities almost always chose to run more lightweight environments (especially since they were often already using the inexpensive and slow versions of workstations).

(Depending on how much work your local university system administrators had done, your new Unix account might start out with the Unix vendor's X setup, or it could start out with what X11R<whatever> defaulted to when built from source, or it might be some locally customized setup. In all cases you often were left to learn about the local tastes in X desktops and how to improve yours from people around you.)

To show how far back this goes (which is to say how little of it has been built 'from scratch' recently), my 1996 SGI Indy desktop has much of the look and the behavior of my current desktop, and its look and behavior wasn't new then; it was an evolution of my desktop from earlier Unix workstations. When I started using Linux, I migrated my Indy X environment to my new (and better) x86 hardware, and then as Linux has evolved and added more and more things you have to run to have a usable desktop with things like volume control, your SSH agent, and automatically mounted removable media, I've added them piece by piece (and sometimes updated them as how you do this keeps changing).

(At some point I moved from twm as my window manager to fvwm, but that was merely redoing my twm configuration in fvwm, not designing a new configuration from scratch.)

I wouldn't want to start from scratch today to create a new custom desktop environment; it would be a lot of work (and the one time I looked at it I wound up giving up). Someday I will have to move from X, fvwm, dmenu, and so on to some sort of Wayland based environment, but even when I do I expect to make the result as similar to my current X setup as I can, rather than starting from a clean sheet design. I know what I want because I'm very used to my current environment and I've been using variants of it for a very long time now.

(This entry was sparked by Ian Z aka nobrowser's comment on my entry from yesterday.)

PS: Part of the long lineage and longevity of my X desktop is that I've been lucky and determined enough to use Unix and X continuously at work, and for a long time at home as well. So I've never had a time when I moved away from X on my desktop(s) and then had to come back to reconstruct an environment and catch it up to date.

PPS: This is one of the roots of my xdm heresy, where my desktops boot into a text console and I log in there to manually start X with a personal script that's a derivative of the ancient startx command.

In an unconfigured Vim, I want to do ':set paste' right away

By: cks
29 December 2024 at 03:53

Recently I wound up using a FreeBSD machine, where I promptly installed vim for my traditional reason. When I started modifying some files, I had contents to paste in from another xterm window, so I tapped my middle mouse button while in insert mode (ie, I did the standard xterm 'paste text' thing). You may imagine the 'this is my face' meme when what vim inserted was the last thing I'd deleted in vim on that FreeBSD machine, instead of my X text selection.

For my future use, the cure for this is ':set paste', which turns off basically all of vim's special handling of pasted text. I've traditionally used this to override things like vim auto-indenting or auto-commenting the text I'm pasting in, but it also turns off vim's special mouse handling, which is generally active in terminal windows, including over SSH.

(The defaults for ':set mouse' seem to vary from system to system and probably vim build to vim build. For whatever reason, this FreeBSD system and its vim defaulted to 'mouse=a', ie special mouse handling was active all the time. I've run into mouse handling limits in vim before, although things may have changed since then.)

In theory, as covered in Vim's X11 selection mechanism, I might be able to paste from another xterm (or whatever) using "*p (to use the '"*' register, which is the primary selection or the cut buffer if there's no primary selection). In practice I think this only works under limited circumstances (although I'm not sure what they are) and the Vim manual itself tells you to get used to using Shift with your middle mouse button. I would rather set paste mode, because that gets everything; a vim that has the mouse active probably has other things I don't want turned on too.

(Some day I'll put together a complete but minimal collection of vim settings to disable everything I want disabled, but that day isn't today.)

PS: If I'm reading various things correctly, I think vim has to be built with the 'xterm_clipboard' option in order to pull out selection information from xterm. Xterm itself must have 'Window Ops' allowed, which is not a normal setting; with this turned on, vim (or any other program) can use the selection manipulation escape sequences that xterm documents in "Operating System Commands". These escape sequences don't require that vim have direct access to your X display, so they can be used over plain SSH connections. Support for these escape sequences is probably available in other terminal emulators too, and these terminal emulators may have them always enabled.

(Note that access to your selection is a potential security risk, which is probably part of why xterm doesn't allow it by default.)

Cgroup V2 memory limits and their potential for thrashing

By: cks
28 December 2024 at 04:10

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

WireGuard on OpenBSD just works (at least as a VPN server)

By: cks
27 December 2024 at 04:12

A year or so ago I mentioned that I'd set up WireGuard on an Android and an iOS device in a straightforward VPN configuration. What I didn't mention in that entry is that the other end of the VPN was not on a Linux machine, but on one of our OpenBSD VPN servers. At the time it was running whatever was the then-current OpenBSD version, and today it's running OpenBSD 7.6, which is the current version at the moment. Over that time (and before it, since the smartphones weren't its first WireGuard clients), WireGuard on OpenBSD has been trouble free and has just worked.

In our configuration, OpenBSD WireGuard requires installing the 'wireguard-tools' package, setting up an /etc/wireguard/wg0.conf (perhaps plus additional files for generated keys), and creating an appropriate /etc/hostname.wg0. I believe that all of these are covered as part of the standard OpenBSD documentation for setting up WireGuard. For this VPN server I allocated a /24 inside the RFC 1918 range we use for VPN service to be used for WireGuard, since I don't expect too many clients on this server. The server NATs WireGuard connections just as it NATs connections from the other VPNs it supports, which requires nothing special for WireGuard in its /etc/pf.conf.

(I did have to remember to allow incoming traffic to the WireGuard UDP port. For this server, we allow WireGuard clients to send traffic to each other through the VPN server if they really want to, but in another one we might want to restrict that with additional pf rules.)

Everything I'd expect to work does work, both in terms of the WireGuard tools (I believe the information 'wg' prints is identical between Linux and OpenBSD, for example) and for basic system metrics (as read out by, for example, the OpenBSD version of the Prometheus host agent, which has overall metrics for the 'wg0' interface). If we wanted per-client statistics, I believe we could probably get them through this third party WireGuard Prometheus exporter, which uses an underlying package to talk to WireGuard that does apparently work on OpenBSD (although this particular exporter can potentially have label cardinality issues), or generate them ourselves by parsing 'wg' output (likely from 'wg show all dump').

This particular OpenBSD VPN server is sufficiently low usage that I haven't tried to measure either the possible bandwidth we can achieve with WireGuard or the CPU usage of WireGuard. Historically, neither are particularly critical for our VPNs in general, which have generally not been capable of particularly high bandwidth (with either OpenVPN or L2TP, our two general usage VPN types so far; our WireGuard VPN is for system staff only).

(In an ideal world, none of this should count as surprising. In this world, I like to note when things that are a bit out of the mainstream just work for me, with a straightforward setup process and trouble free operation.)

A gotcha with importing ZFS pools and NFS exports on Linux (as of ZFS 2.3.0)

By: cks
24 December 2024 at 03:41

Ever since its Solaris origins, ZFS has supported automatic NFS and CIFS sharing of ZFS filesystems through their 'sharenfs' and 'sharesmb' properties. Part of the idea of this is that you could automatically have NFS (and SMB) shares created and removed as you did things like import and export pools, rather than have to maintain a separate set of export information and keep it in sync with what ZFS filesystems were available. On Linux, OpenZFS still supports this, working through standard Linux NFS export permissions (which don't quite match the Solaris/Illumos model that's used for sharenfs) and standard tools like exportfs. A lot of this works more or less as you'd expect, but it turns out that there's a potentially unpleasant surprise lurking in how 'zpool import' and 'zpool export' work.

In the current code, if you import or export a ZFS pool that has no filesystems with a sharenfs set, ZFS will still run 'exportfs -ra' at the end of the operation even though nothing could have changed in the NFS exports situation. An important effect that this has is that it will wipe out any manually added or changed NFS exports, reverting your NFS exports to what is currently in /etc/exports and /etc/exports.d. In many situations (including ours) this is a harmless operation, because /etc/exports and /etc/exports.d are how things are supposed to be. But in some environments you may have programs that maintain their own exports list and permissions through running 'exportfs' in various ways, and in these environments a ZFS pool import or export will destroy those exports.

(Apparently one such environment is high availability systems, some of which manually manage NFS exports outside of /etc/exports (I maintain that this is a perfectly sensible design decision). These are also the kind of environment that might routinely import or export pools, as HA pools move between hosts.)

The current OpenZFS code runs 'exportfs -ra' entirely blindly. It doesn't matter if you don't NFS export any ZFS filesystems, much less any from the pool that you're importing or exporting. As long as an 'exportfs' binary is on the system and can be executed, ZFS will run it. Possibly this could be changed if someone was to submit an OpenZFS bug report, but for a number of reasons (including that we're not directly affected by this and aren't in a position to do any testing), that someone will not be me.

(As far as I can tell this is the state of the code in all Linux OpenZFS versions up through the current development version and 2.3.0-rc4, the latest 2.3.0 release candidate.)

Appendix: Where this is in the current OpenZFS source code

The exportfs execution is done in nfs_commit_shares() in lib/libshare/os/linux/nfs.c. This is called (indirectly) by sa_commit_shares() in lib/libshare/libshare.c, which is called by zfs_commit_shares() in lib/libzfs/libzfs_mount.c. In turn this is called by zpool_enable_datasets() and zpool_disable_datasets(), also in libzfs_mount.c, which are called as part of 'zpool import' and 'zpool export' respectively.

(As a piece of trivia, zpool_disable_datasets() will also be called during 'zpool destroy'.)

On the US FAA's response to Falcon 9 debris

By: VM
4 March 2025 at 10:06
On the US FAA's response to Falcon 9 debris

On February 1, SpaceX launched its Starlink 11-4 mission onboard a Falcon 9 rocket. The rocket's reusable first stage returned safely to the ground and the second stage remained in orbit after deploying the Starlink satellites. It was to deorbit later in a controlled procedure and land somewhere in the Pacific Ocean. But on February 19 it was seen breaking up in the skies over Denmark, England, Poland, and Sweden, with some larger pieces crashing into parts of Poland. After the Polish space agency determined the debris to belong to a SpaceX Falcon 9 rocket, the US Federal Aviation Administration (FAA) was asked about its liability. This was its response:

The FAA determined that all flight events for the SpaceX Starlink 11-4 mission occurred within the scope of SpaceX's licensed activities and that SpaceX satisfied safety at end-of-launch requirements. Per post-launch reporting requirements, SpaceX must identify any discrepancy or anomaly that occurred during the launch to the FAA within 90-days. The FAA has not identified any events that should be classified as a mishap at this time. Licensed flight activities and FAA oversight concluded upon SpaceX's last exercise of control over the Falcon 9 vehicle. SpaceX posted information on its website that the second stage from this launch reentered over Europe. The FAA is not investigating the uncontrolled reentry of the second stage nor the debris found in Poland.

I've spotted a lot of people on the internet (not trolls) describing this response as being in line with Donald Trump's "USA first" attitude and reckless disregard for the consequences of his government's actions and policies on other countries. It's understandable given how his meeting with Zelenskyy on February 28 played out as well as NASA acting administrator Janet Petro's disgusting comment about US plans to "dominate" lunar and cislunar space. However, the FAA's position has been unchanged since at least August 18, 2023, when it issued a "notice of proposed rulemaking" designated 88 FR 56546. Among other things:

The proposed rule would … update definitions relating to commercial space launch and reentry vehicles and occupants to reflect current legislative definitions … as well as implement clarifications to financial responsibility requirements in accordance with the United States Commercial Space Launch Competitiveness Act.

Under Section 401.5 2(i), the notice stated:

(1) Beginning of launch. (i) Under a license, launch begins with the arrival of a launch vehicle or payload at a U.S. launch site.

The FAA's position has likely stayed the same for some duration before the August 2023 date. According to Table 1 in the notice, the "effect of change" of the clarification of the term "Launch", under which Section 401.5 2(i) falls, is:

None. The FAA has been applying these definitions in accordance with the statute since the [US Commercial Space Launch Competitiveness Act 2015] went into effect. This change would now provide regulatory clarity.

Skipping back a bit further, the FAA issued a "final rule" on "Streamlined Launch and Reentry License Requirements" on September 30, 2020. The rule states (pp. 680-681) under Section 450.1 (b) 3:

(i) For an orbital launch of a vehicle without a reentry of the vehicle, launch ends after the licensee’s last exercise of control over its vehicle on orbit, after vehicle component impact or landing on Earth, after activities necessary to return the vehicle or component to a safe condition on the ground after impact or landing, or after activities necessary to return the site to a safe condition, whichever occurs latest;
(ii) For an orbital launch of a vehicle with a reentry of the vehicle, launch ends after deployment of all payloads, upon completion of the vehicle's first steady-state orbit if there is no payload deployment, after vehicle component impact or landing on Earth, after activities necessary to return the vehicle or component to a safe condition on the ground after impact or landing, or after activities necessary to return the site to a safe condition, whichever occurs latest; …

In part B of this document, under the heading "Detailed Discussion of the Final Rule" and further under the sub-heading "End of Launch", the FAA presents the following discussion:

[Commercial Spaceflight Federation] and SpaceX suggested that orbital launch without a reentry in proposed Β§450.3(b)(3)(i) did not need to be separately defined by the regulation, stating that, regardless of the type of launch, something always returns: Boosters land or are disposed, upper stages are disposed. CSF and SpaceX further requested that the FAA not distinguish between orbital and suborbital vehicles for end of launch.
The FAA does not agree because the distinctions in § 450.3(b)(3)(i) and (ii) are necessary due to the FAA's limited authority on orbit. For a launch vehicle that will eventually return to Earth as a reentry vehicle, its on-orbit activities after deployment of its payload or payloads, or completion of the vehicle's first steady-state orbit if there is no payload, are not licensed by the FAA. In addition, the disposal of an upper stage is not a reentry under 51 U.S.C. Chapter 509, because the upper stage does not return to Earth substantially intact.

From 51 USC Chapter 509, Section 401.7:

Reentry vehicle means a vehicle designed to return from Earth orbit or outer space to Earth substantially intact. A reusable launch vehicle that is designed to return from Earth orbit or outer space to Earth substantially intact is a reentry vehicle.

This means Section 450.1 (b) 3(i) under "Streamlined Launch and Reentry License Requirements" of 2020 applies to the uncontrolled deorbiting of the Falcon 9 upper stage in the Starlink 11-4 mission. In particular, according to the FAA, the launch ended "after the licensee’s last exercise of control over its vehicle on orbit", which was the latest relevant event.

Back to the "Detailed Discussion of the Final Rule":

Both CSF and SpaceX proposed β€œend of launch” should be defined on a case-by-case basis in pre-application consultation and specified in the license. The FAA disagrees, in part. The FAA only regulates on a case-by-case basis if the nature of an activity makes it impossible for the FAA to promulgate rules of general applicability. This need has not arisen, as evidenced by decades of FAA oversight of end-of-launch activities. That said, because the commercial space transportation industry continues to innovate, Β§450.3(a) gives the FAA the flexibility to adjust the scope of license, including end of launch, based on unique circumstances as agreed to by the Administrator.

The world currently doesn't have a specific international law or agreement dealing with accountability for space debris that crashes to the earth, including paying for the damages such debris wreaks and imposing penalties on offending launch operators. In light of this fact, it's important to remember the FAA's position β€” even if it seems disagreeable β€” has been unchanged for some time even as it has regularly updated its rulemaking to accommodate private sector innovation within the spirit of the existing law.

Trump is an ass and I'm not holding out for him to look out for the concerns of other countries when pieces of made-in-USA rockets descend in uncontrolled fashion over their territories, damaging property or even taking lives. But that the FAA didn't develop its present position afresh under Trump 2.0, and that it was really developed with feedback from SpaceX and other US-based spaceflight operators, is important to understand that its attitude towards crashing debris goes beyond ideology, encompassing the support of both Democrat and Republican governments over the years.

Signal kontejner

10 November 2024 at 00:00

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno Ε‘ifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo Ε‘tevilni aktivisti, novinarji, ΕΎviΕΎgači, pa tudi drΕΎavni uradniki in poslovneΕΎi. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo poveΕΎemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (ΕΎal) vezan na telefonsko Ε‘tevilko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjΕ‘i β€œhack”. KakΕ‘en, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene őtevilne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s őtevilnimi izboljőavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniőki profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameőčate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

ReΕ‘itev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal poveΕΎemo z novo Ε‘tevilko.

Ko je telefonska őtevilka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameőčeni dve različici Signala, vezani na dve različni telefonski őtevilki, in iz obeh različic lahko poőiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko őtevilko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno reőiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameőčena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniőke imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpiőemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameőčena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo pribliΕΎno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameőčen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala poveΕΎemo v operacijski sistem in vpiΕ‘emo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpiőemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpiőemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaΕΎenemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa Ε‘e enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se poveΕΎemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameőčena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naő zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvrőljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne őe čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapiΕ‘emo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakΕ‘no drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvrΕ‘ljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

OsveΕΎimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop"

Ko bomo v iskalnik aplikacij vpisali β€œSignal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo őe povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Ε½al pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej Ε‘e vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo reőiti, pa mi seveda lahko sporočite. :)

Mozilla Is Worried About the Proposed Fixes for Google’s Search Monopoly

By: Nick Heer
27 November 2024 at 00:46

Michael Kan, PC Magazine:

Mozilla points to a key but less eye-catching proposal from the DOJ to regulate Google’s search business, which a judge ruled as a monopoly in August. In their recommendations, federal prosecutors urged the court to ban Google from offering β€œsomething of value” to third-party companies to make Google the default search engine over their software or devices.Β 

β€œThe proposed remedies are designed to end Google’s unlawful practices and open up the market for rivals and new entrants to emerge,” the DOJ told the court. The problem is that Mozilla earns most of its revenue from royalty deals β€” nearly 86% in 2022 β€” making Google the default Firefox browser search engine.

This is probably another reason why U.S. prosecutors want to jettison Chrome from Google: they want to reduce any benefit it may accrue from trying to fix its illegal search monopoly. But it seems Google’s position in the industry is so entrenched that correcting it will hurt lots of other businesses, too. That does not mean it should not be broken up or that the DOJ’s proposed remedies are wrong, however.

βŒ₯ Permalink

Unix's buffered IO in assembly and in C

By: cks
9 December 2024 at 02:44

Recently on the Fediverse, I said something related to Unix's pre-V7 situation with buffered IO:

[...]

(I think the V1 approach is right for an assembly based minimal OS, while the stdio approach kind of wants malloc() and friends.)

The V1 approach, as documented in its putc.3 and getw.3 manual pages, is that the caller to the buffered IO routines supplies the data area used for buffering, and the library functions merely initialize it and later use it. How you get the data area is up to you and your program; you might, for example, simply have a static block of memory in your BSS segment. You can dynamically allocate this area if you want to, but you don't have to. The V2 and later putchar have a similar approach but this time they contain a static buffer area and you just have to do a bit of initialization (possibly putchar was in V1 too, I don't know for sure).

Stdio of course has a completely different API. In stdio, you don't provide the data area; instead, stdio provides you an opaque reference (a 'FILE *') to the information and buffers it maintains internally. This is an interface that definitely wants some degree of dynamic memory allocation, for example for the actual buffers themselves, and in modern usage most of the FILE objects will be dynamically allocated too.

(The V7 stdio implementation had a fixed set of FILE structs and so would error out if you used too many of them. However, it did use malloc() for the buffer associated with them, in filbuf.c and flsbuf.c.)

You can certainly do dynamic memory allocation in assembly, but I think it's much more natural in C, and certainly the C standard library is more heavyweight than the relatively small and minimal assembly language stuff early Unix programs (written in assembly) seem to have required. So I think it makes a lot of sense that Unix started with a buffering approach where the caller supplies the buffer (and probably doesn't dynamically allocate it), then moved to one where the library does at least some allocation and supplies the buffer (and other data) itself.

Buffered IO in Unix before V7 introduced stdio

By: cks
6 December 2024 at 04:16

I recently read Julia Evans' Why pipes sometimes get "stuck": buffering. Part of the reason is that almost every Unix program does some amount of buffering for what it prints (or writes) to standard output and standard error. For C programs, this buffering is built into the standard library, specifically into stdio, which includes familiar functions like printf(). Stdio is one of the many things that appeared first in Research Unix V7. This might leave you wondering if this sort of IO was buffered in earlier versions of Research Unix and if it was, how it was done.

The very earliest version of Research Unix is V1, and in V1 there is putc.3 (at that point entirely about assembly, since C was yet to come). This set of routines allows you to set up and then use a 'struct' to implement IO buffering for output. There is a similar set of buffered functions for input, in getw.3, and I believe the memory blocks the two sets of functions use are compatible with each other. The V1 manual pages note it as a bug that the buffer wasn't 512 bytes, but also notes that several programs would break if the size was changed; the buffer size will be increased to 512 bytes by V3.

In V2, I believe we still have putc and getw, but we see the first appearance of another approach, in putchr.s. This implements putchar(), which is used by printf() and which (from later evidence) uses an internal buffer (under some circumstances) that has to be explicitly flush()'d by programs. In V3, there's manual pages for putc.3 and getc.3 that are very similar to the V1 versions, which is why I expect these were there in V2 as well. In V4, we have manual pages for both putc.3 (plus getc.3) and putch[a]r.3, and there is also a getch[a]r.3 that's the input version of putchar(). Since we have a V4 manual page for putchar(), we can finally see the somewhat tangled way it works, rather than having to read the PDP-11 assembly. I don't have links to V5 manuals, but the V5 library source says that we still have both approaches to buffered IO.

(If you want to see how the putchar() approach was used, you can look at, for example, the V6 grep.c, which starts out with the 'fout = dup(1);' that the manual page suggests for buffered putchar() usage, and then periodically calls flush().)

In V6, there is a third approach that was added, in /usr/source/iolib, although I don't know if any programs used it. Iolib has a global array of structs, that were statically associated with a limited number of low-numbered file descriptors; an iolib function such as cflush() would be passed a file descriptor and use that to look up the corresponding struct. One innovation iolib implicitly adds is that its copen() effectively 'allocates' the struct for you, in contrast to putc() and getc(), where you supply the memory area and fopen()/fcreate() merely initialize it with the correct information.

Finally V7 introduces stdio and sorts all of this out, at the cost of some code changes. There's still getc() and putc(), but now they take a FILE *, instead of their own structure, and you get the FILE * from things like fopen() instead of supplying it yourself and having a stdio function initialize it. Putchar() (and getchar()) still exist but are now redone to work with stdio buffering instead of their own buffering, and 'flush()' has become fflush() and takes an explicit FILE * argument instead of implicitly flushing putchar()'s buffer, and generally it's not necessary any more. The V7 grep.c still uses printf(), but now it doesn't explicitly flush anything by calling fflush(); it just trusts in stdio.

Using systemd-run to limit something's memory usage in cgroups v2

By: cks
1 December 2024 at 03:59

Once upon a time I wrote an entry about using systemd-run to limit something's RAM consumption. This was back in the days of cgroups v1 (also known as 'non-unified cgroups'), and we're now in the era of cgroups v2 ('unified cgroups') and also ZRAM based swap. This means we want to make some adjustments, especially if you're dealing with programs with obnoxiously large RAM usage.

As before, the basic thing you want to do is run your program or thing in a new systemd user scope, which is done with 'systemd-run --user --scope ...'. You may wish to give it a unit name as well, '--unit <name>', especially if you expect it to persist a while and you want to track it specifically. Systemd will normally automatically clean up this scope when everything in it exits, and the scope is normally connected to your current terminal and otherwise more or less acts normally as an interactive process.

To actually do anything with this, we need to set some systemd resource limits. To limit memory usage, the minimum is a MemoryMax= value. It may also work better to set MemoryHigh= to a value somewhat below the absolute limit of MemoryMax. If you're worried about whatever you're doing running your system out of memory and your system uses ZRAM based swap, you may also want to set a MemoryZSwapMax= value so that the program doesn't chew up all of your RAM by 'swapping' it to ZRAM and filling that up. Without a ZRAM swap limit, you might find that the program actually uses MemoryMax RAM plus your entire ZRAM swap RAM, which might be enough to trigger a more general OOM. So this might be:

systemd-run --user --scope -p MemoryHigh=7G -p MemoryMax=8G -p MemoryZSwapMax=1G ./mach build

(Good luck with building Firefox in merely 8 GBytes of RAM, though. And obviously if you do this regularly, you're going to want to script it.)

If you normally use ZRAM based swap and you're worried about the program running you out of memory that way, you may want to create some actual swap space that the program can be turned loose on. These days, this is as simple as creating a 'swap.img' file somewhere and then swapping onto it:

cd /
dd if=/dev/zero of=swap.img bs=1MiB count=$((4*1024))
mkswap swap.img
swapon /swap.img

(You can use swapoff to stop swapping to this image file after you're done running your big program.)

Then you may want to also limit how much of this swap space the program can use, which is done with a MemorySwapMax= value. I've read both systemd's documentation and the kernel's cgroup v2 memory controller documentation, and I can't tell whether the ZRAM swap maximum is included in the swap maximum or is separate. I suspect that it's included in the swap maximum, but if it really matters you should experiment.

If you also want to limit the program's CPU usage, there are two options. The easiest one to set is CPUQuota=. The drawback of CPU quota limits is that programs may not realize that they're being restricted by such a limit and wind up running a lot more threads (or processes) than they should, increasing the chances of overloading things. The more complex but more legible to programs way is to restrict what CPUs they can run on using taskset(1).

(While systemd has AllowedCPUs=, this is a cgroup setting and doesn't show up in the interface used by taskset and sched_getaffinity(2).)

Systemd also has CPUWeight=, but I have limited experience with it; see fair share scheduling in cgroup v2 for what I know. You might want the special value 'idle' for very low priority programs.

What NFS server threads do in the Linux kernel

By: cks
26 November 2024 at 03:40

If we ignore the network stack and take an abstract view, the Linux kernel NFS server needs to do things at various different levels in order to handle NFS client requests. There is NFS specific processing (to deal with things like the NFS protocol and NFS filehandles), general VFS processing (including maintaining general kernel information like dentries), then processing in whatever specific filesystem you're serving, and finally some actual IO if necessary. In the abstract, there are all sorts of ways to split up the responsibility for these various layers of processing. For example, if the Linux kernel supported fully asynchronous VFS operations (which it doesn't), the kernel NFS server could put all of the VFS operations in a queue and let the kernel's asynchronous 'IO' facilities handle them and notify it when a request's VFS operations were done. Even with synchronous VFS operations, you could split the responsibility between some front end threads that handled the NFS specific side of things and a backend pool of worker threads that handled the (synchronous) VFS operations.

(This would allow you to size the two pools differently, since ideally they have different constraints. The NFS processing is more or less CPU bound, and so sized based on how much of the server's CPU capacity you wanted to use for NFS; the VFS layer would ideally be IO bound, and could be sized based on how much simultaneous disk IO it was sensible to have. There is some hand-waving involved here.)

The actual, existing Linux kernel NFS server takes the much simpler approach. The kernel NFS server threads do everything. Each thread takes an incoming NFS client request (or a group of them), does NFS level things like decoding NFS filehandles, and then calls into the VFS to actually do operations. The VFS will call into the filesystem, still in the context of the NFS server thread, and if the filesystem winds up doing IO, the NFS server thread will wait for that IO to complete. When the thread of execution comes back out of the VFS, the NFS thread then does the NFS processing to generate replies and dispatch them to the network.

This unfortunately makes it challenging to answer the question of how many NFS server threads you want to use. The NFS server threads may be CPU bound (if they're handling NFS requests from RAM and the VFS's caches and data structures), or they may be IO bound (as they wait for filesystem IO to be performed, usually for reading and writing files). When you're IO bound, you probably want enough NFS server threads so that you can wait on all of the IO and still have some threads left over to handle the collection of routine NFS requests that can be satisfied from RAM. When you're CPU bound, you don't want any more NFS server threads than you have CPUs, and maybe you want a bit less.

If you're lucky, your workload is consistently and predictably one or the other. If you're not lucky (and we're not), your workload can be either of these at different times or (if we're really out of luck) both at once. Energetic people with NFS servers that have no other real activity can probably write something that automatically tunes the number of NFS threads up and down in response to a combination of the load average, the CPU utilization, and pressure stall information.

(We're probably just going to set it to the number of system CPUs.)

(After yesterday's question I decided I wanted to know for sure what the kernel's NFS server threads were used for, just in case. So I read the kernel code, which did have some useful side effects such as causing me to learn that the various nfsd4_<operation> functions we sometimes use bpftrace on are doing less than I assumed they were.)

The question of how many NFS server threads you should use (on Linux)

By: cks
25 November 2024 at 04:48

Today, not for the first time, I noticed that one of our NFS servers was sitting at a load average of 8 with roughly half of its overall CPU capacity used. People with experience in Linux NFS servers are now confidently predicting that this is a 16-CPU server, which is correct (it has 8 cores and 2 HT threads per core). They're making this prediction because the normal Linux default number of kernel NFS server threads to run is eight.

(Your distribution may have changed this, and if so it's most likely by changing what's in /etc/nfs.conf, which is the normal place to set this. It can be changed on the fly by writing a new value to /proc/fs/nfsd/threads.)

Our NFS server wasn't saturating its NFS server threads because someone on a NFS client was doing a ton of IO. That might actually have slowed the requests down. Instead, there were some number of programs that were constantly making some number of NFS requests that could be satisfied entirely from (server) RAM, which explains why all of the NFS kernel threads were busy using system CPU (mostly on a spinlock, apparently, according to 'perf top'). It's possible that some of these constant requests came from code that was trying to handle hot reloading, since this is one of the sources of constant NFS 'GetAttr' requests, but I believe there's other things going on.

(Since this is the research side of a university department, we have very little visibility into what the graduate students are running on places like our little SLURM cluster.)

If you search around the Internet, you can find all sorts of advice about what to to set the number of NFS server threads to on your Linux NFS server. Many of them involve relatively large numbers (such as this 2024 SuSE advice of 128 threads). Having gone through this recent experience, my current belief is that it depends on what your problem is. In our case, with the NFS server threads all using kernel CPU time and not doing much else, running more threads than we have CPUs seems pointless; all it would do is create unproductive contention for CPU time. If NFS clients are going to totally saturate the fileserver with (CPU-eating) requests even at 16 threads, possibly we should run fewer threads than CPUs, so that user level management operations have some CPU available without contending against the voracious appetite of the kernel NFS server.

(Some advice suggests some number of server NFS kernel threads per NFS client. I suspect this advice is not used in places with tens or hundreds of NFS clients, which is our situation.)

To figure out what your NFS server's problem is, I think you're going to need to look at things like pressure stall information and information on the IO rate and the number of IO requests you're seeing. You can't rely on overall iowait numbers, because Linux iowait is a conservative lower bound. IO pressure stall information is much better for telling you if some NFS threads are blocked on IO even while others are active.

(Unfortunately the kernel NFS threads are not in a cgroup of their own, so you can't get per-cgroup pressure stall information for them. I don't know if you can manually move them into a cgroup, or if systemd would cooperate with this if you tried it.)

PS: In theory it looks like a potentially reasonable idea to run roughly at least as many NFS kernel threads as you have CPUs (maybe a few less so you have some user level CPU left over). However, if you have a lot of CPUs, as you might on modern servers, this might be too many if your NFS server gets flooded with an IO-heavy workload. Our next generation NFS fileserver hardware is dual socket, 12 cores per socket, and 2 threads per core, for a total of 48 CPUs, and I'm not sure we want to run anywhere near than many NFS kernel threads. Although we probably do want to run more than eight.

Ubuntu LTS (server) releases have become fairly similar to each other

By: cks
19 November 2024 at 04:15

Ubuntu 24.04 LTS was released this past April, so one of the things we've been doing since then is building out our install system for 24.04 and then building a number of servers using 24.04, both new servers and servers that used to be build on 20.04 or 22.04. What has been quietly striking about this process is how few changes there have been for us between 20.04, 22.04, and 24.04. Our customization scripts needed only very small changes, and many of the instructions for specific machines could be revised by just searching and replacing either '20.04' or '22.04' with '24.04'.

Some of this lack of changes is illusory, because when I actually look at the differences between our 22.04 and 24.04 postinstall scripting, there are a number of changes, adjustments, and new fixes (and a big change in having to install Python 2 ourselves). Even when we didn't do anything there were decisions to be made, like whether or not we would stick with the Ubuntu 24.04 default of socket activated SSH (our decision so far is to stick with 24.04's default for less divergence from upstream). And there were also some changes to remove obsolete things and restructure how we change things like the system-wide SSH configuration; these aren't forced by the 22.04 to 24.04 change, but building the install setup for a new release is the right time to rethink existing pieces.

However, plenty of this lack of changes is real, and I credit a lot of that to systemd. Systemd has essentially standardized a lot of the init process and in the process, substantially reduced churn in it. For a relevant example, our locally developed systemd units almost never need updating between Ubuntu versions; if it worked in 20.04, it'll still work just as well in 24.04 (including its relationships to various other units). Another chunk of this lack of changes is that the current 20.04+ Ubuntu server installer has maintained a stable configuration file and relatively stable feature set (at least of features that we want to use), resulting in very little needing to be modified in our spin of it as we moved from 20.04 to 22.04 to 24.04. And the experience of going through the server installer has barely changed; if you showed me an installer screen from any of the three releases, I'm not sure I could tell you which it's from.

I generally feel that this is a good thing, at least on servers. A normal Linux server setup and the software that you run on it has broadly reached a place of stability, where there's no particular need to make really visible changes or to break backward compatibility. It's good for us that moving from 20.04 to 22.04 to 24.04 is mostly about getting more recent kernels and more up to date releases of various software packages, and sometimes having bugs fixed so that things like bpftrace work better.

(Whether this is 'welcome maturity' or 'unwelcome statis' is probably somewhat in the eye of the observer. And there are quiet changes afoot behind the scenes, like the change from iptables to nftables.)

Complications in supporting 'append to a file' in a NFS server

By: cks
8 November 2024 at 04:14

In the comments of my entry on the general problem of losing network based locks, an interesting side discussion has happened between commentator abel and me over NFS servers (not) supporting the Unix O_APPEND feature. The more I think about it, the more I think it's non-trivial to support well in an NFS server and that there are some subtle complications (and probably more than I haven't realized). I'm mostly going to restrict this to something like NFS v3, which is what I'm familiar with.

The basic Unix semantics of O_APPEND are that when you perform a write(), all of your data is immediately and atomically put at the current end of the file, and the file's size and maximum offset are immediately extended to the end of your data. If you and I do a single append write() of 128 Mbytes to the same file at the same time, either all of my 128 Mbytes winds up before yours or vice versa; your and my data will never wind up intermingled.

This basic semantics is already a problem for NFS because NFS (v3) connections have a maximum size for single NFS 'write' operations and that size may be (much) smaller than the user level write(). Without a multi-operation transaction of some sort, we can't reliably perform append write()s of more data than will fit in a NFS write operation; either we fail those 128 Mbyte writes, or we have the possibility that data from you and I will be intermingled in the file.

In NFS v2, all writes were synchronous (or were supposed to be, servers sometimes lied about this). NFS v3 introduced the idea of asynchronous, buffered writes that were later committed by clients. NFS servers are normally permitted to discard asynchronous writes that haven't yet been committed by the client; when the client tries to commit them later, the NFS server rejects the commit and the client resends the data. This works fine when the client's request has a definite position in the file, but it has issues if the client's request is a position-less append write. If two clients do append writes to the same file, first A and then B after it, the server discards both, and then client B is the first one to go through the 'COMMIT, fail, resend' process, where does its data wind up? It's not hard to wind up with situations where a third client that's repeatedly reading the file will see inconsistent results, where first it sees A's data then B's and then later either it sees B's data before A's or B's data without anything from A (not even a zero-filled gap in the file, the way you'd get with ordinary writes).

(While we can say that NFS servers shouldn't ever deliberately discard append writes, one of the ways that this happens is that the server crashes and reboots.)

You can get even more fun ordering issues created by retrying lost writes if there is another NFS client involved that is doing manual append writes by finding out the current end of file and writing at it. If A and B do append writes, C does a manual append write, all writes are lost before they're committed, B redoes, C redoes, and then A redoes, a natural implementation could easily wind up with B's data, an A data sized hole, C's data, and then A's data appended after C's.

This also creates server side ordering dependencies for potentially discarding uncommitted asynchronous write data, ones that a NFS server can normally make independently. If A appended a lot of data and then B appended a little bit, you probably don't want to discard A's data but not B's, because there's no guarantee that A will later show up to fail a COMMIT and resend it (A could have crashed, for example). And if B requests a COMMIT, you probably want to commit A's data as well, even if there's much more of it.

One way around this would be to adopt a more complex model of append writes over NFS, where instead of the client requesting an append write, it requests 'write this here but fail if this is not the current end of file'. This would give all NFS writes a definite position in the file at the cost of forcing client retries on the initial request (if the client later has to repeat the write because of a failed commit, it must carefully strip this flag off). Unfortunately a file being appended to from multiple clients at a high rate would probably result in a lot of client retries, with no guarantee that a given client would ever actually succeed.

(You could require all append writes to be synchronous, but then this would do terrible things to NFS server performance for potentially common use of append writes, like appending log lines to a shared log file from multiple machines. And people absolutely would write and operate programs like that if append writes over NFS were theoretically reliable.)

Losing NFS locks and the SunOS SIGLOST signal

By: cks
7 November 2024 at 02:48

NFS is a network filesystem that famously also has a network locking protocol associated with it (or part of it, for NFSv4). This means that NFS has to consider the issue of the NFS client losing a lock that it thinks it holds. In NFS, clients losing locks normally happens as part of NFS(v3) lock recovery, triggered when a NFS server reboots. On server reboot, clients are told to re-acquire all of their locks, and this re-acquisition can explicitly fail (as well as going wrong in various ways that are one way to get stuck NFS locks). When a NFS client's kernel attempts to reclaim a lock and this attempt fails, it has a problem. Some process on the local machine thinks that it holds a (NFS) lock, but as far as the NFS server and other NFS clients are concerned, it doesn't.

Sun's original version of NFS dealt with this problem with a special signal, SIGLOST. When the NFS client's kernel detected that a NFS lock had been lost, it sent SIGLOST to whatever process held the lock. SIGLOST was a regular signal, so by default the process would exit abruptly; a process that wanted to do something special could register a signal handler for SIGLOST and then do whatever it could. SIGLOST appeared no later than SunOS 3.4 (cf) and still lives on today in Illumos, where you can find this discussed in uts/common/klm/nlm_client.c and uts/common/fs/nfs/nfs4_recovery.c (and it's also mentioned in fcntl(2)). The popularity of actually handling SIGLOST may be indicated by the fact that no program in the Illumos source tree seems to set a signal handler for it.

Other versions of Unix mainly ignore the situation. The Linux kernel has a specific comment about this in fs/lockd/clntproc.c, which very briefly talks about the issue and picks ignoring it (apart from logging the kernel message "lockd: failed to reclaim lock for ..."). As far as I can tell from reading FreeBSD's sys/nlm/nlm_advlock.c, FreeBSD silently ignores any problems when it goes through the NFS client process of reclaiming locks.

(As far as I can see, NetBSD and OpenBSD don't support NFS locks on clients at all, rendering the issue moot. I don't know if POSIX locks fail on NFS mounted filesystems or if they work but create purely local locks on that particular NFS client, although I think it's the latter.)

On the surface this seems rather bad, and certainly worse than the Sun approach of SIGLOST. However, I'm not sure that SIGLOST is all that great either, because it has some problems. First, what you can do in a signal handler is very constrained; basically all that a SIGLOST handler can do is set a variable and hope that the rest of the code will check it before it does anything dangerous. Second, programs may hold multiple (NFS) locks and SIGLOST doesn't tell you which lock you lost; as far as I know, there's no way of telling. If your program gets a SIGLOST, all you can do is assume that you lost all of your locks. Third, file locking may quite reasonably be used inside libraries in a way that is hidden from callers by the library's API, but signals and handling signals is global to the entire program. If taking a file lock inside a library exposes the entire program to SIGLOST, you have a collection of problems (which ones depend on whether the program has its own file locks and whether or not it has installed a SIGLOST handler).

This collection of problems may go part of the way to explain why no Illumos programs actually set a SIGLOST handler and why other Unixes simply ignore the issue. A kernel that uses SIGLOST essentially means 'your program dies if it loses a lock', and it's not clear that this is better than 'your program optimistically continues', especially in an environment where a NFS client losing a NFS lock is rare (and letting the program continue is certainly simpler for the kernel).

A rough equivalent to "return to last power state" for libvirt virtual machines

By: cks
5 November 2024 at 04:13

Physical machines can generally be set in their BIOS so that if power is lost and then comes back, the machine returns to its previous state (either powered on or powered off). The actual mechanics of this are complicated (also), but the idealized version is easily understood and convenient. These days I have a revolving collection of libvirt based virtual machines running on a virtualization host that I periodically reboot due to things like kernel updates, and for a while I have quietly wished for some sort of similar libvirt setting for its virtual machines.

It turns out that this setting exists, sort of, in the form of the libvirt-guests systemd service. If enabled, it can be set to restart all guests that were running when the system was shut down, regardless of whether or not they're set to auto-start on boot (none of my VMs are). This is a global setting that applies to all virtual machines that were running at the time the system went down, not one that can be applied to only some VMs, but for my purposes this is sufficient; it makes it less of a hassle to reboot the virtual machine host.

Linux being Linux, life is not quite this simple in practice, as is illustrated by comparing my Ubuntu VM host machine with my Fedora desktops. On Ubuntu, libvirt-guests.service defaults to enabled, it is configured through /etc/default/libvirt-guests (the Debian standard), and it defaults to not not automatically restarting virtual machines. On my Fedora desktops, libvirt-guests.service is not enabled by default, it is configured through /etc/sysconfig/libvirt-guests (as in the official documentation), and it defaults to automatically restarting virtual machines. Another difference is that Ubuntu has a /etc/default/libvirt-guests that has commented out default values, while Fedora has no /etc/sysconfig/libvirt-guests so you have to read the script to see what the defaults are (on Fedora, this is /usr/libexec/libvirt-guests.sh, on Ubuntu /usr/lib/libvirt/libvirt-guests.sh).

I've changed my Ubuntu VM host machine so that it will automatically restart previously running virtual machines on reboot, because generally I leave things running intentionally there. I haven't touched my Fedora machines so far because by and large I don't have any regularly running VMs, so if a VM is still running when I go to reboot the machine, it's most likely because I forgot I had it up and hadn't gotten around to shutting it off.

(My pre-libvirt virtualization software was much too heavy-weight for me to leave a VM running without noticing, but libvirt VMs have a sufficiently low impact on my desktop experience that I can and have left them running without realizing it.)

The history of Unix's ioctl and signal about window sizes

By: cks
4 November 2024 at 03:38

One of the somewhat obscure features of Unix is that the kernel has a specific interface to get (and set) the 'window size' of your terminal, and can also send a Unix signal to your process when that size changes. The official POSIX interface for the former is tcgetwinsize(), but in practice actual Unixes have a standard tty ioctl for this, TIOCGWINSZ (see eg Linux ioctl_tty(2) (also) or FreeBSD tty(4)). The signal is officially standardized by POSIX as SIGWINCH, which is the name it always has had. Due to a Fediverse conversation, I looked into the history of this today, and it turns out to be more interesting than I expected.

(The inclusion of these interfaces in POSIX turns out to be fairly recent.)

As far as I can tell, 4.2 BSD did not have either TIOCGWINSZ or SIGWINCH (based on its sigvec(2) and tty(4) manual pages). Both of these appear in the main BSD line in 4.3 BSD, where sigvec(2) has added SIGWINCH (as the first new signal along with some others) and tty(4) has TIOCGWINSZ. This timing makes a certain amount of sense in Unix history. At the time of 4.2 BSD's development and release, people were connecting to Unix systems using serial terminals, which had more or less fixed sizes that were covered by termcap's basic size information. By the time of 4.3 BSD in 1986, Unix workstations existed and with them, terminal windows that could have their size changed on the fly; a way of finding out (and changing) this size was an obvious need, along with a way for full-screen programs like vi to get notified if their terminal window was resized on the fly.

However, as far as I can tell 4.3 BSD itself did not originate SIGWINCH, although it may be the source of TIOCGWINSZ. The FreeBSD project has manual pages for a variety of Unixes, including 'Sun OS 0.4', which seems to be an extremely early release from early 1983. This release has a signal(2) with a SIGWINCH signal (using signal number 28, which is what 4.3 BSD will use for it), but no (documented) TIOCGWINSZ. However, it does have some programs that generate custom $TERMCAP values with the right current window sizes.

The Internet Archives has a variety of historical material from Sun Microsystems, including (some) documentation for both SunOS 2.0 and SunOS 3.0. This documentation makes it clear that the primary purpose of SIGWINCH was to tell graphical programs that their window (or one of them) had been changed, and they should repaint the window or otherwise refresh the contents (a program with multiple windows didn't get any indication of which window was damaged; the programming advice is to repaint them all). The SunOS 2.0 tgetent() termcap function will specifically update what it gives you with the current size of your window, but as far as I can tell there's no other documented support of getting window sizes; it's not mentioned in tty(4) or pty(4). Similar wording appears in the SunOS 3.0 Unix Interface Reference Manual.

(There are PDFs of some SunOS documentation online (eg), and up through SunOS 3.5 I can't find any mention of directly getting the 'window size'. In SunOS 4.0, we finally get a TIOCGWINSZ, documented in termio(4). However, I have access to SunOS 3.5 source, and it does have a TIOCGWINSZ ioctl, although that ioctl isn't documented. It's entirely likely that TIOCGWINSZ was added (well) before SunOS 3.5.)

According to this Git version of the original BSD development history, BSD itself added both SIGWINCH and TIOCGWINSZ at the end of 1984. The early SunOS had SIGWINCH and it may well have had TIOCGWINSZ as well, so it's possible that BSD got both from SunOS. It's also possible that early SunOS had a different (terminal) window size mechanism than TIOCGWINSZ, one more specific to their window system, and the UCB CSRG decided to create a more general mechanism that Sun then copied back by the time of SunOS 3.5 (possibly before the official release of 4.3 BSD, since I suspect everyone in the BSD world was talking to each other at that time).

PS: SunOS also appears to be the source of the mysteriously missing signal 29 in 4.3 BSD (mentioned in my entry on how old various Unix signals are). As described in the SunOS 3.4 sigvec() manual page, signal 29 is 'SIGLOST', "resource lost (see lockd(8C))". This appears to have been added at some point between the initial SunOS 3.0 release and SunOS 3.4, but I don't know exactly when.

Notes on the compatibility of crypted passwords across Unixes in late 2024

By: cks
2 November 2024 at 02:31

For years now, all sorts of Unixes have been able to support better password 'encryption' schemes than the basic old crypt(3) salted-mutant-DES approach that Unix started with (these days it's usually called 'password hashing'). However, the support for specific alternate schemes varies from Unix to Unix, and has for many years. Back in 2010 I wrote some notes on the situation at the time; today I want to look at the situation again, since password hashing is on my mind right now.

The most useful resource for cross-Unix password hash compatibility is Wikipedia's comparison table. For Linux, support varies by distribution based on their choice of C library and what version of libxcrypt they use, and you can usually see a list in crypt(5), and pam_unix may not support using all of them for new passwords. For FreeBSD, their support is documented in crypt(3). In OpenBSD, this is documented in crypt(3) and crypt_newhash(3), although there isn't much to read since current OpenBSD only lists support for 'Blowfish', which for password hashing is also known as bcrypt. On Illumos, things are more or less documented in crypt(3), crypt.conf(5), and crypt_unix(7) and associated manual pages; the Illumos section 7 index provides one way to see what seems to be supported.

System administrators not infrequently wind up wanting cross-Unix compatibility of their local encrypted passwords. If you don't care about your shared passwords working on OpenBSD (or NetBSD), then the 'sha512' scheme is you best bet; it basically works everywhere these days. If you do need to include OpenBSD or NetBSD, you're stuck with bcrypt but even then there may be problems because bcrypt is actually several schemes, as Wikipedia covers.

Some recent Linux distributions seem to be switching to 'yescrypt' by default (including Debian, which means downstream distributions like Ubuntu have also switched). Yescrypt in Ubuntu is now old enough that it's probably safe to use in an all-Ubuntu environment, although your distance may vary if you have 18.04 or earlier systems. Yescrypt is not yet available in FreeBSD and may never be added to OpenBSD or NetBSD (my impression is that OpenBSD is not a fan of having lots of different password hashing algorithms and prefers to focus on one that they consider secure).

(Compared to my old entry, I no longer particularly care about the non-free Unixes, including macOS. Even Wikipedia doesn't bother trying to cover AIX. For our local situation, we may someday want to share passwords to FreeBSD machines, but we're very unlikely to care about sharing passwords to OpenBSD machines since we currently only use them in situations where having their own stand-alone passwords is a feature, not a bug.)

Pam_unix and your system's supported password algorithms

By: cks
1 November 2024 at 03:15

The Linux login passwords that wind up in /etc/shadow can be encrypted (well, hashed) with a variety of algorithms, which you can find listed (and sort of documented) in places like Debian's crypt(5) manual page. Generally the choice of which algorithm is used to hash (new) passwords (for example, when people change them) is determined by an option to the pam_unix PAM module.

You might innocently think, as I did, that all of the algorithms your system supports will all be supported by pam_unix, or more exactly will all be available for new passwords (ie, what you or your distribution control with an option to pam_unix). It turns out that this is not the case some of the time (or if it is actually the case, the pam_unix manual page can be inaccurate). This is surprising because pam_unix is the thing that handles hashed passwords (both validating them and changing them), and you'd think its handling of them would be symmetric.

As I found out today, this isn't necessarily so. As documented in the Ubuntu 20.04 crypt(5) manual page, 20.04 supports yescrypt in crypt(3) (sadly Ubuntu's manual page URL doesn't seem to work). This means that the Ubuntu 20.04 pam_unix can (or should) be able to accept yescrypt hashed passwords. However, the Ubuntu 20.04 pam_unix(8) manual page doesn't list yescrypt as one of the available options for hashing new passwords. If you look only at the 20.04 pam_unix manual page, you might (incorrectly) assume that a 20.04 system can't deal with yescrypt based passwords at all.

At one level, this makes sense once you know that pam_unix and crypt(3) come from different packages and handle different parts of the work of checking existing Unix password and hashing new ones. Roughly speaking, pam_unix can delegate checking passwords to crypt(3) without having to care how they're hashed, but to hash a new password with a specific algorithm it has to know about the algorithm, have a specific PAM option added for it, and call some functions in the right way. It's quite possible for crypt(3) to get ahead of pam_unix for a new password hashing algorithm, like yescrypt.

(Since they're separate packages, pam_unix may not want to implement this for a new algorithm until a crypt(3) that supports it is at least released, and then pam_unix itself will need a new release. And I don't know if linux-pam can detect whether or not yescrypt is supported by crypt(3) at build time (or at runtime).)

PS: If you have an environment with a shared set of accounts and passwords (whether via LDAP or your own custom mechanism) and a mixture of Ubuntu versions (maybe also with other Linux distribution versions), you may want to be careful about using new password hashing schemes, even once it's supported by pam_unix on your main systems. The older some of your Linuxes are, the more you'll want to check their crypt(3) and crypt(5) manual pages carefully.

Linux's /dev/disk/by-id unfortunately often puts the transport in the name

By: cks
28 October 2024 at 03:24

Filippo Valsorda ran into an issue that involved, in part, the naming of USB disk drives. To quote the relevant bit:

I can't quite get my head around the zfs import/export concept.

When I replace a drive I like to first resilver the new one as a USB drive, then swap it in. This changes the device name (even using by-id).

[...]

My first reaction was that something funny must be going on. My second reaction was to look at an actual /dev/disk/by-id with a USB disk, at which point I got a sinking feeling that I should have already recognized from a long time ago. If you look at your /dev/disk/by-id, you will mostly see names that start with things like 'ata-', 'scsi-OATA-', 'scsi-1ATA', and maybe 'usb-' (and perhaps 'nvme-', but that's a somewhat different kettle of fish). All of these names have the problem that they burn the transport (how you talk to the disk) into the /dev/disk/by-id, which is supposed to be a stable identifier for the disk as a standalone thing.

As Filippo Valsorda's case demonstrates, the problem is that some disks can move between transports. When this happens, the theoretically stable name of the disk changes; what was 'usb-' is now likely 'ata-' or vice versa, and in some cases other transformations may happen. Your attempt to use a stable name has failed and you will likely have problems.

Experimentally, there seem to be some /dev/disk/by-id names that are more stable. Some but not all of our disks have 'wwn-' names (one USB attached disk I can look at doesn't). Our Ubuntu based systems have 'scsi-<hex digits>' and 'scsi-SATA-<disk id>' names, but one of my Fedora systems with SATA drives has only the 'scsi-<hex>' names and the other one has neither. One system we have a USB disk on has no names for the disk other than 'usb-' ones. It seems clear that it's challenging at best to give general advice about how a random Linux user should pick truly stable /dev/disk/by-id names, especially if you have USB drives in the picture.

(See also Persistent block device naming in the Arch Wiki.)

This whole current situation seems less than ideal, to put it one way. It would be nice if disks (and partitions on them) had names that were as transport independent and usable as possible, especially since most disks have theoretically unique serial numbers and model names available (and if you're worried about cross-transport duplicates, you should already be at least as worried as duplicates within the same type of transport).

PS: You can find out what information udev knows about your disks with 'udevadm info --query=all --name=/dev/...' (from, via, by coincidence). The information for a SATA disk differs between my two Fedora machines (one of them has various SCSI_* and ID_SCSI* stuff and the other doesn't), but I can't see any obvious reason for this.

Using pam_access to sometimes not use another PAM module

By: cks
26 October 2024 at 02:40

Suppose that you want to authenticate SSH logins to your Linux systems using some form of multi-factor authentication (MFA). The normal way to do this is to use 'password' authentication and then in the PAM stack for sshd, use both the regular PAM authentication module(s) of your system and an additional PAM module that requires your MFA (in another entry about this I used the module name pam_mfa). However, in your particular MFA environment it's been decided that you don't have to require MFA for logins from some of your other networks or systems, and you'd like to implement this.

Because your MFA happens through PAM and the details of this are opaque to OpenSSH's sshd, you can't directly implement skipping MFA through sshd configuration settings. If sshd winds up doing password based authentication at all, it will run your full PAM stack and that will challenge people for MFA. So you must implement sometimes skipping your MFA module in PAM itself. Fortunately there is a PAM module we can use for this, pam_access.

The usual way to use pam_access is to restrict or allow logins (possibly only some logins) based on things like the source address people are trying to log in from (in this, it's sort of a superset of the old tcpwrappers). How this works is configured through an access control file. We can (ab)use this basic matching in combination with the more advanced form of PAM controls to skip our PAM MFA module if pam_access matches something.

What we want looks like this:

auth  [success=1 default=ignore]  pam_access.so noaudit accessfile=/etc/security/access-nomfa.conf
auth  requisite  pam_mfa

Pam_access itself will 'succeed' as a PAM module if the result of processing our access-nomfa.conf file is positive. When this happens, we skip the next PAM module, which is our MFA module. If it 'fails', we ignore the result, and as part of ignoring the result we tell pam_access to not report failures.

Our access-nomfa.conf file will have things like:

# Everyone skips MFA for internal networks
+:ALL:192.168.0.0/16 127.0.0.1

# Insure we fail otherwise.
-:ALL:ALL

We list the networks we want to allow password logins without MFA from, and then we have to force everything else to fail. (If you leave this off, everything passes, either explicitly or implicitly.)

As covered in the access.conf manual page, you can get quite sophisticated here. For example, you could have people who always had to use MFA, even from internal machines. If they were all in a group called 'mustmfa', you might start with:

-:(mustmfa):ALL

If you get at all creative with your access-nomfa.conf, I strongly suggest writing a lot of comments to explain everything. Your future self will thank you.

Unfortunately but entirely reasonably, the information about the remote source of a login session doesn't pass through to later PAM authentication done by sudo and su commands that you do in the session. This means that you can't use pam_access to not give MFA challenges on su or sudo to people who are logged in from 'trusted' areas.

(As far as I can tell, the only information ``pam_access' gets about the 'origin' of a su is the TTY, which is generally not going to be useful. You can probably use this to not require MFA on su or sudo that are directly done from logins on the machine's physical console or serial console.)

Having an emergency backup DNS resolver with systemd-resolved

By: cks
25 October 2024 at 03:08

At work we have a number of internal DNS resolvers, which you very much want to use to resolve DNS names if you're inside our networks for various reasons (including our split-horizon DNS setup). Purely internal DNS names aren't resolvable by the outside world at all, and some DNS names resolve differently. However, at the same time a lot of the host names that are very important to me are in our public DNS because they have public IPs (sort of for historical reasons), and so they can be properly resolved if you're using external DNS servers. This leaves me with a little bit of a paradox; on the one hand, my machines must resolve our DNS zones using our internal DNS servers, but on the other hand if our internal DNS servers aren't working for some reason (or my home machine can't reach them) it's very useful to still be able to resolve the DNS names of our servers, so I don't have to memorize their IP addresses.

A while back I switched to using systemd-resolved on my machines. Systemd-resolved has a number of interesting virtues, including that it has fast (and centralized) failover from one upstream DNS resolver to another. My systemd-resolved configuration is probably a bit unusual, in that I have a local resolver on my machines, so resolved's global DNS resolution goes to it and then I add a layer of (nominally) interface-specific DNS domain overrides that point to our internal DNS resolvers.

(This doesn't give me perfect DNS resolution, but it's more resilient and under my control than routing everything to our internal DNS resolvers, especially for my home machine.)

Somewhat recently, it occurred to me that I could deal with the problem of our internal DNS resolvers all being unavailable by adding '127.0.0.1' as an additional potential DNS server for my interface specific list of our domains. Obviously I put it at the end, where resolved won't normally use it. But with it there, if all of the other DNS servers are unavailable I can still try to resolve our public DNS names with my local DNS resolver, which will go out to the Internet to talk to various authoritative DNS servers for our zones.

The drawback with this emergency backup approach is that systemd-resolved will stick with whatever DNS server it's currently using unless that DNS server stops responding. So if resolved switches to 127.0.0.1 for our zones, it's going to keep using it even after the other DNS resolvers become available again. I'll have to notice that and manually fiddle with the interface specific DNS server list to remove 127.0.0.1, which would force resolved to switch to some other server.

(As far as I can tell, the current systemd-resolved correctly handles the situation where an interface says that '127.0.0.1' is the DNS resolver for it, and doesn't try to force queries to 127.0.0.1:53 to go out that interface. My early 2013 notes say that this sometimes didn't work, but I failed to write down the specific circumstances.)

Doing basic policy based routing on FreeBSD with PF rules

By: cks
24 October 2024 at 03:26

Suppose, not hypothetically, that you have a FreeBSD machine that has two interfaces and these two interfaces are reached through different firewalls. You would like to ping both of the interfaces from your monitoring server because both of them matter for the machine's proper operation, but to make this work you need replies to your pings to be routed out the right interface on the FreeBSD machine. This is broadly known as policy based routing and is often complicated to set up. Fortunately FreeBSD's version of PF supports a basic version of this, although it's not well explained in the FreeBSD pf.conf manual page.

To make our FreeBSD machine reply properly to our monitoring machine's ICMP pings, or in general to its traffic, we need a stateful 'pass' rule with a 'reply-to':

B_IF="emX"
B_IP="10.x.x.x"
B_GW="10.x.x.254"
B_SUBNET="10.x.x.0/24"

pass in quick on $B_IF \
  reply-to ($B_IF $B_GW) \
  inet from ! $B_SUBNET to $B_IP \
  keep state

(Here $B_IP is the machine's IP on this second interface, and we also need the second interface, the gateway for the second interface's subnet, and the subnet itself.)

As I discovered, you must put the 'reply-to' where it is here, although as far as I can tell the FreeBSD pf.conf manual page will only tell you that if you read the full BNF. If you put it at the end the way you might read the text description, you will get only opaque syntax errors.

We must specifically exclude traffic from the subnet itself to us, because otherwise this rule will faithfully send replies to other machines on the same subnet off to the gateway, which either won't work well or won't work at all. You can restrict the PF rule more narrowly, for example 'from { IP1 IP2 IP3 }' if those are the only off-subnet IPs that are supposed to be talking to your secondary interface.

(You may also want to match only some ports here, unless you want to give all incoming traffic on that interface the ability to talk to everything on the machine. This may require several versions of this rule, basically sticking the 'reply-to ...' bit into every 'pass in quick on ...' rule you have for that interface.)

This PF rule only handles incoming connections (including implicit ones from ICMP and UDP traffic). If we want to be able to route our outgoing traffic over our secondary interface by selecting a source address when you do things, we need a second PF rule:

pass out quick \
  route-to ($B_IF $B_GW) \
  inet from $B_IP to ! $B_SUBNET \
  keep state

Again we must specifically exclude traffic to our local network, because otherwise it will go flying off to our gateway, and also you can be more specific if you only want this machine to be able to connect to certain things using this gateway and firewall (eg 'to { IP1 IP2 SUBNET3/24 }', or you could use a port-based restriction).

(The PF rule can't be qualified with 'on $B_IF', because the situation where you need this rule is where the packet would not normally be going out that interface. Using 'on <the interface with your default route's gateway>' has some subtle differences in the semantics if you have more than two interfaces.)

Although you might innocently think otherwise, the second rule by itself isn't sufficient to make incoming connections to the second interface work correctly. If you want both incoming and outgoing connections to work, you need both rules. Possibly it would work if you matched incoming traffic on $B_IF without keeping state.

A surprise with /etc/cron.daily, run-parts, and files with '.' in their name

By: cks
16 October 2024 at 03:30

Linux distributions have a long standing general cron feature where there is are /etc/cron.hourly, /etc/cron.daily, and /etc/cron.weekly directories and if you put scripts in there, they will get run hourly, daily, or weekly (at some time set by the distribution). The actual running is generally implemented by a program called 'run-parts'. Since this is a standard Linux distribution feature, of course there is a single implementation of run-parts and its behavior is standardized, right?

Since I'm asking the question, you already know the answer: there are at least two different implementations of run-parts, and their behavior differs in at least one significant way (as well as several other probably less important ones).

In Debian, Ubuntu, and other Debian-derived distributions (and also I think Arch Linux), run-parts is a C program that is part of debianutils. In Fedora, Red Hat Enterprise Linux, and derived RPM-based distributions, run-parts is a shell script that's part of the crontabs package, which is part of cronie-cron. One somewhat unimportant way that these two versions differ is that the RPM version ignores some extensions that come from RPM packaging fun (you can see the current full list in the shell script code), while the Debian version only skips the Debian equivalents with a non-default option (and actually documents the behavior in the manual page).

A much more important difference is that the Debian version ignores files with a '.' in their name (this can be changed with a command line switch, but /etc/cron.daily and so on are not processed with this switch). As a non-hypothetical example, if you have a /etc/cron.daily/backup.sh script, a Debian based system will ignore this while a RHEL or Fedora based system will happily run it. If you are migrating a server from RHEL to Ubuntu, this may come as an unpleasant surprise, partly since the Debian version doesn't complain about skipping files.

(Whether or not the restriction could be said to be clearly documented in the Debian manual page is a matter of taste. Debian does clearly state the allowed characters, but it does not point out that '.', a not uncommon character, is explicitly not accepted by default.)

Linux software RAID and changing your system's hostname

By: cks
11 October 2024 at 03:46

Today, I changed the hostname of an old Linux system (for reasons) and rebooted it. To my surprise, the system did not come up afterward, but instead got stuck in systemd's emergency mode for a chain of reasons that boiled down to there being no '/dev/md0'. Changing the hostname back to its old value and rebooting the system again caused it to come up fine. After some diagnostic work, I believe I understand what happened and how to work around it if it affects us in the future.

One of the issues that Linux RAID auto-assembly faces is the question of what it should call the assembled array. People want their RAID array names to stay fixed (so /dev/md0 is always /dev/md0), and so the name is part of the RAID array's metadata, but at the same time you have the problem of what happens if you connect up two sets of disks that both want to be 'md0'. Part of the answer is mdadm.conf, which can give arrays names based on their UUID. If your mdadm.conf says 'ARRAY /dev/md10 ... UUID=<x>' and mdadm finds a matching array, then in theory it can be confident you want that one to be /dev/md10 and it should rename anything else that claims to be /dev/md10.

However, suppose that your array is not specified in mdadm.conf. In that case, another software RAID array feature kicks in, which is that arrays can have a 'home host'. If the array is on its home host, it will get the name it claims it has, such as '/dev/md0'. Otherwise, well, let me quote from the 'Auto-Assembly' section of the mdadm manual page:

[...] Arrays which do not obviously belong to this host are given names that are expected not to conflict with anything local, and are started "read-auto" so that nothing is written to any device until the array is written to. i.e. automatic resync etc is delayed.

As is covered in the documentation for the '--homehost' option in the mdadm manual page, on modern 1.x superblock formats the home host is embedded into the name of the RAID array. You can see this with 'mdadm --detail', which can report things like:

Name : ubuntu-server:0
Name : <host>:25  (local to host <host>)

Both of these have a 'home host'; in the first case the home host is 'ubuntu-server', and in the second case the home host is the current machine's hostname. Well, its 'hostname' as far as mdadm is concerned, which can be set in part through mdadm.conf's 'HOMEHOST' directive. Let me repeat that, mdadm by default identifies home hosts by their hostname, not by any more stable identifier.

So if you change a machine's hostname and you have arrays not in your mdadm.conf with home hosts, their /dev/mdN device names will get changed when you reboot. This is what happened to me, as we hadn't added the array to the machine's mdadm.conf.

(Contrary to some ways to read the mdadm manual page, arrays are not renamed if they're in mdadm.conf. Otherwise we'd have noticed this a long time ago on our Ubuntu servers, where all of the arrays created in the installer have the home host of 'ubuntu-server', which is obviously not any machine's actual hostname.)

Setting the home host value to the machine's current hostname when an array is created is the mdadm default behavior, although you can turn this off with the right mdadm.conf HOMEHOST setting. You can also tell mdadm to consider all arrays to be on their home host, regardless of the home host embedded into their names.

(The latter is 'HOMEHOST <ignore>', the former by itself is 'HOMEHOST <none>', and it's currently valid to combine them both as 'HOMEHOST <ignore> <none>', although this isn't quite documented in the manual page.)

PS: Some uses of software RAID arrays won't care about their names. For example, if they're used for filesystems, and your /etc/fstab specifies the device of the filesystem using 'UUID=' or with '/dev/disk/by-id/md-uuid-...' (which seems to be common on Ubuntu).

PPS: For 1.x superblocks, the array name as a whole can only be 32 characters long, which obviously limits how long of a home host name you can have, especially since you need a ':' in there as well and an array number or the like. If you create a RAID array on a system with a too long hostname, the name of the resulting array will not be in the '<host>:<name>' format that creates an array with a home host; instead, mdadm will set the name of the RAID to the base name (either whatever name you specified, or the N of the 'mdN' device you told it to use).

(It turns out that I managed to do this by accident on my home desktop, which has a long fully qualified name, by creating an array with the name 'ssd root'. The combination turns out to be 33 characters long, so the RAID array just got the name 'ssd root' instead of '<host>:ssd root'.)

The history of inetd is more interesting than I expected

By: cks
10 October 2024 at 03:11

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them. When inetd listens on a port, it can act in two different modes. In the simplest mode, it starts a separate copy of the configured program for every connection (much like the traditional HTTP CGI model), which is an easy way to implement small, low volume services but usually not good for bigger, higher volume ones. The second mode is more like modern 'socket activation'; when a connection comes in, inetd starts your program and passes it the master socket, leaving it to you to keep accepting and processing connections until you exit.

(In inetd terminology, the first mode is 'nowait' and the second is 'wait'; this describes whether inetd immediate resumes listening on the socket for connections or waits until the program exits.)

Inetd turns out to have a more interesting history than I expected, and it's a history that's entwined with daemonization, especially with how the BSD r* commands daemonize themselves in 4.2 BSD. If you'd asked me before I started writing this entry, I'd have said that inetd was present in 4.2 BSD and was being used for various low-importance services. This turns out to be false in both respects. As far as I can tell, inetd was introduced in 4.3 BSD, and when it was introduced it was immediately put to use for important system daemons like rlogind, telnetd, ftpd, and so on, which were surprisingly run in the first style (with a copy of the relevant program started for each connection). You can see this in the 4.3 BSD /etc/inetd.conf, which has the various TCP daemons and lists them as 'nowait'.

(There are still network programs that are run as stand-alone daemons, per the 4.3 BSD /etc/rc and the 4.3 BSD /etc/rc.local. If we don't count syslogd, the standard 4.3 BSD tally seems to be rwhod, lpd, named, and sendmail.)

While I described inetd as having two modes and this is the modern state, the 4.3 BSD inetd(8) manual page says that only the 'start a copy of the program every time' mode ('nowait') is to be used for TCP programs like rlogind. I took a quick read over the 4.3 BSD inetd.c and it doesn't seem to outright reject a TCP service set up with 'wait', and the code looks like it might actually work with that. However, there's the warning in the manual page and there's no inetd.conf entry for a TCP service that is 'wait', so you'd be on your own.

The corollary of this is that in 4.3 BSD, programs like rlogind don't have the daemonization code that they did in 4.2 BSD. Instead, the 4.3 BSD rlogind.c shows that it can only be run under inetd or some equivalent, as rlogind immediately aborts if its standard input isn't a socket (and it expects the socket to be connected to some other end, which is true for the 'nowait' inetd mode but not how things would be for the 'wait' mode).

This 4.3 BSD inetd model seems to have rapidly propagated into BSD-derived systems like SunOS and Ultrix. I found traces that relatively early on, both of them had inherited the 4.3 style non-daemonizing rlogind and associated programs, along with an inetd-based setup for them. This is especially interesting for SunOS, because it was initially derived from 4.2 BSD (I'm less sure of Ultrix's origins, although I suspect it too started out as 4.2 BSD derived).

PS: I haven't looked to see if the various BSDs ever changed this mode of operation for rlogind et al, or if they carried the 'per connection' inetd based model all through until each of them removed the r* commands entirely.

OpenBSD kernel messages about memory conflicts on x86 machines

By: cks
9 October 2024 at 02:44

Suppose you boot up an OpenBSD machine that you think may be having problems, and as part of this boot you look at the kernel messages for the first time in a while (or perhaps ever), and when doing so you see messages that look like this:

3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000

Or maybe the messages are like this:

memory map conflict 0xe00fd000/0x1000
memory map conflict 0xfe000000/0x11000
[...]
3:0:0: mem address conflict 0xfffc0000/0x40000
3:0:1: mem address conflict 0xfffc0000/0x40000

This sounds alarming, but there's almost certainly no actual problem, and if you check logs you'll likely find that you've been getting messages like this for as long as you've had OpenBSD on the machine.

The short version is that both of these are reports from OpenBSD that it's finding conflicts in the memory map information it is getting from your BIOS. The messages that start with 'X:Y:Z' are about PCI(e) device memory specifically, while the 'memory map conflict' errors are about the general memory map the BIOS hands the system.

Generally, OpenBSD will report additional information immediately after about what the PCI(e) devices in question are. Here are the full kernel messages around the 'rom address conflict':

pci3 at ppb2 bus 3
3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000
bge0 at pci3 dev 0 function 0 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy0 at bge0 phy 1: BCM5720C 10/100/1000baseT PHY, rev. 0
bge1 at pci3 dev 0 function 1 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy1 at bge1 phy 2: BCM5720C 10/100/1000baseT PHY, rev. 0

Here these are two network ports on the same PCIe device (more or less), so it's not terribly surprising that the same ROM is maybe being reused for both. I believe the two messages mean that both ROMs (at the same address) are conflicting with another unmentioned allocation. I'm not sure how you find out what the original allocation and device is that they're both conflicting with.

The PCI related messages come from sys/dev/pci/pci.c and in current OpenBSD come in a number of variations, depending on what sort of PCI address space is detected as in conflict in pci_reserve_resources(). Right now, I see 'mem address conflict', 'io address conflict', the already mentioned 'rom address conflict', 'bridge io address conflict', 'bridge mem address conflict' (in several spots in the code), and 'bridge bus conflict'. Interested parties can read the source for more because this exhausts my knowledge on the subject.

The 'memory map conflict' message comes from a different place; for most people it will come from sys/arch/amd64/pci/pci_machdep.c, in pci_init_extents(). If I'm understanding the code correctly, this is creating an initial set of reserved physical address space that PCI devices should not be using. It registers each piece of bios_memmap, which according to comments in sys/arch/amd64/amd64/machdep.c is "the memory map as the bios has returned it to us". I believe that a memory map conflict at this point says that two pieces of the BIOS memory map overlap each other (or one is entirely contained in the other).

I'm not sure it's correct to describe these messages as harmless. However, it's likely that they've been there for as long as your system's BIOS has been setting up its general memory map and the PCI devices as it has been, and you'd likely see the same address conflicts with another system (although Linux doesn't seem to complain about it; I don't know about FreeBSD).

Daemonization in Unix programs is probably about restarting programs

By: cks
6 October 2024 at 02:55

It's standard for Unix daemon programs to 'daemonize' themselves when they start, completely detaching from how they were run; this behavior is quite old and these days it's somewhat controversial and sometimes considered undesirable. At this point you might ask why programs even daemonize themselves in the first place, and while I don't know for sure, I do have an opinion. My belief is that daemonization is because of restarting daemon programs, not starting them at boot.

During system boot, programs don't need to daemonize in order to start properly. The general Unix boot time environment has long been able to detach programs into the background (although the V7 /etc/rc didn't bother to do this with /etc/update and /etc/cron, the 4.2BSD /etc/rc did do this for the new BSD network daemons). In general, programs started at boot time don't need to worry that they will be inheriting things like stray file descriptors or a controlling terminal. It's the job of the overall boot time environment to insure that they start in a clean environment, and if there's a problem there you should fix it centrally, not make it every program's job to deal with the failure of your init and boot sequence.

However, init is not a service manager (not historically), which meant that for a long time, starting or restarting daemons after boot was entirely in your hands with no assistance from the system. Even if you remembered to restart a program as 'daemon &' so that it was backgrounded, the newly started program could inherit all sorts of things from your login session. It might have some random current directory, it might have stray file descriptors that were inherited from your shell or login environment, its standard input, output, and error would be connected to your terminal, and it would have a controlling terminal, leaving it exposed to various bad things happening to it when, for example, you logged out (which often would deliver a SIGHUP to it).

This is the sort of thing that even very old daemonization code deals with, which is to say that it fixes. The 4.2BSD daemonization code closes (stray) file descriptors and removes any controlling terminal the process may have, in addition to detaching itself from your shell (in case you forgot or didn't use the '&' when starting it). It's also easy to see how people writing Unix daemons might drift into adding this sort of code to them as people restarted the daemons (by hand) and ran into the various problems (cf). In fact the 4.2BSD code for it is conditional on 'DEBUG' not being defined; presumably if you were debugging, say, rlogind, you'd build a version that didn't detach itself on you so you could easily run it under a debugger or whatever.

It's a bit of a pity that 4.2 BSD and its successors didn't create a general 'daemonize' program that did all of this for you and then told people to restart daemons with 'daemonize <program>' instead of '<program>'. But we got the Unix that we have, not the Unix that we'd like to have, and Unixes did eventually grow various forms of service management that tried to encapsulate all of the things required to restart daemons in one place.

(Even then, I'm not sure that old System V init systems would properly daemonize something that you restarted through '/etc/init.d/<whatever> restart', or if it was up to the program to do things like close extra file descriptors and get rid of any controlling terminal.)

PS: Much later, people did write tools for this, such as daemonize. It's surprisingly handy to have such a program lying around for when you want or need it.

Traditionally, init on Unix was not a service manager as such

By: cks
5 October 2024 at 03:05

Init (the process) has historically had a number of roles but, perhaps surprisingly, being a 'service manager' (or a 'daemon manager') was not one of them in traditional init systems. In V7 Unix and continuing on into traditional 4.x BSD, init (sort of) started various daemons by running /etc/rc, but its only 'supervision' was of getty processes for the console and (other) serial lines. There was no supervision or management of daemons or services, even in the overall init system (stretching beyond PID 1, init itself). To restart a service, you killed its process and then re-ran it somehow; getting even the command line arguments right was up to you.

(It's conventional to say that init started daemons during boot, even though technically there are some intermediate processes involved since /etc/rc is a shell script.)

The System V init had a more general /etc/inittab that could in theory handle more than getty processes, but in practice it wasn't used for managing anything more than them. The System V init system as a whole did have a concept of managing daemons and services, in the form of its multi-file /etc/rc.d structure, but stopping and restarting services was handled outside of the PID 1 init itself. To stop a service you directly ran its init.d script with 'whatever stop', and the script used various approaches to find the processes and get them to stop. Similarly, (re)starting a daemon was done directly by its init.d script, without PID 1 being involved.

As a whole system the overall System V init system was a significant improvement on the more basic BSD approach, but it (still) didn't have init itself doing any service supervision. In fact there was nothing that actively did service supervision even in the System V model. I'm not sure what the first system to do active service supervision was, but it may have been daemontools. Extending the init process itself to do daemon supervision has a somewhat controversial history; there are Unix systems that don't do this through PID 1, although doing a good job of it has clearly become one of the major jobs of the init system as a whole.

That init itself didn't do service or daemon management is, in my view, connected to the history of (process) daemonization. But that's another entry.

(There's also my entry on how init (and the init system as a whole) wound up as Unix's daemon manager.)

(Unix) daemonization turns out to be quite old

By: cks
4 October 2024 at 02:51

In the Unix context, 'daemonization' means a program that totally detaches itself from how it was started. It was once very common and popular, but with modern init systems they're often no longer considered to be all that good an idea. I have some views on the history here, but today I'm going to confine myself to a much smaller subject, which is that in Unix, daemonization goes back much further than I expected. Some form of daemonization dates to Research Unix V5 or earlier, and an almost complete version appears in network daemons in 4.2 BSD.

As far back as Research Unix V5 (from 1974), /etc/rc is starting /etc/update (which does a periodic sync()) without explicitly backgrounding it. This is the giveaway sign that 'update' itself forks and exits in the parent, the initial version of daemonization, and indeed that's what we find in update.s (it wasn't yet a C program). The V6 update is still in assembler, but now the V6 update.s is clearly not just forking but also closing file descriptors 0, 1, and 2.

In the V7 /etc/rc, the new /etc/cron is also started without being explicitly put into the background. The V7 update.c seems to be a straight translation into C, but the V7 cron.d has a more elaborate version of daemonization. V7 cron forks, chdir's to /, does some odd things with standard input, output, and error, ignores some signals, and then starts doing cron things. This is pretty close to what you'd do in modern daemonization.

The first 'network daemons' appeared around the time of 4.2 BSD. The 4.2BSD /etc/rc explicitly backgrounds all of the r* daemons when it starts them, which in theory means they could have skipped having any daemonization code. In practice, rlogind.c, rshd.c, rexecd.c, and rwhod.c all have essentially identical code to do daemonization. The rlogind.c version is:

#ifndef DEBUG
	if (fork())
		exit(0);
	for (f = 0; f < 10; f++)
		(void) close(f);
	(void) open("/", 0);
	(void) dup2(0, 1);
	(void) dup2(0, 2);
	{ int tt = open("/dev/tty", 2);
	  if (tt > 0) {
		ioctl(tt, TIOCNOTTY, 0);
		close(tt);
	  }
	}
#endif

This forks with the parent exiting (detaching the child from the process hierarchy), then the child closes any (low-numbered) file descriptors it may have inherited, sets up non-working standard input, output, and error, and detaches itself from any controlling terminal before starting to do rlogind's real work. This is pretty close to the modern version of daemonization.

(Today, the ioctl() stuff is done by calling setsid() and you'd probably want to close more than the first ten file descriptors, although that's still a non-trivial problem.)

Resetting the backoff restart delay for a systemd service

By: cks
1 October 2024 at 02:48

Suppose, not hypothetically, that your Linux machine is your DSL PPPoE gateway, and you run the PPPoE software through a simple script to invoke pppd that's run as a systemd .service unit. Pppd itself will exit if the link fails for some reason, but generally you want to automatically try to establish it again. One way to do this (the simple way) is to set the systemd unit to 'Restart=always', with a restart delay.

Things like pppd generally benefit from a certain amount of backoff in their restart attempts, rather than restarting either slowly or rapidly all of the time. If your PPP(oE) link just dropped out briefly because of a hiccup, you want it back right away, not in five or ten minutes, but if there's a significant problem with the link, retrying every second doesn't help (and it may trigger things in your service provider's systems). Systemd supports this sort of backoff if you set 'RestartSteps' and 'RestartMaxDelaySec' to appropriate values. So you could wind up with, for example:

Restart=always
RestartSec=1s
RestartSteps=10
RestartMaxDelaySec=10m

This works fine in general, but there is a problem lurking. Suppose that one day you have a long outage in your service but it comes back, and then a few stable days later you have a brief service blip. To your surprise, your PPPoE session is not immediately restarted the way you expect. What's happened is that systemd doesn't reset its backoff timing just because your service has been up for a while.

To see the current state of your unit's backoff, you want to look at its properties, specifically 'NRestarts' and especially 'RestartUSecNext', which is the delay systemd will put on for the next restart. You see these with 'systemctl show <unit>', or perhaps 'systemctl show -p NRestarts,RestartUSecNext <unit>'. To reset your unit's dynamic backoff time, you run 'systemctl reset-failed <unit>'; this is the same thing you may need to do if you restart a unit too fast and the start stalls.

(I don't know if manually restarting your service with 'systemctl restart <unit>' bumps up the restart count and the backoff time, the way it can cause you to run into (re)start limits.)

At the moment, simply doing 'systemctl reset-failed' doesn't seem to be enough to immediately re-activate a unit that is slumbering in a long restart delay. So the full scale, completely reliable version is probably 'systemctl stop <unit>; systemctl reset-failed <unit>; systemctl start <unit>'. I don't know how you see that a unit is currently in a 'RestartUSecNext' delay, or how much time is left on the delay (such a delay doesn't seem to be a 'job' that appears in 'systemctl list-jobs', and it's not a timer unit so it doesn't show up in 'systemctl list-timers').

If you feel like making your start script more complicated (and it runs as root), I believe that you could keep track of how long this invocation of the service has been running, and if it's long enough, run a 'systemctl reset-failed <unit>' before the script exits. This would (manually) reset the backoff counter if the service has been up for long enough, which is often what you really want.

(If systemd has a unit setting that will already do this, I was unable to spot it.)

Options for adding IPv6 networking to your libvirt based virtual machines

By: cks
29 September 2024 at 02:47

Recently, my home ISP switched me from an IPv6 /64 allocation to a /56 allocation, which means that now I can have a bunch of proper /64s for different purposes. I promptly celebrated this by, in part, extending IPv6 to my libvirt based virtual machine, which is on a bridged internal virtual network (cf). Libvirt provides three different ways to provide (public) IPv6 to such virtual machines, all of which will require you to edit your network XML (either inside the virt-manager GUI or directly with command line tools). The three ways aren't exclusive; you can use two of them or even all three at the same time, in which case your VMs will have two or three public IPv6 addresses (at least).

(None of this applies if you're directly bridging your virtual machines onto some physical network. In that case, whatever the physical network has set up for IPv6 is what your VMs will get.)

First, in all cases you're probably going to want an IPv6 '<ip>' block that sets the IPv6 address for your host machine and implicitly specifies your /64. This is an active requirement for two of the options, and typically looks like this:

<ip family='ipv6' address='2001:19XX:0:1102::1' prefix='64'>
[...]
</ip>

Here my desktop will have 2001:19XX:0:1102::1/64 as its address on the internal libvirt network.

The option that is probably the least hassle is to give static IPv6 addresses to your VMs. This is done with <host> elements inside a <dhcp> element (inside your IPv6 <ip>, which I'm not going to repeat):

<dhcp>
  <host name='hl-fedora-36' ip='2001:XXXX:0:1102::189'/>
</dhcp>

Unlike with IPv4, you can't identify VMs by their MAC address because, to quote the network XML documentation:

[...] The IPv6 host element differs slightly from that for IPv4: there is no mac attribute since a MAC address has no defined meaning in IPv6. [...]

Instead you probably need to identify your virtual machines by their (DHCP) hostname. Libvirt has another option for this but it's not really well documented and your virtual machine may not be set up with the necessary bits to use it.

The second least hassle option is to provide a DHCP dynamic range of IPv6 addresses. In the current Fedora 40 libvirt, this has the undocumented limitation that the range can't include more than 65,535 IPv6 addresses, so you can't cover the entire /64. Instead you wind up with something like this:

<dhcp>
  <range start='2001:XXXX:0:1102::1000' end='2001:XXXX:0:1102::ffff'/>
</dhcp>

Famously, not everything in the world does DHCP6; some things only do SLAAC, and in general SLAAC will allocate random IPv6 IPs across your entire /64. Libvirt uses dnsmasq (also) to provide IP addresses to virtual machines, and dnsmasq can do SLAAC (see the dnsmasq manual page). However, libvirt currently provides no directly exposed controls to turn this on; instead, you need to use a special libvirt network XML namespace to directly set up the option in the dnsmasq configuration file that libvirt will generate.

What you need looks like:

<network xmlns:dnsmasq='http://libvirt.org/schemas/network/dnsmasq/1.0'>
[...]
  <dnsmasq:options>
    <dnsmasq:option value='dhcp-range=2001:XXXX:0:1102::,slaac,64'/>
  </dnsmasq:options>
</network>

(The 'xmlns:dnsmasq=' bit is what you have to add to the normal <network> element.)

I believe that this may not require you to declare an IPv6 <ip> section at all, although I haven't tested that. In my environment I want both SLAAC and a static IPv6 address, and I'm happy to not have DHCP6 as such, since SLAAC will allocate a much wider and more varied range of IPv6 addresses.

(You can combine a dnsmasq SLAAC dhcp-range with a regular DHCP6 range, in which case SLAAC-capable IPv6 virtual machines will get an IP address from both, possibly along with a third static IPv6 address.)

PS: Remember to set firewall rules to restrict access to those public IPv6 addresses, unless you want your virtual machines fully exposed on IPv6 (when they're probably protected on IPv4 by virtue of being NAT'd).

Mostly getting redundant UEFI boot disks on modern Ubuntu (especially 24.04)

By: cks
24 September 2024 at 02:44

When I wrote about how our primary goal for mirrored (system) disks is increased redundancy, including being able to reboot the system after the primary disk failed, vowhite asked in a comment if there was any trick to getting this working with UEFI. The answer is sort of, and it's mostly the same as you want to do with BIOS MBR booting.

In the Ubuntu installer, when you set up redundant system disks it's long been the case that you wanted to explicitly tell the installer to use the second disk as an additional boot device (in addition to setting up a software RAID mirror of the root filesystem across both disks). In the BIOS MBR world, this installed GRUB bootblocks on the disk; in the UEFI world, this causes the installer to set up an extra EFI System Partition (ESP) on the second drive and populate it with the same sort of things as the ESP on the first drive.

(The 'first' and the 'second' drive are not necessarily what you think they are, since the Ubuntu installer doesn't always present drives to you in their enumeration order.)

I believe that this dates from Ubuntu 22.04, when Ubuntu seems to have added support for multi-disk UEFI. Ubuntu will mount one of these ESPs (the one it considers the 'first') on /boot/efi, and as part of multi-disk UEFI support it will also arrange to update the other ESP. You can see what other disk Ubuntu expects to find this ESP on by looking at the debconf selection 'grub-efi/install_devices'. For perfectly sensible reasons this will identify disks by their disk IDs (as found in /dev/disk/by-id), and it normally lists both ESPs.

All of this is great but it leaves you with two problems if the disk with your primary ESP fails. The first is the question of whether your system's BIOS will automatically boot off the second ESP. I believe that UEFI firmware will often do this, and you can specifically set this up with EFI boot entries through things like efibootmgr (also); possibly current Ubuntu installers do this for you automatically if it seems necessary.

The bigger problem is the /boot/efi mount. If the primary disk fails, a mounted /boot/efi will start having disk IO errors and then if the system reboots, Ubuntu will probably be unable to find and mount /boot/efi from the now gone or error-prone primary disk. If this is a significant concern, I think you need to make the /boot/efi mount 'nofail' in /etc/fstab (per fstab(5)). Energetic people might want to go further and make it either 'noauto' so that it's not even mounted normally, or perhaps mark it as a systemd automounted filesystem with 'x-systemd.automount' (per systemd.mount).

(The disclaimer is that I don't know how Ubuntu will react if /boot/efi isn't mounted at all or is a systemd automount mountpoint. I think that GRUB updates will cope with having it not mounted at all.)

If any disk with an ESP on it fails and has to be replaced, you have to recreate a new ESP on that disk and then, I believe, run 'dpkg-reconfigure grub-efi-amd64', which will ask you to select the ESPs you want to be automatically updated. You may then need to manually run '/usr/lib/grub/grub-multi-install --target=x86_64-efi', which will populate the new ESP (or it may be automatically run through the reconfigure). I'm not sure about this because we haven't had any UEFI system disks fail yet.

(The ESP is a vfat formatted filesystem, which can be set up with mkfs.vfat, and has specific requirements for its GUIDs and so on, which you'll have to set up by hand in the partitioning tool of your choice or perhaps automatically by copying the partitioning of the surviving system disk to your new disk.)

If it was the primary disk that failed, you will probably want to update /etc/fstab to get /boot/efi from a place that still exists (probably with 'nofail' and perhaps with 'noauto'). This might be somewhat easy to overlook if the primary disk fails without the system rebooting, at which point you'd get an unpleasant surprise on the next system reboot.

The general difference between UEFI and BIOS MBR booting for this is that in BIOS MBR booting, there's no /boot/efi to cause problems and running 'grub-install' against your replacement disk is a lot easier than creating and setting up the ESP. As I found out, a properly set up BIOS MBR system also 'knows' in debconf what devices you have GRUB installed on, and you'll need to update this (probably with 'dpkg-reconfigure grub-pc') when you replace a system disk.

(We've been able to avoid this so far because in Ubuntu 20.04 and 22.04, 'grub-install' isn't run during GRUB package updates for BIOS MBR systems so no errors actually show up. If we install any 24.04 systems with BIOS MBR booting and they have system disk failures, we'll have to remember to deal with it.)

(See also my entry on multi-disk UEFI in Ubuntu 22.04, which goes deeper into some details. That entry was written before I knew that a 'grub-*/install_devices' setting of a software RAID array was actually an error on Ubuntu's part, although I'd still like GRUB's UEFI and BIOS MBR scripts to support it.)

Old (Unix) workstations and servers tended to boot in the same ways

By: cks
23 September 2024 at 02:50

I somewhat recently read j. b. crawford's ipmi, where in a part crawford talks about how old servers of the late 80s and 90s (Unix and otherwise) often had various features for management like serial consoles. What makes something an old school 80s and 90s Unix server and why they died off is an interesting topic I have views on, but today I want to mention and cover a much smaller one, which is that this sort of early boot environment and low level management system was generally also found on Unix workstations.

By and large, the various companies making both Unix servers and Unix workstations, such as Sun, SGI, and DEC, all used the same boot time system firmware on both workstation models and server models (presumably partly because that was usually easier and cheaper). Since most workstations also had serial ports, the general consequence of this was that you could set up a 'workstation' with a serial console if you wanted to. Some companies even sold the same core hardware as either a server or workstation depending on what additional options you put in it (and with appropriate additional hardware you could convert an old server into a relatively powerful workstation).

(The line between 'workstation' and 'server' was especially fuzzy for SGI hardware, where high end systems could be physically big enough to be found in definite server-sized boxes. Whether you considered these 'servers with very expensive graphics boards' or 'big workstations' could be a matter of perspective and how they were used.)

As far as the firmware was concerned, generally what distinguished a 'server' that would talk to its serial port to control booting and so on from a 'workstation' that had a graphical console of some sort was the presence of (working) graphics hardware. If the firmware saw a graphics board and no PROM boot variables had been set, it would assume the machine was a workstation; if there was no graphics hardware, you were a server.

As a side note, back in those days 'server' models were not necessarily rack-mountable and weren't always designed with the 'must be in a machine room to not deafen you' level of fans that modern servers tend to be found with. The larger servers were physically large and could require special power (and generate enough noise that you didn't want them around you), but the smaller 'server' models could look just like a desktop workstation (at least until you counted up how many SCSI disks were cabled to them).

Sidebar: An example of repurposing older servers as workstations

At one point, I worked with an environment that used DEC's MIPS-based DECstations. DEC's 5000/2xx series were available either as a server, without any graphics hardware, or as a workstation, with graphics hardware. At one point we replaced some servers with better ones; I think they would have been 5000/200s being replaced with 5000/240s. At the time I was using a DECstation 3100 as my system administrator workstation, so I successfully proposed taking one of the old 5000/200s, adding the basic colour graphics module, and making it my new workstation. It was a very nice upgrade.

OpenBSD versus FreeBSD pf.conf syntax for address translation rules

By: cks
20 September 2024 at 02:53

I mentioned recently that we're looking at FreeBSD as a potential replacement for OpenBSD for our PF-based firewalls (for the reasons, see that entry). One of the things that will determine how likely we are to try this is how similar the pf.conf configuration syntax and semantics are between OpenBSD pf.conf (which all of our current firewall rulesets are obviously written in) and FreeBSD pf.conf (which we'd have to move them to). I've only done preliminary exploration of this but the news has been relatively good so far.

I've already found one significant syntax (and to some extent semantics) difference between the two PF ruleset dialects, which is that OpenBSD does BINAT, redirection, and other such things by means of rule modifiers; you write a 'pass' or a 'match' rule and add 'binat-to', 'nat-to', 'rdr-to', and so on modifiers to it. In FreeBSD PF, this must be done as standalone translation rules that take effect before your filtering rules. In OpenBSD PF, strategically placed (ie early) 'match' BINAT, NAT, and RDR rules have much the same effect as FreeBSD translation rules, causing your later filtering rules to see the translated addresses; however, 'pass quick' rules with translation modifiers combine filtering and translation into one thing, and there's not quite a FreeBSD equivalent.

That sounds abstract, so let's look at a somewhat hypothetical OpenBSD RDR rule:

pass in quick on $INT_IF proto {udp tcp} \
     from any to <old-DNS-IP> port = 53 \
     rdr-to <new-DNS-IP>

Here we want to redirect traffic to our deprecated old DNS resolver IP to the new DNS IP, but only DNS traffic.

In FreeBSD PF, the straightforward way would be two rules:

rdr on $INT_IF proto {udp tcp} \
    from any to <old-DNS-IP> port = 53 \
    -> <new-DNS-IP> port 53

pass in quick on $INT_IF proto {udp tcp} \
     from any to <new-DNS-IP> port = 53

In practice we would most likely already have the 'pass in' rule, and also you can write 'rdr pass' to immediately pass things and skip the filtering rules. However, 'rdr pass' is potentially dangerous because it skips all filtering. Do you have a single machine that is just hammering your DNS server through this redirection and you want to cut it off? You can't add a useful 'block in quick' rule for it if you have a 'rdr pass', because the 'pass' portion takes effect immediately. There are ways to work around this but they're not quite as straightforward.

(Probably this alone would push us to not using 'rdr pass'; there's also the potential confusion of passing traffic in two different sections of the pf.conf ruleset.)

Fortunately we have very few non-'match' translation rules. Turning OpenBSD 'match ... <whatever>-to <ip>' pf.conf rules into the equivalent FreeBSD '<whatever> ...' rules seems relatively mechanical. We'd have to make sure that the IP addresses our filtering rules saw continued to be the internal ones, but I think this would be work out naturally; our firewalls that do NAT and BINAT translation do it on their external interfaces, and we usually filter with 'pass in' rules.

(There may be more subtle semantic differences between OpenBSD and FreeBSD pf rules. A careful side by side reading of the two pf.conf manual pages might turn these up, but I'm not sure I can read the two manual pages that carefully.)

Why my Fedora 40 systems stalled logins for ten seconds or so

By: cks
17 September 2024 at 02:10

One of my peculiarities is that I reboot my Fedora 40 desktops by logging in as root on a text terminal and then running 'reboot' (sometimes or often also telling loginctl to terminate any remainders of my login session so that the reboot doesn't stall for irritating lengths of time). Recently, the simple process of logging in as root has been stalling for an alarmingly long time, enough time to make me think something was wrong with the system (it turns out that the stall was probably ten seconds or so, but even a couple of seconds is alarming for your root login not working). Today I hit this again and this time I dug into what was happening, partly because I was able to reproduce it with something other than a root login to reboot the machine.

My first step was to use the excellent extrace to find out what was taking so long, since this can trace all programs run from one top level process and report how long they took (along with the command line arguments). This revealed that the time consuming command was '/usr/libexec/pk-command-not-found compinit -c', and it was being run as part of quite a lot of commands being executed during shell startup. Specifically, Bash, because on Fedora root's login shell is Bash. This was happening because Bash's normal setup will source everything from /etc/profile.d/ in order to set up your new (interactive) Bash setup, and it turns out that there's a lot there. Using 'bash -xl' I was able to determine that pk-command-not-found was probably being run somehow in /usr/share/lmod/lmod/init/bash. If you're as puzzled as I was about that, lmod (also) is apparently a system for setting up paths for accessing Lua 'modules', so it wants to hook into shell startup to set up its environment variables.

It took me a bit of time to understand how the bits fit together, partly because there's no documentation for pk-command-not-found. The first step is that Bash has a feature that allows you to hook into what happens when a command isn't found (cf, see the discussion of the (potential) command_not_found_handle function), and PackageKit is doing this (in the PackageKit-command-not-found Fedora RPM package, which Fedora installs as a standard feature). It turns out that Bash will invoke this handler function not just for commands you run interactively, but also commands that aren't found while Bash is sourcing all of your shell startup. This handler is being triggered in Lmod's init/bash code because said code attempts to run 'compinit -c' to set up completion in zsh so that it can modify zsh's function search path. Compinit is a zsh thing (it's not technically a builtin), so there is no exposed 'compinit' command on the system. Running compinit outside of zsh is a bug; in this case, an expensive bug.

My solution was to remove both PackageKit-command-not-found, because I don't want this slow 'command not found' handling in general, and also the Lmod package, because I don't use Lmod. Because I'm a certain sort of person, I filed Lmod issue #725 to report the issue.

In some testing in a virtual machine, it appears that pk-command-not-found may be so slow only the first time it's invoked. This means that most people with these packages installed may not see or at least realize what's happening, because under normal circumstances they probably log in to Fedora machines graphically, at which point the login stall is hidden in the general graphical environment startup delay that everyone expects to be slow. I'm in the unusual circumstance that my login doesn't use any normal shell, so logging in as root is the first time my desktops will run Bash interactively and trigger pk-command-not-found.

(This elaborates on and cleans up a Fediverse thread I wrote as I poked around.)

I wish (Linux) WireGuard had a simple way to restrict peer public IPs

By: cks
8 September 2024 at 02:32

WireGuard is an obvious tool to build encrypted, authenticated connections out of, over which you can run more or less any network service. For example, you might expose the rsync daemon only over a specific WireGuard interface, instead of running rsync over SSH. Unfortunately, if you want to use WireGuard as a SSH replacement in this fashion, it has one limitation; unlike SSH, there's no simple way to restrict the public IP address of a particular peer.

The rough equivalent of a WireGuard peer is a SSH keypair. In SSH, you can restrict where a keypair will be accepted from with the 'from="..."' restriction in your .ssh/authorized_keys. This provides an extra layer of protection against the key being compromised; not only does an attacker have to acquire the key, they have to be able to use it from exactly the same IP (or the expected IPs). However, more or less by design WireGuard doesn't have a particular restriction on where a WireGuard peer key can be used from. You can set an expected public IP for the peer, but if the peer contacts you from another IP, your (Linux kernel) WireGuard will update its idea of where the peer is. This is handy for WireGuard's usual usage cases but not what we necessarily want for a wired down connection where the IPs should never change.

(I don't think this is a technical restriction in the WireGuard protocol, just something not done in most or all implementations.)

The normal answer is firewall rules that restrict access to the WireGuard port, but this has two limitations. The first and lesser limitation is that it's external to WireGuard, so it's possible to have WireGuard active but your firewall rules not properly applied, theoretically allowing more access than you intend. The bigger limitation is that if you have more than one such wired down WireGuard peer, firewall rules can't tell which WireGuard peer key is being used by which external peer. So in a straightforward implementation of firewall rules, any peer public IP can impersonate any other (if it has the required WireGuard peer key), which is different from the SSH 'from="..."' situation, where each key is restricted separately.

(On the other hand, the firewall situation is better in one way in that you can't accidentally add a WireGuard peer that will be accepted from anywhere the way you can with a SSH key by forgetting to put in a 'from="..."' restriction.)

To get firewall rules that can tell peers apart, you need to use different listening ports for each peer on your end. Today, this requires different WireGuard interfaces (and probably different server keys) for each peer. I think you can probably give all of the interfaces the same internal IP to simplify your life, although I haven't tested this.

(Having written this entry, I now wonder if it would be possible to write an nftables or iptables extension that hooked into the kernel side of WireGuard enough to know peer identities and let you match on them. Existing extensions are already able to be aware of various things like cgroup membership, and there's an existing extension for IPsec. Possibly you could do this with eBPF programs, since there's a BPF/eBPF iptables extension.)

The problems (Open)ZFS can have on new Linux kernel versions

By: cks
6 September 2024 at 03:00

Every so often, someone out there is using a normal released version of OpenZFS on Linux (currently ZFS 2.2.6, which was just released) on a distribution that uses very new kernels (such as Fedora). They may then read that their version of ZFS (such as 2.2.5) doesn't list the latest kernel (such as 6.10) as a 'supported platform'. They may then wonder why this is so.

Part of the answer is that OpenZFS developers are cautious people who don't want to list new kernels as officially supported until people have carefully inspected and tested the situation. Even if everything looks good, it's possible that there is some subtle problem in the interface between (Open)ZFS and the new kernel version. But another part of the answer comes down to how the Linux kernel has no stable internal API, which is also part of how you can get subtle problems in new kernels.

The Linux kernel is constantly changing how things work internally. Functions appear or go away (or simply mutate); fields are added or removed from C structs, or sometimes change their meaning; function arguments change; how you're supposed to do things shifts. It's up to any out of tree code, such as OpenZFS, to keep up with these changes (and that's why you want kernel modules to be in the main Linux kernel if possible, because then other people do some of this work). So to merely compile on a new kernel version, OpenZFS may need to change its own code to match the kernel changes. Sometimes this will be simple, requiring almost no changes; other times it may lead to a bunch of modifications.

(Two examples are the master pull request for 6.10, which had only a few changes, and the larger master pull request for 6.11, which may not even be quite complete yet since 6.11 is not yet released.)

Having things compiling is merely the first step. The OpenZFS developers need to make sure that they're making the right changes, and also they generally want to try to see if things have changed in a way that doesn't break compiling code. To quote a message from Rob Norris on the ZFS on Linux mailing list:

"Support" here means that the people involved with the OpenZFS are reasonably certain that the traditional OpenZFS goals of stability, durability, etc will hold when used with that kernel version. That usually means the test suites have passed, there's no significant new issues reported, and at least three people have looked at the kernel changes, the matching OpenZFS changes, and thought very hard about it.

As a practical matter (as Rob Norris notes), this often means that development versions of OpenZFS will often build and work on new kernel versions well before they're officially supported. Speaking from personal experience, it's possible to be using kernel versions that are not yet 'supported' without noticing until you hit an RPM version dependency surprise.

How not to upgrade (some) held packages on Ubuntu (and Debian)

By: cks
29 August 2024 at 03:38

We hold a number of packages across our Ubuntu fleet (for good reasons), so that they're only upgraded under controlled circumstances. Which packages are held varies, but they always include the kernel packages (among other issues, we don't want machines to reboot into new kernels by surprise, for example after a crash or a power issue). Some of our hosts are used for testing, and I generally update their kernels (far) more often than our regular machines for various reasons. Until recently I did this with the obvious 'apt-get' command line:

apt-get -u upgrade --with-new-pkgs --ignore-hold

The problem with this is that it upgrades all held packages, not just the kernel. I have historically gotten away with this on the machines I do this on, but recently I got burned (well, more burned my co-workers); as part of a kernel upgrade I also upgraded another package that caused some problems.

Instead what you (I) need to do is to use 'apt-mark unhold <packages>' and then just 'apt-get -u upgrade --with-new-pkgs'. This is less convenient (but at least these days we have apt-mark). I continue to be sad that 'apt-get upgrade' doesn't take package(s) to upgrade and will upgrade everything, so you can't do 'apt-get upgrade linux-image-*' to directly express what you (I) want here.

(Fedora's DNF will do this, along with the inverse option of 'dnf upgrade --exclude=...', and both of these are quite nice.)

You can do this with 'apt-get install', but if you're going to use wildcards in the package name for convenience, you need to be careful and add an extra option, --only-upgrade:

apt-get -u install --only-upgrade 'linux-*'

Otherwise, 'apt-get install ...' will faithfully do exactly what you told it to, which is install or upgrade all of the packages that match the wildcard. If you're using 'apt-get install' to upgrade held packages, you probably don't want that. Despite its name, the --only-upgrade option will install new packages that are required by the packages that you're upgrading, such as new kernel packages that are required by a new version of 'linux-image-generic'.

The one semi-virtue of explicitly unholding packages to upgrade them is that this makes it very obvious that the packages are in fact unheld. An 'apt-get install <packages>' or an 'apt-get upgrade --ignore-hold' will unhold the packages as a side effect. Fortunately we long ago modified our update system to automatically apply our standard package holds before it did anything else (after one too many accidents where we should have re-held a package but forgot).

(I'm sure you could write a cover script to handle all of this, if you wanted to. Currently I don't do this often enough to go that far.)

How to talk to a local IPMI under FreeBSD 14

By: cks
26 August 2024 at 03:15

Much like Linux and OpenBSD, FreeBSD is able to talk to a local IPMI using the ipmi kernel driver (or device, if you prefer). This is imprecise although widely understood terminology; in more precise terms, FreeBSD can talk to a machine's BMC (Baseboard Management Controller) that implements the IPMI specification in various ways which you seem to normally not need to care about (for information on 'KCS' and 'SMIC', see the "System Interfaces" section of OpenBSD's ipmi(4)).

Unlike in OpenBSD (covered earlier), the stock FreeBSD 14 kernel appears to report no messages if your machine has an IPMI interface but the driver hasn't been enabled in the kernel. To see if your machine has an IPMI interface that FreeBSD can talk to, you can temporarily load the ipmi module with 'kldload ipmi'. If this succeeds, you will see kernel messages that might look like this:

ipmi0: <IPMI System Interface> port 0xca8,0xcac irq 10 on acpi0
ipmi0: KCS mode found at io 0xca8 on acpi
ipmi0: IPMI device rev. 1, firmware rev. 7.10, version 2.0, device support mask 0xdf
ipmi0: Number of channels 2
ipmi0: Attached watchdog
ipmi0: Establishing power cycle handler

(On the one Dell server I've tried this on so far, the ipmi(4) driver found the IPMI without any special parameters.)

At this point you should have a /dev/ipmi0 device and you can 'pkg install ipmitool' and talk to your IPMI. To make this permanent, you edit /boot/loader.conf to load the driver on boot, by adding:

ipmi_load="YES"

While you're there, you may also want to load the coretemp(4) module or perhaps amdtemp(4). After updating loader.conf, you need to reboot to make it take full effect, although since you can kldload everything before then I don't think there's a rush.

In FreeBSD, IPMI sensor information isn't visible in sysctl (although information from coretemp or amdtemp is). You'll need ipmitool or another suitable program to query it. You can also use ipmitool to configure the basics of the IPMI's networking and set the IPMI administrator's password to something you know, as opposed to whatever unique value the machine's vendor set it to, which you may or may not have convenient access to.

(As far as I can tell, ipmitool works the same on FreeBSD as it does on Linux, so if you have existing scripts and so on that use it for collecting data on your Linux hosts (as we do), they will probably be easy to make work on any FreeBSD machines you add.)

I used libvirt's 'virt-install' briefly and it worked nicely

By: cks
23 August 2024 at 03:17

My normal way of using libvirt based virtual machines has been to initially create them in virt-manager using its convenient GUI, if necessary use virt-viewer to access their consoles, and use virsh for basic operations like starting and stopping VMs and rolling VMs back to snapshots, which I make heavy use of. Then recently I wrote about why and how I keep around spare virtual machines, and wound up discovering virt-install, which is supposed to let you easily create (and install) virtual machines from the command line. My first experience with it went well, so now I'm going to write myself some notes.

(I spun up a new virtual machine from scratch in order to poke at FreeBSD a bit.)

Due to having set up a number of VMs through virt-manager, I had already defined the network I wanted as well as a libvirt storage pool where the disks for the new virt-install VM could go. With those already existing, using virt-install was mostly a long list of arguments:

virt-install -n vmguest7 \
   --memory 8192 -vcpus 2 --cpu host \
   -c /virt/images/freebsd/FreeBSD-14.1-RELEASE-amd64-dvd1.iso \
   --osinfo freebsd14.0 \
   --disk size=20 --disk size=20 \
   -w network=netN-macvtap \
   --graphics spice --noautoconsole

(I think I should have used '--cpu host-passthrough' instead, because I think '--cpu host' caused virt-install to copy the host CPU features into the new VM instead of telling the new VM to just use whatever the host had.)

This created a VM with 8 GB of RAM (FreeBSD's minimum recommended amount for root on ZFS), two CPUs that are just like the host, two 20 GByte disks, the right sort of networking (using the already defined libvirt network), and not trying to start any sort of console since I was ssh'd in to the VM host. Once started, I used virt-viewer on my local machine to connect to the console and went through the standard FreeBSD installer in order to gain experience with it and see how it would go when I later did this on physical hardware.

This didn't create quite the same thing that I would normally get in virt-manager; for instance, this VM was created with an 'i440FX' (virtual) chipset instead of the Q35 chipset that I normally use and that may be better (this might be fixed with '--machine q35' or perhaps '--machine pc-q35-6.2'). The 'CDROM' it wound up with is an IDE one instead of a SATA one, although FreeBSD had no objections to it. All of the various differences don't seem to be particularly important, since the result worked and I'm only doing this for testing. The VM's new disks did get sensible file names, ie ones based on the VM's name.

(When the install finished and rebooted, the VM powered off, but this might have been a peculiarity in how I did things.)

Virt-install can create transient VMs with --transient, but as its documentation notes, the disks for these VMs aren't deleted after the VM itself is cleaned up. There are probably ways to use virt-install and some additional tooling to get truly transient VMs, where even their disks are deleted afterward, but I haven't looked at that since right now it's not really a usage case I'm interested in. If I'm spinning up a VM today, I want it to stick around for at least a bit.

(I'm also not interested in virt-builder or the automatic install side of virt-install; to put it one way, I want virtual versions of our physical servers, and they're not installed through cloud-init or other completely automated ways. I do have a limited use for using guestfish to automatically modify VM filesystems.)

What a POSIX shell has to do with $PWD

By: cks
22 August 2024 at 02:24

It's reasonably well known about Unix people that '$PWD' is a shell variable with the name of the current working directory. Well, sort of, because sometimes $PWD isn't right or isn't even set (all of this is part of the broader subject of shells and the current directory). Until recently, I hadn't looked up what POSIX has to say about $PWD, and when I did I was surprised, partly because I didn't expect POSIX to say anything about it.

(Until I looked it up, I had the vague impression that $PWD was a common but non-POSIX Bourne shell thing.)

What POSIX has to say is in 2.5.3 Shell Variables part of the overall description of the POSIX shell. To put my own summary on what POSIX says, the shell creates and maintains $PWD in basically all circumstances, and is obliged to update $PWD when it does a 'cd', even in shell scripts. The only case where $PWD's value isn't specified in the shell environment is if you don't have access permissions for the current directory for some reason.

(As far as I can tell, the complicated POSIX wording boils down to that if you start the shell with a correct $PWD that uses symbolic links (eg '/u/cks' instead of '/h/281/cks'), the shell is allowed to update that to the post-symlink 'physical' version but doesn't have to. See how 'pwd -P' is described.)

However, $PWD is not necessarily correct when you're running a program written in C, because POSIX chdir() doesn't seem to be required to update $PWD for you (although it's a bit confusing, since Environment Variables seems to imply that POSIX utilities are entitled to believe $PWD is correct if it's in the environment). In fact I don't think that the POSIX shell is always obliged to export $PWD into the environment, which is why I called it a shell variable instead of an environment variable. I believe most actual Bourne shell implementations do always export $PWD, even if they're started in an environment with it undefined (where I believe POSIX allows it to not be exported).

(Bash, Dash, and FreeBSD's Almquist shell all allow $PWD to be unexported, although keeping it that way may be tricky in Dash and FreeBSD sh, which appear to re-export it any time you do a 'cd'.)

The upshort of this is that in a modern environment where /bin/sh is a POSIX shell, $PWD will almost always be correct. It pretty much has to be correct in your POSIX shell sessions and in your POSIX shell scripts. POSIX-compatible shells like Bash will keep it correct even in their more expansive modes, and non-Bourne shells have a strong motive to go with the show because people expect $PWD to work and be correct.

(However, this leaves me mystified about what the problem was in my specific circumstance this time around, since I'd expect $PWD to have gotten set correctly when my /bin/sh based script used 'cd'.)

Why and how I keep around spare libvirt based virtual machines

By: cks
18 August 2024 at 03:17

Recently I mentioned in passing that I keep around spare virtual machines, and in comments Todd quite reasonably asked how one has such a thing (and sort of why one would bother). There are two parts to the answer, a general one and a libvirt one.

The general part is that one sort of my virtual machines are directly on the network, not NAT'd, using specifically assigned static IPs. In order to avoid ever having two VMs accidentally use the same IP, I pre-create a VM for each reserved IP with the (libvirt) name of the VM being its hostname. This still requires configuring each VM's OS with the right IP, but at least accidents are a lot less likely (and in my dominant use for the VMs, I do an initial install of an Ubuntu version with the right IP and then snapshot it).

The libvirt specific part is that I find it a pain in the rear to create a virtual machine, complete with creating and tracking a disk or disks for it, setting various bits and pieces up, and so on. Clever people who do this a lot could probably script it or build generic XML files or similar things, but instead I do it as little as possible, which means that I almost never delete virtual machines even if I'm not using them (although I shut them down). Right now my office desktop has ten VMs configured, none of which are normally running.

(I call this libvirt specific because it's fundamentally a user interface issue, since I could fix it with some sort of provisioning and de-provisioning script that automated all of the fiddly bits for me.)

The most important part of how I keep such VMs as 'spares' is that every time I set up a new VM, I snapshot its initial configuration, complete with a blank initial disk (under the imaginative snapshot name of 'empty-initial'). Then if I want to test something from complete scratch I don't have to go through the effort of making a new VM or erasing the disk of a currently unused one; I just find a currently unused VM, do 'virsh snapshot-revert cksvm5 empty-initial', connect the virtual DVD to an appropriate image (such as the latest FreeBSD or OpenBSD), and then run 'virsh start cksvm5'.

(My earlier entry on how I've set up my libvirt based virtual machines covers the somewhat different way I handle having spare customized Ubuntu VMs that I can use to test things in our standard Ubuntu server environment.)

Using snapshots instead of creating and deleting VMs is probably a bit less efficient at the system level, but not enough for me to notice and care. Having written this, it occurs to me that I could get much the same effect by attaching and detaching virtual disks to the VMs, but with current tooling that would take more work. Libvirt's virsh command line tools make snapshots the easiest approach.

FreeBSD's 'root on ZFS' default appeals to me for an odd reason

By: cks
17 August 2024 at 02:23

For reasons beyond the scope of this entry, we're probably going to take a look at FreeBSD as an alternative to OpenBSD for some of our uses of the latter. This got me to grab a 14.1 ISO image and try a quick install on a spare virtual machine (I keep spare VMs around for just such occasions). This caused me to discover that modern FreeBSD defaults to using ZFS for its root filesystem (although I didn't do this on my VM test install, because my VM has less than the recommended RAM for ZFS). FreeBSD using ZFS for its root filesystem makes me happy, but probably not quite for the reasons you're expecting.

Certainly, I like ZFS in general and I think it has a bunch of nice properties, even for a root filesystem. You get checksums for reliability, compression, the ability to easily add sub-filesystems if you want to limit the amount of space something can use (we have usage cases for this, but that's another entry), and so on. But these aren't what make me happy for it as a root filesystem on FreeBSD. The really nice thing about root on ZFS on FreeBSD for me is the easy mirroring.

A traditional thing with all of our non-Linux installs is that they don't have mirrored system disks. We've made some stabs at it in the past but at the time we found it complex and not clearly compelling, perhaps partly because we didn't have experience with their software mirroring systems. Well, we have a lot of experience with mirroring ZFS vdevs and it's trivial to set ZFS mirroring up after the fact or to revert back from a mirrored setup to a single-disk setup. So while we might not bother going through the hassles of learning a FreeBSD-specific software mirroring system, we're pretty likely to use ZFS mirroring on any production FreeBSD machines. And that will be a good thing for our FreeBSD machines in general.

(Using ZFS for the root filesystem also eliminates any chance that the server will ever stall in boot asking us to approve a fsck, something that has happened to our OpenBSD machines under rare circumstances.)

I'm also personally pleased to see a fully supported 'root on ZFS' in anything. My impression is that FreeBSD is reasonably well used, so their choice of ZFS for the default root filesystem setup may even be exposing a reasonable number of people to (Open)ZFS and its collection of nice things.

PS: our OpenBSD machines come in pairs and we've had very good luck with their root drives, or we might have looked into the OpenBSD bioctl(8) software mirroring system and how you install to a mirror.

The Broadcom 'bnxt' Ethernet driver and RDMA (in Ubuntu 24.04)

By: cks
10 August 2024 at 03:16

We have a number of Supermicro machines with dual 10G-T Broadcom based networking; specifically what they have is the 'BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller'. Under Ubuntu 22.04, everything is fine with these cards (or at least seems to be in non-production use), using the normal bnxt_en kernel driver module. Unfortunately this is not our experience in Ubuntu 24.04.

In Ubuntu 24.04, these machines also load an additional Broadcom bnxt driver, bnxt_re, which is the 'Broadcom NetXtreme-C/E RoCE' driver. RoCE is short for RDMA over Converged Ethernet, and to confuse you, this driver is found in the 'Infiniband' area of the Linux kernel drivers tree. Unfortunately, on our hardware the 24.04 bnxt_re doesn't work (or maybe the hardware doesn't work and bnxt_re is failing to detect that, although with 'RDMA' in the name of the hardware one sort of suspects it's supposed to work). The driver stalls during boot and spits out kernel messages like:

bnxt_en 0000:ab:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xf]=0x3 waited (102721 > 100000) msec active 1
bnxt_en 0000:ab:00.0 bnxt_re0: Failed to modify HW QP
infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
infiniband bnxt_re0: Couldn't start port
bnxt_en 0000:ab:00.0 bnxt_re0: Failed to destroy HW QP
[... more fun ensues ...]

This causes systemd-udev-settle.service to fail:

udevadm[1212]: Timed out for waiting the udev queue being empty.
systemd[1]: systemd-udev-settle.service: Main process exited, code=exited, status=1/FAILURE

This then causes Ubuntu 24.04's ZFS services to fail to completely start, which is a bad thing on hardware that we want to use for our ZFS fileservers.

We aren't the only people with this problem, so I was able to find various threads on the Internet, for example. These gave me the solution, which is to blacklist the bnxt_re kernel module, but at the time left me with the mystery of how and why the bnxt_re module was even being loaded in the first place.

The answer is that bnxt_re is being loaded through the second sort of kernel driver module loading. It is an 'auxiliary' module for handling RDMA on top of the normal bnxt_en network driver, and the bnxt_en module basically asks for it to be loaded (which also suggests that at least the module thinks the hardware should be able to do RDMA properly). More specifically, bnxt_en basically asks for bnxt_en.rdma to be loaded, and that that is an alias for bnxt_re. Fortunately you don't have to know all of this in order to block bnxt_re from loading.

We don't have any 22.04 installs on this specific hardware any more, so I can't be completely sure what happened under 22.04, but it appears that 22.04 didn't load the bnxt_re module on these servers. Running 'modinfo' on the 22.04 module shows that it doesn't have the bnxt_en.rdma module alias it does in 24.04, so maybe you had to manually load it if your hardware had RDMA and you wanted to use it.

(Looking at kernel source history, it appears that bnxt_re support for using this 'auxiliary driver interface' only appeared in kernel 6.3, which is much too late for Ubuntu 22.04's normal server kernel, which is based on 5.15.0.)

One of my lessons learned from this is that in today's Linux kernel environment, drivers may enable additional functionality that you neither asked for or wanted, just because it's there. We don't use RDMA and never asked for anything related to RoCE, but because the hardware is (theoretically) capable of it, we got it anyway.

Review: 'Maharaja' (2024)

30 September 2024 at 07:04
Review: 'Maharaja' (2024)

Spoilers abound; trigger warning: sexual violence

In case you haven't watched the film and don't plan to, you can check out the plot description on Wikipedia.

Maharaja was bad for two reasons.

First, good films don’t lie to their viewers. Maharaja did in two instances. It lied when it led viewers to believe the Selvam/Sabari storyline was contemporaneous to the Maharaja/Lakshmi storyline. Towards the film’s middle it slowly dawns on us that something’s off, followed by the epiphany that the Selvam/Sabari storyline concluded before the Maharaja/Lakshmi storyline began. What was the purpose of this switch? I can’t think of any beyond the film introducing a twist for a twist’s sake, which is disingenuous because it had no other point to it. It's a sign of the film taking its viewership for granted.

It lied the second time it becomes clear Nallasivam was the fourth person in Maharaja's house that day and we realise an ostensibly comical passage of the film has become doubly redundant β€” until we stop and think: what was the purpose of the film depicting Inspector Varadharajan’s phone calls at night to the various crooks asking them to take the responsibility for pilfering the dustbin?

Varadharajan would have known by then that Nallasivam was the culprit. Even if one of the crooks he phoned had agreed to own up to the crime, Varadharajan’s plan (previously hidden from the audience) to deliver Nallasivam to Maharaja’s house would have imploded. Alternatively, if Varadharajan was only fake-calling the crooks, why did we have to spend time watching their reactions? Maharaja offers this passage as comic relief, yet such relief wasn’t necessary. In fact the film could have done itself a favour by presaging Varadharajan’s plot against Nallasivam instead of blindsiding viewers at the climax.


This review benefited from inputs from and feedback by Srividya Tadepalli.


Second, the sexual violence in the film is gratuitous. It was reminiscent of Visaranai (2015) and parts of Paatal Lok (2020). It was trauma porn. We realise Selvam, Dhana, and Nallasivam grievously injured Jothi before Nallasivam raped her multiple times. Rather than simply and directly establish that the three men perpetrated sexual violence, Maharaja split up each instance of Nallasivam raping the girl into a separate scene. We sit there and watch Nallasivam perform the act of seeking Selvam’s β€˜permission’, followed by Selvam’s drawling response, and Nallasivam making excuses for what he’s about to do.

It’s possible Maharaja’s writers presumed they had to lay the groundwork to justify Varadharajan’s and Maharaja’s actions later. And yet they fail when they refuse to admit a rape once is heinous enough and then fail again when they conclude people who commit heinous crimes deserve vigilante justice.

Such justice is an expression of anger, an attempt to deter future crimes with violence. But we should know by now it fails utterly when directed against sexual violence, which erupts most often in intimate settings: when the perpetrator and the survivor are familiar with each other, more broadly when the men think they can get away with it. And most of all vigilante justice fails because it punishes once the (or a rumoured) perpetrator is caught, yet most perpetrators aren’t, which led to the dismal upwelling of voices during #MeToo. The sexual crimes we hear about constitute a small minority of all such crimes out there, which is why the best way to mitigate them has been to improve social justice.

Yet films like Maharaja persist with a vengeful narrative that concludes once the violence is delivered. I fear the only outcome might be more faith in β€œencounter” killings. Visaranai claimed to be fact-based but the brutality in the film served no greater purpose than to illustrate such things happen. If the film was responding to a fourth estate that had failed to highlight the underlying police impunity and the powerlessness of those at society’s margins to defend themselves, it succeeded β€” yet it also failed when it didn’t bother to attempt any sort of triumph, of spirit if not of will. That’s why Paatal Lok and in fact Jai Bhim (2021) were better. But Maharaja is cut from Visaranai’s cloth, and worse for being a work of imagination.

In fact, Maharaja has a β€˜second’ climax during which we discover Jothi is really Ammu, Selvam’s biological daughter, and whom Maharaja has been raising since his daughter, his wife, and Selvam’s wife were killed in the same accident. There are some clues at the film’s beginning as to these (intra-narrative) facts but they're ambiguous at best and in fact just disingenuous β€” another lie like the other plot twist.

But further yet: why? So we can watch Selvam have his lightbulb moment when he realises Jothi was Ammu and feel bad about what he did? (This was also the climax of 2023's Iratta.) Or that men should desist from such crimes because they could be harming their own daughters? Or that viewers might be duped into thinking any kind of justice has been done when Jothi shames Selvam with boilerplate lines? Consider it a third failure.

Why having diverse interests is a virtue

26 September 2024 at 05:19
Why having diverse interests is a virtue

Paris Marx's recent experience on the Canadaland podcast alerted me to the importance of an oft-misunderstood part of journalism in practice. When Paris Marx and his host Justin Ling were recording the podcast, Marx said something about Israel conducting a genocide in Gaza. After the show was recorded, the publisher of Canadaland, a fellow named Jesse Brown, edited that bit out. When Marx as well as Ling complained, Brown reinstated the comment by having Marx re-record it to attribute that claim to some specific sources. Now, following Marx’s newsletter and Ling’s statement about Brown’s actions, Brown has been saying on Twitter that Marx's initial comment, that many people have been saying Israel is conducting a genocide in Gaza, wasn't specific enough and that it needed to have specific sources.

Different publications have different places where they draw the line on how much they'd like their content to be attributed. And frankly, there’s nothing wrong, unfair or unethical about this. As the commentary and narratives around Israel’s violence in West Asia have alerted us, the facts as we consider them are often not set in stone even when they have very clear definitions. We’re seeing in an obnoxious way (from our perspective) many people disputing the claim that Israel is conducting a genocide and contesting whether Israel's actions can be constituted a genocide is a fact. Depending on the community to and for which you are being a journalist, it becomes okay for some things to be attributed to no one and just generally considered true, and for others not so much.

This is fundamentally because each one of us has a different level of access to all the relevant information as well as because the existence of facts other than those that we can experience through our senses (i.e. empirically) is controlled by some social determinants as well.

This whole Canadaland episode alerted me the people trying to repudiate the allegation that Israel is conducting a genocide β€” especially many who are journalists by vocation β€” by purporting to scrutinise the claims they are being presented with. Now, scrutiny in and of itself is a good thing; it's one of the cornerstones of scepticism, especially a reasonable exercise of scepticism. But what they’re scrutinising also matters, and which is a subjective call. I use the word β€˜subjective’ with deliberate intent. Scrutiny in journalism is a good thing (I’m treating Canadaland as a journalistic outlet here), yet it’s important to cultivate a good sense of what can and ought to be scrutinised versus a scrutiny of something that only suggests the scrutiniser is being obstinate or intends to waste time.

Many, if not all, journalists would have started off being told it's important to be alert, to be aware of scrutinising all the claims they encounter. Many journalists also cultivate this sense over time, and the process by which they do so allows subjective considerations to seep in β€” and that is not in and of itself a bad thing. In fact it's good. I have often come across editors who have predicted a particular story's popularity where others only saw a dud based solely on their news sense. This is not a clinical scientific technique, it's by all means a sense. Informing this sense are, among other things, the pulse of the people to whom you're trying to appeal, the things they value, the things they used to value but don’t any more, and so forth. In other words this sense or pulse has an important socio-cultural component to it, and it is within this milieu that scrutiny happens.

Scrutinising something in and of itself is not always a virtue for this reason: in the process of scrutinising something, it’s possible for you to end up appealing to things that people don’t consider virtues or, worse, which they could interpret to mean you’re vouching for something they consider antithetical to their spirit as a people.

This Marx-Ling-Brown incident is illustrative to the extent that it spotlights the many journalists waking up to a barrage of statements, claims, and assertions both on and off the internet that Israel is conducting a genocide in Gaza. These claims are stinging them, cutting at the heart of something they value, something they hold close to their hearts as a community. <>So they're responding by subjecting these claims to some tough scrutiny. Many of us have spent many years applying the same sort of tests to many, many other claims. For example, science journalists had to wade through a lot of bullshit before we could surmount the tide of climate denialism and climate pacifism to get to where we are today.

However, now we're seeing these other people, including journalists, subjecting of all things the claim that Israel is conducting a genocide in Gaza to especial scrutiny. I think they're waking up to the importance of scepticism and scrutiny through this particular news incident. Many of us woke up before, and many of us will wake up in future, through specific incidents that are close to us, that we know more keenly than most others will have a sort of very bad effect on society. These incidents are a sort of catalyst but they are also more than that β€” a kind of awakening.

You learn how to scrutinise things in journalism school, you understand the theory of it very quickly. It's very simple. But in practice, it's a different beast. They say you need to fact check every claim in a reporter's copy. But over time, what you do is you draw the line somewhere and say, "Beyond this point, I'm not going to fact check this copy because the author is a very good reporter and my experience has been that they don't make any statements or claims that don't stand up to scrutiny beyond a particular level." You develop and accrue these habits of journalism in practice because you have to. There are time constraints and mental bandwidth constraints, so you come up with some shortcuts. This is a good thing, but acknowledging this is also important and valuable rather than sweeping it under the rug and pretending you don't do it.

Fact-checking in science journalism
The Gordon and Betty Moore Foundation has helped produce a report on fact-checking in science journalism, and it is an eye-opening read. It was drafted by Deborah Blum and Brooke Borel; there is a nice summary here. The standout findings for me, as a science editor working with journalists for
Why having diverse interests is a virtueClose ReadVM
Why having diverse interests is a virtue

If you want to be a good journalist, you have to cultivate for yourself the right conduits of awakening β€” and by "right" I mean those conduits that will awaken you to the pulses of the people and the beats you’re responsible for rather than serve some counterproductive purpose. These conduits should specifically do two things. One: they should awaken you as quickly and with as much clarity as possible to what it means to fact check or scrutinise something. It should teach you the purpose of it, why you do it. It should teach you what good scrutiny looks like and where the line is between scrutiny and nitpicking or pedantry. Two: it should alert you to, or alert others about, your personal sense of right and wrong, good and bad. That's why it's a virtue to cultivate as many conduits as possible, that is to have diverse interests.

When we're interested in many things about the world, about the communities and the societies that we live in, we are over time awakened again and again. We learn how to subject different claims to different levels of scrutiny because that experience empirically teaches us what, when, and how to scrutinise and, importantly, why. Today we’re seeing many of these people wake up and subject the tests that we've administered to climate denialism, the anti-vaccine movement, and various other pseudo-scientific movements to the claim that Israel is conducting a genocide. When we look at them we see stubborn people who won't admit simple details that are staring us in the face. This disparity arises because of how we construct our facts, the virtues to which we would like to appeal, and the position of the line beyond which we say no further attribution is necessary.

Obviously there is no such thing as the view from nowhere, and I'm clear that I'm almost always appealing to the people who are not right-wingers. So from where I'm standing it seems more often than not as if the tests being administered to, say, the anti-vaccine movement are more valid instances of their use than the tests being administered against claims that Israel is conducting a genocide.

Such divisions arise when we don't cultivate ourselves as individuals, when we don't nurture ourselves and the things that we're interested in. Simply, it speaks to the importance of having diverse interests. It's like traveling the world, meeting many people, experiencing many cultures. Such experiences teach us about multiculturalism and why it’s valuable, and they teach us the precise ways in which xenophobia, authoritarianism, and nationalism effect their baleful consequences. In a very similar way, diverse interests are good teachers about the moral landscape we all share and its normative standards that we co-define. They can quickly teach you about how far you stand from where you might really like to be.

In fact, it’s entirely possible for a right-winger to read this post and take away the idea that where they stand is right. As I said, there is no view from nowhere. Right and wrong depend on your vantage point, in most cases at least. I wanted to put these thoughts down because it seemed like people who may not have many interests or who have very limited interests are people also more likely to disengage from social issues earlier than others. Disengagement is the fundamental problem, the root cause. There are many reasons for why it arises in the first place, but getting rid of it is entirely possible, and importantly something we need to do. And a good way to do it is to cultivate many interests, to be interested in many problems, so that over time our experiences navigating those interests inevitably lead to a good sense of what we should and what we needn’t have to scrutinise. It will teach us why some particular points of an argument are ill-founded. And if we're looking for it, it will give us a chance to fix that and even light the way.

ReΕ‘evanje ZFS

16 July 2024 at 00:00

PrejΕ‘nji teden sem se odločil posodobiti operacijski sistem Debian na enem izmed svojih streΕΎnikov. Posodobitev je načeloma preprosta - v datoteko sources.list je treba vpisati nova skladiőča programskih paketov, nato pa se poΕΎene apt-get -y update, apt-get -y upgrade ter apt-get -y full-upgrade (pa Ε‘e kakΕ‘no malenkost). Vse to sem lepo naredil in na koncu je preostal le Ε‘e ukaz reboot, ki ponovno zaΕΎene sistem. Minuta ali dve čakanja - in streΕΎnik bi se moral zbuditi s posodobljenim operacijskim sistemom. Le da se to ni zgodilo. Niti po petih, niti po desetih minutah. Kar je… znak za alarm. Ε e posebej, če se streΕΎnik nahaja na drugem koncu… Slovenije (ali pa Evrope, saj je vseeno).

PiKVM

No, na srečo je bil na streΕΎnik priključen PiKVM. Gre za napravico, ki omogoča oddaljen dostop in oddaljeno upravljanje računalnikov. PiKVM je v osnovi dodatek (tim. β€œklobuk” oz. angl. hat), ki ga priklopimo na RaspberryPi. Nato pa PiKVM priključimo na računalnik namesto monitorja in tipkovnice/miΕ‘ke - v tem primeru nam PiKVM predstavlja virtualni monitor, virtualno tipkovnico, miΕ‘ko, CD, USB ključek, itd. Preko tega nato lahko računalnik ali streΕΎnik oddaljeno upravljamo (vstopimo lahko tudi v BIOS, virtualno pritisnemo gumb za izklop ali gumb za reset) - in to kar preko spletnega brskalnika. Programska oprema je popolnoma odprtokodna, zadeva pa podpira tudi priklop na KVM razdelilec, kar nam omogoča oddaljeno upravljanje več računalnikov - to je recimo idealno za montaΕΎo v podatkovni center.

PiKVM ob nakupu

PiKVM ob nakupu.

Skratka, ko se streΕΎnik nekaj časa ni več odzival, sem se povezal na PiKVM in Ε‘el pogledat kaj se je dejansko zgodilo. In zgodila se je… katastrofa.

TeΕΎava

StreΕΎnik je namreč po ponovnem zagonu obstal v initramfs. Aaaaaa! Na dnu zaslona pa se je svetilo Ε‘e zadnje opozorilo preden je sistem dokončno izdihnil - ALERT! ZFS=rpool/ROOT/debian does not exists. Dropping to a shell!. V obupu sem spregledal tisti β€œs” in prebral β€œhell”…

V tistem trenutku sem se spomnil, da je bil na korenskem razdelku strežnika seveda nameőčen ZFS datotečni sistem - in to őifriran - ob nadgradnji pa sem seveda pozabil ročno omogočiti tim. jedrne module (angl. kernel modules), ki bi omogočili, da operacijski sistem ob zagonu prepozna ZFS. In da bi bila stvar őe hujőa - na strežniku je teklo (no, zdaj pač ne več) več virtualnih strežnikov. Ki so bili sedaj seveda vsi nedosegljivi.

Opomba. ZFS (Zettabyte File System) je napreden datotečni sistem, ki je znan po svoji zanesljivosti, razőirljivosti, uporabi naprednih tehnik za preverjanje in popravljanje napak (kar zagotavlja, da so podatki vedno dosledni in brez poőkodb), uporabi kompresije in deduplikacije, itd. Skratka, idelaen za strežniőka okolja.

Dobro, zdaj vemo kaj je problem, ampak kako ga reΕ‘iti?

Načrt za njeno reőitev

Da si vsaj malo opomorem od pretresa, sem si najprej pripravil močno kavo. Odločitev se je izkazala za strateőko, saj se je reőevanje sistema zavleklo pozno v noč (in őe v naslednje dopoldne).

Po krajΕ‘em razmisleku se mi je v glavi zarisal naslednji načrt. Najprej sistem zaΕΎenem iz β€œLive Debian CD-ja”, na ta začasni sistem namestim podporo za ZFS, priklopim ZFS diskovje, se β€œchroot-am” v stari sistem, tam popravim nastalo Ε‘kodo in vse skupaj ponovno zaΕΎenem. In to je to!

Na tej točki bi se v kakΕ‘nem starem filmu samo Ε‘e vsedel na konja in odjahal v sončni zahod, ampak kot se je izkazalo, je bila pot do konja (in njegovega sedla)… Ε‘e precej trnova. Pa pojdimo po vrsti.

PiKVM v akciji

PiKVM v akciji.

Najprej sem na PiKVM prenesel datoteko debian-live-12.6.0-amd64-standard.iso, jo priklopil kot navidezni CD, ter zagnal strežnik. To je bilo resnično enostavno in PiKVM se je ponovno izkazal za vreden svojega denarja.

Se je pa ΕΎe kar na začetku izkazalo, da streΕΎnik prepoznava samo ameriΕ‘ko tipkovnico. In ker imam jaz slovensko, je bilo treba najprej ugotoviti katero tipko moram pritisniti, da dobim točno tisti poseben znak, ki ga potrebujem. No, tule je nekaj v mojem primeru najpogosteje uporabljenih znakov na slovenski tipkovnici in njihovi β€œprevodi” na ameriΕ‘ko tipkovnico:

- /
? - 
Ε½ |
+ =
/ &

Luč na koncu tunela

Naslednji korak je bil, da v /etc/apt/sources.list tim. β€œΕΎivega sistema” dodam Ε‘e skladiőče contrib. Nato pa sem ΕΎe lahko namestil podporo za ZFS: sudo apt update && sudo apt install linux-headers-amd64 zfsutils-linux zfs-dkms zfs-zed.

Po minuti ali dveh, pa sem ΕΎe lahko naloΕΎil ZFS jedrne module: sudo modprobe zfs. Ukaz zfs version je pokazal, da podpora za ZFS zdaj deluje:

zfs-2.1.11-1
zfs-kmod-2.1.11-1

No, prvi korak je uspel, sedaj pa je bilo v sistem potrebno β€œsamo Ε‘e” priključiti obstoječe diskovje. Najprej sem naredil ustrezno mapo, kamor bom priklopil diskovje: sudo mkdir /sysroot.

Nato pa sem skuΕ‘al nanjo priključil svoj β€œrpool” ZFS. Spodnji ukazi so zgolj pribliΕΎni (verjetno je treba narediti Ε‘e kaj, recimo nastaviti tim. mountpoint), so pa lahko vodilo komu, ki bo imel podobne teΕΎave. Naj seveda dodam, da ni Ε‘lo povsem enostavno in je bilo potrebno kar nekaj telovadbe, da sem uspel priti do končnega cilja.

sudo zpool import -N -R /sysroot rpool -f

sudo zpool status
sudo zpool list
sudo zfs get mountpoint

Na tej točki sem vnesel Ε‘ifrirno geslo: sudo zfs load-key rpool… in preveril, da je ZFS odklenjen: sudo zfs get encryption,keystatus.

Sedaj pa priklop: sudo zfs mount rpool/ROOT/debian. In evo, podatki so bili vidni in kot je kazalo ni bilo nič izgubljenega!

OΕΎivljanje β€œpacienta”…

Končno je sledil chroot v stari sistem:

sudo mkdir /sysroot/mnt
sudo mkdir /sysroot/mnt/dev
sudo mkdir /sysroot/mnt/proc
sudo mkdir /sysroot/mnt/sys
sudo mkdir /sysroot/mnt/run
sudo mount -t tmpfs tmpfs /sysroot/mnt/run
sudo mkdir /sysroot/mnt/run/lock

sudo mount --make-private --rbind /dev /sysroot/mnt/dev
sudo mount --make-private --rbind /proc /sysroot/mnt/proc
sudo mount --make-private --rbind /sys /sysroot/mnt/sys

sudo chroot /sysroot/mnt /usr/bin/env DISK=$DISK bash --login

Zdaj sem bil torej uspeΕ‘no povezan v stari (β€œokvarjeni”) sistem. Najprej je bilo vanj potrebno namestiti ZFS podporo:

apt install --yes dpkg-dev linux-headers-generic linux-image-generic
apt install --yes zfs-initramfs
echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf

…z manjΕ‘imi teΕΎavami

Seveda se je vmes pojavila őe ena napaka, in sicer nameőčanje programske opreme ni bilo možno zaradi okvarjenega systemd paketa. To sem reőil z:

sudo rm /var/lib/dpkg/info/systemd*
sudo dpkg --configure -D 777 systemd
sudo apt -f install

Potem so se seveda pojavile Ε‘e nereΕ‘ene odvisnosti… kako točno sem to uspel reΕ‘iti se niti ne spomnim več, pomagali pa so naslednji ukazi (ne nujno v tem vrstnem redu):

dpkg --force-all --configure -a
apt --fix-broken install
apt-get -f install

Zdaj je bilo potrebno priklopiti őe efi razdelek (za katerega je bilo potrebno najprej ugotoviti kje točno se sploh nahaja):

cp -r /boot /tmp
zpool import -a
lsblk
mount /dev/nvme0n1p2 /boot/efi
cd /tmp
cp * /boot/

Zdaj pa zares!

Končno sem lahko pognal ukaze s katerimi sem dodal ZFS jedrne module v jedro operacijskega sistema:

update-initramfs -c -k all
dkms autoinstall
dkms-status
update-grub
grub-install

No, in končno je sledil ponovni zagon sistema, po njem pa je bilo treba popraviti Ε‘e mesto priklopa ZFS sistema (zfs set mountpoint=/ rpool/ROOT/debian)… Ε‘e en ponovni zagon - in stari sistem je vstal od mrtvih.

Postfestum sanacija nastale Ε‘kode

Zaradi silnega čaranja in ne povsem dokončane nadgradnje, je bilo potrebno namestiti manjkajoče programske pakete, ponovno namestiti nekaj systemd paketov in odstraniti stara jedra operacijskega sistema. Vse seveda ročno.

Aja, pa iz nekega razloga je ob posodobitvi izginil SSH streΕΎnik. Ampak to reΕ‘iti je bila sedaj mala malica.

Sledil je reboot in nato őe enkrat reboot, da vidim, če res vse deluje.

Konec dober, vse dobro

In zdaj deluje. O, kako lepo deluje! ZFS je kriptiran, sistem se po vnosu gesla za odklep lepo zaΕΎene, prav tako se samodejno zaΕΎenejo virtualni streΕΎniki. PiKVM pa je dobil prav posebno mesto v mojem srcu.

Pa do naslednjič, ali kako že rečejo! :)

P. S. Hvala tudi Juretu za pomoč. Brez njegovih nasvetov bi vse skupaj trajalo precej dlje.

How Linux kernel driver modules for hardware get loaded (I think)

By: cks
9 August 2024 at 03:22

Once upon a time, a long time ago, the kernel modules for your hardware got loaded during boot because they were listed explicitly as 'load these modules' in configuration files somewhere. You can still explicitly list modules this way (and you may need to for things like IPMI drivers), but most hardware driver modules aren't loaded like this any more. Instead they get loaded through udev, through what I believe is two mechanisms.

The first mechanism is that as the kernel inventories things like PCIe devices, it generates udev events with 'MODALIAS' set in them in a way that incorporates the PCIe vendor and device/model numbers. At the same time, kernel modules declare all of the PCIe vendor and model values that they support, which are turned into (somewhat wild carded) module aliases that you can inspect with 'modinfo', for example:

$ modinfo bnxt_en
description: Broadcom BCM573xx network driver
license:     GPL
alias:       pci:v000014E4d0000D800sv*sd*bc*sc*i*
alias:       pci:v000014E4d00001809sv*sd*bc*sc*i*
[...]
alias:       pci:v000014E4d000016D8sv*sd*bc*sc*i*
[...]

(The other parts of the pci MODALIAS value are apparently, in order, the subsystem vendor, the subsystem device/model, the base class, the sub class, and the 'programming interface'. See the Arch Wiki entry on modalias.)

As I understand things (and udev rules), when udev processes a kernel udev event with a MODALIAS set, it will attempt to load a kernel module that matches the name. Usually this will be done through wild card matching against aliases, as in the case of Broadcom BCM573xx cards; a supported card will have its PCIe vendor and device listed as an alias, so udev will wind up loading bnxt_en for it.

The second mechanism is through something called the Auxiliary 'bus'. To put my own spin on it, this is a way for core hardware drivers to declare (possibly only under some situations) that loading an additional driver can enable extra functionality. When the main driver loads and registers itself, it will register a pseudo-device on the 'auxiliary bus'. This bus registration generates a udev event with a MODALIAS that starts with 'auxiliary:' and apparently is generally formatted as 'auxiliary:<core driver>.<some-feature>', for example 'auxiliary:bnxt_en.rdma'. When this pseudo-device is registered, the udev event goes out from the kernel, is picked up by udev, and triggers an attempt to load whatever kernel module has declared that name as an alias. For example:

$ modinfo bnxt_re
[...]
description: Broadcom NetXtreme-C/E RoCE Driver
[...]
alias:       auxiliary:bnxt_en.rdma
depends:     ib_uverbs,ib_core,bnxt_en
[...]

(Inside the kernel, the two kernel modules use this pseudo-device on the auxiliary bus to connect with each other.)

As far as I know, the main kernel driver modules don't explicitly publish information on what auxiliary bus things they may trigger; the information exists only in their code. You can attempt to go the other way by looking for modules that declare themselves as auxiliaries for something else. This is most conveniently done by looking for 'auxiliary:' in /lib/modules/<version>/modules.alias.

(Your results may depend on the specific kernel versions and build options involved, and perhaps what additional packages have been added. On my Fedora 40 machine with 6.9.12, there are 37 auxiliary: aliases; on an Ubuntu 24.04 machine with '6.8.0-39', there are 49, with the extras coming from the peci_cputemp and peci_dimmtemp kernel modules.)

PS: PCI(e) devices aren't the only thing that this kernel module alias facility is used for. There are a whole collection of USB modaliases, a bunch of 'i2c' and 'of' ones, a number of 'hid' ones, and so on.

Host names in syslog messages may not be quite what you expect

By: cks
7 August 2024 at 02:20

Over on the Fediverse, I said something:

It has been '0' days since I (re)discovered that the claimed hostname in syslog messages can be utter junk, and you may be going to live a fun life if you use it for anything much.

Suppose that on your central syslog server you see a syslog line of the form:

[...] alkyone exim[864974]: no host name found for IP address 115.187.17.119

You might reasonably assume that the host name 'alkyone' comes from the central syslog daemon knowing the host name of the host that sent the syslog message to it. Unfortunately, this is not what actually happens. As covered in places like RFC 5424 section 6.2.4 (or RFC 3164 section 4.1.2 for the nominal 'BSD' syslog format, which seem to not actually be what BSD used), syslog messages carry an embedded hostname in them. This hostname is generated by the machine that originated the message, and the machine can put anything it wants to in there. And generally, your syslog daemon (and the log format it's using) will write this hostname into the logs and otherwise use it if you ask for the message's 'hostname'.

(Rsyslog and probably other syslog daemons can create per-host message files on your central syslog server, which can cause you to want a hostname for each message.)

The intent of this embedded hostname is noble; it's there so you can have syslog relays (which may happen accidentally), where the originating system sends its messages to host A and host A relays them to host B, and B records the hostname as the originating system, not host A. Unfortunately, in practice all sorts of things can go wrong, including a quite fun one.

The first thing that can go wrong is systems that have a different view of their hostname than you do. On Unix systems, the normal syslog hostname traditionally comes from whatever the general host name is set to, which isn't necessarily a fully qualified domain name and doesn't necessarily match what its IP address is (you can change the IP address of a system but forget to update its hostname). Some embedded systems will have an internally set host name instead of trying to deduce it from DNS lookups of whatever IP they have, which can cause them to use syslog hostnames like 'idrac-<asset-tag>' (for the BMC of a Dell server with that particular asset tag).

The most fun case is an interaction with a long-standing syslog feature (that I think is often disabled today):

<host> /bsd: arp: attempt to overwrite entry for [...]
last message repeated 2 times

You'll notice that the second message doesn't say '<host> last message repeated ...'. This is achieved with the extremely brute force method of setting the hostname in the message to 'last'. If your central syslog server then attempts to set up per-host syslog logs, you will wind up with a 'last' host (with extremely uninteresting logs).

Also, if people send not quite random garbage to your syslog server's listening network ports (perhaps because they are a vulnerability scanner or nmap or the like), your syslog daemon and your logs can wind up seeing all sorts of weird junk as the nominal hostname. The syslog message format is deliberately relatively liberal and syslog servers have traditionally been even more liberal about interpreting things that arrived on it, on the sensible grounds that it's usually better to record everything you get just in case.

Sidebar: Hostnames in syslog messages appear to be new-ish

In 4.2 BSD, the syslog daemon was part of the sendmail source code, and sendmail/aux/syslog.c doesn't get the hostname from the message but instead from the IP address it came from. I think this continues right through 4.4 BSD if I'm reading the code right. RFC 3164 dates from 2001, so presumably people augmented the syslog format some time before then.

Interestingly, RFC 3164 specifically says that the host name in the message must not include the domain name. I suspect that even at the time this was widely ignored in practice for good operational reasons.

The uncertain possible futures of Unix graphical desktops

By: cks
27 July 2024 at 02:40

Once upon a time, the future of Unix desktops looked fairly straightforward. Everyone ran on X, so the major threat to cross-Unix portability in major desktops was the use of Linux only APIs, which became especially D-Bus and systemd related things. Unix desktops that were less attached to tight integration with the Linux environment would probably stay easily available on FreeBSD, OpenBSD, and so on.

What happened to this nice simple vision was Wayland becoming the future of (Linux) graphics. Linux is the primary target of KDE and especially Gnome, so Wayland being the future on Linux has gotten developers for Gnome to start moving toward a Wayland-only vision. Wayland is unapologetically not cross-platform the way X was, which leaves other Unixes with a problem and creates a number of possible future for Unix desktops.

In one future, other Unixes imitate Linux, implementing enough APIs to run Wayland and the other Linux things that in practice it depends on, and as a result they can probably continue to provide the big Linux-focused desktop environments like Gnome. I believe that FreeBSD is working on this approach, although I don't know if Gnome on Wayland on FreeBSD works yet. This allows the other Unix to mostly look like Linux, desktop-wise. As an additional benefit, it allows the other Unix to also use other, more minimal Wayland compositors (ie, window managers) that people may like, such as Sway (the one everyone mentions).

In another future, other Unixes don't attempt to chase Linux by implementing APIs to get Wayland and Gnome and so on to run, and instead stick with X. As desktops, major toolkits, and applications drop support for X or break working on it through lack of use and lack of caring, these Unixes are likely to increasingly be left with old-fashioned X environments that are a lot more 'window manager' than they are 'desktop'. There are people, me included, who would be more or less happy with this state of affairs (in my case, as long as Firefox and a few other applications keep working). I suspect that this is the path that OpenBSD will stick with, and my guess is that anyone using OpenBSD for their desktop or laptop environment will be happy with this.

An unpleasant variant of this future comes about if Firefox and other applications are aggressive about dropping support for X. This would leave X-only Unixes as a backwater, stuck with (at best) old versions of important tools such as web browsers. There are some people who would still be happy with this, but probably not many.

Broadly, I think there is going to be a split between what you could call the Linux desktop (Wayland based with a major desktop environment such as Gnome, even if it's on FreeBSD instead of Linux), perhaps the Wayland desktop (Wayland based with compositor like Sway instead of a full blown desktop environment), and an increasingly limited Unix desktop that over time will find itself having to move from being a desktop environment to being a window manager environment (as the desktop environments stop working well on X).

PS: One big question about the future of the Unix desktop is how many desktop environments will get good Wayland support and then abandon X. Right now, there are a fair number of desktop environments that have little or no Wayland support and a reasonable user base. The existence and popularity of these environments helps drive demand for continued X support in toolkits and so on. Of course, major Linux distributions may throw X-only desktops overboard someday, regardless of usage.

Seeing and matching pf rules when using tcpdump on OpenBSD's pflog interface

By: cks
24 July 2024 at 02:45

Last year I wrote about some special tcpdump filtering options for OpenBSD's pflog interface, including the 'rnr <number>' option for matching and showing only packets blocked by a specific rule. You might want to do this if, for example, you temporarily throw brute force attacker IPs into a table and want to take them out soon after they stop hitting you.

Assuming that you're watching live, the way you do this is to find the rule number with 'pfctl -vv -s rules | grep @ | grep <term>' for a suitable term, such as the table name (or look through the whole thing with a pager), and then run 'tcpdump -n -i pflog0 "rnr <number>"'. However, looking up rule numbers is annoying and a clever person might remember that the OpenBSD tcpdump can print the pf rule information for pflog packets, through the '-e' option (for pflog, this is considered the link-level header). So you might think that the easy way to achieve what you want is 'tcpdump -n -i pflog0 | grep <term>', which is to say you're dumping all pflog packets and then picking out the ones that matched your rule.

Unfortunately, the pflog 'link-level header' doesn't actually tell you this. What it has is the rule number, whether the packet was blocked or not (you can log without blocking), which direction the block was (in or out), and what interface (plus that the packet was blocked because it matched a rule):

21:20:43.525222 rule 231/(match) block in on ix1: [...]

Quite sensibly, you don't get the actual contents of the rule that blocked the packet, so you can't grep for it and my clever idea was not so clever. If you read all the way to the Link Level Headers section of the OpenBSD tcpdump manual page, it explicitly tells you this:

On the packet filter logging interface pflog(4), logging reason (rule match, bad-offset, fragment, bad-timestamp, short, normalize, memory), action taken (pass/block), direction (in/out) and interface information are printed out for each packet.

So don't be like me and waste your time with the 'grep the tcpdump output' approach. It isn't going to work and you're going to have to do it the hard way.

As far as I know there's no way to attach some sort of marker to rules in your pf.conf that will make them easy to pick out in pflog(4) packets. Based on the pflog(4) manual page, the packet format just doesn't have room for that. If you absolutely need to know this sort of thing for sure, even over rule changes, I think your only option is to log the packets to a non-default pflog(4) interface and then arrange for something to receive and store stuff from that interface.

The challenges of working out how many CPUs your program can use on Linux

By: cks
23 July 2024 at 02:20

In yesterday's entry, I talked about our giant (Linux) login server and how we limit each person to only a small portion of that server's CPUs and RAM. These limits sometimes expose issues in how programs attempt to work out how many CPUs they have available so that they can automatically parallelize themselves, or parallelize their build process. This crops up even in areas where you might not expect it; for example, both the Go and Rust compilers attempt to parallelize various parts of compilation using multiple threads within a single compiler process.

In Linux, there are at least three different ways to count the number of 'CPUs' that you might be able to use. First, your program can read /proc/cpuinfo and count up how many online CPUs there are; if code does this in our giant login server, it will get 112 CPUs. Second, your program can call sched_getaffinity() and count how many bits are set in the result; this will detect if you've been limited to a subset of the CPUs by a tool such as taskset(1). Finally, you can read /proc/self/cgroup and then try to find your cgroup to see if you've been given cgroup-based resource limits. These limits won't be phrased in terms of the number of CPUs, but you can work backward from any CPU quota you've been assigned.

In a shell script, you can do the second with nproc, which will also give you the full CPU count if there are no particular limits. As far as I know, there's no straightforward API or program that will give you information on your cgroup CPU quota if there is one. The closest it looks you can come is to use cgget (if it's even installed), but you have to walk all the way back up the cgroup hierarchy to check for CPU limits; it's not necessarily visible in the cgroup (or cgroups) listed in /proc/self/cgroup.

Given the existence of nproc and sched_getaffinity() (and how using them is easier than reading /proc/cpuinfo), I think a lot of scripts and programs will notice CPU affinity restrictions and restrict their parallelism accordingly. My experience suggests that almost nothing is looking for cgroup-based restrictions. This occasionally creates amusing load average situations on our giant login server when a program will see 112 CPUs 'available' and promptly try to use all of them, resulting in their CPU quota being massively over-subscribed and the load average going quite high without actually really affecting anyone else.

(I once did this myself on the login server by absently firing up a heavily parallel build process without realizing I was on the wrong machine for it.)

PS: The corollary of this is that if you want to limit the multi-CPU load impact of something, such as building Firefox from source, it's probably better to use taskset(1) than to do it with systemd features, because it's much more likely that things will notice the taskset limits and not flood your process table and spike the load average. This will work best on single-user machines, such as your desktop, where you don't have to worry about coordinating taskset CPU ranges with anyone or anything else.

The Linux Out-Of-Memory killer process list can be misleading

By: cks
19 July 2024 at 03:30

Recently, we had a puzzling incident where the OOM killer was triggered for a cgroup, listed some processes, and then reported that it couldn't kill anything:

acctg_prof invoked oom-killer: gfp_mask=0x1100cca (GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
[...]
memory: usage 16777224kB, limit 16777216kB, failcnt 414319
swap: usage 1040kB, limit 16777216kB, failcnt 0
Memory cgroup stats for /system.slice/slurmstepd.scope/job_31944
[...]
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 252732]     0 252732     1443        0    53248        0             0 sleep
[ 252728]     0 252728    37095     1915    90112       54         -1000 slurmstepd
[ 252740]  NNNN 252740  7108532    17219 39829504        5             0 python3
[ 252735]     0 252735    53827     1886    94208      151         -1000 slurmstepd
Out of memory and no killable processes...

We scratched our heads a lot, especially as something seemed to be killing systemd-journald at the same time and the messages being logged suggested that it had been OOM-killed instead (although I'm no longer so sure). Why was the kernel saying that there were no killable processes when there was a giant Python process right there?

What was actually going on is that the OOM task state list leaves out a critical piece of information, namely whether or not the process in question had already been killed. A surprising number of minutes before this set of OOM messages, the kernel had done another round of a cgroup OOM kill for this cgroup and:

oom_reaper: reaped process 252740 (python3), now anon-rss:0kB, file-rss:68876kB, shmem-rss:0kB

So the real problem was that this Python process was doing something that had it stuck sitting there, using memory, even after it was OOM killed. The Python process was indeed not killable, for the reason that it had already been killed.

The whole series of events is probably sufficiently rare that it's not worth cluttering the tasks state listing with some form of 'task status' that would show if a particular process was already theoretically dead, just not cleaned up. Perhaps it could be done with some clever handling of the adjusted OOM score, for example marking such processes with a blank value or a '-'. This would make the field not parse as a number, but then kernel log messages aren't an API and can change as the kernel developers like.

(This happened on one of the GPU nodes of our SLURM cluster, so our suspicion is that some CUDA operation (or a GPU operation in general) was in progress and until it finished, the process could not be cleaned and collected. But there were other anomalies at the time so something even odder could be going on.)

Fedora 40 probably doesn't work with software RAID 0.90 format superblocks

By: cks
11 July 2024 at 02:47

On my home machine, I have an old pair of HDDs that have (had) four old software RAID mirrors. Because these were old arrays, they were set up with the old 0.90 superblock metadata format. For years the arrays worked fine, although I haven't actively used them since I moved my home machine to all solid state storage. However, when I upgraded from Fedora 39 to Fedora 40, things went wrong. When Fedora 40 booted, rather than finding four software RAID arrays on sdc1+sdd1, sdc2+sdd2, sdc3+sdd3, and sdc4+sdd4 respectively, Fedora 40 decided that the fourth RAID array was all I had, and it was on sdc plus sdd (the entire disks). Since the fourth array have a LVM logical volume that I was still mounting filesystems from, things went wrong from there.

One of the observed symptoms during the issue was that my /dev had no entries for the sdc and sdd partitions, although the kernel messages said they had been recognized. This led me to stopping the 'md53' array and running 'partprobe' on both sdc and sdd, which triggered an automatic assembly of the four RAID arrays. Of course this wasn't a long term solution, since I'd have to redo it (probably by hand) every time I rebooted my home machine. In the end I wound up pulling the old HDDs entirely, something I probably should have done a while back.

(This is filed as Fedora bug 2280699.)

Many of the ingredients of this issue seem straightforward. The old 0.90 superblock format is at the end of the object it's in, so a whole-disk superblock is at the same place as a superblock in the last partition on the disk, if the partition goes all the way to the end. If the entire disk has been assembled into a RAID array, it's reasonable to not register 'partitions' on it, since those are probably actually partitions inside the RAID array. But this doesn't explain why the bug started happening in Fedora 40; something seems to have changed so that Fedora 40's boot process 'sees' a whole disk RAID array based on the 0.90 format superblock at the end, where Fedora 39 did not.

I don't know if other Linux distributions have also picked up whatever change in whatever software is triggering this in Fedora 40, or if they will; it's possible that this is a Fedora specific issue. But the general moral I think people should take from this is that if you still have software RAID arrays using superblock format 0.90, you need a plan to change that. The Linux Raid Wiki has a somewhat dangerous looking in-place conversion process, but I wouldn't want to try that without backups. And if you have software RAID arrays that old, they probably contain old filesystems that you may want to recreate so they pick up new features (which isn't always possible with an in-place conversion).

Sidebar: how to tell what superblock format you have

The simple way is to look at /proc/mdstat. If the status line for a software RAID array mentions a superblock version, you have that version, for example:

.pn prewap on

md26 : active raid1 sda4[0] sdb4[1]
      94305280 blocks super 1.2 [2/2] [UU]

This is a superblock 1.2 RAID array.

If the status line doesn't mention a 'super' version, then you have an old 0.90 superblock. For example:

md53 : active raid1 sdd4[1] sdc4[0]
      2878268800 blocks [2/2] [UU]
      bitmap: 0/22 pages [0KB], 65536KB chunk

Unless you made your software RAID arrays a very long time ago and faithfully kept upgrading their system ever since, you probably don't have superblock 0.90 format arrays.

(Although you could have deliberately asked mdadm to make new arrays with 0.90 format superblocks.)

Gtk 4 has decided to blow up some people's world on HiDPI displays

By: cks
6 July 2024 at 03:00

Pavucontrol is my go-to GUI application for full volume control on my Fedora desktops. Recently (since updating to Fedora 40), pavucontrol started to choose giant font rendering, which made it more than a bit inconvenient to use. Today I attempted to diagnose this, without particular success, although I did find a fix, although it still leaves pavucontrol with weird rendering issues. But investigating things more deeply led me to discover just what was going on.

Pavucontrol is one of the applications that is now based on Gtk version 4 ('Gtk-4') instead of Gtk version 3 ('Gtk-3'). On Fedora 40 systems, you can investigate which RPM packages want each of these with:

$ rpm -q --whatrequires 'libgtk-4.so.1()(64bit)' | sort
[... a relatively short list ...]
$ rpm -q --whatrequires 'libgtk-3.so.0()(64bit)' | sort
[... a much longer list including eg Thunderbird ...]

I don't use a standard desktop environment like Gnome or KDE, so HiDPI presented me with some additional hassles that required me to, among other things, run an XSettings daemon and set some X resources to communicate my high DPI. Back in the days of Gtk-3, Gtk-based applications did not notice these settings, and required their own modifications; first I had to scale them up in order to get icons right, and then I had to scale Gtk-3 text back down again because the core text rendering that was used by Gtk-3 did recognize my high DPI. So I needed 'GDK_SCALE=2' and 'GDK_DPI_SCALE=0.5'.

In Gtk-4, it turns out that they removed support for GDK_DPI_SCALE but not GDK_SCALE (via this KDE bug report). This makes life decidedly awkward; I can choose between having decent sized icons and UI elements along with giant text, or reasonable text and tiny icons. Gtk-4 has a settings file (the personal one is normally ~/.config/gtk-4.0/settings.ini), but as explicitly documented it's mostly ignored if you have XSettings active, which I do because I need it for other things. The current Arch Linux wiki page section on HiDPI in X suggests that there is a way to override XSettings values for Gtk(-4?), but this doesn't work for test Gtk-4 applications for me.

At the moment I'm unsetting both environment variables in a cover script for pavucontrol, which is acceptable for it because it has relatively few graphical elements that are scaled down to tiny sizes for this. If and when applications with more graphical elements move to Gtk-4, this is going to be a real problem for me and I don't know how I'll solve it.

(When I started writing this entry I thought I had a mystery, but then more research turned up the direct answers, although not how I'm supposed to deal with this.)

Sidebar: The pavucontrol rendering problem I still have

No matter how wide I make the pavucontrol window, some text on the left side gets cut off. In the picture below, notice that the 'S' of 'Silence' isn't there.

The pavucontrol GUI showing output volume control, with a 'ilence' on the left side instead of 'Silence'

This doesn't happen in a test Cinnamon session in a Fedora 40 virtual machine I have handy, and I have no idea how one would go about figuring out what is wrong.

(This picture is with the sensible text sizes and thus small icons, and probably looks large in your browser because it comes from a HiDPI display.)

Fedora 40 and a natural but less than ideal outcome with 'alternatives'

By: cks
4 July 2024 at 01:03

Fedora, like various Linux distributions, has a system of 'alternatives', where several programs from several different packages can provide alternative versions of the same thing, and which one is used is chosen through symbolic links in /etc/alternatives (Fedora's version appears to be this implementation). These alternatives can have priorities and by default the highest priority version of something gets to be that thing; however, if you manually choose another option, that option is supposed to stick. On a Fedora system, you can see the surprisingly large list of things handled this way with 'alternatives --list' and see about a particular thing with 'alternatives --display <thing>'.

As part of its attempt to move to Wayland, Fedora 40 has opted to package two different builds of Emacs, a "pure GTK" build that does not work well on X11 (or on plain terminals), and a "gtk+x11' build that does work well on X11 and on plain terminals. Which of the two versions gets to be 'emacs' is handled through the alternatives system, and the default is the "pure GTK" version (because of Fedora's love of Wayland). I don't use Wayland, so more or less as soon as I upgraded to Fedora 40, I ran 'alternatives --config emacs' and switched my 'emacs' to the gtk+x11 version, which also sets this alternative to be manually configured and thus it's supposed to be left alone.

Fedora 40 shipped with Emacs 29.3. Recently, Emacs 29.4 was released to deal with a security issue, so of course I updated to it when Fedora had an updated package available. To my unhappy surprise, after the update my 'emacs' was suddenly back to being the pure GTK, Wayland only version. Unfortunately, this outcome is actually a natural one given how everything works, because I left out a critical element of how Emacs works with the alternative system in Fedora. You see, when I manually set my alternatives preferences, I did not set it to 'emacs-gtk+11', because there is no such alternative. Instead, I set it to 'emacs-29.3-gtk+x11', and after the upgrade I had to reset it to 'emacs-29.4-gtk+x11', because that's what now existed.

Quite sensibly, if you have an alternative pointed to something that gets removed, Fedora de-selects that alternative rather than leave you with a definitely non-working configuration. So Fedora removed my (manually set) 'emacs-29.3-gtk+x11' alternative, along with what had been the default 'emacs-29.3' option, and after the package update it had 'emacs-29.4-gtk+x11' (at priority 75) and 'emacs-29.4' (at priority 80, and thus the default). With no manual alternative settings left, it picked the default to be 'emacs', and suddenly I was trying to use the non-working pure GTK version.

This is all perfectly natural and straightforward, and in this situation it more or less has to be implemented this way, but it results in a less than ideal outcome. To solve it, I think you'd need an additional level of alternatives indirection, where 'emacs' pointed to an 'emacs-gtk+x11' or 'emacs-pure' alternative, and then each of these pointed to the appropriate version. Then there would be a chance for the alternatives system to not forget your manual setting over a package upgrade.

(There might be a simpler scheme with some more cleverness in the package update actions, but I think it would still need the extra level of indirection through a 'emacs-gtk+x11' symbolic link.)

All things considered I'm not surprised that Fedora either overlooked this or opted not to go through the extra effort. But still, the current situation is yet another little example of "robot logic".

The systemd journal doesn't force you to not have plain text logs

By: cks
1 July 2024 at 03:27

People are periodically grumpy that systemd's journal(d) doesn't store logs using 'plain text'. Sometimes this is used to imply that you can't have plain text logs with systemd's journal (or, more rarely, to state it). This is false. The systemd journal doesn't force you to not have plain text logs. In fact the systemd journal is often a practical improvement in the state of plain text logs, because your plain text logs will capture more information (if you keep them).

Of course the systemd journal won't write plain text logs directly. But modern syslog daemons on Linux will definitely read from the systemd journal and handle the result as more or less native syslog messages, including forwarding them to a central syslog server and writing them to whatever local files you want in the traditional syslog plain text format. Because the systemd journal generally captures things like the output printed by programs run in units, this stream of syslog'd messages will include more log data than a pure journal-free syslog environment would, which is normally a good thing.

It's perfectly possible to use the systemd journal basically in a pass-through mode; for example, see the discussion of journald.conf's Storage= setting, and also the runtime storage settings. Historically, Linux distributions varied in whether they made the systemd journal persistent on disk, and in the early days some certainly did not (sometimes to our irritation). And of course if you don't like the journal settings that your Linux distribution defaults to, you can change them in your system installation procedures.

(If you set 'Storage=none', I'm not sure if your syslog daemon can still read and log journal data; you should probably test that. But my personal view is that you should retain some amount of journal logs in RAM to enable various convenient things.)

Today, whether or not you get plain text logs as well as the systemd journal by default is a choice that Linux distributions make, not something that systemd itself decides. If the Linux distribution installs and enables a syslog daemon, you'll get plain text logs, possibly as well as a persistent systemd journal. You can look at the plain text logs most of the time, and turn to the journal for the kind of extra metadata that plain text logs aren't good at. And even if your Linux distribution doesn't install a syslog daemon, I believe most of them still package one, so you can install it yourself as a standard piece of additional software.

I wish systemd didn't require two units for each socket service

By: cks
29 June 2024 at 01:37

Triggered by our recent (and repeated) issue with xinetd restarts, we're considering partially or completely replacing our use of xinetd with systemd socket units on our future Ubuntu 24.04 machines. Xinetd generally works okay today, but our perception is that it's fallen out of style and may not last forever as a maintained and packaged thing in Ubuntu (it's already a 'universe' package). By contrast, systemd socket units and socket activation is definitely sticking around. However, I have a little petty gripe about systemd socket units (which also applies to systemd timer units), which is that they require you to provide two unit files, not one.

As the systemd socket unit documentation spells out, socket units work by causing systemd to start another unit when the socket is connected to. Depending on the setting of Accept= this is either a regular unit or a template unit (which will be started with a unique name for every connection). However, in each case you need a second unit file for the .service unit. This is in contrast to the xinetd approach, where all you need is a single file placed into /etc/xinetd.d. As a system administrator, my bias is that the fewer files involved the better, because there's less chance for things to become de-synchronized with each other.

Systemd has a coherent internal model that more or less requires there be two units involved here, because it needs to keep track of the activation status of the socket as well as the program or programs involved in handling it. After all, one of the selling points of socket units is that the socket can be active without the associated program having been started. And in the systemd world, the way you get two units is to have two files, so systemd socket activation needs you to provide both a .socket file and a .service file.

(Systemd could probably work out some way to embed the service information in the .socket unit file if it wanted to, but it's probably better not to complicate the model. I did say my gripe was a little petty one.)

PS: The good news is that although you have to install both unit files, you only have to directly activate one (the socket unit).

The xinetd restart problem with binding ports that we run into

By: cks
27 June 2024 at 02:41

Recently, I said something on the Fediverse:

It has been '0' days since I wished xinetd had an option to exit with a failed status if it couldn't bind some configured ports on startup. (Yes, yes, I know, replace it with systemd socket listeners or something.)

The job of xinetd is to listen on some number of TCP or UDP ports for you, and run things when people connect to those ports. This has traditionally been used to avoid having N different inactive daemons each listening to its own ports, and also so that people don't have to write those daemons at all; they can write something that gets started with a network connection handed to it and talks over that connection, which is generally simpler (you can even use shell scripts). At work, our primary use for xinetd is invoking Amanda daemons on all of the backup clients.

Every so often, Ubuntu releases a package update that wants to restart xinetd (for whatever reason, most obviously updating xinetd itself). When xinetd restarts, the old xinetd process stops listening on its ports and the new xinetd attempts to start listening on what you've configured. Unfortunately this is where a quirk of the BSD sockets API comes in. If there is an existing connection on some port, the new xinetd is unable to start listening on that port again. So, for example, if the Amanda client is running at the time that xinetd restarts, xinetd will not be able to start listening on the Amanda TCP port.

(See the description of SO_REUSEADDR in socket(7). The error you'll get is 'address already in use', EADDRINUSE. At one point I thought this was an issue only for UDP listening, but today it is clearly also a problem for TCP services.)

When this happens, the current version of xinetd will log messages but happily start up, which means that to general status monitoring it looks like a healthy service. This can make it hard to detect until, for example, some Amanda backups fail because xinetd on the machines to be backed up isn't actually listening for new Amanda connections. This happens to us every so often, which is why I wish xinetd had an option to fail in this situation (and then we'd get an alert, or we could have the systemd service unit set to auto-restart xinetd after a delay).

(Systemd socket units don't so much solve this as work around it by never closing and re-opening the listening socket as part of service changes or restarts.)

Some notes on ZFS's zstd compression kstats (on Linux)

By: cks
24 June 2024 at 02:38

Like various other filesystems, (Open)ZFS can compress file data when it gets written. As covered in the documentation for the 'compression=' filesystem property, ZFS can use a number of compression algorithms for this, including Zstd. The zstd compression system in ZFS exposes some information about its activity in the form of ZFS kstats; on Linux, these are visible in /proc/spl/kstat/zfs/zstd, but are unfortunately a little underdocumented. For reasons beyond the scope of this blog entry I recently looked into this, so here is what I know about them from reading the code.

compress_level_invalid
The zstd compression level was invalid on compression.
compress_alloc_fail
We failed to allocate a zstd compression context.
compress_failed
We failed to compress a block (after allocating a compression context).

decompress_level_invalid
A compressed block had an invalid zstd compression level.
decompress_header_invalid
A compressed block had an invalid zstd header.
decompress_alloc_fail
We failed to allocate a zstd decompression context. This should not normally happen.
decompress_failed
A compressed block with a valid header failed to decode (after we allocated a decompression context).

The zstd code does some memory allocation for data buffers and contexts and so on. These have more granular information in the kstats:

alloc_fail
How many times allocation failed for either compression or decompression.
alloc_fallback
How many times decompression allocation had to fall back to a special emergency reserve in order to allow blocks to be decompressed. A fallback is also considered an allocation failure.

buffers
How many buffers zstd has in its internal memory pools. I don't understand what 'buffers' are in this context, but I think they're objects (such as contexts or data buffers), not pool arenas.
size
The total size of all buffers currently in zstd's internal memory pools.

If I'm reading the code correctly, compression and decompression have separate pools, and each of them can have up to 16 buffers in them. Buffers are only freed if they're old enough, and if the zstd code needs more buffers than this for various possible reasons, it allocates them directly outside of the memory pools. No current kstat tracks how many allocations happened outside of the memory pools (or how effective the pools are), although this information could be extracted with eBPF.

In the current version of OpenZFS, compressing things with zstd has a complex system of trying to figure out if things are compressible (a 'tiered early abort'). If the data is small enough (128 Kb or less), ZFS just does the zstd compression (and then a higher level will discard the result if it didn't save enough). Otherwise, if you're using a high enough zstd level, ZFS tries a quick check with LZ4 and then zstd-1 to see if it can bail out quickly rather than try to compress the entire thing with zstd only to throw it away. How this goes is shown in some kstats:

passignored
How many times zstd didn't try the complex approach.
passignored_size
The amount of small data processed without extensive checks.
lz4pass_allowed
How many times the LZ4 pre-check passed things.
lz4pass_rejected
How many times the LZ4 pre-check rejected things.
zstdpass_allowed
How many times the quick zstd pre-check passed things.
zstdpass_rejected
How many times the quick zstd pre-check rejected things.

A number of these things will routinely be zero. Obviously you would like to see no compression and decompression failures, no allocation failures, and so on. And if you're compressing with only zstd-1 or zstd-2, you'll never trigger the complicated pre-checks; however, I believe that zstd-3 will trigger this, and that's the normal default if you set 'compression=zstd'.

This particular sausage is all made in module/zstd/zfs_zstd.c, which has some comments about things.

Where Thunderbird seems to get your default browser from on Linux

By: cks
21 June 2024 at 03:02

Today, I had a little unhappy experience with Thunderbird and the modern Linux desktop experience:

Dear Linux Thunderbird, why are you suddenly starting to open links I click on in Chrome? I do not want that in the least, and your documentation on how you obtain this information is nonexistent (I am not on a Gnome or KDE desktop environment). This worked until relatively recently and opened in the right browser, but now it is opaquely broken.

This is an example of why people want to set computers on fire.

(All of this is for Thunderbird 115.12 on Fedora 40. Other Linux distributions may differ, and things may go differently if you're using KDE.)

After some investigation, I now know where Thunderbird was getting this information from and why it wound up on Chrome, although I don't know what changed so that this started happening recently. A critical source for my journey was Kevin Locke's Changing the Default Browser in Thunderbird on Linux, originally written in 2012, revised in 2018, and still applicable today, almost six years later.

If you're an innocent person, you might think that Thunderbird would of course use xdg-open to open URLs, since it is theoretically the canonical desktop-independent way to open URLs in your browser of choice. A slightly less innocent person could expect Thunderbird to use the xdg-mime tool and database to find what .desktop file handles the 'x-scheme-handler/https' MIME type and then use it (although this would require Thunderbird to find and parse the .desktop file).

Thunderbird does neither of these. Instead, it uses the GTK/Gnome 'gconf' system (which is the old system, in contrast to the new GSettings), which gives Thunderbird (and anyone else who asks) the default command to run to open a URL. We can access the same information with 'gconftool-2' or 'gconf-editor' (don't confuse the latter with dconf-editor, which works on GSettings/dconf). So:

$ gconftool-2 --get /desktop/gnome/url-handlers/http/command
gio open %s

The 'gio' command provides command line access to the GTK GIO system, and is actually what xdg-open would probably use too if I was using a Gnome desktop instead of my own weird environment. We can check what .desktop file 'gio' will use and compare it to xdg-mime with:

$ xdg-mime query default x-scheme-handler/https
org.mozilla.firefox.desktop
$ gio mime x-scheme-handler/https
Default application for β€œx-scheme-handler/https”: google-chrome.desktop
Registered applications:
        google-chrome.desktop
        kfmclient_html.desktop
        org.midori_browser.Midori.desktop
        org.mozilla.firefox.desktop
Recommended applications:
        google-chrome.desktop
        kfmclient_html.desktop
        org.midori_browser.Midori.desktop
        org.mozilla.firefox.desktop

So GIO and xdg-mime disagree, and GIO is picking Chrome.

(In case you're wondering, Thunderbird really does run 'gio open ...'.)

What happened to me is foreshadowed by my 2016 entry on how xdg-mime searches for things. I had a lingering old set of files in /usr/local/share/applications and the 'defaults.list' there contained (among a few other things):

[Default Applications]
x-scheme-handler/http=mozilla-firefox.desktop;google-chrome.desktop
x-scheme-handler/https=mozilla-firefox.desktop;google-chrome.desktop

The problem with these entries is that there's no 'mozilla-firefox.desktop' (or 'firefox.desktop') any more; it was long since renamed to 'org.mozilla.firefox.desktop'. Since there is no 'mozilla-firefox.desktop', it is ignored and this line really picks 'google-chrome.desktop' (instead of ignoring Chrome). For a long time this seems to have been harmless, but then apparently GIO started deciding to pay attention to /usr/local/share/applications, although xdg-mime was ignoring it. Getting rid of those 2013-era files made 'gio mime ...' agree that org.mozilla.firefox.desktop was what it should be using.

(The Arch wiki page on Default Applications has some additional information and pointers. Note that all of this ignores /etc/mailcap, although some things will use it.)

This is still not what I want (or what it used to be), but fixing that is an internal Thunderbird settings change, not Thunderbird getting weird system settings.

Sidebar: Fixing Thunderbird to use a browser of your choice

This comes from this superuser.com answer. Get into the Config Editor (the Thunderbird version of Firefox's 'about:config') and set network.protocol-handler.warn-external.http and network.protocol-handler.warn-external.https to 'true'. To be safe, quit and restart Thunderbird. Now click on a HTTPS or HTTP link in a message, and you should get the usual 'what to do with this' dialog, which will let you pick the program of your choice and select 'always use this'. Under some circumstances, post-restart you'll be able to find a 'https' or 'http' entry in the 'Files & Attachments' part of 'General' Settings, which you can change on the spot.

The Linux kernel NFS server and reconnecting client NFS filehandles

By: cks
13 June 2024 at 03:11

Unlike some other Unix NFS servers, the Linux kernel NFS server attempts to solve the NFS server 'subtree' export problem, along with a related permissions problem that is covered in the exportfs(5) manual page section on no_subtree_check. To quote the manual page on this additional check:

subtree checking is also used to make sure that files inside directories to which only root has access can only be accessed if the filesystem is exported with no_root_squash (see below), even if the file itself allows more general access.

In general, both of these checks require finding a path that leads to the file obtained from a NFS filehandle. NFS filehandles don't contain paths; they normally only contain roughly the inode number, which is a flat, filesystem-wide reference to the file. The NFS server calls this 'reconnection', and it is somewhat complex and counterintuitive. It also differs for NFS filehandles of directories and files.

(All of this is as of kernel 6.10-rc3, although this area doesn't seem to change often.)

For directories, the kernel first gets the directory's dentry from the dentry cache (dcache); this dentry can be 'disconnected' (which mostly means it was newly created due to this lookup) or already connected (in general, already set up in the dcache). If the dentry is disconnected, the kernel immediately reconnects it. Reconnecting a specific directory dentry works like this:

  1. obtain the dentry's parent directory through a filesystem specific method (which may more or less look up what '..' is in the directory).
  2. search the parent directory to find the name of the directory entry that matches the inode number of the dentry you're trying to reconnect. (A few filesystems have special code to do this more efficiently.)
  3. using the dcache, look up that name in the parent directory to get the name's dentry.
  4. verify that this new dentry and your original dentry are the same (which guards against certain sorts of rename races).

It's possible to have multiple disconnected dentries on the way to the filesystem's mount point; if so, each level follows this process. The obvious happy path is that the dcache already has a fully connected dentry for the directory the NFS client is working on, in which case all of this can be skipped. This is frequently going to be the case if clients are repeatedly working on the same directories.

Once the directory's dentry is fully connected (ie, all of its parents are connected), the kernel NFS server code will check if it is 'acceptable'. If the export uses no_subtree_check (which is now the default), this acceptability check always answers 'yes'.

For files, things are more complicated. First, the kernel checks to see if the initial dentry for the file (and any aliases it may have) is 'acceptable'; if the export uses no_subtree_check the answer is always 'yes', and things stop. Otherwise, the kernel uses a filesystem specific method to obtain the (or a) directory the file is in, reconnects the directory using the same code as above, then does steps 2 through 4 of the 'directory reconnection' process for the file and its parent directory in order to check against renames (which will involve at least one scan of the parent directory to discover the file's name). Finally with all of this done and a verified, fully connected dentry for the file, the kernel does the acceptability check again and returns the result.

Because the kernel immediately reconnects the dentries of directory NFS file handles before looking at the status of subtree checks, you really want those directories to have dentries that are already in the dcache (and fully connected). Every directory NFS filehandle with a dentry that has to be freshly created in disconnected state means at least one scan of a possibly large parent directory, and more scans of more directories if the parent directory itself isn't in the dcache too.

I'm not sure of how the dcache shrinks, and especially if filesystems can trigger removing dcache entries because the filesystem itself wants to remove the inode entry. The general kernel code that shrinks a filesystem's associated dcache and inodes triggers dcache shrinking first and inode shrinking second, with the comment that the inode cache is pinned by the dcache.

Sidebar: Monitoring NFS filehandle reconnections

If you want to see how much reconnection is happening, you'll need to use bpftrace (or some equivalent). The total number of NFS filehandles being looked at is found by counting calls to exportfs_decode_fh_raw(). If you want to know how many reconnections are needed, you want to count calls to reconnect_path(); if you want to count how many path components had to be reconnected, you want to (also) count calls to reconnect_one(). All of these are in fs/exportfs/expfs.c. The exportfs_get_name() call searches for the name for a given inode in a directory, and then the lookup_one_unlocked() call does the name to dentry lookup needed for revalidation, and I think it will probably fall through to a filesystem directory lookup.

(You can also look at general dcache stats, as covered in my entry on getting some dcache information, but I don't think this dcache lookup information covers all of the things you want to know here. I don't know how to track dentries being dropped and freed up, although prune_dcache_sb() is part of the puzzle and apparently returns a count of how many dentries were freed up for a particular filesystem superblock.)

The NFS server 'subtree' export problem

By: cks
11 June 2024 at 03:06

NFS servers have a variety of interesting problems that ultimately exist because NFS was first defined a long time ago in a world where (Unix) filesystems were simpler and security was perhaps less of a concern. One of their classical problems is that how NFS clients identify files is surprisingly limited. Another problem is what I will call the 'subtree export' issue.

Suppose that you have a filesystem called '/special', and this filesystem contains directory trees '/special/a' and '/special/b'. The first directory tree is exported only to one NFS client, A, and the second directory tree is exported only to another NFS client, B. Now suppose that client A presents the NFS server with a request to read some file, which it identifies by an NFS filehandle. How does the NFS server know that this file is located under /special/a, the only part of the /special filesystem that client A is supposed to have access to? This is the subtree export issue.

(This problem comes up because NFS clients can forge their own NFS filehandles or copy NFS filehandles from other clients, and NFS filehandles generally don't contain the path to the object being accessed. Normally all the NFS server can recover from an NFS filehandle is the filesystem and some non-hierarchical identifier for the object, such as its inode number.)

The very early NFS servers ignored the entire problem because they started out with no NFS filehandle access checks at all. Even when NFS servers started applying some access checks to NFS filehandles, they generally ignored the subtree issue because they had no way to deal with it. If you exported a subtree of a filesystem to some client, in practice the client could access the entire filesystem if it made up appropriate valid NFS filehandles. The manual pages sometimes warned you about this.

(One modern version of this warning appears in the the FreeBSD exports manual page. The Illumos share_nfs manual page doesn't seem to discuss this subtree issue, so I don't know how Illumos handles it.)

Some modern NFS servers try to do better, and in particular the Linux kernel NFS server does. Linux does this by trying to work out the full path within the filesystem of everything you access, leveraging the kernel directory entry caches and perhaps filesystem specific information about parent directories. On Linux, and in general on any system where the NFS server attempts to do this, checking for this subtree export issue adds some overhead to NFS operations and may possibly reject some valid NFS operations because the NFS server can't be sure that the request is within an allowed subtree. Because of this, Linux's exports(5) NFS options support a 'no_subtree_check' option that disables this check and in fact any security check that requires working out the parent of something.

Generally, the subtree export issue is only a problem if you think NFS clients can be compromised to present NFS filehandles that you didn't give them. If you only export a subtree of a filesystem to a NFS client, a properly operating NFS environment will deny the client's request to mount anything else in the filesystem, which will stop the client from getting NFS filehandles for anything outside its allowed subtree.

(This still leaves you with the corner case of moving a file or a directory tree from inside the client's allowed subtree to outside of it. If the NFS client is currently using the file or directory, it will still likely be able to keep accessing it until it stops and forgets the file's NFS filehandle.)

Obviously, life is simpler if you only export entire filesystems to NFS clients and don't try to restrict them to subtrees. If a NFS client only wants a subtree, it can do that itself.

Infinity in 15 kilograms

By: VM
19 April 2024 at 21:54

While space is hard, there are also different kinds of hardness. For example, on April 15, ISRO issued a press release saying it had successfully tested nozzles made of a carbon-carbon composite that would replace those made of Columbium alloy in the PSLV rocket’s fourth stage and thus increase the rocket’s payload capacity by 15 kg. Just 15 kg!

The successful testing of the C-C nozzle divergent marked a major milestone for ISRO. On March 19, 2024, a 60-second hot test was conducted at the High-Altitude Test (HAT) facility in ISRO Propulsion Complex (IPRC), Mahendragiri, confirming the system’s performance and hardware integrity. Subsequent tests, including a 200-second hot test on April 2, 2024, further validated the nozzle’s capabilities, with temperatures reaching 1216K, matching predictions.

Granted, the PSLV’s cost of launching a single kilogram to low-earth orbit is more than 8 lakh rupees (a very conservative estimate, I reckon) – meaning an additional 15 kg means at least an additional Rs 1.2 crore per launch. But finances alone are not a useful way to evaluate this addition: more payload mass could mean, say, one additional instrument onboard an indigenous spacecraft instead of waiting for a larger rocket to become available or postponing that instrument’s launch to a future mission.

But equally fascinating, and pride- and notice-worthy, to me is the fact that ISRO’s scientists and engineers were able to fine-tune the PSLV to this extent. This isn’t to say I’m surprised they were able to do it at all; on the contrary, it means the feat is as much about the benefits accruing to the rocket, and the Indian space programme by extension, as about R&D advances on the materials science front. It speaks to the oft-underestimated importance of the foundations on which a space programme is built.

Vikram Sarabhai Space Centre … has leveraged advanced materials like Carbon-Carbon (C-C) Composites to create a nozzle divergent that offers exceptional properties. By utilizing processes such as carbonization of green composites, Chemical Vapor Infiltration, and High-Temperature Treatment, it has produced a nozzle with low density, high specific strength, and excellent stiffness, capable of retaining mechanical properties even at elevated temperatures.

A key feature of the C-C nozzle is its special anti-oxidation coating of Silicon Carbide, which extends its operational limits in oxidizing environments. This innovation not only reduces thermally induced stresses but also enhances corrosion resistance, allowing for extended operational temperature limits in hostile environments.

The advances here draw from insights into metallurgy, crystallography, ceramic engineering, composite materials, numerical methods, etc., which in turn stand on the shoulders of people trained well enough in these areas, the educational institutions (and their teachers) that did so, and the schooling system and socio-economic support structures that brought them there. A country needs a lot to go right for achievements like squeezing an extra 15 kg into the payload capacity of an already highly fine-tuned machine to be possible. It’s a bummer that such advances are currently largely vertically restricted, except in the case of the Indian space programme, rather than diffusing freely across sectors.

Other enterprises ought to have these particular advantages ISRO enjoys. Even should one or two rockets fail, a test not work out or a spacecraft go kaput sooner than designed, the PSLV’s new carbon-carbon-composite nozzles stand for the idea that we have everything we need to keep trying, including the opportunity to do better next time. They represent the idea of how advances in one field of research can lead to advances in another, such that each field is no longer held back by the limitations of its starting conditions.

Maybe understanding uname(1)'s platform and machine fields

By: cks
5 June 2024 at 21:17

When I wrote about some history and limitations of uname(1) fields, I was puzzled by the differences between 'uname -m', 'uname -i', and 'uname -p' in the two variants of uname that have all three, Linux uname and Illumos uname. Illumos is descended from (Open)Solaris, and although I can't find manual pages for old Solaris versions of uname online, I suspect that Solaris is probably the origin of both '-i' and '-p' (the '-m' option comes from the original System V version that also led to POSIX uname). The Illumos manual page doesn't explain the difference, but it does refer to sysinfo(2), which has some quite helpful commentary if you read various bits and pieces. So here is my best guess at the original meanings of the three different options in Solaris.

Going from most general to most specific, it seems to be that on Solaris:

  • -p tells you the broad processor ISA or architecture, such as 'sparc', 'i386', or 'amd64' (or 'x86_64' if you like that label). This is what Illumos sysinfo(2) calls SI_ARCHITECTURE.

  • -m theoretically tells you a more specific processor and machine type. For SPARC specifically, you can get an idea of Solaris's list of these in the Debian Wiki SunSparc page (and also Wikipedia's Sun-4 architecture list).

  • -i theoretically tells you about the specific desktop or server platform you're on, potentially down to a relatively narrow model family; the Illumos sysinfo(2) section on SI_PLATFORM gives 'SUNW,Sun-Fire-T200' as one example.

Of course, 'uname -m' came first, and '-p' and '-i' were added later. I believe that Solaris started out being relatively specific in 'uname -m', going along with the System V and POSIX definition of it as the 'hardware type' or machine type. Once Solaris had done that, it couldn't change the output of 'uname -m' even as people started to want a broader processor ISA label, hence '-p' being the more generic version despite -m being the more portable option.

(GNU Coreutils started with only -m, added '-p' in 1996, and added '-i' in 2001. The implementation of both -p and -i initially only used sysinfo() to obtain the information.)

On x86 hardware, it seems that Unixes chose to interpret 'uname -m' generically, instead of trying to be specific about things like processor families. Especially in the early days of x86 Unix, the information needed for 'uname -i' probably just wasn't available, and also wasn't necessarily particularly useful. The Illumos sysinfo(2) section on SI_PLATFORM suggests that it just returns 'i86pc' on all conventional x86 platforms.

(GNU Coreutils theoretically supports '-i' and '-p' on Linux, but in practice both will normally report "unknown".)

Of course, once x86 Unixes started reporting generic things for 'uname -m', they were stuck with it due to backward compatibility with build scripts and other things that people had based on the existing output (and it's not clear what more specific x86 information would be useful for 'uname -m', although for 32-bit x86, people have done variant names of 'i586' and 'i686'). While there was some reason to support 'uname -p' for compatibility, it is probably not surprising that on both FreeBSD and OpenBSD, the output of 'uname -m' is probably mostly the same as 'uname -p'.

(OpenBSD draws a distinction between the kernel architecture and the application architecture, per the OpenBSD uname(1). FreeBSD draws a distinction between the 'hardware platform' (uname -m) and the 'processor architecture' (uname -p), per the FreeBSD uname(1), but on the FreeBSD x86 machine I have access to, they produce the same output. However, see the FreeBSD arch(7) manual page and this FreeBSD bug from 2017.)

PS: In a comment on my first uname entry. Phil Pennock showed that 'uname -m' and 'uname -p' differed in some macOS environments. I suspect the difference is following the FreeBSD model but I'm not sure.

Sidebar: The details of uname -p and uname -i on Linux

Linux distributions normally use the GNU Coreutils version of 'uname'. In its normal state, Coreutils' uname.c gets this information from either sysinfo() or sysctl(), if either support obtaining it (see the end of src/uname.c). On Linux, the normal C library sysinfo() and sysctl() don't support this, so normally 'uname -p' and 'uname -i' will both report 'unknown', since they're left with no code that can determine this information.

The Ubuntu (and perhaps Debian) package for coreutils carries a patch, originally from Fedora (but no longer used there), that uses the machine information that 'uname -m' would report to generate the information for -p and -i. The raw machine information can be modified a bit for both -i and -p. For -i, all 'i?86' results are turned into 'i386', and for -p, if the 'uname -m' result is 'i686', uname checks /proc/cpuinfo to see if you have an AMD and reports 'athlon' instead if you do (although this code may have decayed since it was written).

Some history and limitations of uname(1) fields

By: cks
5 June 2024 at 02:16

Uname(1) is a command that hypothetically prints some potentially useful information about your system. In practice what information it prints, how useful that information is, and what command line options it supports varies widely between both different sorts of Unixes and between different versions of Linux (due to using different versions of GNU Coreutils, and different patches for it). I was asked recently if this situation ever made any sense and the general answer is 'maybe'.

In POSIX, uname(1) is more or less a program version of the uname() function. It supports only '-m', '-n', '-r', '-s', and '-v', and as a result of all of these arguments being required by POSIX, they are widely supported by the various versions of uname that are out there in various Unixes. All other arguments are non-standard and were added well after uname(1) initially came into being, which is one reason they are so divergent in presence and meaning; there is no enhanced ancestral 'uname' command for things to descend from.

The uname command itself comes from the System V side of Unix; it was first added as far back as at least System III, where the System III uname.c accepts -n, -r, -s, and -v with the modern meanings. System III gets the information from the kernel, in a utssys() system call. I believe that System V added the 'machine' information ('-m'), which then was copied straight into POSIX. On the BSD side, a uname command first appeared in 4.4 BSD, and the 4.4 BSD uname(1) manual page says that it also had the POSIX arguments, including -m. The actual implementation didn't use a uname() system call but instead extracted the information with sysctl() calls.

The modern versions of uname that I can find manual pages for are rather divergent; contrast Linux uname(1) (also), FreeBSD uname(1), OpenBSD uname(1), NetBSD uname(1), and Illumos uname(1) (manual pages for other Unixes are left as an exercise). For instance, take the '-i' argument, supported in Linux and Illumos to print a theoretical hardware platform and FreeBSD to print the 'kernel ident'. On top of that difference, on Linux distributions that use an unpatched build of Coreutils, I believe that 'uname -i' and 'uname -p' will both report 'unknown'.

(Based on how everyone has the -p argument for processor type, I suspect that it was one of the earliest additions to POSIX uname. How 'uname -m' differs from 'uname -p' in practice is something I don't know, but apparently people felt a need to distinguish the two at some point. Some Internet searches suggest that on Unixes such as Solaris, the processor type might be 'sparc' while the machine hardware name might be more specific, like 'sun4m'.)

On Linux and several other Unixes, much of the core information for uname comes from the kernel, which means that options like 'uname -r' and 'uname -v' have traditionally reported about the kernel version and build string, not anything to do with the general release of Linux (or Unix). On FreeBSD, the kernel release is usually fairly tightly connected to the userland version (although FreeBSD uname can tell you about the latter too), but on Linux it is not, and the Linux uname has no option to report a 'distribution' name or version.

In general, I suspect that the only useful fields you can count on from uname(1) are '-n' (some version of the hostname), '-s' (the broad operating system), and perhaps '-m', although you probably want to be wary about that. One of the cautions with 'uname -m' is that there is no agreement between Unixes about what the same hardware platform should be called; for example, OpenBSD uses 'amd64' for 64-bit x86 while Linux uses 'x86_64'. Illumos recommends using 'uname -p' instead of 'uname -m'.

(This entry's topic was suggested to me by Hunter Matthews, although I suspect I haven't answered their questions about historical uname values.)

Sidebar: An options comparison for non-POSIX options

The common additional options are:

  • -i: semi-supported on Linux, supported on Illumos, and means something different on FreeBSD (it's the kernel identifier instead of the hardware platform).
  • -o: supported on Linux, where it is often different from 'uname -s', FreeBSD, where it is explicitly the same as 'uname -s', and Illumos, where I don't know how it relates to 'uname -s'.
  • -p: semi-supported on Linux and fully supported on FreeBSD, OpenBSD, NetBSD, and Illumos. Illumos specifically suggests using 'uname -p' instead of 'uname -m', which will generally make you sad on Linux.

FreeBSD has -b, -K, and -U as additional FreeBSD specific arguments.

How 'uname -i', 'uname -p', and 'uname -m' differ on Illumos is not something I know; they all report something about the hardware, but the Illumos uname manpage mostly doesn't illuminate the difference. It's possible that this is more or less covered in sysinfo(2).

(The moral is that you can't predict the result of these options without running uname on an applicable system, or at least having a lot of OS-specific knowledge.)

Infinity in 15 kilograms

By: V.M.
19 April 2024 at 16:24

While space is hard, there are also different kinds of hardness. For example, on April 15, ISRO issued a press release saying it had successfully tested nozzles made of a carbon-carbon composite that would replace those made of Columbium alloy in the PSLV rocket’s fourth stage and thus increase the rocket’s payload capacity by 15 kg. Just 15 kg!

The successful testing of the C-C nozzle divergent marked a major milestone for ISRO. On March 19, 2024, a 60-second hot test was conducted at the High-Altitude Test (HAT) facility in ISRO Propulsion Complex (IPRC), Mahendragiri, confirming the system’s performance and hardware integrity. Subsequent tests, including a 200-second hot test on April 2, 2024, further validated the nozzle’s capabilities, with temperatures reaching 1216K, matching predictions.

Granted, the PSLV’s cost of launching a single kilogram to low-earth orbit is more than 8 lakh rupees (a very conservative estimate, I reckon) – meaning an additional 15 kg means at least an additional Rs 1.2 crore per launch. But finances alone are not a useful way to evaluate this addition: more payload mass could mean, say, one additional instrument onboard an indigenous spacecraft instead of waiting for a larger rocket to become available or postponing that instrument’s launch to a future mission.

But equally fascinating, and pride- and notice-worthy, to me is the fact that ISRO’s scientists and engineers were able to fine-tune the PSLV to this extent. This isn’t to say I’m surprised they were able to do it at all; on the contrary, it means the feat is as much about the benefits accruing to the rocket, and the Indian space programme by extension, as about R&D advances on the materials science front. It speaks to the oft-underestimated importance of the foundations on which a space programme is built.

Vikram Sarabhai Space Centre … has leveraged advanced materials like Carbon-Carbon (C-C) Composites to create a nozzle divergent that offers exceptional properties. By utilizing processes such as carbonization of green composites, Chemical Vapor Infiltration, and High-Temperature Treatment, it has produced a nozzle with low density, high specific strength, and excellent stiffness, capable of retaining mechanical properties even at elevated temperatures.

A key feature of the C-C nozzle is its special anti-oxidation coating of Silicon Carbide, which extends its operational limits in oxidizing environments. This innovation not only reduces thermally induced stresses but also enhances corrosion resistance, allowing for extended operational temperature limits in hostile environments.

The advances here draw from insights into metallurgy, crystallography, ceramic engineering, composite materials, numerical methods, etc., which in turn stand on the shoulders of people trained well enough in these areas, the educational institutions (and their teachers) that did so, and the schooling system and socio-economic support structures that brought them there. A country needs a lot to go right for achievements like squeezing an extra 15 kg into the payload capacity of an already highly fine-tuned machine to be possible. It’s a bummer that such advances are currently largely vertically restricted, except in the case of the Indian space programme, rather than diffusing freely across sectors.

Other enterprises ought to have these particular advantages ISRO enjoys. Even should one or two rockets fail, a test not work out or a spacecraft go kaput sooner than designed, the PSLV’s new carbon-carbon-composite nozzles stand for the idea that we have everything we need to keep trying, including the opportunity to do better next time. They represent the idea of how advances in one field of research can lead to advances in another, such that each field is no longer held back by the limitations of its starting conditions.

Viewing and resetting the BIOS passwords on the RedmiBook 16

17 January 2021 at 23:00

I recently lost the BIOS password for my Xiaomi RedmiBook 16. Luckily, viewing and even resetting the password from inside a Linux session turned out to be incredibly easy.

As it turns out, both the user and the system ("supervisor") passwords are not hashed in any way and stored as plaintext inside EFI variables. Viewing these EFI variables is incredibly easy on a Linux system where efivarfs is enabled, even under a regular user account and if secure boot is enabled:

$ uname -a
Linux book 5.10.7.a-1-hardened #1 SMP PREEMPT Tue, 12 Jan 2021 20:46:33 +0000 x86_64 GNU/Linux
$ whoami
xx
$ sudo dmesg | grep "Secure boot"
[    0.010717] Secure boot enabled

Reading the variables:

$ hexdump -C /sys/firmware/efi/efivars/SystemSupervisorPw*
00000000  07 00 00 00 0a 70 61 73 73 77 6f 72 64 31 32 20  |.....password12 |

$ hexdump -C /sys/firmware/efi/efivars/SystemUserPw*
00000000  07 00 00 00 0a 70 61 73 73 77 6f 72 64 31 31 21  |.....password11!|

If you have a root shell, removing the passwords entirely is also possible:

# chattr -i /sys/firmware/efi/efivars/SystemUserPw* /sys/firmware/efi/efivars/SystemSupervisorPw*

# rm /sys/firmware/efi/efivars/SystemUserPw* /sys/firmware/efi/efivars/SystemSupervisorPw*

Reboot, and the BIOS no longer asks for a password to enter setup, change secure boot settings, etc.

Patching ACPI tables to enable deep sleep on the RedmiBook 16

18 October 2020 at 23:00

I recently purchased Xiaomi's RedmiBook 16. For the price, it's an excellent MacBook clone. Being a Ryzen-based laptop, Linux support works great out of the box, with one big caveat: deep sleep does not work. I decided to try and fix this.

Deep sleep?

To clear some confusion about what I mean by deep sleep, I need to explain a bit of how hibernation/suspending works.

There are a number of sleep states on modern machines.

The most basic of these is referred to as S0. It's implemented purely in software (i.e. the kernel), and doesn't do a very good job at preserving battery. While userland processes are suspended, the machine (and the CPU) is still running and using power. As S0 doesn't rely on hardware compatibility, it's enabled on all devices. Using this mode, my RedmiBook's battery drained to 0% overnight.

S1, also known as "shallow" sleep, is similar to S0 but enables some additional power saving features such as suspending power to nonboot CPUs. This mode still doesn't provide significant power saving, however.

S3 ("suspend-to-RAM") saves the system's state to memory and powers off everything but the memory itself. On boot, this state is restored and the system can resume from suspension. This mode is the one known as "deep sleep" and can provide acceptable levels of power saving. Overnight, this drains only about 3-5% battery on my laptop, which is perfectly fine for my needs.

S4 is known as "suspend-to-disk" and works a lot like S3, but instead, as you can probably tell by the name, saves the state to disk. This means you can remove power from the device completely and resuming from suspension would still work as the state is not stored in volatile memory.

ACPI

Modes S1 - S4 require hardware compatibility. This compatibility is usually advertised to the operating system's kernel using ACPI definitions. The kernel uses this information to know what suspension methods to provide to the user.

On some systems (such as the RedmiBook), the ACPI definitions declare no or only conditional support for some (or all) modes.

You can see what sleep states your machine supports by looking into /sys/power/mem_sleep. On my machine, only S0 ("s2idle") was supported:

$ cat /sys/power/mem_sleep
[s2idle]

Annoying. I knew deep sleep works on Windows, so it's not a case of missing hardware support. I suspected misconfigured ACPI tables to be at fault here.

Patching ACPI

Luckily, Linux supports loading "patched" ACPI tables during the boot process. It is possible to grab the currently used tables, decompile them, patch out the parts which block S3 from being supported, recompile, and embed the patched table into a cpio archive.

The specific ACPI component we're interested in is the DSDT table. We can dump this somewhere safe:

# cat /sys/firmware/acpi/tables/DSDT > dsdt.aml

We'll use iasl from the ACPICA software set to decompile the dumped table:

$ iasl -d dsdt.aml

If you get warnings about unresolved references to external control methods, it might be worth decompiling again, but this time including the SSDT tables. See this post at encryp.ch for more info.

You'll end up with a human-readable dsdt.dsl file. You'll want to peek into this and search for "S3 System State" to find what you're looking for. In my case, it was nested into two flag checks, which I simply deleted, so as to advertise S3 support even if the flag checks failed:

@@ -18,7 +18,7 @@
  *     Compiler ID      "    "
  *     Compiler Version 0x01000013 (16777235)
  */
-DefinitionBlock ("", "DSDT", 1, "XMCC  ", "XMCC1953", 0x00000002)
+DefinitionBlock ("", "DSDT", 1, "XMCC  ", "XMCC1953", 0x00000003)
 {
     /*
      * iASL Warning: There were 9 external control methods found during
@@ -769,19 +769,13 @@ DefinitionBlock ("", "DSDT", 1, "XMCC  ", "XMCC1953", 0x00000002)
         Zero,
         Zero
     })
-    If ((CNSB == Zero))
-    {
-        If ((DAS3 == One))
-        {
-            Name (_S3, Package (0x04)  // _S3_: S3 System State
-            {
-                0x03,
-                0x03,
-                Zero,
-                Zero
-            })
-        }
-    }
+    Name (_S3, Package (0x04)  // _S3_: S3 System State
+    {
+        0x03,
+        0x03,
+        Zero,
+        Zero
+    })

     Name (_S4, Package (0x04)  // _S4_: S4 System State
     {

You'll also want to increment the version number by one (as shown above) as the patched table wouldn't be loaded otherwise.

Once this is done, we can recompile it, again using iasl:

$ iasl dsdt.dsl

If this refuses to compile due to the compiler thinking Zero is not a valid type, check out the post at encryp.ch, where they shed some light on this.

Compiling using iasl overwrites the old .aml file. We'll need to create the proper directory tree in order to archive it in a manner which the kernel accepts:

$ mkdir -p kernel/firmware/acpi

Copy the patched table into place and create the archive using the cpio tool:

$ cp dsdt.aml kernel/firmware/acpi/.
$ find kernel | cpio -H newc --create > dsdt_patch

Copy the newly created archive into your boot directory:

# cp dsdt_patch /boot/.

You'll need to figure out how to get your bootloader to load this archive on boot. As I use systemd-boot, I modified my default entry and added the following initrd line before initramfs is loaded:

$ grep initrd /boot/loader/entries/arch.conf
initrd	/amd-ucode.img
initrd  /dsdt_patch
initrd	/initramfs-linux.img

For grub users, you'll need to edit the /boot/grub/grub.cfg file and add the same line.

I also recommend adding the following kernel parameter, as that makes sure that S3 is used by default instead of S0:

mem_sleep_default=deep

After rebooting, peek into /sys/power/mem_sleep once again to make sure deep is supported and enabled as the current mode:

$ cat /sys/power/mem_sleep
s2idle [deep]

It's also a good idea to check whether the system properly suspends and resumes. In my case, there have been no issues and I get excellent battery life during sleep.

Some readers have tested this method and reported that this method also works for the RedmiBook 14 and the Ryzen edition of the Xiaomi Notebook Pro 15, which have similar hardware.

chroot shenanigans 2: Running a full desktop environment on an Amazon Kindle

14 April 2019 at 14:00

In my previous post, I described running Arch on an OpenWRT router. Today, I'll be taking it a step further and running Arch and a full LXDE installation natively on an Amazon Kindle, which can be interacted with directly using the touch screen. This is possible thanks to the Kindle's operating system being Linux!

You can see the end result in action here. Apologies for the shaky video - it was shot using my phone and no tripod.

If you're wanting to follow along, make sure you've rooted your Kindle beforehand. This is essential – without it, it's impossible to run custom scripts or binaries.

I'm testing this on an 8th generation Kindle (KT3) – it should, however, work for all recent Kindles given you've enough storage and are rooted. You also need to set up USBnetwork for SSH access and optionally KUAL if you want a simple way of launching the chroot.

First things first: We need to set up a filesystem and extract an Arch installation into it, which we can later chroot into. The filesystem will be a file which will be mounted as a loop device. The reason why we're not extracting the Arch installation directly into a directory on the Kindle is because the Kindle's storage filesystem is FAT32. FAT32 doesn't support required features such as symbolic links, which would break the Arch installation. Please note that this also means that your chroot filesystem can be 4 gigabytes large, at maximum. This can be worked around by mounting the real root inside the chroot filesystem, which it's still a hacky way to go about it. But I digress.

First, figure out how large your filesystem actually can be. SSH into your Kindle and see how much free space you have:

$ ssh root@192.168.15.244

kindle# df -k /mnt/base-us
Filesystem   1K-blocks  Used    Available  Use%  Mounted on
/dev/loop/0  3188640    361856  2826784    11%   /mnt/base-us

Seems like we have around 2800000K (around 2.8G) of space available. Let's make our filesystem 2.6G – it's enough to host our root filesystem and some extra applications, such as LXDE. Note that I'll be running the following commands on my PC and transferring the filesystem over later. You can also do all of this on the Kindle, but it's simply easier and faster this way.

Let's create a blank file of the wanted size. I'm using dd, but you can also use fallocate for this:

$ dd if=/dev/zero of=arch.img bs=1024 count=2600000
2600000+0 records in
2600000+0 records out
2662400000 bytes (2.7 GB, 2.5 GiB) copied, 6.92058 s, 385 MB/s

Let's create our filesystem on it. Since we're doing this on the PC, we need make it 32bit and disable the metadata_csum and huge_file options on the filesystem, as the Kindle's ext4 kernel doesn't support them.

$ mkfs.ext4 -O ^64bit,^metadata_csum,^huge_file arch.img
mke2fs 1.45.0 (6-Mar-2019)
Discarding device blocks: done                            
Creating filesystem with 650000 4k blocks and 162560 inodes
Filesystem UUID: a4e72620-368a-44b4-81bb-9e66b2903523
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

This is optional, but I'll also disable periodic filesystem checks on it:

$ tune2fs -c 0 -i 0 arch.img                               
tune2fs 1.45.0 (6-Mar-2019)         
Setting maximal mount count to -1
Setting interval between checks to 0 seconds

Next it's time to mount the filesystem:

$ mkdir rootfs
$ sudo mount -o loop arch.img rootfs/

The Kindle I'm using has a Cortex-A9-based processor, so let's download the ARMv7 version of Arch Linux ARM from here. You can download it and extract then, or simply download and extract at the same time:

$ curl -L http://os.archlinuxarm.org/os/ArchLinuxARM-armv7-latest.tar.gz | sudo tar xz -C rootfs/

sudo is required to extract as it sets up a lot of files with root permissions. You can ignore the errors about SCHILY.fflags. Verify that the files extracted successfully with ls -l rootfs/.

Let's prepare our Kindle for the filesystem. I opted for hosting the filesystem in extensions/karch as I want to use KUAL for easy launching:

$ ssh root@192.168.15.244

kindle# mkdir -p /mnt/base-us/extensions/karch

While we're here, it's also a good idea to stop the power daemon to prevent the Kindle from going into sleep mode while transferring the filesystem and interrupting our transfer:

kindle# stop powerd
powerd stop/waiting

Let's transfer our filesystem:

kindle# exit
Connection to 192.168.15.244 closed.

$ scp arch.img root@192.168.15.244:/mnt/base-us/extensions/karch/

This might take quite a bit of time, depending on your connection.

Once it's done, let's SSH in once again and set up our mountpoint:

$ ssh root@192.168.15.244

kindle# cd /mnt/base-us/extensions/karch/
kindle# mkdir system

I decided to set up my own loop device, so I can have it named, but you can ignore this and opt to use /dev/loop/12 or similar instead. Just make sure it's already not in use with mount.

Setting up a loop point and mounting the filesystem:

kindle# mknod -m0660 /dev/loop/karch b 7 250
kindle# mount -o loop=/dev/loop/karch -t ext4 arch.img system/

We should also mount some system directories into it:

kindle# mount -o bind /dev system/dev
kindle# mount -o bind /dev/pts system/dev/pts
kindle# mount -o bind /proc system/proc
kindle# mount -o bind /sys system/sys
kindle# mount -o bind /tmp system/tmp
kindle# cp /etc/hosts system/etc/

It's time to chroot into our new system and set it up for LXDE. You can also use this opportunity to set up whatever applications you need, such as an onscreen keyboard:

kindle# chroot system/ /bin/bash
chroot# echo 'en_US.UTF-8 UTF-8' > /etc/locale.gen 
chroot# locale-gen
chroot# rm /etc/resolv.conf 
chroot# echo 'nameserver 8.8.8.8' > /etc/resolv.conf
chroot# pacman-key --init # this will take a while
chroot# pacman-key --populate
chroot# pacman -Syu --noconfirm
chroot# pacman -S lxde xorg-server-xephyr --noconfirm

We use Xephyr because it's the easiest way to get our LXDE session up and running. Since the Kindle uses X11 natively, we can try using that. It's possible to stop the native window manager using stop lab126_gui outside the chroot, but then the Kindle will stop updating the screen with new data, leaving it blank – forcing you to use something like eips to refresh the screen. The X server still works, however, and you can confirm this by using something like x11vnc after running your own WM in it. Xephyr spawns a new X server inside the preexisting X server, which is not as efficient but a lot easier.

We can however stop everything else related to the native GUI, as we need the extra memory and we can't use it while LXDE is running anyways:

chroot# exit
kindle# SERVICES="framework pillow webreader kb contentpackd"
kindle# for service in ${SERVICES}; do stop ${service}; done

While we're here, we need to get the screen size for later:

kindle# eips -i | grep 'xres:' | awk '{print $2"x"$4}'
600x800

Let's chroot back into the system and see if we can get LXDE to run. Be sure to replace the screen size parameter if needed:

kindle# chroot system/ /bin/bash
chroot# export DISPLAY=:0
chroot# Xephyr :1 -title "L:A_N:application_ID:xephyr" -screen 600x800 -cc 4 -nocursor &
chroot# export DISPLAY=:1
chroot# lxsession &
chroot# xrandr -o right

If everything goes well, you should have LXDE visible on your Kindle's screen. Ta-da! Feel free to play around with it. I've found that the touch screen is suprisingly accurate, even though it is using an IR LED system to detect touches instead of a normal digitizer.

Once done in the chroot, Ctrl-C + Ctrl-D can be issued to exit the chroot. We can then restore the Kindle UI by doing:

kindle# for service in ${SERVICES}; do start ${service}; done

It might take a while for anything to display again.

I've mentioned setting up a KUAL extension to automate the entering and exiting of the chroot. You can find that here. If you're interested in using this, make sure you've set up your filesystem first and copied it over to the same directory as the extension, and that it's named arch.img. Everything else is not mandatory - the extension will do it for you.

chroot shenanigans: Running Arch Linux on OpenWRT (LEDE) routers

21 March 2019 at 14:45

Here's some notes on how to get Arch Linux running on OpenWRT devices. I'm using an Inteno IOPSYS (OpenWRT-based) DG400 for this, which has a Broadcom BCM963138 SoC - reportedly ARMv7 but not really (I'll get to that later).

I figured it would be fun trying to run Arch on such an unconventional device. I ran into 3 issues which I will be discussing, and the workarounds for them.

I've already "hacked" my router and have direct root access to the system, so I won't be discussing that in this post. If you're interested, check out any of my older posts with a CVE label for more information, or if you're brave and want to compile and flash custom firmware on your Inteno router, check out this post.

I used the lovely Arch Linux ARM community project as the basis for this. The plan of action: Grab a tarball of a compiled system for my architecture (ARMv7), extract it on the router and use chroot to effectively "run" it as if it was the root filesystem. Seems simple enough.

Issue 1: Space

These sort of devices are usually built with very limited storage to keep production costs down. The firmware just about fits on the onboard flash with some extra space for temporary files. It's not meant to be used as your conventional system.

df -h reported my root filesystem to only have 304 Kb of available space, and my tmp filesystem to have 100 Mb. Considering that the Arch tarball itself is already over 500 Mb, the device doesn't have nearly enough space to fit another OS on it.

The solution for this is quite simple: Use a USB drive. Indeed, my DG400 router has a USB2.0 and 3.0 port presumably for sticking pen drives into them. Evidently, seeing as any drives inserted are automatically mounted in /mnt (I'm unsure whether this is done by OpenWRT by default or if it's an IOPSYS feature).

It's settled then. I used my PC to format a pen drive as ext4 (FAT won't work for this very well), downloaded the ARMv7 tarball and extracted it onto the pen drive:

# umount /dev/sdc1 # (replace with your USB drive)
# mkfs.ext4 /dev/sdc1
# mount /dev/sdc1 /mnt
# mkdir /mnt/archfs
# wget http://os.archlinuxarm.org/os/ArchLinuxARM-armv7-latest.tar.gz
# bsdtar -xpf ArchLinuxARM-armv7-latest.tar.gz -C /mnt/archfs

Done. After plugging the USB drive into the router, it got automatically mounted at /mnt/usb0 (might differ). However, it got mounted with the noexec flag, which will prevent executables being run. It's easy enough to remount it. On the router:

# mount /mnt/usb0 -o exec,remount

Great! It's time to test if we can now actually chroot into it:

# chroot /mnt/usb0/archfs /bin/bash
Illegal instruction (core dumped)

Uh oh. Looks like something is still wrong. Which brings us to…

Issue 2: Not all ARM is created equal

Looks like we're running into some instructions while running bash that our processor doesn't support. Let's see if we're still ARMv7 and I hadn't messed up:

# cat /proc/cpuinfo 
processor       : 0
model name      : ARMv7 Processor rev 1 (v7l)
BogoMIPS        : 1325.05
Features        : half thumb fastmult edsp tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x4
CPU part        : 0xc09
CPU revision    : 1

Strange. We're using the ARMv7 tarball, it should all be groovy. My custom firmware is compiled with GDB, which I could use to see exactly which instruction it's failing on. Since there's no way of running GDB + any of my Arch binaries natively without library mismatches, I opted to simply grab the core dump and use that instead. I looked into /proc/sys/kernel/core_pattern to identify the script responsible for handling coredumps and modified it to dump it to the root of my USB stick instead. I could then use GDB to look through the backtrace:

# gdb /mnt/usb0/archfs/bin/grep /mnt/usb0/coredump -q
Reading symbols from archfs/bin/grep...(no debugging symbols found)...done.
[New LWP 14713]

warning: Could not load shared library symbols for /lib/ld-linux-armhf.so.3.
Do you need "set solib-search-path" or "set sysroot"?
Core was generated by `/bin/grep'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0xb6fe5ba4 in ?? ()

I needed to set the proper sysroot as well, to fetch proper library symbols:

(gdb) set sysroot /mnt/usb0/archfs/
Reading symbols from /mnt/usb0/archfs/lib/ld-linux-armhf.so.3...(no debugging symbols found)...done.
(gdb) disas 0xb6fe5ba4
Dump of assembler code for function __sigsetjmp:
   0xb6fe5b70 <+0>:	movw	r12, #28028	; 0x6d7c
   0xb6fe5b74 <+4>:	movt	r12, #1
   0xb6fe5b78 <+8>:	ldr	r2, [pc, r12]
   0xb6fe5b7c <+12>:	mov	r12, r0
   0xb6fe5b80 <+16>:	mov	r3, sp
   0xb6fe5b84 <+20>:	eor	r3, r3, r2
   0xb6fe5b88 <+24>:	str	r3, [r12], #4
   0xb6fe5b8c <+28>:	eor	r3, lr, r2
   0xb6fe5b90 <+32>:	str	r3, [r12], #4
   0xb6fe5b94 <+36>:	stmia	r12!, {r4, r5, r6, r7, r8, r9, r10, r11}
   0xb6fe5b98 <+40>:	movw	r3, #28064	; 0x6da0
   0xb6fe5b9c <+44>:	movt	r3, #1
   0xb6fe5ba0 <+48>:	ldr	r2, [pc, r3]
=> 0xb6fe5ba4 <+52>:	vstmia	r12!, {d8-d15}
   0xb6fe5ba8 <+56>:	tst	r2, #512	; 0x200
   0xb6fe5bac <+60>:	beq	0xb6fe5bc8 <__sigsetjmp+88>
   0xb6fe5bb0 <+64>:	stfp	f2, [r12], #8
   0xb6fe5bb4 <+68>:	stfp	f3, [r12], #8
   0xb6fe5bb8 <+72>:	stfp	f4, [r12], #8
   0xb6fe5bbc <+76>:	stfp	f5, [r12], #8
   0xb6fe5bc0 <+80>:	stfp	f6, [r12], #8
   0xb6fe5bc4 <+84>:	stfp	f7, [r12], #8
   0xb6fe5bc8 <+88>:	b	0xb6fe39d8 <__sigjmp_save>
End of assembler dump.

Looks like our processor didn't like the vstmia instruction. Can't imagine why - it seems to be a valid ARMv7 instruction.

After reading through some reference manuals and consulting others online, it turned out that my SoC processor is crippled: A set of instructions simply wasn't supported by my processor. Luckily, thanks to those instructions not existing in ARMv5 and ARM being backwards-compatible, I could simply use the ARMv5-compiled system instead.

Repeating the steps to create the root filesystem, this time using the ArchLinuxARM-armv5-latest.tar.gz tarball instead, showed promising results. I could finally:

# chroot /mnt/usb0/archfs /bin/bash
[root@iopsys /]# cat /etc/os-release
NAME="Arch Linux ARM"
PRETTY_NAME="Arch Linux ARM"
ID=archarm

I exited the chroot after seeing it works. We still needed to mount some partitions so the chroot could see and interact with them and copy some files over. I wrote a helper script for all of that which you can find here.

Great, we can now initialise pacman and try upgrading the system.

# pacman-key --init
# pacman-key --populate archlinuxarm
# pacman -Syu

error: out of memory

Issue 3: Memory problems

Honestly, should've seen this one coming. free -m showed that I was working with around 100 Mb of usable memory, which is not much - no wonder pacman crapped out. Luckily, my device kernel was compiled with swap support. This essentially allows the system to "swap" memory contents out to the filesystem and load them later when necessary. It's very slow compared to real memory, but it gets the job done in a pinch. I created a 1G swapfile on my USB drive and activated it, whilst inside the chroot:

# truncate -s 0   /swapfile
# chattr +C       /swapfile
# fallocate -l 1G /swapfile
# chmod 600       /swapfile
# mkswap          /swapfile
# swapon          /swapfile

Running pacman again allowed me to continue upgrading the system, which it finished successfully.

At this point, I had a fully functional Arch Linux system which I could chroot into and utilise pretty much to the maximum. I've successfully set up Python bots, compiled software with gcc/g++, etc. what you'd expect to see from a normal system. I don't know why you would want to do this, but it's definitely possible.

I realise that it may not go this smoothly on other systems. For example, a large portion of routers utilise the MIPS architecture instead of ARM. If this is the case for you, it unfortunately means that Arch Linux is off the table, as it doesn't have any functioning MIPS builds. However, the Debian community maintains an active MIPS port of Debian which you might want to look into instead. Everything in this post should still pretty much apply to Debian/MIPS as well, with some minor differences.

This has also been done on other unconventional devices. Reddit user parkerlreed used a similar procedure to run Arch Linux on a Steamlink, which you can read here - it even has instructions on how to compile applications natively on it.

Infinity in 15 kilograms

By: VM
19 April 2024 at 03:53
Infinity in 15 kilograms

While space is hard, there are also different kinds of hardness. For example, on April 15, ISRO issued a press release saying it had successfully tested nozzles made of a carbon-carbon composite that would replace those made of Columbium alloy in the PSLV rocket's fourth stage and thus increase the rocket's payload capacity by 15 kg. Just 15 kg!

The successful testing of the C-C nozzle divergent marked a major milestone for ISRO. On March 19, 2024, a 60-second hot test was conducted at the High-Altitude Test (HAT) facility in ISRO Propulsion Complex (IPRC), Mahendragiri, confirming the system's performance and hardware integrity. Subsequent tests, including a 200-second hot test on April 2, 2024, further validated the nozzle's capabilities, with temperatures reaching 1216K, matching predictions.

Granted, the PSLV's cost of launching a single kilogram to low-earth orbit is more than 8 lakh rupees (a very conservative estimate, I reckon) – meaning an additional 15 kg means at least an additional Rs 1.2 crore per launch. But finances alone are not a useful way to evaluate this addition: more payload mass could mean, say, one additional instrument onboard an indigenous spacecraft instead of waiting for a larger rocket to become available or postponing that instrument's launch to a future mission.

But equally fascinating, and pride- and notice-worthy, to me is the fact that ISRO's scientists and engineers were able to fine-tune the PSLV to this extent. This isn't to say I'm surprised they were able to do it at all; on the contrary, it means the feat is as much about the benefits accruing to the rocket, and the Indian space programme by extension, as about R&D advances on the materials science front. It speaks to the oft-underestimated importance of the foundations on which a space programme is built.

Vikram Sarabhai Space Centre … has leveraged advanced materials like Carbon-Carbon (C-C) Composites to create a nozzle divergent that offers exceptional properties. By utilizing processes such as carbonization of green composites, Chemical Vapor Infiltration, and High-Temperature Treatment, it has produced a nozzle with low density, high specific strength, and excellent stiffness, capable of retaining mechanical properties even at elevated temperatures.
A key feature of the C-C nozzle is its special anti-oxidation coating of Silicon Carbide, which extends its operational limits in oxidizing environments. This innovation not only reduces thermally induced stresses but also enhances corrosion resistance, allowing for extended operational temperature limits in hostile environments.

The advances here draw from insights into metallurgy, crystallography, ceramic engineering, composite materials, numerical methods, etc., which in turn stand on the shoulders of people trained well enough in these areas, the educational institutions (and their teachers) that did so, and the schooling system and socio-economic support structures that brought them there. A country needs a lot to go right for achievements like squeezing an extra 15 kg into the payload capacity of an already highly fine-tuned machine to be possible. It's a bummer that such advances are currently largely vertically restricted, except in the case of the Indian space programme, rather than diffusing freely across sectors.

Other enterprises ought to have these particular advantages ISRO enjoys. Even should one or two rockets fail, a test not work out or a spacecraft go kaput sooner than designed, the PSLV's new carbon-carbon-composite nozzles stand for the idea that we have everything we need to keep trying, including the opportunity to do better next time. They represent the idea of how advances in one field of research can lead to advances in another, such that each field is no longer held back by the limitations of its starting conditions.

❌
❌