❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Make Your Own Backup System – Part 2: Forging the FreeBSD Backup Stronghold

29 July 2025 at 06:00
Build a bulletproof backup server with FreeBSD, ZFS, and jails. Complete guide covering encryption, security hardening, and multiple backup strategies for enterprise-grade data protection.

Another thing V7 Unix gave us is environment variables

By: cks
29 July 2025 at 02:38

Simon Tatham recently wondered "Why is PATH called PATH?". This made me wonder the closely related question of when environment variables appeared in Unix, and the answer is that the environment and environment variables appeared in V7 Unix as another of the things that made it so important to Unix history (also).

Up through V6, the exec system call and family of system calls took two arguments, the path and the argument list; we can see this in both the V6 exec(2) manual page and the implementation of the system call in the kernel. As bonus trivia, it appears that the V6 exec() limited you to 510 characters of arguments (and probably V1 through V5 had a similarly low limit, but I haven't looked at their kernel code).

In V7, the exec(2) manual page now documents a possible third argument, and the kernel implementation is much more complex, plus there's an environ(5) manual page about it. Based on h/param.h, V7 also had a much higher size limit on the combined sized of arguments and environment variables, which isn't all that surprising given the addition of the environment. Commands like login.c were updated to put some things into the new environment; login sets a default $PATH and a $HOME, for example, and environ(5) documents various other uses (which I haven't checked in the source code).

This implies that the V7 shell is where $PATH first appeared in Unix, where the manual page describes it as 'the search path for commands'. This might make you wonder how the V6 shell handled locating commands, and where it looked for them. The details are helpfully documented in the V6 shell manual page, and I'll just quote what it has to say:

If the first argument is the name of an executable file, it is invoked; otherwise the string `/bin/' is prepended to the argument. (In this way most standard commands, which reside in `/bin', are found.) If no such command is found, the string `/usr' is further prepended (to give `/usr/bin/command') and another attempt is made to execute the resulting file. (Certain lesser-used commands live in `/usr/bin'.)

('Invoked' here is carrying some extra freight, since this may not involve a direct kernel exec of the file. An executable file that the kernel didn't like would be directly run by the shell.)

I suspect that '$PATH' was given such as short name (instead of a longer, more explicit one) simply as a matter of Unix style at the time. Pretty much everything in V7 was terse and short in this style for various reasons, and verbose environment variable names would have reduced that limited exec argument space.

Unix had good reasons to evolve since V7 (and had to)

By: cks
27 July 2025 at 03:21

There's a certain sort of person who feels that the platonic ideal of Unix is somewhere around Research Unix V7 and it's almost all been downhill since then (perhaps with the exception of further Research Unixes and then Plan 9, although very few people got their hands on any of them). For all that I like Unix and started using it long ago when it was simpler (although not as far back as V7), I reject this view and think it's completely mistaken.

V7 Unix was simple but it was also limited, both in its implementation (which often took shortcuts (also, also, also) and in its overall features (such as short filenames). Obviously V7 didn't have networking, but even things that most people think of as perfectly reasonable and good Unix features like '#!' support for shell scripts in the kernel and processes being in multiple groups at once. That V7 was a simple and limited system meant that its choices were to grow to meet people's quite reasonable needs or to fall out of use.

(Some of these needs were for features and some of them were for performance. The original V7 filesystem was quite simple but also suffered from performance issues, ones that often got worse over time.)

I'll agree that the path that the growth of Unix has taken since V7 is not necessarily ideal; we can all point to various things about modern Unixes that we don't like. Any particular flaws came about partly because people don't necessarily make ideal decisions and partly because we haven't necessarily had perfect understandings of the problems when people had to do something, and then once they'd done something they were constrained by backward compatibility.

(In some ways Plan 9 represents 'Unix without the constraint of backward compatibility', and while I think there are a variety of reasons that it failed to catch on in the world, that lack of compatibility is one of them. Even if you had access to Plan 9, you had to be fairly dedicated to do your work in a Plan 9 environment (and that was before the web made it worse).)

PS: It's my view that the people who are pushing various Unixes forward aren't incompetent, stupid, or foolish. They're rational and talented people who are doing their best in the circumstances that they find themselves. If you want to throw stones, don't throw them at the people, throw them at the overall environment that constrains and shapes how everything in this world is pushed to evolve. Unix is far from the only thing shaped in potentially undesirable ways by these forces; consider, for example, C++.

(It's also clear that a lot of people involved in the historical evolution of BSD and other Unixes were really quite smart, even if you don't like, for example, the BSD sockets API.)

NFS v4 delegations on a Linux NFS server can act as mandatory locks

By: cks
23 July 2025 at 01:33

Over on the Fediverse, I shared an unhappy learning experience:

Linux kernel NFS: we don't have mandatory locks.
Also Linux kernel NFS: if the server has delegated a file to a NFS client that's now not responding, good luck writing to the file from any other machine. Your writes will hang.

NFS v4 delegations are an feature where the NFS server, such as your Linux fileserver, hands a lot of authority over a particular file over to a client that is using that file. There are various sorts of delegations, but even a basic read delegation will force the NFS server to recall the delegation if anything else wants to write to the file or to remove it. Recalling a delegation requires notifying the NFS v4 client that it has lost the delegation and then having the client accept and respond to that. NFS v4 clients have to respond to the loss of a delegation because they may be holding local state that needs to be flushed back to the NFS server before the delegation can be released.

(After all the NFS v4 server promised the client 'this file is yours to fiddle around with, I will consult you before touching it'.)

Under some circumstances, when the NFS v4 server is unable to contact the NFS v4 client, it will simply sit there waiting and as part of that will not allow you to do things that require the delegation to be released. I don't know if there's a delegation recall timeout, although I suspect that there is, and I don't know how to find out what the timeout is, but whatever the value is, it's substantial (it may be the 90 second 'default lease time' from nfsd4_init_leases_net(), or perhaps the 'grace', also probably 90 seconds, or perhaps the two added together).

(90 seconds is not what I consider a tolerable amount of time for my editor to completely freeze when I tell it to write out a new version of the file. When NFS is involved, I will typically assume that something has gone badly wrong well before then.)

As mentioned, the NFS v4 RFC also explicitly notes that NFS v4 clients may have to flush file state in order to release their delegation, and this itself may take some time. So even without an unavailable client machine, recalling a delegation may stall for some possibly arbitrary amount of time (depending on how the NFS v4 server behaves; the RFC encourages NFS v4 servers to not be hasty if the client seems to be making a good faith effort to clear its state). Both the slow client recall and the hung client recall can happen even in the absence of any actual file locks; in my case, the now-unavailable client merely having read from the file was enough to block things.

This blocking recall is effectively a mandatory lock, and it affects both remote operations over NFS and local operations on the fileserver itself. Short of waiting out whatever timeout applies, you have two realistic choices to deal with this (the non-realistic choice is to reboot the fileserver). First, you can bring the NFS client back to life, or at least something that's at its IP address and responds to the server with NFS v4 errors. Second, I believe you can force everything from the client to expire through /proc/fs/nfsd/clients/<ID>, by writing 'expire' to the client's 'ctl' file. You can find the right client ID by grep'ing for something in all of the clients/*/info files.

Discovering this makes me somewhat more inclined than before to consider entirely disabling 'leases', the underlying kernel feature that is used to implement these NFS v4 delegations (I discovered how to do this when investigating NFS v4 client locks on the server). This will also affect local processes on the fileserver, but that now feels like a feature since hung NFS v4 delegation recalls will stall or stop even local operations.

Why Ubuntu 24.04's ls can show a puzzling error message on NFS filesystems

By: cks
18 July 2025 at 02:19

Suppose that you're on Ubuntu 24.04, using NFS v4 filesystems mounted from a Linux NFS fileserver, and at some point you do a 'ls -l' or a 'ls -ld' of something you don't own. You may then be confused and angered:

; /bin/ls -ld ckstst
/bin/ls: ckstst: Permission denied
drwx------ 64 ckstst [...] 131 Jul 17 12:06 ckstst

(There are situations where this doesn't happen or doesn't repeat, which I don't understand but which I'm assuming are NFS caching in action.)

If you apply strace to the problem, you'll find that the failing system call is listxattr(2), which is trying to list 'extended attributes'. On Ubuntu 24.04, ls comes from Coreutils, and Coreutils apparently started using listxattr() in version 9.4.

The Linux NFS v4 code supports extended attributes (xattrs), which are from RFC 8276; they're supported in both the client and the server since mid-2020 if I'm reading git logs correctly. Both the normal Ubuntu 22.04 LTS and 24.04 LTS server kernels are recent enough to include this support on both the server and clients, and I don't believe there's any way to turn just them off in the kernel server (although if you disable NFS v4.2 they may disappear too).

However, the NFS v4 server doesn't treat listxattr() operations the way the kernel normally does. Normally, the kernel will let you do listxattr() on an object (a directory, a file, etc) that you don't have read permissions on, just as it will let you do stat() on it. However, the NFS v4 server code specifically requires that you have read access to the object. If you don't, you get EACCES (no second S).

(The sausage is made in nfsd_listxattr() in fs/nfsd/vfs.c, specifically in the fh_verify() call that uses NFSD_MAY_READ instead of NFSD_MAY_NOP, which is what eg GETATTR uses.)

In January of this year, Coreutils applied a workaround to this problem, which appeared in Coreutils 9.6 (and is mentioned in the release notes).

Normally we'd have found this last year, but we've been slow to roll out Ubuntu 24.04 LTS machines and apparently until now no one ever did a 'ls -l' of unreadable things on one of them (well, on a NFS mounted filesystem).

(This elaborates on a Fediverse post. Our patch is somewhat different than the official one.)

The development version of OpenZFS is sometimes dangerous, illustrated

By: cks
12 July 2025 at 02:41

I've used OpenZFS on my office and home desktops (on Linux) for what is a long time now, and over that time I've consistently used the development version of OpenZFS, updating to the latest git tip on a regular basis (cf). There have been occasional issues but I've said, and continue to say, that the code that goes into the development version is generally well tested and I usually don't worry too much about it. But I do worry somewhat, and I do things like read every commit message for the development version and I sometimes hold off on updating my version if a particular significant change has recently landed.

But, well, sometimes things go wrong in a development version. As covered in Rob Norris's An (almost) catastrophic OpenZFS bug and the humans that made it (and Rust is here too) (via), there was a recently discovered bug in the development version of OpenZFS that could or would have corrupted RAIDZ vdevs. When I saw the fix commit go by in the development version, I felt extremely lucky that I use mirror vdevs, not raidz, and so avoided being affected by this.

(While I might have detected this at the first scrub after some data was corrupted, the data would have been gone and at a minimum I'd have had to restore it from backups. Which I don't currently have on my home desktop.)

In general this is a pointed reminder that the development version of OpenZFS isn't perfect, no matter how long I and other people have been lucky with it. You might want to think twice before running the development version in order to, for example, get support for the very latest kernels that are used by distributions like Fedora. Perhaps you're better off delaying your kernel upgrades a bit longer and sticking to released branches.

I don't know if this is going to change my practices around running the development version of OpenZFS on my desktops. It may make me more reluctant to update to the very latest version on my home desktop; it would be straightforward to have that run only time-delayed versions of what I've already run through at least one scrub cycle on my office desktop (where I have backups). And I probably won't switch to the next release version when it comes out, partly because of kernel support issues.

(Maybe) understanding how to use systemd-socket-proxyd

By: cks
10 July 2025 at 03:47

I recently read systemd has been a complete, utter, unmitigated success (via among other places), where I found a mention of an interesting systemd piece that I'd previously been unaware of, systemd-socket-proxyd. As covered in the article, the major purpose of systemd-socket-proxyd is the bridge between systemd dynamic socket activation and a conventional programs that listens on some socket, so that you can dynamically activate the program when a connection comes in. Unfortunately the systemd-socket-proxyd manual page is a little bit opaque about how it works for this purpose (and what the limitations are). Even though I'm familiar with systemd stuff, I had to think about it for a bit before things clicked.

A systemd socket unit activates the corresponding service unit when a connection comes in on the socket. For simple services that are activated separately for each connection (with 'Accept=yes'), this is actually a templated unit, but if you're using it to activate a regular daemon like sshd (with 'Accept=no') it will be a single .service unit. When systemd activates this unit, it will pass the socket to it either through systemd's native mechanism or an inetd-compatible mechanism using standard input. If your listening program supports either mechanism, you don't need systemd-socket-proxyd and your life is simple. But plenty of interesting programs don't; they expect to start up and bind to their listening socket themselves. To work with these programs, systemd-socket-proxyd accepts a socket (or several) from systemd and then proxies connections on that socket to the socket your program is actually listening to (which will not be the official socket, such as port 80 or 443).

All of this is perfectly fine and straightforward, but the question is, how do we get our real program to be automatically started when a connection comes in and triggers systemd's socket activation? The answer, which isn't explicitly described in the manual page but which appears in the examples, is that we make the socket's .service unit (which will run systemd-socket-proxyd) also depend on the .service unit for our real service with a 'Requires=' and an 'After='. When a connection comes in on the main socket that systemd is doing socket activation for, call it 'fred.socket', systemd will try to activate the corresponding .service unit, 'fred.service'. As it does this, it sees that fred.service depends on 'realthing.service' and must be started after it, so it will start 'realthing.service' first. Your real program will then start, bind to its local socket, and then have systemd-socket-proxyd proxy the first connection to it.

To automatically stop everything when things are idle, you set systemd-socket-proxyd's --exit-idle-time option and also set StopWhenUnneeded=true on your program's real service unit ('realthing.service' here). Then when systemd-socket-proxyd is idle for long enough, it will exit, systemd will notice that the 'fred.service' unit is no longer active, see that there's nothing that needs your real service unit any more, and shut that unit down too, causing your real program to exit.

The obvious limitation of using systemd-socket-proxyd is that your real program no longer knows the actual source of the connection. If you use systemd-socket-proxyd to relay HTTP connections on port 80 to an nginx instance that's activated on demand (as shown in the examples in the systemd-socket-proxyd manual page), that nginx sees and will log all of the connections as local ones. There are usage patterns where this information will be added by something else (for example, a frontend server that is a reverse proxy to a bunch of activated on demand backend servers), but otherwise you're out of luck as far as I know.

Another potential issue is that systemd's idea of when the .service unit for your real program has 'started' and thus it can start running systemd-socket-proxyd may not match when your real program actually gets around to setting up its socket. I don't know if systemd-socket-proxyd will wait and try a bit to cope with the situation where it gets started a bit faster than your real program can get its socket ready.

(Systemd has ways that your real program can signal readiness, but if your program can use these ways it may well also support being passed sockets from systemd as a direct socket activated thing.)

Linux 'exportfs -r' stops on errors (well, problems)

By: cks
9 July 2025 at 03:26

Linux's NFS export handling system has a very convenient option where you don't have to put all of your exports into one file, /etc/exports, but can instead write them into a bunch of separate files in /etc/exports.d. This is very convenient for allowing you to manage filesystem exports separately from each other and to add, remove, or modify only a single filesystem's exports. Also, one of the things that exportfs(8) can do is 'reexport' all current exports, synchronizing the system state to what is in /etc/exports and /etc/exports.d; this is 'exportfs -r', and is a handy thing to do after you've done various manipulations of files in /etc/exports.d.

Although it's not documented and not explicit in 'exportfs -v -r' (which will claim to be 'exporting ...' for various things), I have an important safety tip which I discovered today: exportfs does nothing on a re-export if you have any problems in your exports. In particular, if any single file in /etc/exports.d has a problem, no files from /etc/exports.d get processed and no exports are updated.

One potential problem with such files is syntax errors, which is fair enough as a 'problem'. But another problem is that they refer to directories that don't exist, for example because you have lingering exports for a ZFS pool that you've temporarily exported (which deletes the directories that the pool's filesystems may have previously been mounted on). A missing directory is an error even if the exportfs options include 'mountpoint', which only does the export if the directory is a mount point.

When I stubbed my toe on this I was surprised. What I'd vaguely expected was that the error would cause only the particular file in /etc/exports.d to not be processed, and that it wouldn't be a fatal error for the entire process. Exportfs itself prints no notices about this being a fatal problem, and it will happily continue to process other files in /etc/exports.d (as you can see with 'exportfs -v -r' with the right ordering of where the problem file is) and claim to be exporting them.

Oh well, now I know and hopefully it will stick.

Systemd user units, user sessions, and environment variables

By: cks
8 July 2025 at 03:17

A variety of things in typical graphical desktop sessions communicate through the use of environment variables; for example, X's $DISPLAY environment variable. Somewhat famously, modern desktops run a lot of things as systemd user units, and it might be nice to do that yourself (cf). When you put these two facts together, you wind up with a question, namely how the environment works in systemd user units and what problems you're going to run into.

The simplest case is using systemd-run to run a user scope unit ('systemd-run --user --scope --'), for example to run a CPU heavy thing with low priority. In this situation, the new scope will inherit your entire current environment and nothing else. As far as I know, there's no way to do this with other sorts of things that systemd-run will start.

Non-scope user units by default inherit their environment from your user "systemd manager". I believe that there is always only a single user manager for all sessions of a particular user, regardless of how you've logged in. When starting things via 'systemd-run', you can selectively pass environment variables from your current environment with 'systemd-run --user -E <var> -E <var> -E ...'. If the variable is unset in your environment but set in the user systemd manager, this will unset it for the new systemd-run started unit. As you can tell, this will get very tedious if you want to pass a lot of variables from your current environment into the new unit.

You can manipulate your user "systemd manager environment block", as systemctl describes it in Environment Commands. In particular, you can export current environment settings to it with 'systemctl --user import-environment VAR VAR2 ...'. If you look at this with 'systemctl --user show-environment', you'll see that your desktop environment has pushed a lot of environment variables into the systemd manager environment block, including things like $DISPLAY (if you're on X). All of these environment variables for X, Wayland, DBus, and so on are probably part of how the assorted user units that are part of your desktop session talk to the display and so on.

You may now see a little problem. What happens if you're logged in with a desktop X session, and then you go elsewhere and SSH in to your machine (maybe with X forwarding) and try to start a graphical program as a systemd user unit? Since you only have a single systemd manager regardless of how many sessions you have, the systemd user unit you started from your SSH session will inherit all of the environment variables that your desktop session set and it will think it has graphics and open up a window on your desktop (which is hopefully locked, and in any case it's not useful to you over SSH). If you import the SSH session's $DISPLAY (or whatever) into the systemd manager's environment, you'll damage your desktop session.

For specific environment variables, you can override or remove them with 'systemd-run --user -E ...' (for example, to override or remove $DISPLAY). But hunting down all of the session environment variables that may trigger undesired effects is up to you, making systemd-run's user scope units by far the easiest way to deal with this.

(I don't know if there's something extra-special about scope units that enables them and only them to be passed your entire environment, or of this is simply a limitation in systemd-run that it doesn't try to implement this for anything else.)

The reason I find all of this regrettable is that it makes putting applications and other session processes into their own units much harder than it should be. Systemd-run's scope units inherit your session environment but can't be detached, so at a minimum you have extra systemd-run processes sticking around (and putting everything into scopes when some of them might be services is unaesthetic). Other units can be detached but don't inherit your environment, requiring assorted contortions to make things work.

PS: Possibly I'm missing something obvious about how to do this correctly, or perhaps there's an existing helper that can be used generically for this purpose.

What is going on in Unix with errno's limited nature

By: cks
4 July 2025 at 02:13

If you read manual pages, such as Linux's errno(3), you'll soon discover an important and peculiar seeming limitation of looking at errno. To quote the Linux version:

The value in errno is significant only when the return value of the call indicated an error (i.e., -1 from most system calls; -1 or NULL from most library functions); a function that succeeds is allowed to change errno. The value of errno is never set to zero by any system call or library function.

This is also more or less what POSIX says in errno, although in standards language that's less clear. All of this is a sign of what has traditionally been going on behind the scenes in Unix.

The classical Unix approach to kernel system calls doesn't return multiple values, for example the regular return value and errno. Instead, Unix kernels have traditionally returned either a success value or the errno value along with an indication of failure, telling them apart in various ways (such as the PDP-11 return method). At the C library level, the simple approach taken in early Unix was that system call wrappers only bothered to set the C level errno if the kernel signaled an error. See, for example, the V7 libc/crt/cerror.s combined with libc/sys/dup.s, where the dup() wrapper only jumps to cerror and sets errno if the kernel signals an error. The system call wrappers could all have explicitly set errno to 0 on success, but they didn't.

The next issue is that various C library calls may make a number of system calls themselves, some of which may fail without the library call itself failing. The classical case is stdio checking to see whether stdout is connected to a terminal and so should be line buffered, which was traditionally implemented by trying to do a terminal-only ioctl() to the file descriptors, which would fail with ENOTTY on non-terminal file descriptors. Even if stdio did a successful write() rather than only buffering your output, the write() system call wrapper wouldn't change the existing ENOTTY errno value from the failed ioctl(). So you can have a fwrite() (or printf() or puts() or other stdio call) that succeeds while 'setting' errno to some value such as ENOTTY.

When ANSI C and POSIX came along, they inherited this existing situation and there wasn't much they could do about it (POSIX was mostly documenting existing practice). I believe that they also wanted to allow a situation where POSIX functions were implemented on top of whatever oddball system calls you wanted to have your library code do, even if they set errno. So the only thing POSIX could really require was the traditional Unix behavior that if something failed and it was documented to set errno on failure, you could then look at errno and have it be meaningful.

(This was what existing Unixes were already mostly doing and specifying it put minimal constraints on any new POSIX environments, including POSIX environments on top of other operating systems.)

(This elaborates on a Fediverse post of mine, and you can run into this in non-C languages that have true multi-value returns under the right circumstances.)

How history works in the version of the rc shell that I use

By: cks
30 June 2025 at 03:15

Broadly, there have been three approaches to command history in Unix shells. In the beginning there was none, which was certainly simple but which led people to be unhappy. Then csh gave us in-memory command history, which could be recalled and edited with shell builtins like '!!' but which lasted only as long as that shell process did. Finally, people started putting 'readline style' interactive command editing into shells, which included some history of past commands that you could get back with cursor-up, and picked up the GNU Readline feature of a $HISTORY file. Broadly speaking, the shell would save the in-memory (readline) history to $HISTORY when it exited and load the in-memory (readline) history from $HISTORY when it started.

I use a reimplementation of rc, the shell created by Tom Duff, and my version of the shell started out with a rather different and more minimal mechanism for history. In the initial release of this rc, all the shell itself did was write every command executed to $history (if that variable was set). Inspecting and reusing commands from a $history file was left up to you, although rc provided a helper program that could be used in a variety of ways. For example, in a terminal window I commonly used '-p' to print the last command and then either copied and pasted it with the mouse or used an rc function I wrote to repeat it directly.

(You didn't have to set $history to the same file in every instance of rc. I arranged to have a per-shell history file that was removed when the shell exited, because I was only interested in short term 'repeat a previous command' usage of history.)

Later, the version of rc that I use got support for GNU Readline and other line editing environments (and I started using it). GNU Readline maintains its own in-memory command history, which is used for things like cursor-up to the previous line. In rc, this in-memory command history is distinct from the $history file history, and things can get confusing if you mix the two (for example, cursor-up to an invocation of your 'repeat the last command' function won't necessarily repeat the command you expect).

It turns out that at least for GNU Readline, the current implementation in rc does the obvious thing; if $history is set when rc starts, the commands from it are read into GNU Readline's in-memory history. This is one half of the traditional $HISTORY behavior. Rc's current GNU Readline code doesn't attempt to save its in-memory history back to $history on exit, because if $history is set the regular rc code has already been recording all of your commands there. Rc otherwise has no shell builtins to manipulate GNU Readline's command history, because GNU Readline and other line editing alternatives are just optional extra features that have relatively minimal hooks into the core of rc.

(In theory this allows thenshell to inject a synthetic command history into rc on startup, but it requires thenshell to know exactly how I handle my per-shell history file.)

Sidebar: How I create per-shell history in this version of rc

The version of rc that I use doesn't have an 'initialization' shell function that runs when the shell is started, but it does support a 'prompt' function that's run just before the prompt is printed. So my prompt function keeps track of the 'expected shell PID' in a variable and compares it to the actual PID. If there's a mismatch (including the variable being unset), the prompt function goes through a per-shell initialization, including setting up my per-shell $history value.

Current cups-browsed seems to be bad for central CUPS print servers

By: cks
28 June 2025 at 02:26

Suppose, not hypothetically, that you have a central CUPS print server, and that people also have Linux desktops or laptops that they point at your print server to print to your printers. As of at least Ubunut 24.04, if you're doing this you probably want to get people to turn off and disable cups-browsed on their machines. If you don't, your central print server may see a constant flood of connections from client machines running cups-browsed. You're probably running it, as I believe that cups-browsed is installed and activated by default these days in most desktop Linux environments.

(We didn't really notice this in prior Ubuntu versions, although it's possible cups-browsed was always doing something like this and what's changed in the Ubuntu 24.04 version is that it's doing it more and faster.)

I'm not entirely sure why this happens, and I'm also not sure what the CUPS requests typically involve, but one pattern that we see is that such clients will make a lot of requests to the CUPS server's /admin/ URL. I'm not sure what's in these requests, because CUPS immediately rejects them as unauthenticated. Another thing we've seen is frequent attempts to get printer attributes for printers that don't exist and that have name patterns that look like local printers. One of the reason that the clients are hitting the /admin/ endpoint may be to somehow add these printers to our CUPS server, which is definitely not going to work.

(We've also seen signs that some Ubuntu 24.04 applications can repeatedly spam the CUPS server, probably with status requests for printers or print jobs. This may be something enabled or encouraged by cups-browsed.)

My impression is that modern Linux desktop software, things like cups-browsed included, is not really spending much time thinking about larger scale, managed Unix environments where there are a bunch of printers (or at least print queues), the 'print server' is not on your local machine and not run by you, anything random you pick up through broadcast on the local network is suspect, and so on. I broadly sympathize with this, because such environments are a small minority now, but it would be nice if client side CUPS software didn't cause problems in them.

(I suspect that cups-browsed and its friends are okay in an environment where either the 'print server' is local or it's operated by you and doesn't require authentication, there's only a few printers, everyone on the local network is friendly and if you see a printer it's definitely okay to use it, and so on. This describes a lot of Linux desktop environments, including my home desktop.)

Some notes on X terminals in their heyday

By: cks
26 June 2025 at 03:41

I recently wrote about how the X Window System didn't immediately have (thin client) X terminals. X terminals are now a relatively obscure part of history and it may not be obvious to people today why they were a relatively significant deal at the time. So today I'm going to add some additional notes about X terminals in their heyday, from their introduction around 1989 through the mid 1990s.

One of the reactions to my entry that I've seen is to wonder if there was much point to X terminals, since it seems like they should be close to much more functional normal computers and all you'd save is perhaps storage. Practically this wasn't the case in 1989 when they were introduced; NCD's initial models cost substantially less than, say, a Sparcstation 1 (also introduced in 1989), it appears less than half the cost of even a diskless Sparcstation 1. I believe that one reason for this is that memory was comparatively more expensive in those days and X terminals could get away with much, much less of it, since they didn't need to run a Unix kernel and enough of a Unix user space to boot up the X server (and I believe that some or all of the software was run directly from ROM instead of being loaded into precious RAM).

(The NCD16 apparently started at 1 MByte of RAM and the NCD19 at 2 MBytes, for example. You could apparently get a Sparcstation 1 with that little memory but you probably didn't want to use it for much.)

In one sense, early PCs were competition for X terminals in that they put computation on people's desks, but in another sense they weren't, because you couldn't use them as an inexpensive way to get Unix on people's desks. There eventually was at least one piece of software for this, DESQview/X, but it appeared later and you'd have needed to also buy the PC to run it on, as well as a 'high resolution' black and white display card and monitor. Of course, eventually the march of PCs made all of that cheap, which was part of the diminishing interest in X terminals in the later part of the 1990s and onward.

(I suspect that one reason that X terminals had lower hardware costs was that they probably had what today we would call a 'unified memory system', where the framebuffer's RAM was regular RAM instead of having to be separate because it came on a separate physical card.)

You might wonder how well X terminals worked over the 10 MBit Ethernet that was all you had at the time. With the right programs it could work pretty well, because the original approach of X was that you sent drawing commands to the X server, not rendered bitmaps. If you were using things that could send simple, compact rendering commands to your X terminal, such as xterm, 10M Ethernet could be perfectly okay. Anything that required shipping bitmapped graphics could be not as impressive, or even not something you'd want to touch, but for what you typically used monochrome X for between 1989 and 1995 or so, this was generally okay.

(Today many things on X want to ship bitmaps around, even for things like displaying text. But back in the day text was shipped as, well, text, and it was the X server that rendered the fonts.)

When looking at the servers you'd need for a given number of diskless Unix workstations or X terminals, the X terminals required less server side disk space but potentially more server side memory and CPU capacity, and were easier to administer. As noted by some commentators here, you might also save on commercial software licensing costs if you could license it only for your few servers instead of your lots of Unix workstations. I don't know how the system administration load actually compared to a similar number of PCs or Macs, but in my Unix circles we thought we scaled much better and could much more easily support many seats (and many potential users if you had, for example, many more students than lab desktops).

My perception is that what killed off X terminals as particularly attractive, even for Unix places, was that on the one hand the extra hardware capabilities PCs needed over X terminals kept getting cheaper and cheaper and on the other hand people started demanding more features and performance, like decent colour displays. That brought the X terminal 'advantage' more or less down to easier administration, and in the end that wasn't enough (although some X terminals and X 'thin client' setups clung on quite late, eg the SunRay, which we had some of in the 2000s).

Of course that's a Unix centric view. In a larger view, Unix was displaced on the desktop by PCs, which naturally limited the demand for both X terminals and dedicated Unix workstations (which were significantly marketed toward the basic end of Unix performance, and see also). By no later than the end of the 1990s, PCs were better basic Unix workstations than the other options and you could use them to run other software too if you wanted to, so they mostly ran over everything else even in the remaining holdouts.

(We ran what were effectively X terminals quite late, but the last few generations were basic PCs running LTSP not dedicated hardware. All our Sun Rays got retired well before the LTSP machines.)

(I think that the 'personal computer' model has or at least had some significant pragmatic advantages over the 'terminal' model, but that's something for another entry.)

Compute GPUs can have odd failures under Linux (still)

By: cks
24 June 2025 at 03:04

Back in the early days of GPU computation, the hardware, drivers, and software were so relatively untrustworthy that our early GPU machines had to be specifically reserved by people and that reservation gave them the ability to remotely power cycle the machine to recover it (this was in the days before our SLURM cluster). Things have gotten much better since then, with things like hardware and driver changes so that programs with bugs couldn't hard-lock the GPU hardware. But every so often we run into odd failures where something funny is going on that we don't understand.

We have one particular SLURM GPU node that has been flaky for a while, with the specific issue being that every so often the NVIDIA GPU would throw up its hands and drop off the PCIe bus until we rebooted the system. This didn't happen every time it was used, or with any consistent pattern, although some people's jobs seemed to regularly trigger this behavior. Recently I dug up a simple to use GPU stress test program, and when this machine's GPU did its disappearing act this Saturday, I grabbed the machine, rebooted it, ran the stress test program, and promptly had the GPU disappear again. Success, I thought, and since it was Saturday, I stopped there, planning to repeat this process today (Monday) at work, while doing various monitoring things.

Since I'm writing a Wandering Thoughts entry about it, you can probably guess the punchline. Nothing has changed on this machine since Saturday, but all today the GPU stress test program could not make the GPU disappear. Not with the same basic usage I'd used Saturday, and not with a different usage that took the GPU to full power draw and a reported temperature of 80C (which was a higher temperature and power draw than the GPU had been at when it disappeared, based on our Prometheus metrics). If I'd been unable to reproduce the failure at all with the GPU stress program, that would have been one thing, but reproducing it once and then not again is just irritating.

(The machine is an assembled from parts one, with an RTX 4090 and a Ryzen Threadripper 1950X in an X399 Taichi motherboard that is probably not even vaguely running the latest BIOS, seeing as the base hardware was built many years ago, although the GPU has been swapped around since then. Everything is in a pretty roomy 4U case, but if the failure was consistent we'd have assumed cooling issues.)

I don't really have any theories for what could be going on, but I suppose I should try to find a GPU stress test program that exercises every last corner of the GPU's capabilities at full power rather than using only one or two parts at a time. On CPUs, different loads light up different functional units, and I assume the same is true on GPUs, so perhaps the problem is in one specific functional unit or a combination of them.

(Although this doesn't explain why the GPU stress test program was able to cause the problem on Saturday but not today, unless a full reboot didn't completely clear out the GPU's state. Possibly we should physically power this machine off entirely for long enough to dissipate any lingering things.)

The X Window System didn't immediately have X terminals

By: cks
23 June 2025 at 03:21

For a while, X terminals were a reasonably popular way to give people comparatively inexpensive X desktops. These X terminals relied on X's network transparency so that only the X server had to run on the X terminal itself, with all of your terminal windows and other programs running on a server somewhere and just displaying on the X terminal. For a long time, using a big server and a lab full of X terminals was significantly cheaper than setting up a lab full of actual workstations (until inexpensive and capable PCs showed up). Given that X started with network transparency and X terminals are so obvious, you might be surprised to find out that X didn't start with them.

In the early days, X ran on workstations. Some of them were diskless workstations, and on some of them (especially the diskless ones), you would log in to a server somewhere to do a lot of your more heavy duty work. But they were full workstations, with a full local Unix environment and you expected to run your window manager and other programs locally even if you did your real work on servers. Although probably some people who had underpowered workstations sitting around experimented with only running the X server locally, with everything else done remotely (except perhaps the window manager).

The first X terminals arrived only once X was reasonably well established as the successful cross-vendor Unix windowing system. NCD, who I suspect were among the first people to make an X terminal, was founded only in 1987 and of course didn't immediately ship a product (it may have shipped its first product in 1989). One indication of the delay in X terminals is that XDM was only released with X11R3, in October of 1988. You technically didn't need XDM to have an X terminal, but it made life much easier, so its late arrival is a sign that X terminals didn't arrive much before then.

(It's quite possible that the possibility for an 'X terminal' was on people's minds even in the early days of X. The Bell Labs Blit was a 'graphical terminal' that had papers written and published about it sometime in 1983 or 1984, and the Blit was definitely known in various universities and so on. Bell Labs even gave people a few of them, which is part of how I wound up using one for a while. Sadly I'm not sure what happened to it in the end, although by now it would probably be a historical artifact.)

(This entry was prompted by a comment on a recent entry of mine.)

PS: A number of people seem to have introduced X terminals in 1989; I didn't spot any in 1988 or earlier.

Sidebar: Using an X terminal without XDM

If you didn't have XDM available or didn't want to have to rely on it, you could give your X terminal the ability to open up a local terminal window that ran a telnet client. To start up an X environment, people would telnet into their local server, set $DISPLAY (or have it automatically set by the site's login scripts), and start at least their window manager by hand. This required your X terminal to not use any access control (at least when you were doing the telnet thing), but strong access control wasn't exactly an X terminal feature in the first place.

What I've observed about Linux kernel WireGuard on 10G Ethernet so far

By: cks
20 June 2025 at 03:08

I wrote about a performance mystery with WireGuard on 10G Ethernet, and since then I've done additional measurements with results that both give some clarity and leave me scratching my head a bit more. So here is what I know about the general performance characteristics of Linux kernel WireGuard on a mixture of Ubuntu 22.04 and 24.04 servers with stock settings, and using TCP streams inside the WireGuard tunnels (because the high bandwidth thing we care about runs over TCP).

  • CPU performance is important even when WireGuard isn't saturating the CPU.

  • CPU performance seems to be more important on the receiving side than on the sending side. If you have two machines, one faster than the other, you get more bandwidth sending a TCP stream from the slower machine to the faster one. I don't know if this is an artifact of the Linux kernel implementation or if the WireGuard protocol requires the receiver to do more work than the sender.

  • There seems to be a single-peer bandwidth limit (related to CPU speeds). You can increase the total WireGuard bandwidth of a given server by talking to more than one peer.

  • When talking to a single peer, there's both a unidirectional bandwidth limit and a bidirectional bandwidth limit. If you send and receive to a single peer at once, you don't get the sum of the unidirectional send and unidirectional receive; you get less.

  • There's probably also a total WireGuard bandwidth that, in our environment, falls short of 10G bandwidth (ie, a server talking WireGuard to multiple peers can't saturate its 10G connection, although maybe it could if I had enough peers in my test setup).

The best performance between a pair of WireGuard peers I've gotten is from two servers with Xeon E-2226G CPUs; these can push their 10G Ethernet to about 850 MBytes/sec of WireGuard bandwidth in one direction and about 630 MBytes/sec in each direction if they're both sending and receiving. These servers (and other servers with slower CPUs) can basically saturate their 10G-T network links with plain (non-WireGuard) TCP.

If I was to build a high performance 'WireGuard gateway' today, I'd build it with a fast CPU and dual 10G networks, with WireGuard traffic coming in (and going out) one 10G interface and the resulting gatewayed traffic using the other. WireGuard on fast CPUs can run fast enough that a single 10G interface could limit total bandwidth under the right (or wrong) circumstances; segmenting WireGuard and clear traffic onto different interfaces avoids that.

(A WireGuard gateway that only served clients at 1G or less would likely be perfectly fine with a single 10G interface and reasonably fast CPUs. But I'd want to test how many 1G clients it took to reach the total WireGuard bandwidth limit on a 10G WireGuard server before I was completely confident about that.)

A performance mystery with Linux WireGuard on 10G Ethernet

By: cks
18 June 2025 at 03:45

As a followup on discovering that WireGuard can saturate a 1G Ethernet (on Linux), I set up WireGuard on some slower servers here that have 10G networking. This isn't an ideal test but it's more representative of what we would see with our actual fileservers, since I used spare fileserver hardware. What I got out of it was a performance and CPU usage mystery.

What I expected to see was that WireGuard performance would top out at some level above 1G as the slower CPUs on both the sending and the receiving host ran into their limits, and I definitely wouldn't see them drive the network as fast as they could without WireGuard. What I actually saw was that WireGuard did hit a speed limit but the CPU usage didn't seem to saturate, either for kernel WireGuard processing or for the iperf3 process. These machines can manage to come relatively close to 10G bandwidth with bare TCP, while with WireGuard they were running around 400 MBytes/sec of on the wire bandwidth (which translates to somewhat less inside the WireGuard connection, due to overheads).

One possible explanation for this is increased packet handling latency, where the introduction of WireGuard adds delays that keep things from running at full speed. Another possible explanation is that I'm running into CPU limits that aren't obvious from simple tools like top and htop. One interesting thing is that if I do a test in both directions at once (either an iperf3 bidirectional test or two iperf3 sessions, one in each direction), the bandwidth in each direction is slightly over half the unidirectional bandwidth (while a bidirectional test without WireGuard runs at full speed in both directions at once). This certainly makes it look like there's a total WireGuard bandwidth limit in these servers somewhere; unidirectional traffic gets basically all of it, while bidirectional traffic splits it fairly between each direction.

I looked at 'perf top' on the receiving 10G machine and kernel spin lock stuff seems to come in surprisingly high. I tried having a 1G test machine also send WireGuard traffic to the receiving 10G test machine at the same time and the incoming bandwidth does go up by about 100 Mbytes/sec, so perhaps on these servers I'm running into a single-peer bandwidth limitation. I can probably arrange to test this tomorrow.

(I can't usefully try both of my 1G WireGuard test machines at once because they're both connected to the same 1G switch, with a 1G uplink into our 10G switch fabric.)

PS: The two 10G servers are running Ubuntu 24.04 and Ubuntu 22.04 respectively with standard kernels; the faster server with more CPUs was the 'receiving' server here, and is running 24.04. The two 1G test servers are running Ubuntu 24.04.

Linux kernel WireGuard can go 'fast' on decent hardware

By: cks
17 June 2025 at 02:19

I'm used to thinking of encryption as a slow thing that can't deliver anywhere near to network saturation, even on basic gigabit Ethernet connections. This is broadly the experience we see with our current VPN servers, which struggle to turn in more than relatively anemic bandwidth with OpenVPN and L2TP, and so for a long time I assumed it would also be our experience with WireGuard if we tried to put anything serious behind it. I'd seen the 2023 Tailscale blog post about this but discounted it as something we were unlikely to see; as their kernel throughput on powerful sounding AWS nodes was anemic by 10G standards, so I assumed our likely less powerful servers wouldn't even get 1G rates.

Today, for reasons beyond the scope of this entry, I wound up wondering how fast we could make WireGuard go. So I grabbed a couple of spare servers we had with reasonably modern CPUs (by our limited standards), put our standard Ubuntu 24.04 on them, and took a quick look to see how fast I could make them go over 1G networking. To my surprise, the answer is that WireGuard can saturate that 1G network with no particularly special tuning, and the system CPU usage is relatively low (4.5% on the client iperf3 side, 8% on the server iperf3 side; each server has a single Xeon E-2226G). The low usage suggests that we could push well over 1G of WireGuard bandwidth through a 10G link, which means that I'm going to set one up for testing at some point.

While the Xeon E-2226G is not a particularly impressive CPU, it's better than the CPUs our NFS fileservers have (the current hardware has Xeon Silver 4410Ys). But I suspect that we could sustain over 1G of WireGuard bandwidth even on them, if we wanted to terminate WireGuard on the fileservers instead of on a 'gateway' machine with a fast CPU (and a 10G link).

More broadly, I probably need to reset my assumptions about the relative speed of encryption as compared to network speeds. These days I suspect a lot of encryption methods can saturate a 1G network link, at least in theory, since I don't think WireGuard is exceptionally good in this respect (as I understand it, encryption speed wasn't particularly a design goal; it was designed to be secure first). Actual implementations may vary for various reasons so perhaps our VPN servers need some tuneups.

(The actual bandwidth achieved inside WireGuard is less than the 1G data rate because simply being encrypted adds some overhead. This is also something I'm going to have to remember when doing future testing; if I want to see how fast WireGuard is driving the underlying networking, I should look at the underlying networking data rate, not necessarily WireGuard's rate.)

Some thoughts on GNOME's systemd dependencies and non-Linux Unixes

By: cks
12 June 2025 at 02:17

One of the pieces of news of the time interval is (GNOME is) Introducing stronger dependencies on systemd (via). Back in the old days, GNOME was a reasonably cross-platform Unix desktop environment, one that you could run on, for example, FreeBSD. I believe that's been less and less true over time already (although the FreeBSD handbook has no disclaimers), but GNOME adding more relatively hard dependencies on systemd really puts a stake in it, since systemd is emphatically Linux-only.

It's possible that people in FreeBSD will do more and more work to create emulations of systemd things that GNOME uses, but personally I think this is quixotic and not sustainable. The more likely outcome is that FreeBSD and other Unixes will drop GNOME entirely, retaining only desktop environments that are more interested in being cross-Unix (although I'm not sure what those are these days; it's possible that GNOME is simply more visible in its Linux dependency).

An aspect of this shift that's of more interest to me is that GNOME is the desktop environment and (I believe) GUI toolkit that has been most vocal about wanting to drop support for X, and relatively soon. The current GDM has apparently already dropped support for starting non-Wayland sessions, for example (at least on Fedora, although it's possible that Fedora has been more aggressive than GNOME itself recommends). This loss of X support in GNOME has been a real force pushing for Wayland, probably including Wayland on FreeBSD. However, if FreeBSD no longer supports GNOME, the general importance of Wayland for FreeBSD may go down. Wayland's importance would especially go down if the general Unix desktop world splits into one camp that is increasingly Linux-dependent due to systemd and Wayland requirements, and another camp that is increasingly 'old school' non-systemd and X only. This second camp would become what you'd find on FreeBSD and other non-Linux Unixes.

(Despite what Wayland people may tell you, there are still a lot of desktop environments that have little or no Wayland support.)

However, this leaves the future of X GUI applications that use GTK somewhat up in the air. If GTK is going to seriously remain a cross-platform thing and the BSDs are effectively only doing X, then GTK needs to retain X support and GTK based applications will work on FreeBSD (at least as much as they ever do). But if the GNOME people decide that 'cross-platform' for GTK doesn't include X, the BSDs would be stuck in an awkward situation. One possibility is that there are enough people using FreeBSD (and X) with GTK applications that they would push the GNOME developers to keep GTK's X support.

(I care about this because I want to keep using X for as long as possible. One thing that would force me to Wayland is if important programs such as Firefox stop working on X because the GUI toolkits they use have dropped X support. The more pressure there is from FreeBSD people to keep the X support in toolkits, the better for me.)

A silly systemd wish for moving new processes around systemd units

By: cks
8 June 2025 at 02:39

Linux cgroups offer a bunch of robust features for limiting resource usage and handling resource contention between different groups of processes, which you can use to implement things like per-user memory and CPU resource limits. On a systemd based system, which is to say basically almost all Linuxes today, systemd more or less completely owns the cgroup hierarchy and using cgroups for resource limits requires that the processes involved be placed inside relevant systemd units, and for that matter that the systemd units exist.

Unfortunately, the mechanisms for doing this are a little bit under-developed. If you're dealing with something that goes through PAM and for which putting processes into user slices based on the UID running them is the right answer, you can use pam_systemd (which we do for various reasons). If you want a different hierarchy and things go through PAM, you can perhaps write a PAM session module that does this, copying code from pam_systemd, but I don't know if there's anything for that today. If you have processes that are started in ways that don't go through PAM, as far as I know you're currently out of luck. One case that's quite relevant for us is Apache CGI processes run through suexec.

It would be nice to be able to do better, since the odds that everything that starts processes will pick up the ability to talk to systemd to set up slices, sessions, and so on for them seem rather low. Some things have specific magic support for this, but I don't think the process is very documented and I believe it requires that things change how they start programs (so eg suexec would have to know how to do this). This means that what I'm wishing for is a daemon that would be given some sort of rules and use them to move processes between systemd slices and other units, possibly creating things like user sessions on the fly. Then you could write a rule that said 'if a process is in the Apache system cgroup and its UID isn't <X>, put it in a slice in a user hierarchy'.

An extra problem is that this daemon probably wouldn't be perfect, since it would have to react to processes after they'd appeared rather than intercept their creation; some processes could slip through the cracks or otherwise do weird things. This would make it sort of a hack, rather than something that I suspect anyone would want as a proper feature.

(I don't know if a kernel LSM could make this more reliable by intercepting and acting on certain things, like setuid() calls.)

PS: Possibly the correct answer is to persuade the Apache people to make suexec consult PAM, even if the standard suexec PAM stack does nothing. Then you could in theory add pam_systemd or whatever there. It appears that Debian may have had a custom patch for this at one but I believe they gave it up years and years ago.

In POSIX, you can theoretically use inode zero

By: cks
31 May 2025 at 01:17

When I wrote about the length of file names in early Unix, I noted that inode numbers were unsigned 16-bit integers and thus you could only have at most 65,536 inodes in any given filesystem. Over on the Fediverse, JdeBP correctly noted that I had an off by one error. The original Unix directory entry format used a zero inode number to mark deleted entries, which meant that you couldn't actually use inode zero for anything (not even the root directory of the filesystem, which needed a non-zero inode number in order to have a '.' entry).

(Contrary to what I said in that Fediverse thread, I think that V7 and earlier may not have actually had a zero inode. The magic happens in usr/sys/h/param.h in the itod() and itoo() macros. These give a result for inode 0, but I suspect they're never supposed to be used; if I'm doing it right, inode 1 is at offset 0 within block 2.)

Since I'm the sort of person that I am, this made me wonder if you could legally use inode zero today in a POSIX compliant system. The Single Unix Specification, which is more or less POSIX, sets out that ino_t is some unsigned integer type, but it doesn't constrain its value. Instead, inode numbers are simply called the 'file serial number' in places like sys/stat.h and dirent.h, and the stat() family of functions, readdir() and posix_getdents() don't put any restrictions on the inode numbers except that st_dev and st_ino together uniquely identify a file. In the normal way to read standards, anything not explicitly commented on is allowed, so you're allowed to return a zero for the inode value in these things (provided that there is only one per st_dev, or at least that all of them are the same file, hardlinked together).

On Linux, I don't think there's any official restrictions on whether there can be a zero inode in some weird filesystem (eg, also), although kernel and user-space code may make assumptions. FreeBSD doesn't document any restrictions in stat(2) or getdirentries(2). The OpenBSD stat(2) and getdents(2) manual pages are similarly silent about whether or not the inode number can be zero.

(The tar archive format doesn't include inode numbers. The cpio archive format does, but I can't find a mention of a zero inode value having special meaning. The POSIX pax archive format is similarly silent; both cpio and pax use the inode number only as part of recording hardlinks.)

Whether it would be a good idea to make a little filesystem that returned a zero inode value for some file is another question. Enterprising people can try it and see, which these days might be possible with various 'filesystem in user space' features (although these might filter and restrict the inode numbers that you can use). My personal expectation is that there are a variety of things that expect non-zero inode numbers for any real file and might use a zero inode number as, for example, a 'no such file' signal value.

The length of file names in early Unix

By: cks
25 May 2025 at 01:33

If you use Unix today, you can enjoy relatively long file names on more or less any filesystem that you care to name. But it wasn't always this way. Research V7 had 14-byte filenames, and the System III/System V lineage continued this restriction until it merged with BSD Unix, which had significantly increased this limit as part of moving to a new filesystem (initially called the 'Fast File System', for good reasons). You might wonder where this unusual number came from, and for that matter, what the file name limit was on very early Unixes (it was 8 bytes, which surprised me; I vaguely assumed that it had been 14 from the start).

I've mentioned before that the early versions of Unix had a quite simple format for directory entries. In V7, we can find the directory structure specified in sys/dir.h (dir(5) helpfully directs you to sys/dir.h), which is so short that I will quote it in full:

#ifndef	DIRSIZ
#define	DIRSIZ	14
#endif
struct	direct
{
    ino_t    d_ino;
    char     d_name[DIRSIZ];
};

To fill in the last blank, ino_t was a 16-bit (two byte) unsigned integer (and field alignment on PDP-11s meant that this structure required no padding), for a total of 16 bytes. This directory structure goes back to V4 Unix. In V3 Unix and before, directory entries were only ten bytes long, with 8 byte file names.

(Unix V4 (the Fourth Edition) was when the kernel was rewritten in C, so that may have been considered a good time to do this change. I do have to wonder how they handled the move from the old directory format to the new one, since Unix at this time didn't have multiple filesystem types inside the kernel; you just had the filesystem, plus all of your user tools knew the directory structure.)

One benefit of the change in filename size is that 16-byte directory entries fit evenly in 512-byte disk blocks (or other powers-of-two buffer sizes). You never have a directory entry that spans two disk blocks, so you can deal with directories a block at a time. Ten byte directory entries don't have this property; eight-byte ones would, but then that would leave space for only six character file names, and presumably that was considered too small even in Unix V1.

PS: That inode numbers in V7 (and earlier) were 16-bit unsigned integers does mean what you think it means; there could only be at most 65,536 inodes in a single classical V7 filesystem. If you needed more files, you had better make more filesystems. Early Unix had a lot of low limits like that, some of them quite hard-coded.

Enfeebling the Indian space programme

By: VM
3 July 2025 at 13:15

There’s no denying that there currently prevails a public culture in India that equates criticism, even well-reasoned, with pooh-poohing. It’s especially pronounced in certain geographies where the Bharatiya Janata Party (BJP) enjoys majority support as well as vis-Γ -vis institutions that the subscribers of Hindu politics consider to be ripe for international renown, especially in the eyes of the country’s former colonial masters. The other side of the same cultural coin is the passive encouragement it offers to those who’d play up the feats of Indian enterprises even if they lack substantive evidence to back their claims up. While these tendencies are pronounced in many enterprises, I have encountered them most often in the spaceflight domain.

Through its feats of engineering and administration over the years, the Indian Space Research Organisation (ISRO) has cultivated a deserved reputation of setting a high bar for itself and meeting them. Its achievements are the reason why India is one of a few countries today with a functionally complete space programme. It operates launch vehicles, conducts spaceflight-related R&D, has facilities to develop as well as track satellites, and maintains data-processing pipeliness to turn the data it collects from space into products usable for industry and academia. It is now embarking on a human spaceflight programme as well. ISRO has also launched interplanetary missions to the moon and Mars, with one destined for Venus in the works. In and of itself the organisation has an enviable legacy. Thus, unsurprisingly, many sections of the Hindutva brigade have latched onto ISRO’s achievements to animate their own propaganda of India’s greatness, both real and imagined.

The surest signs of this adoption are most visible when ISRO missions fail or succeed in unclear ways. The Chandrayaan 2 mission and the Axiom-4 mission respectively are illustrative examples. As if to forestall any allegations that the Chandrayaan 2 mission failed, then ISRO chairman K. Sivam said right after its Vikram lander crashed on the moon that it had been a β€œ98% success”. Chandrayaan 2 was a technology demonstrator and it did successfully demonstrate most of those onboard very well. The β€œ98%” figure, however, was so disproportionate as to suggest Sivan was defending the mission less on its merits than on its ability to fit into reductive narratives of how good ISRO was. (Recall, similarly, when former DCGI V.G. Somani claimed the homegrown Covaxin vaccine was β€œ110% safe” when safety data from its phase III clinical trials weren’t even available.)

On the other hand, even as the Axiom-4 mission was about to kick off, neither ISRO nor the Department of Space (DoS) had articulated what Indian astronaut Shubhanshu Shukla’s presence onboard the mission was expected to achieve. If these details didn’t actually exist before the mission, to participate in which ISRO had paid Axiom Space more than Rs 500 crore, both ISRO and the DoS were effectively keeping the door open to picking a goalpost of their choosing to kick the ball through as the mission progressed. If they did have these details but had elected to not share them, their (in)actions raised β€” or ought to have β€” difficult questions about the terms on which these organisations believed they were accountable in a democratic country. Either way, the success of the Axiom-4 mission vis-Γ -vis Shukla’s participation was something of an empty vessel: a ready receptacle for any narrative that could be placed inside ex post facto.

At the same time, raising this question has often been construed in the public domain, but especially on social media platforms, in response to arguments presented in the news, and in conversations among people interested in Indian spaceflight, as naysaying Shukla’s activities altogether. By all means let’s celebrate Shukla’s and by extension India’s β€˜citius, altius, fortius’ moment in human spaceflight; the question is: what didn’t ISRO/DoS share before Axiom-4 lifted off and why? (Note that what journalists have been reporting since liftoff, while valuable, isn’t the answer to the question posed here.) While it’s tempting to think this pinched communication is a strategy developed by the powers that be to cope with insensitive reporting in the press, doing so would also ignore the political capture institutions like ISRO have already suffered and which ISRO arguably has as well, during and after Sivan’s term as chairman.

For just two examples of institutions that have historically enjoyed a popularity comparable in both scope and flavour to that of ISRO, consider India’s cricket administration and the Election Commission. During the 2024 men’s T20 World Cup that India eventually won, the Indian team had the least amount of travel and the most foreknowledge on the ground it was to play its semifinal game on. At the 2023 men’s ODI World Cup, too, India played all its matches on Sundays, ensuring the highest attendance for its own contests rather than be able to share that opportunity with all teams. The tournament is intended to be a celebration of the sport after all. For added measure, police personnel were also deployed at various stadia to take away spectators’ placards and flags in support of Pakistan in matches featuring the Pakistani team. The stage management of both World Cups only lessened, rather than heightened, the Indian team’s victories.

It’s been a similar story with the Election Commission of India, which has of late come under repeated attack from the Indian National Congress party and some of its allies for allegedly rigging their electronic voting machines and subsequently entire elections in favour of the BJP. While the Congress has failed to submit the extraordinary evidence required to support these extraordinary claims, doubts about the ECI’s integrity have spread anyway because there are other, more overt ways in which the once-independent institution of Indian democracy favours the BJP β€” including scheduling elections according to the availability of party supremo Narendra Modi to speak at rallies.

Recently, a more obscure but nonetheless pertinent controversy erupted in some circles when in an NDTV report incumbent ISRO chairman V. Narayanan seemed to suggest that SpaceX called one of the attempts to launch Axiom-4 off because his team at ISRO had insisted that the company thoroughly check its rocket for bugs. The incident followed SpaceX engineers spotting a leak on the rocket. The point of egregiousness here is that while SpaceX had built and flown that very type of rocket hundreds of times, Narayanan and ambiguous wording in the NDTV report made it out to be that SpaceX would have flown the rocket if not for ISRO’s insistence. What’s more likely to have happened is NASA and SpaceX engineers would have consulted ISRO as they would have consulted the other agencies involved in the flight β€” ESA, HUNOR, and Axiom Space β€” about their stand, and the ISRO team on its turn would have clarified its position: that SpaceX recheck the rocket before the next launch attempt. However, the narrative β€œif not for ISRO, SpaceX would’ve flown a bad rocket” took flight anyway.

Evidently these are not isolated incidents. The last three ISRO chairmen β€” Sivan, Somanath, and now Narayanan β€” have progressively curtailed the flow of information from the organisation to the press even as they have maintained a steady pro-Hindutva, pro-establishment rhetoric. All three leaders have also only served at ISRO’s helm when the BJP was in power at the Centre, wielding its tendency to centralise power by, among others, centralising the permissions to speak freely. Some enterprising journalists like Chethan Kumar and T.S. Subramanian and activists like r/Ohsin and X.com/@SolidBoosters have thus far kept the space establishment from resembling a black hole. But the overarching strategy is as simple as it is devious: while critical arguments become preoccupied by whataboutery and fending off misguided accusations of neocolonialist thinking (β€œwhy should we measure an ISRO mission’s success the way NASA measures its missions’ successes?”), unconditional expressions of support and adulation spread freely through our shared communication networks. This can only keep up a false veil of greatness that crumbles the moment it brooks legitimate criticism, becoming desperate for yet another veil to replace itself.

But even that is beside the point: to echo the philosopher Bruno Latour, when criticism is blocked from attending to something we have all laboured to build, that something is deprived of the β€œcare and caution” it needs to grow, to no longer be fragile. Yet that’s exactly what the Indian space programme risks becoming today. Here’s a brand new case in point, from the tweets that prompted this post: according to an RTI query filed by @SolidBoosters, India’s homegrown NavIC satellite navigation constellation is just one clock failure away from β€œcomplete operational collapse”. The issue appears to be ISRO’s subpar launch cadence and the consequently sluggish replacement of clocks that have already failed.

6/6 Root Cause Analysis for atomic clock failures has been completed but classified under RTI Act Section 8 as vital technical information. Meanwhile public transparency is limited while the constellation continues degrading. #NavIC #ISRO #RTI

β€” SolidBoosters (@SolidBoosters) July 2, 2025

Granted, rushed critiques and critiques designed to sting more than guide can only be expected to elicit defensive posturing. But to minimise one’s exposure to all criticism altogether, especially those from learned quarters and conveyed in respectful language, is to deprive oneself of the pressure and the drive to solve the right problems in the right ways, both drawing from and adding to India’s democratic fabric. The end results are public speeches and commentary that are increasingly removed from reality as well as, more importantly, thicker walls between criticism and The Thing it strives to nurture.

The length of file names in early Unix

By: cks
25 May 2025 at 01:33

If you use Unix today, you can enjoy relatively long file names on more or less any filesystem that you care to name. But it wasn't always this way. Research V7 had 14-byte filenames, and the System III/System V lineage continued this restriction until it merged with BSD Unix, which had significantly increased this limit as part of moving to a new filesystem (initially called the 'Fast File System', for good reasons). You might wonder where this unusual number came from, and for that matter, what the file name limit was on very early Unixes (it was 8 bytes, which surprised me; I vaguely assumed that it had been 14 from the start).

I've mentioned before that the early versions of Unix had a quite simple format for directory entries. In V7, we can find the directory structure specified in sys/dir.h (dir(5) helpfully directs you to sys/dir.h), which is so short that I will quote it in full:

#ifndef	DIRSIZ
#define	DIRSIZ	14
#endif
struct	direct
{
    ino_t    d_ino;
    char     d_name[DIRSIZ];
};

To fill in the last blank, ino_t was a 16-bit (two byte) unsigned integer (and field alignment on PDP-11s meant that this structure required no padding), for a total of 16 bytes. This directory structure goes back to V4 Unix. In V3 Unix and before, directory entries were only ten bytes long, with 8 byte file names.

(Unix V4 (the Fourth Edition) was when the kernel was rewritten in C, so that may have been considered a good time to do this change. I do have to wonder how they handled the move from the old directory format to the new one, since Unix at this time didn't have multiple filesystem types inside the kernel; you just had the filesystem, plus all of your user tools knew the directory structure.)

One benefit of the change in filename size is that 16-byte directory entries fit evenly in 512-byte disk blocks (or other powers-of-two buffer sizes). You never have a directory entry that spans two disk blocks, so you can deal with directories a block at a time. Ten byte directory entries don't have this property; eight-byte ones would, but then that would leave space for only six character file names, and presumably that was considered too small even in Unix V1.

PS: That inode numbers in V7 (and earlier) were 16-bit unsigned integers does mean what you think it means; there could only be at most 65,536 inodes in a single classical V7 filesystem. If you needed more files, you had better make more filesystems. Early Unix had a lot of low limits like that, some of them quite hard-coded.

Fedora's DNF 5 and the curse of mandatory too-smart output

By: cks
23 May 2025 at 02:49

DNF is Fedora's high(er) level package management system, which pretty much any system administrator is going to have to use to install and upgrade packages. Fedora 41 and later have switched from DNF 4 to DNF 5 as their normal (and probably almost mandatory) version of DNF. I ran into some problems with this switch, and since then I've found other issues, all of which boil down to a simple issue: DNF 5 insists on doing too-smart output.

Regardless of what you set your $TERM to and what else you do, if DNF 5 is connected to a terminal (and perhaps if it isn't), it will pretty-print its output in an assortment of ways. As far as I can tell it simply assumes ANSI cursor addressability, among other things, and will always fit its output to the width of your terminal window, truncating output as necessary. This includes output from RPM package scripts that are running as part of the update. Did one of them print a line longer than your current terminal width? Tough, it was probably truncated. Are you using script so that you can capture and review all of the output from DNF and RPM package scripts? Again, tough, you can't turn off the progress bars and other things that will make a complete mess of the typescript.

(It's possible that you can find the information you want in /var/log/dnf5.log in un-truncated and readable form, but if so it's buried in debug output and I'm not sure I trust dnf5.log in general.)

DNF 5 is far from the only offender these days. An increasing number of command line programs simply assume that they should always produce 'smart' output (ideally only if they're connected to a terminal). They have no command line option to turn this off and since they always use 'ANSI' escape sequences, they ignore the tradition of '$TERM' and especially 'TERM=dumb' to turn that off. Some of them can specifically disable colour output (typically with one of a number of environment variables, which may or may not be documented, and sometimes with a command line option), but that's usually the limits of their willingness to stop doing things. The idea of printing one whole line at a time as you do things and not printing progress bars, interleaving output, and so on has increasingly become a non-starter for modern command line tools.

(Another semi-offender is Debian's 'apt' and also 'apt-get' to some extent, although apt-get's progress bars can be turned off and 'apt' is explicitly a more user friendly front end to apt-get and friends.)

PS: I can't run DNF with its output directed into a file because it wants you to interact with it to approve things, and I don't feel like letting it run freely without that.

The lack of a good command line way to sort IPv6 addresses

By: cks
19 May 2025 at 02:54

A few years ago, I wrote about how 'sort -V' can sort IPv4 addresses into their natural order for you. Even back then I was smart enough to put in that 'IPv4' qualification and note that this didn't work with IPv6 addresses, and said that I didn't know of any way to handle IPv6 addresses with existing command line tools. As far as I know, that remains the case today, although you can probably build a Perl, Python, or other language program that does such sorting for you if you need to do this regularly.

Unix tools like 'sort' are pretty flexible, so you might innocently wonder why it can't be coerced into sorting IPv6 addresses. The first problem is that IPv6 addresses are written in hex without leading 0s, not decimal. Conventional sort will correctly sort hex numbers if all of the numbers are the same length, but IPv6 addresses are written in hex groups that conventionally drop leading zeros, so you will have 'ff' instead of '00ff' in common output (or '0' instead of '0000'). The second and bigger problem is the IPv6 '::' notation, which stands for the longest run of all-zero fields, ie some number of '0000' fields.

(I'm ignoring IPv6 scopes and zones for this, let's assume we have public IPv6 addresses.)

If IPv6 addresses were written out in full, with leading 0s on fields and all their 0000 fields, you could handle them as a simple conventional sort (you wouldn't even need to tell sort that the field separator was ':'). Unfortunately they almost never are, so you need to either transform them to that form, print them out, sort the output, and perhaps transform them back, or read them into a program as 128-bit numbers, sort the numbers, and print them back out as IPv6 addresses. Ideally your language of choice for this has a way to sort a collection of IPv6 addresses.

The very determined can probably do this with awk with enough work (people have done amazing things in awk). But my feeling is that doing this in conventional Unix command line tools is a Turing tarpit; you might as well use a language where there's a type of IPv6 addresses that exposes the functionality that you need.

(And because IPv6 addresses are so complex, I suspect that GNU Sort will never support them directly. If you need GNU Sort to deal with them, the best option is a program that turns them into their full form.)

PS: People have probably written programs to sort IPv6 addresses, but with the state of the Internet today, the challenge is finding them.

Some notes on using 'join' to supplement one file with data from another

By: cks
10 May 2025 at 02:45

Recently I said something vaguely grumpy about the venerable Unix 'join' tool. As the POSIX specification page for join will unhelpfully tell you, join is a 'relational database operator', which means that it implements the rough equivalent of SQL joins. One way to use join is to add additional information for some lines in your input data.

Suppose, not entirely hypothetically, that we have an input file (or data stream) that starts with a login name and contains some additional information, and that for some logins (but not all of them) we have useful additional data about them in another file. Using join, the simple case of this is easy, if the 'master' and 'suppl' files are already sorted:

join -1 1 -2 1 -a 1 master suppl

(I'm sticking to POSIX syntax here. Some versions of join accept '-j 1' as an alternative to '-1 1 -2 1'.)

Our specific options tell join to join each line of 'master' and 'suppl' on the first field in each (the login) and print them, and also print all of the lines from 'master' that didn't have a login in 'suppl' (that's the '-a 1' argument). For lines with matching logins, we get all of the fields from 'master' and then all of the extra fields from 'suppl'; for lines from 'master' that don't match, we just get the fields from 'master'. Generally you'll tell apart which lines got supplemented and which ones didn't by how many fields they have.

If we want something other than all of the fields in the order that they are in the existing data source, in theory we have the '-o <list>' option to tell join what fields from each source to output. However, this option has a little problem, which I will show you by quoting the important bit from the POSIX standard (emphasis mine):

The fields specified by list shall be written for all selected output lines. Fields selected by list that do not appear in the input shall be treated as empty output fields.

What that means is that if we're also printing non-joined lines from our 'master' file, our '-o' still applies and any fields we specified from 'suppl' will be blank and empty (unless you use '-e'). This can be inconvenient if you were re-ordering fields so that, for example, a field from 'suppl' was listed before some fields from 'master'. It also means that you want to use '1.1' to get the login from 'master', which is always going to be there, not '2.1', the login from 'suppl', which is only there some of the time.

(All of this assumes that your supplementary file is listed second and the master file first.)

On the other hand, using '-e' we can simplify life in some situations. Suppose that 'suppl' contains only one additional interesting piece of information, and it has a default value that you'll use if 'suppl' doesn't contain a line for the login. Then if 'master' has three fields and 'suppl' two, we can write:

join -1 1 -2 1 -a 1 -e "$DEFVALUE" -o '1.1,1.2,1.3,2.2' master suppl

Now we don't have to try to tell whether or not a line from 'master' was supplemented by counting how many fields it has; everything has the same number of fields, it's just sometimes the last (supplementary) field is the default value.

(This is harder to apply if you have multiple fields from the 'suppl' file, but possibly you can find a 'there is nothing here' value that works for the rest of your processing.)

Netplan can only have WireGuard peers in one file

By: cks
7 May 2025 at 02:43

We have started using WireGuard to build a small mesh network so that machines outside of our network can securely get at some services inside it (for example, to send syslog entries to our central syslog server). Since this is all on Ubuntu, we set it up through Netplan, which works but which I said 'has warts' in my first entry about it. Today I discovered another wart due to what I'll call the WireGuard provisioning problem:

Current status: provisioning WireGuard endpoints is exhausting, at least in Ubuntu 22.04 and 24.04 with netplan. So many netplan files to update. I wonder if Netplan will accept files that just define a single peer for a WG network, but I suspect not.

The core WireGuard provisioning problem is that when you add a new WireGuard peer, you have to tell all of the other peers about it (or at least all of the other peers you want to be able to talk to the new peer). When you're using iNetplan, it would be convenient if you could put each peer in a separate file in /etc/netplan; then when you add a new peer, you just propagate the new Netplan file for the peer to everything (and do the special Netplan dance required to update peers).

(Apparently I should now call it 'Canonical Netplan', as that's what its front page calls it. At least that makes it clear exactly who is responsible for Netplan's state and how it's not going to be widely used.)

Unfortunately this doesn't work, and it doesn't work in a dangerous way, which is that Netplan only notices one set of WireGuard peers in one netplan file (at least on servers, using systemd-networkd as the backend). If you put each peer in its own file, only the first peer is picked up. If you define some peers in the file where you define your WireGuard private key, local address, and so on, and some peers in another file, only peers from whichever is first will be used (even if the first file only defines peers, which isn't enough to bring up a WireGuard device by itself). As far as I can see, Netplan doesn't report any errors or warnings to the system logs on boot about this situation; instead, you silently get incomplete WireGuard configurations.

This is visibly and clearly a Netplan issue, because on servers you can inspect the systemd-networkd files written by Netplan (in /run/systemd/network). When I do this, the WireGuard .netdev file has only the peers from one file defined in it (and the .netdev file matches the state of the WireGuard interface). This is especially striking when the netplan file with the private key and listening port (and some peers) is second; since the .netdev file contains the private key and so on, Netplan is clearly merging data from more than one netplan file, not completely ignoring everything except the first one. It's just ignoring any peers encountered after the first set of them.

My overall conclusion is that in Netplan, you need to put all configuration for a given WireGuard interface into a single file, however tempting it might be to try splitting it up (for example, to put core WireGuard configuration stuff in one file and then list all peers in another one).

I don't know if this is an already filed Netplan bug and I don't plan on bothering to file one for it, partly because I don't expect Canonical to fix Netplan issues any more than I expect them to fix anything else and partly for other reasons.

PS: I'm aware that we could build a system to generate the Netplan WireGuard file, or maybe find a YAML manipulating program that could insert and delete blocks that matched some criteria. I'm not interested in building yet another bespoke custom system to deal with what is (for us) a minor problem, since we don't expect to be constantly deploying or removing WireGuard peers.

These days, Linux audio seems to just work (at least for me)

By: cks
4 May 2025 at 02:29

For a long time, the common perception was that 'Linux audio' was the punchline for a not particularly funny joke. I sort of shared that belief; although audio had basically worked for me for a long time, I had a simple configuration and dreaded having to make more complex audio work in my unusual desktop environment. But these days, audio seems to just work for me, even in systems that have somewhat complex audio options.

On my office desktop, I've wound up with three potential audio outputs and two audio inputs: the motherboard's standard sound system, a USB headset with a microphone that I use for online meetings, the microphone on my USB webcam, and (to my surprise) a HDMI audio output because my LCD displays do in fact have tiny little speakers built in. In PulseAudio (or whatever is emulating it today), I have the program I use for online meetings set to use the USB headset and everything else plays sound through the motherboard's sound system (which I have basic desktop speakers plugged into). All of this works sufficiently seamlessly that I don't think about it, although I do keep a script around to reset the default audio destination.

On my home desktop, for a long time I had a simple single-output audio system that played through the motherboard's sound system (plus a microphone on a USB webcam that was mostly not connected). Recently I got an outboard USB DAC and, contrary to my fears, it basically plugged in and just worked. It was easy to set the USB DAC as the default output in pavucontrol and all of the settings related to it stick around even when I put it to sleep overnight and it drops off the USB bus. I was quite pleased by how painless the USB DAC was to get working, since I'd been expecting much more hassles.

(Normally I wouldn't bother meticulously switching the USB DAC to standby mode when I'm not using it for an extended time, but I noticed that the case is clearly cooler when it rests in standby mode.)

This is still a relatively simple audio configuration because it's basically static. I can imagine more complex ones, where you have audio outputs that aren't always present and that you want some programs (or more generally audio sources) to use when they are present, perhaps even with priorities. I don't know if the Linux audio systems that Linux distributions are using these days could cope with that, or if they did would give you any easy way to configure it.

(I'm aware that PulseAudio and so on can be fearsomely complex under the hood. As far as the current actual audio system goes, I believe that what my Fedora 41 machines are using for audio is PipeWire (also) with WirePlumber, based on what processes seem to be running. I think this is the current Fedora 41 audio configuration in general, but I'm not sure.)

The appeal of keyboard launchers for (Unix) desktops

By: cks
30 April 2025 at 03:41

A keyboard launcher is a big part of my (modern) desktop, but over on the Fediverse I recently said something about them in general:

I don't necessarily suggest that people use dmenu or some equivalent. Keyboard launchers in GUI desktops are an acquired taste and you need to do a bunch of setup and infrastructure work before they really shine. But if you like driving things by the keyboard and will write scripts, dmenu or equivalents can be awesome.

The basic job of a pure keyboard launcher is to let you hit a key, start typing, and then select and do 'something'. Generally the keyboard launcher will make a window appear so that you can see what you're typing and maybe what you could complete it to or select.

The simplest and generally easiest way to use a keyboard launcher, and how many of them come configured to work, is to use it to select and run programs. You can find a version of this idea in GNOME, and even Windows has a pseudo-launcher in that you can hit a key to pop up the Start menu and the modern Start menu lets you type in stuff to search your programs (and other things). One problem with the GNOME version, and many basic versions, is that in practice you don't necessarily launch desktop programs all that often or launch very many different ones, so you can have easier ways to invoke the ones you care about. One problem with the Windows version (at least in my experience) is that it will do too much, which is to say that no matter what garbage you type into it by accident, it will do something with that garbage (such as launching a web search).

The happy spot for a keyboard launcher is somewhere in the middle, where they do a variety of things that are useful for you but not without limits. The best window launcher for your desktop is one that gives you fast access to whatever things you do a lot, ideally with completion so you type as little as possible. When you have it tuned up and working smoothly the feel is magical; I tap a key, type a couple of characters and then hit tab, hit return, and the right thing happens without me thinking about it, all fast enough that I can and do type ahead blindly (which then goes wrong if the keyboard launcher doesn't start fast enough).

The problem with keyboard launchers, and why they're not for everyone, is that everyone has a different set of things that they do a lot and that are useful for them to trigger entirely through the keyboard. No keyboard launcher will come precisely set up for what you do a lot in their default installation, so at a minimum you need to spend the time and effort to curate what the launcher will do and how it does it. If you're more ambitious, you may need to build supporting scripts that give the launcher a list of things to complete and then act on them when you complete one. If you don't curate the launcher and throw in the kitchen sink, you wind up with the Windows experience where it will certainly do something when you type things but perhaps not really what you wanted.

(For example, I routinely ssh to a lot of our machines, so my particular keyboard launcher setup lets me type a machine name (with completion) to start a session to it. But I had to build all of that, including sourcing the machine names I wanted included from somewhere, and this isn't necessarily useful for people who aren't constantly ssh'ing to machines.)

There are a variety of keyboard launchers for both X and Wayland, basically none of which I have any experience with. See the Arch Wiki section on application launchers. Someday I will have to get a Wayland equivalent to my particular modified dmenu, a thought that fills me with no more enthusiasm than any other part of replacing my whole X environment.

PS: Another issue with keyboard launchers is that sometimes you're wrong about what you want to do with them. I once built an entire keyboard launcher setup to select terminal windows and then later wound up abandoning it when I didn't use it enough.

My Cinnamon desktop customizations (as of 2025)

By: cks
22 April 2025 at 03:15

A long time ago I wrote up some basic customizations of Cinnamon, shortly after I started using Cinnamon (also) on my laptop of the time. Since then, the laptop got replaced with another one and various things changed in both the land of Cinnamon and my customizations (eg, also). Today I feel like writing down a general outline of my current customizations, which fall into a number of areas from the modest but visible to the large but invisible.

The large but invisible category is that just like on my main fvwm-based desktop environment, I use xcape (plus a custom Cinnamon key binding for a weird key combination) to invoke my custom dmenu setup (1, 2) when I tap the CapsLock key. I have dmenu set to come up horizontally on the top of the display, which Cinnamon conveniently leaves alone in the default setup (it has its bar at the bottom). And of course I make CapsLock into an additional Control key when held.

(On the laptop I'm using a very old method of doing this. On more modern Cinnamon setups in virtual machines, I do this with Settings β†’ Keyboard β†’ Layout β†’ Options, and then in the CapsLock section set CapsLock to be an additional Ctrl key.)

To start xcape up and do some other things, like load X resources, I have a personal entry in Settings β†’ Startup Applications that runs a script in my ~/bin/X11. I could probably do this in a more modern way with an assortment of .desktop files in ~/.config/autostart (which is where my 'Startup Applications' setting actually wind up) that run each thing individually or perhaps some systemd user units. But the current approach works and is easy to modify if I want to add or remove things (I can just edit the script).

I have a number of Cinnamon 'applets' installed on my laptop and my other Cinnamon VM setups. The ones I have everywhere are Spices Update and Shutdown Applet, the latter because if I tell the (virtual) machine to log me off, shut down, or restart, I generally don't want to be nagged about it. On my laptop I also have CPU Frequency Applet (set to only display a summary) and CPU Temperature Indicator, for no compelling reason. In all environments I also pin launchers for Firefox and (Gnome) Terminal to the Cinnamon bottom bar, because I start both of them often enough. I position the Shutdown Applet on the left side, next to the launchers, because I think of it as a peculiar 'launcher' instead of an applet (on the right).

(The default Cinnamon keybindings also start a terminal with Ctrl + Alt + T, which you can still find through the same process from several years ago provided that you don't cleverly put something in .local/share/glib-2.0/schemas and then run 'glib-compile-schemas .' in that directory. If I was a smarter bear, I'd understand what I should have done when I was experimenting with something.)

On my virtual machines with Cinnamon, I don't bother with the whole xcape and dmenu framework, but I do set up the applets and the launchers and fix CapsLock.

(This entry was sort of inspired by someone I know who just became a Linux desktop user (after being a long time terminal user).)

Sidebar: My Cinnamon 'window manager' custom keybindings

I have these (on my laptop) and perpetually forget about them, so I'm going to write them down now so perhaps that will change.

move-to-corner-ne=['<Alt><Super>Right']
move-to-corner-nw=['<Alt><Super>Left']
move-to-corner-se=['<Primary><Alt><Super>Right']
move-to-corner-sw=['<Primary><Alt><Super>Left']
move-to-side-e=['<Shift><Alt><Super>Right']
move-to-side-n=['<Shift><Alt><Super>Up']
move-to-side-s=['<Shift><Alt><Super>Down']
move-to-side-w=['<Shift><Alt><Super>Left']

I have some other keybindings on the laptop but they're even less important, especially once I added dmenu.

Looking at what NFSv4 clients have locked on a Linux NVS(v4) server

By: cks
17 April 2025 at 03:10

A while ago I wrote an entry about (not) finding which NFSv4 client owns a lock on a Linux NFS(v4) server, where the best I could do was pick awkwardly through the raw NFS v4 client information in /proc/fs/nfsd/clients. Recently I discovered an alternative to doing this by hand, which is the nfsdclnts program, and as a result of digging into it and what I was seeing when I tried it out, I now believe I have a better understanding of the entire situation (which was previously somewhat confusing).

The basic thing that nfsdclnts will do is list 'locks' and some information about them with 'nfsdclnts -t lock', in addition to listing other state information such as 'open', for open files, and 'deleg', for NFS v4 delegations. The information it lists is somewhat limited, for example it will list the inode number but not the filesystem, but on the good side nfsdclnts is a Python program so you can easily modify it to report any extra information that exists in the clients/#/states files. However, this information about locks is not complete, because of how file level locks appear to normally manifest in NFS v4 client state.

(The information in the states files is limited, although it contains somewhat more than nfsdclnts shows.)

Here is how I understand NFS v4 locking and states. To start with, NFS v4 has a feature called delegations where the NFS v4 server can hand a lot of authority over a file to a NFS v4 client. When a NFS v4 client accesses a file, the NFS v4 server likes to give it a delegation if this is possible; it normally will be if no one else has the file open or active. Once a NFS v4 client holds a delegation, it can lock the file without involving the NFS v4 server. At this point, the client's 'states' file will report an opaque 'type: deleg' entry for the file (and this entry may or may not have a filename or instead be what nfsdclnts will report as 'disconnected dentry').

While a NFS v4 client has the file delegated, if any other NFS v4 client does anything with the file, including simply opening it, the NFS v4 server will recall the delegation from the original client. As a result, the original client now has to tell the NFS v4 server that it has the file locked. At this point a 'type: lock' entry for the file appears in the first NFS v4 client's states file. If the first NFS v4 client releases its lock while the second NFS v4 client is trying to acquire it, the second NFS v4 client will not have a delegation for the file, so its lock will show up as an explicit 'type: lock' entry in its states file.

An additional wrinkle, a NFS v4 client holding a delegation doesn't immediately release it once all processes have released their locks, closed the file, and so on. Instead the delegation may linger on for some time. If another NFS v4 client opens the file during this time, the first client will lose the delegation but the second NFS v4 client may not get a delegation from the NFS v4 server, so its lock will be visible as a 'type: lock' states file entry.

A third wrinkle is that multiple clients may hold read-only delegations for a file and have fcntl() read locks on it at once, with each of them having a 'type: deleg, access: r' entry for it in their states files. These will only become visible 'type: lock' states entries if the clients have to release their delegations.

So putting this all together:

  • If there is a 'type: lock' entry for the file in any states file (or it's listed in 'nfsdclnts -t lock'), the file is definitely locked by whoever has that entry.

  • If there are no 'type: deleg' or 'type: lock' entries for the file, it's definitely not locked; you can also see this by whether nfsdclnts lists it as having delegations or locks.

  • If there are 'type: deleg' entries for the file, it may or may not be locked by the NFS v4 client (or clients) with the delegation. If the delegation is an 'access: w' delegation, you can see if someone actually has the file locked by accessing the file on another NFS v4 client, which will force the NFS v4 server to recall the delegation and expose the lock if there is one.

If the delegation is 'access: r' and might have multiple read-only locks, you can't force the NFS v4 server to recall the delegation by merely opening the file read-only (for example with 'cat file' or 'less file'). Instead the server will only recall the delegation if you open the file read-write. A convenient way to do this is probably to use 'flock -x <file> -c /bin/true', although this does require you to have more permissions for the file than simply the ability to read it.

Sidebar: Disabling NFS v4 delegations on the server

Based on trawling various places, I believe this is done by writing a '0' to /proc/sys/fs/leases-enabled (or the equivalent 'fs.leases-enabled' sysctl) and then apparently restarting your NFS v4 server processes. This will disable all user level uses of fcntl()'s F_SETLEASE and F_GETLEASE as an additional effect, and I don't know if this will affect any important programs running on the NFS server itself. Based on a study of the kernel source code, I believe that you don't need to restart your NFS v4 server processes if it's sufficient for the NFS server to stop handing out new delegations but current delegations can stay until they're dropped.

(There have apparently been some NFS v4 server and client issues with delegations, cf, along with other NFS v4 issues. However, I don't know if the cure winds up being worse than the disease here, or if there's another way to deal with these stateid problems.)

Unix files have (at least) two sizes

By: cks
14 April 2025 at 03:02

I'll start by presenting things in illustrated form:

; ls -l testfile
-rw-r--r-- 1 cks 262144 Apr 13 22:03 testfile
; ls -s testfile
1 testfile
; ls -slh testfile
512 -rw-r--r-- 1 cks 256K Apr 13 22:03 testfile

The two well known sizes that Unix files have are the logical 'size' in bytes and what stat.h describes as "the number of blocks allocated for this object", often converted to some number of bytes (as ls is doing here in the last command). A file's size in bytes is roughly speaking the last file offset that has been written to in the file, and not all of the bytes covered by it may have actually been written; when this is the case, the result is a sparse file. Sparse files are the traditional cause of a mismatch between the byte size and the number of blocks a file uses. However, that is not what is happening here.

This file is on a ZFS filesystem with ZFS's compression turned on, and it was created with 'dd if=/dev/zero of=testfile bs=1k count=256'. In ZFS, zeroes compress extremely well, and so ZFS has written basically no physical data blocks and faithfully reported that (minimal) number in the stat() st_blocks field. However, at the POSIX level we have indeed written data to all 256 KBytes of the file; it's not a sparse file. This is an extreme example of filesystem compression, and there are plenty of lesser ones.

This leaves us with a third size, which is the number of logical blocks for this file. When a filesystem is doing data compression, this number will be different from the number of physical blocks used. As far as I can tell, the POSIX stat.h description doesn't specify which one you have to report for st_blocks. As we can see, ZFS opts to report the physical block size of the file, which is probably the more useful number for the purposes of things like 'du'. However, it does leave us with no way of finding out the logical block size, which we may care about for various reasons (for example, if our backup system can skip unwritten sparse blocks but always writes out uncompressed blocks).

This also implies that a non-sparse file can change its st_blocks number if you move it from one filesystem to another. One filesystem might have compression on and the other one have it off, or they might have different compression algorithms that give different results. In some cases this will cause the file's space usage to expand so that it doesn't actually fit into the new filesystem (or for a tree of files to expand their space usage).

(I don't know if there are any Unix filesystems that report the logical block size in st_blocks and only report the physical block size through a private filesystem API, if they report it at all.)

One way to set up local programs in a multi-architecture Unix environment

By: cks
11 April 2025 at 02:29

Back in the old days, it used to be reasonably routine to have 'multi-architecture' Unix environments with shared files (where here architecture was a combination of the process architecture and the Unix variant). The multi-architecture days have faded out, and with them fading, so has information about how people made this work with things like local binaries.

In the modern era of large local disks and build farms, the default approach is probably to simply build complete copies of '/local' for each architecture type and then distribute the result around somehow. In the old days people were a lot more interested in reducing disk space by sharing common elements and then doing things like NFS-mounting your entire '/local', which made life more tricky. There likely were many solutions to this, but the one I learned at the university as a young sprout worked like the following.

The canonical paths everyone used and had in their $PATH were things like /local/bin, /local/lib, /local/man, and /local/share. However, you didn't (NFS) mount /local; instead, you NFS mounted /local/mnt (which was sort of an arbitrary name, as we'll see). In /local/mnt there were 'share' and 'man' directories, and also a per-architecture directory for every architecture you supported, with names like 'solaris-sparc' or 'solaris-x86'. These per-architecture directories contained 'bin', 'lib', 'sbin', and so on subdirectories.

(These directories contained all of the locally installed programs, all jumbled together, which did have certain drawbacks that became more and more apparent as you added more programs.)

Each machine had a /local directory on its root filesystem that contained /local/mnt, symlinks from /local/share and /local/man to 'mnt/share' and 'mnt/man', and then symlinks for the rest of the directories that went to 'mnt/<arch>/bin' (or sbin or lib). Then everyone mounted /local/mnt on, well, /local/mnt. Since /local and its contents were local to the machine, you could have different symlinks on each machine that used the appropriate architecture (and you could even have built them on boot if you really wanted to, although in practice they were created when the machine was installed).

When you built software for this environment, you told it that its prefix was /local, and let it install itself (on a suitable build server) using /local/bin, /local/lib, /local/share and so on as the canonical paths. You had to build (and install) software repeatedly, once for each architecture, and it was on the software (and you) to make sure that /local/share/<whatever> was in fact the same from architecture to architecture. System administrators used to get grumpy when people accidentally put architecture dependent things in their 'share' areas, but generally software was pretty good about this in the days when it mattered.

(In some variants of this scheme, the mount points were a bit different because the shared stuff came from one NFS server and the architecture dependent parts from another, or might even be local if your machine was the only instance of its particular architecture.)

There were much more complicated schemes that various places did (often universities), including ones that put each separate program or software system into its own directory tree and then glued things together in various ways. Interested parties can go through LISA proceedings from the 1980s and early 1990s.

Getting older, now-replaced Fedora package updates

By: cks
9 April 2025 at 02:05

Over the history of a given Fedora version, Fedora will often release multiple updates to the same package (for example, kernels, but there are many others). When it does this, the older package wind up being removed from the updates repository and are no longer readily available through mechanisms like 'dnf list --showduplicates <package>'. For a long time I used dnf's 'local' plugin to maintain a local archive of all packages I'd updated, so I could easily revert, but it turns out that as of Fedora 41's change to dnf5 (dnf version 5), that plugin is not available (presumably it hasn't been ported to dnf5, and may never be). So I decided to look into my other options for retrieving and installing older versions of packages, in case the most recent version has a bug that affects me (which has happened).

Before I take everyone on a long yak-shaving expedition, the simplest and best answer is to install the 'fedora-repos-archive' package, which installs an additional Fedora repository that has those replaced updates. After installing it, I suggest that you edit /etc/yum.repos.d/fedora-updates-archive.repo to disable it by default, which will save you time, bandwidth, and possibly aggravation. Then when you really want to see all possible versions of, say, Rust, you can do:

dnf list --showduplicates --enablerepo=updates-archive rust

You can then use 'dnf downgrade ...' as appropriate.

(Like the other Fedora repositories, updates-archive automatically knows your release version and picks packages from it. I think you can change this a bit with '--releasever=<NN>', but I'm not sure how deep the archive is.)

The other approach is to use Fedora Bodhi (also) and Fedora Koji (also) to fetch the packages for older builds, in much the same way as you can use Bodhi (and Koji) to fetch new builds that aren't in the updates or updates-testing repository yet. To start with, we're going to need to find out what's available. I think this can be done through either Bodhi or Koji, although Koji is presumably more authoritative. Let's do this for Rust in Fedora 41:

bodhi updates query --packages rust --releases f41
koji list-builds --state COMPLETE --no-draft --package rust --pattern '*.fc41'

Note that both of these listings are going to include package versions that were never released as updates for various reasons, and also versions built for the pre-release Fedora 41. Although Koji has a 'f41-updates' tag, I haven't been able to find a way to restrict 'koji list-builds' output to packages with that tag, so we're getting more than we'd like even after we use a pattern to restrict this to just Fedora 41.

(I think you may need to use the source package name, not a binary package one; if so, you can get it with 'rpm -qi rust' or whatever and looking at the 'Source RPM' line and name.)

Once you've found the package version you want, the easiest and fastest way to get it is through the koji command line client, following the directions in Installing Kernel from Koji with appropriate changes:

mkdir /tmp/scr
cd /tmp/scr
koji download-build --arch=x86_64 --arch=noarch rust-1.83.0-1.fc41

This will get you a bunch of RPMs, and then you can do 'dnf downgrade /tmp/scr/*.rpm' to have dnf do the right thing (only downgrading things you actually have installed).

One reason you might want to use Koji is that this gets you a local copy of the old package in case you want to go back and forth between it and the latest version for testing. If you use the dnf updates-archive approach, you'll be re-downloading the old version at every cycle. Of course at that point you can also use Koji to get a local copy of the latest update too, or 'dnf download ...', although Koji has the advantage that it gets all the related packages regardless of their names (so for Rust you get the 'cargo', 'clippy', and 'rustfmt' packages too).

(In theory you can work through the Fedora Bodhi website, but in practice it seems to be extremely overloaded at the moment and very slow. I suspect that the bot scraper plague is one contributing factor.)

PS: If you're using updates-archive and you just want to download the old packages, I think what you want is 'dnf download --enablerepo=updates-archive ...'.

Fedora 41 seems to have dropped an old XFT font 'property'

By: cks
8 April 2025 at 03:09

Today I upgraded my office desktop from Fedora 40 to Fedora 41, and as traditional there was a little issue:

Current status: it has been '0' days since a Fedora upgrade caused X font problems, this time because xft apparently no longer accepts 'encoding=...' as a font specification argument/option.

One of the small issues with XFT fonts is that they don't really have canonical names. As covered in the "Font Name" section of fonts.conf, a given XFT font is a composite of a family, a size, and a number of attributes that may be used to narrow down the selection of the XFT font until there's only one option left (or no option left). One way to write that in textual form is, for example, 'Sans:Condensed Bold:size=13'.

For a long time, one of the 'name=value' properties that XFT font matching accepted was 'encoding=<something>'. For example, you might say 'encoding=iso10646-1' to specify 'Unicode' (and back in the long ago days, this apparently could make a difference for font rendering). Although I can't find 'encoding=' documented in historical fonts.conf stuff, I appear to have used it for more than a decade, dating back to when I first converted my fvwm configuration from XLFD fonts to XFT fonts. It's still accepted today on Fedora 40 (although I suspect it does nothing):

: f40 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
DejaVuSans.ttf: "DejaVu Sans" "Regular"

However, it's no longer accepted on Fedora 41:

: f41 ; fc-match 'Sans:Condensed Bold:size=13:encoding=iso10646-1'
Unable to parse the pattern

Initially I thought this had to be a change in fontconfig, but that doesn't seem to be the case; both Fedora 40 and Fedora 41 use the same version, '2.15.0', just with different build numbers (partly because of a mass rebuild for Fedora 41). Freetype itself went from version 2.13.2 to 2.13.3, but the release notes don't seem to have anything relevant. So I'm at a loss. At least it was easy to fix once I knew what had happened; I just had to take the ':encoding=iso10646-1' bit out from the places I had it.

(The visual manifestation was that all of my fvwm menus and window title bars switched to a tiny font. For historical reasons all of my XFT font specifications in my fvwm configuration file used 'encoding=...', so in Fedora 41 none of them worked and fvwm reported 'can't load font <whatever>' and fell back to its default of an XLFD font, which was tiny on my HiDPI display.)

PS: I suspect that this change will be coming in other Linux distributions sooner or later. Unsurprisingly, Ubuntu 24.04's fc-match still accepts 'encoding=...'.

PPS: Based on ltrace output, FcNameParse() appears to be what fails on Fedora 41.

I should learn systemd's features for restricting things

By: cks
5 April 2025 at 03:19

Today, for reasons beyond the scope of this entry, I took something I'd been running by hand from the command line for testing and tried to set it up under systemd. This is normally straightforward, and it should have been extra straightforward because the thing came with a .service file. But that .service file used a lot of systemd's features for restricting what programs can do, and for my sins I'd decided to set up the program with its binary, configuration file, and so on in different places than it expected (and I think without some things it expected, like a supplementary group for permission to read some files). This was, unfortunately, an abject failure, so I wound up yanking all of the restrictions except 'DynamicUser=true'.

I'm confident that with enough time, I can (or could) sort out all of the problems (although I didn't feel like spending that time today). What this experience really points out is that systemd has a lot of options for really restricting what programs you run can do, and I'm not particularly familiar with them. To get the service working with all of its original restrictions, I'd have to read way through things like systemd.exec and understanding what everything the .service file used did. Once I did that, I could have understood what I needed to change to deal with my setup of the program.

(An expert probably could have fixed things in short order.)

That systemd has a lot of potential restrictions it can impose and that those restrictions are complex is not a flaw of systemd (or its fault). We already know that fine grained permissions are hard to set up and manage in any environment, especially if you don't know what you're doing (as I don't with systemd's restrictions). At the same time, fine grained restrictions are quite useful for being able to apply some restrictions to programs not designed for them.

(The simplicity of OpenBSD's 'pledge' system is great, but it needs the program's active cooperation. For better or worse, Linux doesn't have a native, fully supported equivalent; instead we have to build it out of more fine grained, lower level facilities, and that's what systemd exposes.)

Learning how to do use the restrictions is probably worthwhile in general. We run plenty of things through locally written systemd .service units. Some amount of those things are potentially risky (although generally not too risky), and some of them could be more restricted than they are today if we wanted to do the work and knew what we were doing (and knew some of the gotchas involved).

(And sooner or later we're going to run into more things with restrictions already in their .service units, and we're going to want to change some aspects of how they work.)

I'm working to switch from wget to curl (due to Fedora)

By: cks
1 April 2025 at 02:51

I've been using wget for a long time now, which means that I've developed a lot of habits, reflexes and even little scripts around it. Then wget2 happened, or more exactly Fedora switched from wget to wget2 (and Ubuntu is probably going to follow along). I'm very much not a fan of wget2 (also); I find it has both worse behavior and worse output than classical wget, in ways that routinely get in my way. Or got in my way before I started retraining myself to use curl instead of wget.

(It's actually possible that Ubuntu won't follow Fedora here. Ubuntu 24.04's 'wget' is classic wget, and Debian unstable currently has the wget package still as classic wget. The wget to wget2 transition involves the kind of changes that I can see Debian developers rejecting, so maybe Debian will keep 'wget' as classic wget. The upstream has a wget 1.25.0 release as recently as November 2024 (cf); on the other hand, the main project page says that 'currently GNU wget2 is being developed', so it certainly sounds like the upstream wants to move.)

One tool for my switch is wcurl (also, via), which is a cover script to provide a wget-like interface to curl. But I don't have wcurl everywhere (it's not packaged in Ubuntu 24.04, although I think it's coming in 26.04), so I've also been working to remember things like curl's -L and -O options (for downloading things, these are basically 'do what I want' options; I almost always want curl to follow HTTP redirects). There's a number of other options I want to remember, so since I've been looking at the curl manual page, here's some notes to myself.

(If I downloaded multiple URLs at once, I'll probably want to use '--remote-name-all' instead of repeating -O a lot. But I'm probably not going to remember that unless I write a script.)

My 'wcat' script is basically 'curl -L -sS <url>' (-s to not show the progress bar, -S to include at least the HTTP payload on an error, -L to follow redirects). My related 'wretr' script, which is intended to show headers too, is 'curl -L -sS -i <url>' (-i includes headers), or 'curl -sS -i <url>' if I want to explicitly see any HTTP redirect rather than automatically follow it.

(What I'd like is an option to show HTTP headers only if there was an HTTP error, but curl is currently all or nothing here.)

Some of the time I'll want to fetch files with the -J option, which is the curl equivalent of wget's --trust-server-names. This is necessary in cases where a project doesn't bother with good URLs for things. Possibly I also want to use '-R' to set the local downloaded file's timestamp based on the server provided timestamp, which is wget's traditional behavior (sometimes it's good, sometimes it's confusing).

PS: I care about wcurl being part of a standard Ubuntu package because then we can install it as part of one of our standard package sets. If it's a personal script, it's not pervasive, although that's still better than nothing.

PPS: I'm not going to blame Fedora for the switch from wget to wget2. Fedora has a consistent policy of marching forward in changes like this to stay in sync with what upstream is developing, even when they cause pain to people using Fedora. That's just what you sign up for when you choose Fedora (or drift into it, in my case; I've been using 'Fedora' since before it was Fedora).

How I discovered a hidden microphone on a Chinese NanoKVM

10 February 2025 at 00:00

NanoKVM is a hardware KVM switch developed by the Chinese company Sipeed. Released last year, it enables remote control of a computer or server using a virtual keyboard, mouse, and monitor. Thanks to its compact size and low price, it quickly gained attention online, especially when the company promised to release its code as open-source. However, as we’ll see, the device has some serious security issues. But first, let’s start with the basics.

How Does the Device Work?

As mentioned, NanoKVM is a KVM switch designed for remotely controlling and managing computers or servers. It features an HDMI port, three USB-C ports, an Ethernet port for network connectivity, and a special serial interface. The package also includes a small accessory for managing the power of an external computer.

Using it is quite simple. First, you connect the device to the internet via an Ethernet cable. Once online, you can access it through a standard web browser (though JavaScript JIT must be enabled). The device supports Tailscale VPN, but with some effort (read: hacking), it can also be configured to work with your own VPN, such as WireGuard or OpenVPN server. Once set up, you can control it from anywhere in the world via your browser.

NanoKVM

NanoKVM

The device could be connected to the target computer using an HDMI cable, capturing the video output that would normally be displayed on a monitor. This allows you to view the computer’s screen directly in your browser, essentially acting as a virtual monitor.

Through the USB connection, NanoKVM can also emulate a keyboard, mouse, CD-ROM, USB drive, and even a USB network adapter. This means you can remotely control the computer as if you were physically sitting in front of it - but all through a web interface.

While it functions similarly to remote management tools like RDP or VNC, it has one key difference: there’s no need to install any software on the target computer. Simply plug in the device, and you’re ready to manage it remotely. NanoKVM even allows you to enter the BIOS, and with the additional accessory for power management, you can remotely turn the computer on, off, or reset it.

This makes it incredibly useful - you can power on a machine, access the BIOS, change settings, mount a virtual bootable CD, and install an operating system from scratch, just as if you were physically there. Even if the computer is on the other side of the world.

NanoKVM is also quite affordable. The fully-featured version, which includes all ports, a built-in mini screen, and a case, costs just over €60, while the stripped-down version is around €30. By comparison, a similar RaspberryPi-based device, PiKVM, costs around €400. However, PiKVM is significantly more powerful and reliable and, with a KVM splitter, can manage multiple devices simultaneously.

As mentioned earlier, the announcement of the device caused quite a stir online - not just because of its low price, but also due to its compact size and minimal power consumption. In fact, it can be powered directly from the target computer via a USB cable, which it also uses to simulate a keyboard, mouse, and other USB devices. So you have only one USB cable - in one direction it powers NanoKVM, on the other it helps it to simulate keyboard mouse and other devices on a computer you want to manage.

The device is built on the open-source RISC-V processor architecture, and the manufacturer eventually did release the device’s software under an open-source license at the end of last year. (To be fair, one part of the code remains closed, but the community has already found a suitable open-source replacement, and the manufacturer has promised to open this portion soon.)

However, the real issue is security.

Understandably, the company was eager to release the device as soon as possible. In fact, an early version had a minor hardware design flaw - due to an incorrect circuit cable, the device sometimes failed to detect incoming HDMI signals. As a result, the company recalled and replaced all affected units free of charge. Software development also progressed rapidly, but in such cases, the primary focus is typically on getting basic functionality working, with security taking a backseat.

So, it’s not surprising that the developers made some serious missteps - rushed development often leads to stupid mistakes. But some of the security flaws I discovered in my quick (and by no means exhaustive) review are genuinely concerning.

One of the first security analysis revealed numerous vulnerabilities - and some rather bizarre discoveries. For instance, a security researcher even found an image of a cat embedded in the firmware. While the Sipeed developers acknowledged these issues and relatively quickly fixed at least some of them, many remain unresolved.

NanoKVM

NanoKVM

After purchasing the device myself, I ran a quick security audit and found several alarming flaws. The device initially came with a default password, and SSH access was enabled using this preset password. I reported this to the manufacturer, and to their credit, they fixed it relatively quickly. However, many other issues persist.

The user interface is riddled with security flaws - there’s no CSRF protection, no way to invalidate sessions, and more. Worse yet, the encryption key used for password protection (when logging in via a browser) is hardcoded and identical across all devices. This is a major security oversight, as it allows an attacker to easily decrypt passwords. More problematic, this needed to be explained to the developers. Multiple times.

Another concern is the device’s reliance on Chinese DNS servers. And configuring your own (custom) DNS settings is quite complicated. Additionally, the device communicates with Sipeed’s servers in China - downloading not only updates but also the closed-source component mentioned earlier. For this closed source component it needs to verify an identification key, which is stored on the device in plain text. Alarmingly, the device does not verify the integrity of software updates, includes a strange version of the WireGuard VPN application (which does not work on some networks), and runs a heavily stripped-down version of Linux that lacks systemd and apt. And these are just a few of the issues.

Were these problems simply oversights? Possibly. But what additionally raised red flags was the presence of tcpdump and aircrack - tools commonly used for network packet analysis and wireless security testing. While these are useful for debugging and development, they are also hacking tools that can be dangerously exploited. I can understand why developers might use them during testing, but they have absolutely no place on a production version of the device.

A Hidden Microphone

And then I discovered something even more alarming - a tiny built-in microphone that isn’t clearly mentioned in the official documentation. It’s a miniature SMD component, measuring just 2 x 1 mm, yet capable of recording surprisingly high-quality audio.

What’s even more concerning is that all the necessary recording tools are already installed on the device! By simply connecting via SSH (remember, the device initially used default passwords!), I was able to start recording audio using the amixer and arecord tools. Once recorded, the audio file could be easily copied to another computer. With a little extra effort, it would even be possible to stream the audio over a network, allowing an attacker to eavesdrop in real time.

Hidden Microphone in NanoKVM

Hidden Microphone in NanoKVM

Physically removing the microphone is possible, but it’s not exactly straightforward. As seen in the image, disassembling the device is tricky, and due to the microphone’s tiny size, you’d need a microscope or magnifying glass to properly desolder it.

To summarize: the device is riddled with security flaws, originally shipped with default passwords, communicates with servers in China, comes preinstalled with hacking tools, and even includes a built-in microphone - fully equipped for recording audio - without clear mention of it in the documentation. Could it get any worse?

I am pretty sure these issues stem from extreme negligence and rushed development rather than malicious intent. However, that doesn’t make them any less concerning.

That said, these findings don’t mean the device is entirely unusable.

Since the device is open-source, it’s entirely possible to install custom software on it. In fact, one user has already begun porting his own Linux distribution - starting with Debian and later switching to Ubuntu. With a bit of luck, this work could soon lead to official Ubuntu Linux support for the device.

This custom Linux version already runs the manufacturer’s modified KVM code, and within a few months, we’ll likely have a fully independent and significantly more secure software alternative. The only minor inconvenience is that installing it requires physically opening the device, removing the built-in SD card, and flashing the new software onto it. However, in reality, this process isn’t too complicated.

And while you’re at it, you might also want to remove the microphone… or, if you prefer, connect a speaker. In my test, I used an 8-ohm, 0.5W speaker, which produced surprisingly good sound - essentially turning the NanoKVM into a tiny music player. Actually, the idea is not so bad, because PiKVM also included 2-way audio support for their devices end of last year.

Basic board with speaker

Basic board with speaker

Final Thoughts

All this of course raises an interesting question: How many similar devices with hidden functionalities might be lurking in your home, just waiting to be discovered? And not just those of Chinese origin. Are you absolutely sure none of them have built-in miniature microphones or cameras?

You can start with your iPhone - last year Apple has agreed to pay $95 million to settle a lawsuit alleging that its voice assistant Siri recorded private conversations. They shared the data with third parties and used them for targeted ads. β€œUnintentionally”, of course! Yes, that Apple, that cares about your privacy so much.

And Google is doing the same. They are facing a similar lawsuit over their voice assistant, but the litigation likely won’t be settled until this fall. So no, small Chinese startup companies are not the only problem. And if you are worried about Chinese companies obligations towards Chinese government, let’s not forget that U.S. companies also have obligations to cooperate with U.S. government. While Apple is publicly claiming they do not cooperate with FBI and other U. S. agencies (because thy care about your privacy so much), some media revealed that Apple was holding a series secretive Global Police Summit at its Cupertino headquarters where they taught police how to use their products for surveillance and policing work. And as one of the police officers pointed out - he has β€œnever been part of an engagement that was so collaborative.”. Yep.

P.S. How to Record Audio on NanoKVM

If you want to test the built-in microphone yourself, simply connect to the device via SSH and run the following two commands:

  • amixer -Dhw:0 cset name='ADC Capture Volume 20' (this sets microphone sensitivity to high)
  • arecord -Dhw:0,0 -d 3 -r 48000 -f S16_LE -t wav test.wav & > /dev/null & (this will capture the sound to a file named test.wav)

Now, speak or sing (perhaps the Chinese national anthem?) near the device, then press Ctrl + C, copy the test.wav file to your computer, and listen to the recording.

Kako sem na mini kitajski napravi odkril skriti mikrofon

7 February 2025 at 00:00

Lansko leto je kitajsko podjetje Sipeed izdalo zanimivo napravico za oddaljeno upravljanje računalnikov in strežnikov, ki sliői na ime NanoKVM. Gre za tim. KVM stikalo (angl. KVM switch), torej fizično napravo, ki omogoča oddaljeno upravljanje računalnika oz. strežnika preko virtualne tipkovnice, miőke in monitorja.

Kako deluje?

Napravica ima en HDMI, tri USB-C priključke, Ethernet priključek za omreΕΎni kabel in posebno β€œletvico”, kamor priključimo dodaten priloΕΎen vmesnik za upravljanje napajanja zunanjega računalnika. Kako zadeva deluje? Zelo preprosto. Napravico preko omreΕΎnega Ethernet kabla poveΕΎemo na internet in se potem lahko nanjo s pomočjo navadnega spletnega brskalnika poveΕΎemo od koderkoli (je pa v brskalniku potrebno omogočiti JavaScript JIT). Vgrajena je sicer ΕΎe tudi podpora za Tailscale VPN, a z malo truda oz. hekanja jo lahko poveΕΎemo tudi na svoj VPN (Wireguard ali OpenVPN). Torej lahko do nje preprosto dostopamo preko interneta od kjerkoli na svetu.

NanoKVM

NanoKVM

Napravico nato na računalnik, ki ga ΕΎelimo upravljati poveΕΎemo preko HDMI kabla, naprava pa nato zajema sliko (ki bi se sicer prikazovala na monitorju) in to sliko lahko potem vidimo v brskalniku. Povezava preko USB na ciljnem računalniku simulira tipkovnico, miΕ‘ko, CD-ROM/USB ključek ter celo USB omreΕΎno kartico. S tem naprava omogoča oddaljeno upravljanje računalnika kot bi sedeli za njim, v resnici pa računalnik upravljamo kar preko brskalnika preko interneta. Za razliko od aplikacij za oddaljeno upravljanje računalnika tukaj na ciljni računalnik ni potrebno nameőčati ničesar, dovolj je, da nanj priključimo to napravico. Seveda pa s pomočjo te naprave lahko vstopimo tudi v BIOS ciljnega računalnika, z dodatnim vmesnikom, ki ga priključimo na prej omenjeno β€œletvico” pa oddaljeni računalnik lahko tudi ugasnemo, priΕΎgemo ali resetiramo.

Uporabno, saj na ta način lahko računalnik prižgemo, gremo v BIOS in tam spreminjamo nastavitve, nato pa vanj virtualno vstavimo zagonski CD in celo namestimo operacijski sistem. Pa čeprav se računalnik nahaja na drugem koncu sveta.

Napravica je precej poceni - razőirjena različica, ki ima vse priključke, vgrajen mini zaslonček in prikupno ohiője stane nekaj čez 60 EUR, oskubljena različica pa okrog 30 EUR. Za primerjavo, podobna naprava ki temelji na RaspberryPi in se imenuje PiKVM, stane okrog 400 EUR, je pa res, da je tista naprava precej bolj zmogljiva in zanesljiva, preko KVM razdelillca pa omogoča tudi upravljanje več naprav hkrati.

Kaj pa varnost?

Najava naprave je na spletu povzročila precej navduőenja, ne samo zaradi nizke cene, pač pa tudi zato, ker je res majhna in porabi minimalno energije (napaja se lahko kar iz ciljnega računalnika preko USB kabla s katerim v drugo smer simulira tipkovnico, miőko in ostale USB naprave). Zgrajena je na odprtokodni RISC-V procesorski arhitekturi, proizvajalec pa je obljubil, da bo programsko kodo naprave odprl oziroma jo izdal pod odprtokodno licenco, kar se je konec lanskega leta tudi res zgodilo. No, en del sicer őe ni povsem odprt, a je skupnost že naőla ustrezno odprtokodno nadomestilo, pa tudi proizvajalec je obljubil, da bodo odprli tudi ta del kode.

TeΕΎava pa je varnost.

Proizvajalec je seveda imel interes napravico čim prej dati na trg in ena izmed prvih različic je celo imela manjőo napako v strojni zasnovi (zaradi uporabe napačnega kabla na vezju naprava včasih ni zaznala vhodnega HDMI signala) zato so vse napravice odpoklicali in jih brezplačno zamenjali. Tudi razvoj programske opreme je bil precej intenziven in jasno je, da je podjetju v takem primeru v fokusu predvsem razvoj osnovne funkcionalnosti, varnost pa je na drugem mestu.

Zato ne preseneča, da so bili razvijalci pri razvoju precej malomarni, kar je seveda posledica hitenja. A nekatere ugotovitve mojega hitrega (in vsekakor ne celovitega) varnostnega pregleda so resnično zaskrbljujoče.

Že eden prvih hitrih varnostnih pregledov je odkril őtevilne pomanjkljivosti in celo prav bizarne zadeve - med drugim je varnostni raziskovalec na strojni programski opremi naprave naőel celo sliko mačke. Razvijalci podjetja Sipeed so te napake priznali in jih - vsaj nekatere - tudi relativno hitro odpravili. A őe zdaleč ne vseh.

Odprt NanoKVM

Odprt NanoKVM

Napravico sem pred kratkim kupil tudi sam in tudi moj hitri pregled je odkril őtevilne pomanjkljivosti. Naprava je na začetku imela nastavljeno privzeto geslo, z enakim geslom so bile omogočene tudi ssh povezave na napravo. Proizvajalca sem o tem obvestil in so zadevo relativno hitro popravili. A őtevilne napake so ostale.

Tako ima uporabniőki vmesnik őe vedno cel kup pomanjkljivosti - ni CSFR zaőčite, ni mogoče invalidirati seje, in tako dalje. Šifrirni ključ za zaőčito gesel (ko se preko brskalnika prijavimo na napravo) je kar vgrajen (angl. hardcoded) in za vse naprave enak. Kar absolutno nima smisla, saj napadalec s pomočjo tega ključa geslo lahko povsem preprosto deőifrira. Težava je, da je bilo to potrebno razvijalcem posebej razložiti. In to večkrat.

Osebno me je zmotilo, da naprava uporablja neke kitajske DNS strežnike - nastavitev lastnih DNS strežnikov pa je precej zapletena. Prav tako naprava prenaőa podatke iz kitajskih strežnikov podjetja (v bistvu iz teh strežnikov prenaőa zaenkrat őe edino zaprtokodno komponento, pri čemer pa preverja identifikacijski ključ naprave, ki je sicer na napravi shranjen v neőifrirani obliki). Naprava ne preverja integritete posodobitev, ima nameőčeno neko čudno različico Wireguard VPN aplikacije, na njej teče precej oskubljena različica Linuxa brez systemd in apt komponente, najde pa se őe precej podobnih cvetk. Porodne težave?

Morda. A na napravi sta nameőčeni orodji tcpdump in aircrack, ki se sicer uporabljata za razhroőčevanje in pomoč pri razvoju, vseeno pa gre za hekerski orodji, ki ju je mogoče nevarno zlorabiti. Sicer povsem razumem zakaj razvijalci ti dve orodji uporabljajo, a v produkcijski različici naprave resnično nimata kaj iskati.

Skriti mikrofon

Potem pa sem na napravici odkril őe mini mikrofon, ki ga dokumentacija ne omenja jasno. Gre za miniaturno SMD komponento, velikosti 2 x 1 mm, ki pa dejansko omogoča snemanje precej kakovostnega zvoka. In kar je dodatno zaskrbljujoče je to, da so na napravi že nameőčena vsa orodja za snemanje! To omogoča, da se na napravico povežemo preko ssh (saj se spomnite, da sem na začetku omenil, da je naprava uporabljala privzeta gesla!), nato pa s pomočjo orodij amixer in arecord preprosto zaženemo snemanje zvoka. Datoteko s posnetkom nato preprosto skopiramo na svoj računalnik. Z malo truda pa bi bilo seveda mogoče implementirati tudi oddajanje zvoka preko omrežja, kar bi napadalcu seveda omogočalo prisluőkovanje v realnem času.

Skriti mikrofon v NanoKVM

Skriti mikrofon v NanoKVM

Mikrofon bi bilo sicer mogoče odstraniti, a je za to napravico potrebno fizično razdreti in mikrofon nato odlotati iz nje. Kot je razvidno iz slike to ni povsem enostavno, poleg tega si je treba pri lotanju pomagati z mikroskopom oz. povečevalnim steklom.

Skratka, če povzamemo. Naprava ima kup varnostnih pomanjkljivosti, vsaj na začetku je uporabljala privzeta gesla, komunicira s strežniki na Kitajskem, ima nameőčena hekerska orodja in vgrajen mikrofon z vso programsko podporo za snemanje zvoka, ki ga pa dokumentacija ne omenja jasno! Je lahko őe slabőe?

Sicer sem prepričan, da je to posledica predvsem skrajne malomarnosti in hitenja pri razvoju in ne zlonamernosti, a vseeno vse skupaj puőča precej slab priokus.

Po drugi strani pa te ugotovitve nikakor ne pomenijo, da naprava ni uporabna.

Ker je zasnova naprave odprta je seveda nanjo mogoče namestiti svojo programsko opremo. Eden izmed uporabnikov je tako začel na napravo prenaΕ‘ati svojo različico Linuxa (najprej Debian, zdaj je preklopil na Ubuntu), in z malo sreče bo ta koda kmalu postala osnova za to, da bo Ubuntu Linux tudi uradno podprt na teh napravah. Na tej različici Linuxa ΕΎe teče modificirana KVM koda proizvajalca in verjetno bomo v nekaj mesecih ΕΎe dobili popolnoma neodvisno programsko opremo, ki bo tudi bistveno bolj varna. ManjΕ‘a teΕΎava je, da bo za namestitev te programske opreme napravo treba fizično odpreti, ven vzeti vgrajeno SD kartico in nanjo zapisati to alternativno programsko kodo. A v resnici to ni preveč zapleteno. Lahko pa ob tem Ε‘e odlotamo mikrofon… ali pa gor priključimo zvočnik. Sam sem za test uporabil 8 Ohmski, 0.5 W zvočnik, ki zmore predvajati kar kvaliteten zvok in tako dobil mini predvajalnik glasbe. :)

Osnovna ploőča z zvočnikom

Osnovna ploőča z zvočnikom

Za konec pa se je dobro vpraőati koliko podobnih napravic s skritimi funkcionalnostmi bi se s podobnim pregledom őe naőlo v vaőih domovih? In to ne nujno samo kitajskega izvora. Ste prepričani, da nobena od njih nima vgrajenih miniaturnih mikrofonov ali kamer?

P. S. Za snemanje se je treba na napravico povezati preko ssh in zagnati naslednja dva ukaza:

  • amixer -Dhw:0 cset name='ADC Capture Volume 20' (s tem nastavimo visoko občutljivost mikrofona)
  • arecord -Dhw:0,0 -d 3 -r 48000 -f S16_LE -t wav test.wav & > /dev/null &

Zdaj lahko poleg napravice govorite ali prepevate (na primer kitajsko himno), nato pa pritisnete ctrl-c in datoteko test.wav skopirate na svoj računalnik kjer jo lahko posluőate.

Signal kontejner

10 November 2024 at 00:00

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno Ε‘ifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo Ε‘tevilni aktivisti, novinarji, ΕΎviΕΎgači, pa tudi drΕΎavni uradniki in poslovneΕΎi. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo poveΕΎemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (ΕΎal) vezan na telefonsko Ε‘tevilko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjΕ‘i β€œhack”. KakΕ‘en, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene őtevilne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s őtevilnimi izboljőavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniőki profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameőčate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

ReΕ‘itev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal poveΕΎemo z novo Ε‘tevilko.

Ko je telefonska őtevilka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameőčeni dve različici Signala, vezani na dve različni telefonski őtevilki, in iz obeh različic lahko poőiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko őtevilko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno reőiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameőčena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniőke imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpiőemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameőčena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo pribliΕΎno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameőčen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala poveΕΎemo v operacijski sistem in vpiΕ‘emo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpiőemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpiőemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaΕΎenemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa Ε‘e enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se poveΕΎemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameőčena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naő zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvrőljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne őe čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapiΕ‘emo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakΕ‘no drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvrΕ‘ljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

OsveΕΎimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop

Ko bomo v iskalnik aplikacij vpisali β€œSignal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo őe povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Ε½al pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej Ε‘e vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo reőiti, pa mi seveda lahko sporočite. :)

The Myth and Reality of Mac OS X Snow Leopard

By: Nick Heer
28 March 2025 at 03:56

Jeff Johnson in November 2023:

When people wistfully proclaim that they wish for the next major macOS version to be a β€œSnow Leopard update”, they’re wishing for the wrong thing. No major update will solve Apple’s quality issues. Major updates are the cause of quality issues. The solution would be a long string of minor bug fix updates. What people should be wishing for are the two years of stability and bug fixes that occurred after the release of Snow Leopard. But I fear we’ll never see that again with Tim Cook in charge.

I read an article today from yet another person pining for a mythical Snow Leopard-style MacOS release. While I sympathize with the intent of their argument, it is largely fictional and, as Johnson writes, it took until about two years into Snow Leopard’s release cycle for it to be the release we want to remember:

It’s an iron law of software development that major updates always introduce more bugs than they fix. Mac OS X 10.6.0 was no exception, of course. The next major update, Mac OS X 10.7.0, was no exception either, and it was much buggier than 10.6.8 v1.1, even though both versions were released in the same week.

What I desperately miss is that period of stability after a few rounds of bug fixes. As I have previously complained about, my iMac cannot run any version of MacOS newer than Ventura, released in 2022. It is still getting bug and security fixes. In theory, this should mean I am running a solid operating system despite missing some features.

It is not. Apple’s engineering efforts quickly moved toward shipping MacOS Sonoma in 2023, and then Sequoia last year. It seems as though any bug fixes were folded into these new major versions and, even worse, new bugs were introduced late in the Ventura release cycle that have no hope of being fixed. My iMac seizes up when I try to view HDR media; because this Extended Dynamic Range is an undocumented enhancement, there is no preference to turn it off. Recent Safari releases have contained several bugs related to page rendering and scrolling. Weather sometimes does not display for my current location.

Ventura was by no means bug-free when it shipped, and I am disappointed even its final form remains a mess. My MacBook Pro is running the latest public release of MacOS Sequoia and it, too, has new problems late in its development cycle; I reported a Safari page crashing bug earlier this week. These are on top of existing problems, like how there is no way to change the size of search results’ thumbnails in Photos.

Alas, I am not expecting many bugs to be fixed. It is, after all, nearly April, which means there are just two months until WWDC and the first semi-public builds of another new MacOS version. I am hesitant every year to upgrade. But it does not appear much effort is being put into the maintenance of any previous version. We all get the choice of many familiar bugs, or a blend of hopefully fewer old bugs plus some new ones.

βŒ₯ Permalink

β€˜Adolescence’

By: Nick Heer
22 March 2025 at 21:54

Lucy Mangan, the Guardian:

There have been a few contenders for the crown [of β€œtelevisual perfection”] over the years, but none has come as close as Jack Thorne’s and Stephen Graham’s astonishing four-part series Adolescence, whose technical accomplishments – each episode is done in a single take – are matched by an array of award-worthy performances and a script that manages to be intensely naturalistic and hugely evocative at the same time. Adolescence is a deeply moving, deeply harrowing experience.

I did not intend on watching the whole four-part series today, maybe just the first and second episodes. But I could not turn away. The effectively unanimous praise for this is absolutely earned.

The oner format sounds like it could be a gimmick, the kind of thing that screams a bit too loud and overshadows what should be a tender and difficult narrative. Nothing could be further from the truth. The technical decisions force specific storytelling decisions, in the same way that a more maximalist production in the style of, say, David Fincher does. Fincher would shoot fifty versions of everything and then assemble the best performances into a tight machine β€” and I love that stuff. But I love this, too, little errors and all. It is better for these choices. The dialogue cannot get just a little bit tighter in the edit, or whatever. It is all just there.

I know nothing about reviewing television or movies but, so far as I can tell, everyone involved has pulled this off spectacularly. You can quibble with things like the rainbow party-like explanation of different emoji β€” something for which I cannot find any evidence β€” that has now become its own moral panic. I get that. Even so, this is one of the greatest storytelling achievements I have seen in years.

Update: Watch it on Netflix. See? The ability to edit means I can get away with not fully thinking this post through.

βŒ₯ Permalink

How we handle debconf questions during our Ubuntu installs

By: cks
26 March 2025 at 02:37

In a comment on How we automate installing extra packages during Ubuntu installs, David Magda asked how we dealt with the things that need debconf answers. This is a good question and we have two approaches that we use in combination. First, we have a prepared file of debconf selections for each Ubuntu version and we feed this into debconf-set-selections before we start installing packages. However in practice this file doesn't have much in it and we rarely remember to update it (and as a result, a bunch of it is somewhat obsolete). We generally only update this file if we discover debconf selections where the default doesn't work in our environment.

Second, we run apt-get with a bunch of environment variables set to muzzle debconf:

export DEBCONF_TERSE=yes
export DEBCONF_NOWARNINGS=yes
export DEBCONF_ADMIN_EMAIL=<null address>@<our domain>
export DEBIAN_FRONTEND=noninteractive

Traditionally I've considered muzzling debconf this way to be too dangerous to do during package updates or installing packages by hand. However, I consider it not so much safe as safe enough to do this during our standard install process. To put it one way, we're not starting out with a working system and potentially breaking it by letting some new or updated package pick bad defaults. Instead we're starting with a non-working system and hopefully ending up with a working one. If some package picks bad defaults and we wind up with problems, that's not much worse than we started out with and we'll fix it by updating our file of debconf selections and then redoing the install.

Also, in practice all of this gets worked out during our initial test installs of any new Ubuntu version (done on test virtual machines these days). By the time we're ready to start installing real servers with a new Ubuntu version, we've gone through most of the discovery process for debconf questions. Then the only time we're going to have problems during future system installs future is if a package update either changes the default answer for a current question (to a bad one) or adds a new question with a bad default. As far as I can remember, we haven't had either happen.

(Some of our servers need additional packages installed, which we do by hand (as mentioned), and sometimes the packages will insist on stopping to ask us questions or give us warnings. This is annoying, but so far not annoying enough to fix it by augmenting our standard debconf selections to deal with it.)

The pragmatics of doing fsync() after a re-open() of journals and logs

By: cks
25 March 2025 at 02:02

Recently I read Rob Norris' fsync() after open() is an elaborate no-op (via). This is a contrarian reaction to the CouchDB article that prompted my entry Always sync your log or journal files when you open them. At one level I can't disagree with Norris and the article; POSIX is indeed very limited about the guarantees it provides for a successful fsync() in a way that frustrates the 'fsync after open' case.

At another level, I disagree with the article. As Norris notes, there are systems that go beyond the minimum POSIX guarantees, and also the fsync() after open() approach is almost the best you can do and is much faster than your other (portable) option, which is to call sync() (on Linux you could call syncfs() instead). Under POSIX, sync() is allowed to return before the IO is complete, but at least sync() is supposed to definitely trigger flushing any unwritten data to disk, which is more than POSIX fsync() provides you (as Norris notes, POSIX permits fsync() to apply only to data written to that file descriptor, not all unwritten data for the underlying file). As far as fsync() goes, in practice I believe that almost all Unixes and Unix filesystems are going to be more generous than POSIX requires and fsync() all dirty data for a file, not just data written through your file descriptor.

Actually being as restrictive as POSIX allows would likely be a problem for Unix kernels. The kernel wants to index the filesystem cache by inode, including unwritten data. This makes it natural for fsync() to flush all unwritten data associated with the file regardless of who wrote it, because then the kernel needs no extra data to be attached to dirty buffers. If you wanted to be able to flush only dirty data associated with a file object or file descriptor, you'd need to either add metadata associated with dirty buffers or index the filesystem cache differently (which is clearly less natural and probably less efficient).

Adding metadata has an assortment of challenges and overheads. If you add it to dirty buffers themselves, you have to worry about clearing this metadata when a file descriptor is closed or a file object is deallocated (including when the process exits). If you instead attach metadata about dirty buffers to file descriptors or file objects, there's a variety of situations where other IO involving the buffer requires updating your metadata, including the kernel writing out dirty buffers on its own without a fsync() or a sync() and then perhaps deallocating the now clean buffer to free up memory.

Being as restrictive as POSIX allows probably also has low benefits in practice. To be a clear benefit, you would need to have multiple things writing significant amounts of data to the same file and fsync()'ing their data separately; this is when the file descriptor (or file object) specific fsync() saves you a bunch of data write traffic over the 'fsync() the entire file' approach. But as far as I know, this is a pretty unusual IO pattern. Much of the time, the thing fsync()'ing the file is the only writer, either because it's the only thing dealing with the file or because updates to the file are being coordinated through it so that processes don't step over each other.

PS: If you wanted to implement this, the simplest option would be to store the file descriptor and PID (as numbers) as additional metadata with each buffer. When the system fsync()'d a file, it could check the current file descriptor number and PID against the saved ones and only flush buffers where they matched, or where these values had been cleared to signal an uncertain owner. This would flush more than strictly necessary if the file descriptor number (or the process ID) had been reused or buffers had been touched in some way that caused the kernel to clear the metadata, but doing more work than POSIX strictly requires is relatively harmless.

Sidebar: fsync() and mmap() in POSIX

Under a strict reading of the POSIX fsync() specification, it's not entirely clear how you're properly supposed to fsync() data written through mmap() mappings. If 'all data for the open file descriptor' includes pages touched through mmap(), then you have to keep the file descriptor you used for mmap() open, despite POSIX mmap() otherwise implicitly allowing you to close it; my view is that this is at least surprising. If 'all data' only includes data directly written through the file descriptor with system calls, then there's no way to trigger a fsync() for mmap()'d data.

The obviousness of indexing the Unix filesystem buffer cache by inodes

By: cks
24 March 2025 at 02:34

Like most operating systems, Unix has an in-memory cache of filesystem data. Originally this was a fixed size buffer cache that was maintained separately from the memory used by processes, but later it became a unified cache that was used for both memory mappings established through mmap() and regular read() and write() IO (for good reasons). Whenever you have a cache, one of the things you need to decide is how the cache is indexed. The more or less required answer for Unix is that the filesystem cache is indexed by inode (and thus filesystem, as inodes are almost always attached to some filesystem).

Unix has three levels of indirection for straightforward IO. Processes open and deal with file descriptors, which refer to underlying file objects, which in turn refer to an inode. There are various situations, such as calling dup(), where you will wind up with two file descriptors that refer to the same underlying file object. Some state is specific to file descriptors, but other state is held at the level of file objects, and some state has to be held at the inode level, such as the last modification time of the inode. For mmap()'d files, we have a 'virtual memory area', which is a separate level of indirection that is on top of the inode.

The biggest reason to index the filesystem cache by inode instead of file descriptor or file object is coherence. If two processes separately open the same file, getting two separate file objects and two separate file descriptors, and then one process writes to the file while the other reads from it, we want the reading process to see the data that the writing process has written. The only thing the two processes naturally share is the inode of the file, so indexing the filesystem cache by inode is the easiest way to provide coherence. If the kernel indexed by file object or file descriptor, it would have to do extra work to propagate updates through all of the indirection. This includes the 'updates' of reading data off disk; if you index by inode, everyone reading from the file automatically sees fetched data with no extra work.

(Generally we also want this coherence for two processes that both mmap() the file, and for one process that mmap()s the file while another process read()s or write()s to it. Again this is easiest to achieve if everything is indexed by the inode.)

Another reason to index by inode is how easy it is to handle various situations in the filesystem cache when things are closed or removed, especially when the filesystem cache holds writes that are being buffered in memory before being flushed to disk. Processes frequently close file descriptors and drop file objects, including by exiting, but any buffered writes still need to be findable so they can be flushed to disk before, say, the filesystem itself is unmounted. Similarly, if an inode is deleted we don't want to flush its pending buffered writes to disk (and certainly we can't allocate blocks for them, since there's nothing to own those blocks any more), and we want to discard any clean buffers associated with it to free up memory. If you index the cache by inode, all you need is for filesystems to be able to find all their inodes; everything else more or less falls out naturally.

This doesn't absolutely require a Unix to index its filesystem buffer caches by inode. But I think it's clearly easiest to index the filesystem cache by inode, instead of the other available references. The inode is the common point for all IO involving a file (partly because it's what filesystems deal with), which makes it the easiest index; everyone has an inode reference and in a properly implemented Unix, everyone is using the same inode reference.

(In fact all sorts of fun tend to happen in Unixes if they have a filesystem that gives out different in-kernel inodes that all refer to the same on-disk filesystem object. Usually this happens by accident or filesystem bugs.)

How we automate installing extra packages during Ubuntu installs

By: cks
23 March 2025 at 02:52

We have a local system for installing Ubuntu machines, and one of the important things it does is install various additional Ubuntu packages that we want as part of our standard installs. These days we have two sorts of standard installs, a 'base' set of packages that everything gets and a broader set of packages that login servers and compute servers get (to make them more useful and usable by people). Specialized machines need additional packages, and while we can automate installation of those too, they're generally a small enough set of packages that we document them in our install instructions for each machine and install them by hand.

There are probably clever ways to do bulk installs of Ubuntu packages, but if so, we don't use them. Our approach is instead a brute force one. We have files that contain lists of packages, such as a 'base' file, and these files just contain a list of packages with optional comments:

# Partial example of Basic package set
amanda-client
curl
jq
[...]

# decodes kernel MCE/machine check events
rasdaemon

# Be able to build Debian (Ubuntu) packages on anything
build-essential fakeroot dpkg-dev devscripts automake 

(Like all of the rest of our configuration information, these package set files live in our central administrative filesystem. You could distribute them in some other way, for example fetching them with rsync or even HTTP.)

To install these packages, we use grep to extract the actual packages into a big list and feed the big list to apt-get. This is more or less:

pkgs=$(cat $PKGDIR/$s | grep -v '^#' | grep -v '^[ \t]*$')
apt-get -qq -y install $pkgs

(This will abort if any of the packages we list aren't available. We consider this a feature, because it means we have an error in the list of packages.)

A more organized and minimal approach might be to add the '--no-install-recommends' option, but we started without it and we don't particularly want to go back to find which recommended packages we'd have to explicitly add to our package lists.

At least some of the 'base' package installs could be done during the initial system install process from our customized Ubuntu server ISO image, since you can specify additional packages to install. However, doing package installs that way would create a series of issues in practice. We'd probably need to more carefully track which package came from which Ubuntu collection, since only some of them are enabled during the server install process, it would be harder to update the lists, and the tools for handling the whole process would be a lot more limited, as would our ability to troubleshoot any problems.

Doing this additional package install in our 'postinstall' process means that we're doing it in a full Unix environment where we have all of the standard Unix tools, and we can easily look around the system if and when there's a problem. Generally we've found that the more of our installs we can defer to once the system is running normally, the better.

(Also, the less the Ubuntu installer does, the faster it finishes and the sooner we can get back to our desks.)

(This entry was inspired by parts of a blog post I read recently and reflecting about how we've made setting up new versions of machines pretty easy, assuming our core infrastructure is there.)

The mystery (to me) of tiny font sizes in KDE programs I run

By: cks
22 March 2025 at 03:24

Over on the Fediverse I tried a KDE program and ran into a common issue for me:

It has been '0' days since a KDE app started up with too-small fonts on my bespoke fvwm based desktop, and had no text zoom. I guess I will go use a browser, at least I can zoom fonts there.

Maybe I could find a KDE settings thing and maybe find where and why KDE does this (it doesn't happen in GNOME apps), but honestly it's simpler to give up on KDE based programs and find other choices.

(The specific KDE program I was trying to use this time was NeoChat.)

My fvwm based desktop environment has an XSettings daemon running, which I use in part to set up a proper HiDPI environment (also, which doesn't talk about KDE fonts because I never figured that out). I suspect that my HiDPI display is part of why KDE programs often or always seem to pick tiny fonts, but I don't particularly know why. Based on the xsettingsd documentation and the registry, there doesn't seem to be any KDE specific font settings, and I'm setting the Gtk/FontName setting to a font that KDE doesn't seem to be using (which I could only verify once I found a way to see the font I was specifying).

After some searching I found the systemsettings program through the Arch wiki's page on KDE and was able to turn up its font sizes in a way that appears to be durable (ie, it stays after I stop and start systemsettings). However, this hasn't affected the fonts I see in NeoChat when I run it again. There are a bunch of font settings, but maybe NeoChat is using the 'small' font for some reason (apparently which app uses what font setting can be variable).

QT (the underlying GUI toolkit of much or all of KDE) has its own set of environment variables for scaling things on HiDPI displays, and setting $QT_SCALE_FACTOR does size up NeoChat (although apparently bits of Plasma ignore these, although I think I'm unlikely to run into this since I don't want to use KDE's desktop components).

Some KDE applications have their own settings files with their own font sizes; one example I know if is kdiff3. This is quite helpful because if I'm determined enough, I can either adjust the font sizes in the program's settings or at least go edit the configuration file (in this case, .config/kdiff3rc, I think, not .kde/share/config/kdiff3rc). However, not all KDE applications allow you to change font sizes through either their GUI or a settings file, and NeoChat appears to be one of the ones that don't.

In theory now that I've done all of this research I could resize NeoChat and perhaps other KDE applications through $QT_SCALE_FACTOR. In practice I feel I would rather switch to applications that interoperate better with the rest of my environment unless for some reason the KDE application is either my only choice or the significantly superior one (as it has been so far for kdiff3 for my usage).

Using Netplan to set up WireGuard on Ubuntu 22.04 works, but has warts

By: cks
28 February 2025 at 04:07

For reasons outside the scope of this entry, I recently needed to set up WireGuard on an Ubuntu 22.04 machine. When I did this before for an IPv6 gateway, I used systemd-networkd directly. This time around I wasn't going to set up a single peer and stop; I expected to iterate and add peers several times, which made netplan's ability to update and re-do your network configuration look attractive. Also, our machines are already using Netplan for their basic network configuration, so this would spare my co-workers from having to learn about systemd-networkd.

Conveniently, Netplan supports multiple configuration files so you can put your WireGuard configuration into a new .yaml file in your /etc/netplan. The basic version of a WireGuard endpoint with purely internal WireGuard IPs is straightforward:

network:
  version: 2
  tunnels:
    our-wg0:
      mode: wireguard
      addresses: [ 192.168.X.1/24 ]
      port: 51820
      key:
        private: '....'
      peers:
        - keys:
            public: '....'
          allowed-ips: [ 192.168.X.10/32 ]
          keepalive: 90
          endpoint: A.B.C.D:51820

(You may want something larger than a /24 depending on how many other machines you think you'll be talking to. Also, this configuration doesn't enable IP forwarding, which is a feature in our particular situation.)

If you're using netplan's systemd-networkd backend, which you probably are on an Ubuntu server, you can apparently put your keys into files instead of needing to carefully guard the permissions of your WireGuard /etc/netplan file (which normally has your private key in it).

If you write this out and run 'netplan try' or 'netplan apply', it will duly apply all of the configuration and bring your 'our-wg0' WireGuard configuration up as you expect. The problems emerge when you change this configuration, perhaps to add another peer, and then re-do your 'netplan try', because when you look you'll find that your new peer hasn't been added. This is a sign of a general issue; as far as I can tell, netplan (at least in Ubuntu 22.04) can set up WireGuard devices from scratch but it can't update anything about their WireGuard configuration once they're created. This is probably be a limitation in the Ubuntu 22.04 version of systemd-networkd that's only changed in the very latest systemd versions. In order to make WireGuard level changes, you need to remove the device, for example with 'ip link del dev our-wg0' and then re-run 'netplan try' (or 'netplan apply') to re-create the WireGuard device from scratch; the recreated version will include all of your changes.

(The latest online systemd.netdev manual page says that systemd-networkd will try to update netdev configurations if they change, and .netdev files are where WireGuard settings go. The best information I can find is that this change appeared in systemd v257, although the Fedora 41 systemd.netdev manual page has this same wording and it has systemd '256.11'. Maybe there was a backport into Fedora.)

In our specific situation, deleting and recreating the WireGuard device is harmless and we're not going to be doing it very often anyway. In other configurations things may not be so straightforward and so you may need to resort to other means to apply updates to your WireGuard configuration (including working directly through the 'wg' tool).

I'm not impressed by the state of NFS v4 in the Linux kernel

By: cks
27 February 2025 at 04:15

Although NFS v4 is (in theory) the latest great thing in NFS protocol versions, for a long time we only used NFS v3 for our fileservers and our Ubuntu NFS clients. A few years ago we switched to NFS v4 due to running into a series of problems our people were experiencing with NFS (v3) locks (cf); since NFS v4 locks are integrated into the protocol and NFS v4 is the 'modern' NFS version that's probably receiving more attention than anything to do with NFS v3.

(NFS v4 locks are handled relatively differently than NFS v3 locks.)

Moving to NFS v4 did fix our NFS lock issues in that stuck NFS locks went away, when before they'd been a regular issue on our IMAP server. However, all has not turned out to be roses, and the result has left me not really impressed with the state of NFS v4 in the Linux kernel. In Ubuntu 22.04's 5.15.x server kernel, we've now run into scalability issues in both the NFS server (which is what sparked our interest in how many NFS server threads to run and what NFS server threads do in the kernel), and now in the NFS v4 client (where I have notes that let me point to a specific commit with the fix).

(The NFS v4 server issue we encountered may be the one fixed by this commit.)

What our two issues have in common is that both are things that you only find under decent or even significant load. That these issues both seem to have still been present as late as kernels 6.1 (server) and 6.6 (client) suggests that neither the Linux NFS v4 server nor the Linux NFS v4 client had been put under serious load until then, or at least not by people who could diagnose their problems precisely enough to identify the problem and get kernel fixes made. While both issues are probably fixed now, their past presence leaves me wondering what other scalability issues are lurking in the kernel's NFS v4 support, partly because people have mostly been using NFS v3 until recently (like us).

We're not going to go back to NFS v3 in general (partly because of the clear improvement in locking), and the server problem we know about has been wiped away because we're moving our NFS fileservers to Ubuntu 24.04 (and some day the NFS clients will move as well). But I'm braced for further problems, including ones in 24.04 that we may be stuck with for a while.

PS: I suspect that part of the issues may come about because the Linux NFS v4 client and the Linux NFS v4 server don't add NFS v4 operations at the same time. As I found out, the server supports more operations than the client uses but the client's use is of whatever is convenient and useful for it, not necessarily by NFS v4 revision. If the major use of Linux NFS v4 servers is with v4 clients, this could leave the server implementation of operations under-used until the client starts using them (and people upgrade clients to kernel versions with that support).

Why I have a little C program to filter a $PATH (more or less)

By: cks
16 February 2025 at 02:07

I use a non-standard shell and have for a long time, which means that I have to write and maintain my own set of dotfiles (which sometimes has advantages). In the long ago days when I started doing this, I had a bunch of accounts on different Unixes around the university (as was the fashion at the time, especially if you were a sysadmin). So I decided that I was going to simplify my life by having one set of dotfiles for rc that I used on all of my accounts, across a wide variety of Unixes and Unix environments. That way, when I made an improvement in a shell function I used, I could get it everywhere by just pushing out a new version of my dotfiles.

(This was long enough ago that my dotfile propagation was mostly manual, although I believe I used rdist for some of it.)

In the old days, one of the problems you faced if you wanted a common set of dotfiles across a wide variety of Unixes was that there were a lot of things that potentially could be in your $PATH. Different Unixes had different sets of standard directories, and local groups put local programs (that I definitely wanted access to) in different places. I could have put everything in $PATH (giving me a gigantic one) or tried to carefully scope out what system environment I was on and set an appropriate $PATH for each one, but I decided to take a more brute force approach. I started with a giant potential $PATH that listed every last directory that could appear in $PATH in any system I had an account on, and then I had a C program that filtered that potential $PATH down to only things that existed on the local system. Because it was written in C and had to stat() things anyways, I made it also keep track of what concrete directories it had seen and filter out duplicates, so that if there were symlinks from one name to another, I wouldn't get it twice in my $PATH.

(Looking at historical copies of the source code for this program, the filtering of duplicates was added a bit later; the very first version only cared about whether a directory existed or not.)

The reason I wrote a C program for this (imaginatively called 'isdirs') instead of using shell builtins to do this filtering (which is entirely possible) is primarily because this was so long ago that running a C program was definitely faster than using shell builtins in my shell. I did have a fallback shell builtin version in case my C program might not be compiled for the current system and architecture, although it didn't do the filtering of duplicates.

(Rc uses a real list for its equivalent of $PATH instead of the awkward ':' separated pseudo-list that other Unix shells use, so both my C program and my shell builtin could simply take a conventional argument list of directories rather than having to try to crack a $PATH apart.)

(This entry was inspired by Ben Zanin's trick(s) to filter out duplicate $PATH entries (also), which prompted me to mention my program.)

PS: rc technically only has one dotfile, .rcrc, but I split my version up into several files that did different parts of the work. One reason for this split was so that I could source only some parts to set up my environment in a non-interactive context (also).

Sidebar: the rc builtin version

Rc has very few builtins and those builtins don't include test, so this is a bit convoluted:

path=`{tpath=() pe=() {
        for (pe in $path)
           builtin cd $pe >[1=] >[2=] && tpath=($tpath $pe)
        echo $tpath
       } >[2]/dev/null}

In a conventional shell with a test builtin, you would just use 'test -d' to see if directories were there. In rc, the only builtin that will tell you if a directory exists is to try to cd to it. That we change directories is harmless because everything is running inside the equivalent of a Bourne shell $(...).

Keen eyed people will have noticed that this version doesn't work if anything in $path has a space in it, because we pass the result back as a whitespace-separated string. This is a limitation shared with how I used the C program, but I never had to use a Unix where one of my $PATH entries needed a space in it.

The profusion of things that could be in your $PATH on old Unixes

By: cks
15 February 2025 at 03:43

In the beginning, which is to say the early days of Bell Labs Research Unix, life was simple and there was only /bin. Soon afterwards that disk ran out of space and we got /usr/bin (and all of /usr), and some people might even have put /etc on their $PATH. When UCB released BSD Unix, they added /usr/ucb as a place for (some of) their new programs and put some more useful programs in /etc (and at some point there was also /usr/etc); now you had three or four $PATH entries. When window systems showed up, people gave them their own directories too, such as /usr/bin/X11 or /usr/openwin/bin, and this pattern was followed by other third party collections of programs, with (for example) /usr/bin/mh holding all of the (N)MH programs (if you installed them there). A bit later, SunOS 4.0 added /sbin and /usr/sbin and other Unixes soon copied them, adding yet more potential $PATH entries.

(Sometimes X11 wound up in /usr/X11/bin, or /usr/X11<release>/bin. OpenBSD still has a /usr/X11R6 directory tree, to my surprise.)

When Unix went out into the field, early system administrators soon learned that they didn't want to put local programs into /usr/bin, /usr/sbin, and so on. Of course there was no particular agreement on where to put things, so people came up with all sorts of options for the local hierarchy, including /usr/local, /local, /slocal, /<group name> (such as /csri or /dgp), and more. Often these /local/bin things had additional subdirectories for things like the locally built version of X11, which might be plain 'bin/X11' or have a version suffix, like 'bin/X11R4', 'bin/X11R5', or 'bin/X11R6'. Some places got more elaborate; rather than putting everything in a single hierarchy, they put separate things into separate directory hierarchies. When people used /opt for this, you could get /opt/gnu/bin, /opt/tk/bin, and so on.

(There were lots of variations, especially for locally built versions of X11. And a lot of people built X11 from source in those days, at least in the university circles I was in.)

Unix vendors didn't sit still either. As they began adding more optional pieces they started splitting them up into various directory trees, both for their own software and for third party software they felt like shipping. Third party software was often planted into either /usr/local or /usr/contrib, although there were other options, and vendor stuff could go in many places. A typical example is Solaris 9's $PATH for sysadmins (and I think that's not even fully complete, since I believe Solaris 9 had some stuff hiding under /usr/xpg4). Energetic Unix vendors could and did put various things in /opt under various names. By this point, commercial software vendors that shipped things for Unixes also often put them in /opt.

This led to three broad things for people using Unixes back in those days. First, you invariably had a large $PATH, between all of the standard locations, the vendor additions, and the local additions on top of those (and possibly personal 'bin' directories in your $HOME). Second, there was a lot of variation in the $PATH you wanted, both from Unix to Unix (with every vendor having their own collection of non-standard $PATH additions) and from site to site (with sysadmins making all sorts of decisions about where to put local things). Third, setting yourself up on a new Unix often required a bunch of exploration and digging. Unix vendors often didn't add everything that you wanted to their standard $PATH, for example. If you were lucky and got an account at a well run site, their local custom new account dotfiles would set you up with a correct and reasonably complete local $PATH. If you were a sysadmin exploring a new to you Unix, you might wind up writing a grumpy blog entry.

(This got much more complicated for sites that had a multi-Unix environment, especially with shared home directories.)

Modern Unix life is usually at least somewhat better. On Linux, you're typically down to two main directories (/usr/bin and /usr/sbin) and possibly some things in /opt, depending on local tastes. The *BSDs are a little more expansive but typically nowhere near the heights of, for example, Solaris 9's $PATH (see the comments on that entry too).

The Prometheus host agent is missing some Linux NFSv4 RPC stats (as of 1.8.2)

By: cks
9 February 2025 at 03:51

Over on the Fediverse I said:

This is my face when the Prometheus host agent provides very incomplete monitoring of NFS v4 RPC operations on modern kernels that can likely hide problems. For NFS servers I believe that you get only NFS v4.0 ops, no NFS v4.1 or v4.2 ones. For NFS v4 clients things confuse me but you certainly don't get all of the stats as far as I can see.

When I wrote that Fediverse post, I hadn't peered far enough into the depths of the Linux kernel to be sure what was missing, but now that I understand the Linux kernel NFS v4 server and client RPC operations stats I can provide a better answer of what's missing. All of this applies to node_exporter as of version 1.8.2 (the current one as I write this).

(I now think 'very incomplete' is somewhat wrong, but not entirely so, especially on the server side.)

Importantly, what's missing is different for the server side and the client side, with the client side providing information on operations that the server side doesn't. This can make it very puzzling if you're trying to cross-compare two 'NFS RPC operations' graphs, one from a client and one from a server, because the client graph will show operations that the server graph doesn't.

In the host agent code, the actual stats are read from /proc/net/rpc/nfs and /proc/net/rpc/nfsd by a separate package, prometheus/procfs, and are parsed in nfs/parse.go. For the server case, if we cross compare this to the kernel's include/linux/nfs4.h, what's missing from server stats is all NFS v4.1, v4.2, and RFC 8276 xattr operations, everything from operation 40 through operation 75 (as I write this).

Because the Linux NFS v4 client stats are more confusing and aren't so nicely ordered, the picture there is more complex. The nfs/parse.go code handles everything up through 'Clone', and is missing from 'Copy' onward. However, both what it has and what it's missing are a mixture of NFS v4, v4.1, and v4.2 operations; for example, 'Allocate' and 'Clone' (both included) are v4.2 operations, while 'Lookupp', a v4.0 operation, is missing from client stats. If I'm reading the code correctly, the missing NFS v4 client operations are currently (using somewhat unofficial names):

Copy OffloadCancel Lookupp LayoutError CopyNotify Getxattr Setxattr Listxattrs Removexattr ReadPlus

Adding the missing operations to the Prometheus host agent would require updates to both prometheus/procfs (to add fields for them) and to node_exporter itself, to report the fields. The NFS client stats collector in collector/nfs_linux.go uses Go reflection to determine the metrics to report and so needs no updates, but the NFS server stats collector in collector/nfsd_linux.go directly knows about all 40 of the current operations and so would need code updates, either to add the new fields or to switch to using Go reflection.

If you want numbers for scale, at the moment node_exporter reports on 50 out of 69 NFS v4 client operations, and is missing 36 NFS v4 server operations (reporting on what I believe is 36 out of 72). My ability to decode what the kernel NFS v4 client and server code is doing is limited, so I can't say exactly how these operations match up and, for example, what client operations the server stats are missing.

(I haven't made a bug report about this (yet) and may not do so, because doing so would require making my Github account operable again, something I'm sort of annoyed by. Github's choice to require me to have MFA to make bug reports is not the incentive they think it is.)

Linux kernel NFSv4 server and client RPC operation statistics

By: cks
7 February 2025 at 02:59

NFS servers and clients communicate using RPC, sending various NFS v3, v4, and possibly v2 (but we hope not) RPC operations to the server and getting replies. On Linux, the kernel exports statistics about these NFS RPC operations in various places, with a global summary in /proc/net/rpc/nfsd (for the NFS server side) and /proc/net/rpc/nfs (for the client side). Various tools will extract this information and convert it into things like metrics, or present it on the fly (for example, nfsstat(8)). However, as far as I know what is in those files and especially how RPC operations are reported is not well documented, and also confusing, which is a problem if you discover that something has an incomplete knowledge of NFSv4 RPC stats.

For a general discussion of /proc/net/rpc/nfsd, see Svenn D'Hert's nfsd stats explained article. I'm focusing on NFSv4, which is to say the 'proc4ops' line. This line is produced in nfsd_show in fs/nfsd/stats.c. The line starts with a count of how many operations there are, such as 'proc4ops 76', and then has one number for each operation. What are the operations and how many of them are there? That's more or less found in the nfs_opnum4 enum in include/linux/nfs4.h. You'll notice that there are some gaps in the operation numbers; for example, there's no 0, 1, or 2. Despite there being no such actual NFS v4 operations, 'proc4ops' starts with three 0s for them, because it works with an array numbered by nfs_opnum4 and like all C arrays, it starts at 0.

(The counts of other, real NFS v4 operations may be 0 because they're never done in your environment.)

For NFS v4 client operations, we look at the 'proc4' line in /proc/net/rpc/nfs. Like the server's 'proc4ops' line, it starts with a count of how many operations are being reported on, such as 'proc4 69', and then a count for each operation. Unfortunately for us and everyone else, these operations are not numbered the same as the NFS server operations. Instead the numbering is given in an anonymous and unnumbered enum in include/linux/nfs4.h that starts with 'NFSPROC4_CLNT_NULL = 0,' (as a spoiler, the 'null' operation is not unused, contrary to the include file's comment). The actual generation and output of /proc/net/rpc/nfs is done in rpc_proc_show in net/sunrpc/stats.c. The whole structure this code uses is set up in fs/nfs/nfs4xdr.c, and while there is a confusing level of indirection, I believe the structure corresponds directly with the NFSPROC4_CLNT_* enum values.

What I think is going on is that Linux has decided to optimize its NFSv4 client statistics to only include the NFS v4 operations that it actually uses, rather than take up a bit of extra memory to include all of the NFS v4 operations, including ones that will always have a '0' count. Because the Linux NFS v4 client started using different NFSv4 operations at different times, some of these operations (such as 'lookupp') are out of order; when the NFS v4 client started using them, they had to be added at the end of the 'proc4' line to preserve backward compatibility with existing programs that read /proc/net/rpc/nfs.

PS: As far as I can tell from a quick look at fs/nfs/nfs3xdr.c, include/uapi/linux/nfs3.h, and net/sunrpc/stats.c, the NFS v3 server and client stats cover all of the NFS v3 operations and are in the same order, the order of the NFS v3 operation numbers.

How Ubuntu 24.04's bad bpftrace package appears to have happened

By: cks
6 February 2025 at 02:39

When I wrote about Ubuntu 24.04's completely broken bpftrace '0.20.2-1ubuntu4.2' package (which is now no longer available as an Ubuntu update), I said it was a disturbing mystery how a theoretical 24.04 bpftrace binary was built in such a way that it depended on a shared library that didn't exist in 24.04. Thanks to the discussion in bpftrace bug #2097317, we have somewhat of an answer, which in part shows some of the challenges of building software at scale.

The short version is that the broken bpftrace package wasn't built in a standard Ubuntu 24.04 environment that only had released packages. Instead, it was built in a '24.04' environment that included (some?) proposed updates, and one of the included proposed updates was an updated version of libllvm18 that had the new shared library. Apparently there are mechanisms that should have acted to make the new bpftrace depend on the new libllvm18 if everything went right, but some things didn't go right and the new bpftrace package didn't pick up that dependency.

On the one hand, if you're planning interconnected package updates, it's a good idea to make sure that they work with each other, which means you may want to mingle in some proposed updates into some of your build environments. On the other hand, if you allow your build environments to be contaminated with non-public packages this way, you really, really need to make sure that the dependencies work out. If you don't and packages become public in the wrong order, you get Ubuntu 24.04's result.

(While the RPM build process and package format would have avoided this specific problem, I'm pretty sure that there are similar ways to make it go wrong.)

Contaminating your build environment this way also makes testing your newly built packages harder. The built bpftrace binary would have run inside the build environment, because the build environment had the right shared library from the proposed libllvm18. To see the failure, you would have to run tests (including running the built binary) in a 'pure' 24.04 environment that had only publicly released package updates. This would require an extra package test step; I'm not clear if Ubuntu has this as part of their automated testing of proposed updates (there's some hints in the discussion that they do but that these tests were limited and didn't try to run the binary).

An alarmingly bad official Ubuntu 24.04 bpftrace binary package

By: cks
2 February 2025 at 03:53

Bpftrace is a more or less official part of Ubuntu; it's even in the Ubuntu 24.04 'main' repository, as opposed to one of the less supported ones. So I'll present things in the traditional illustrated form (slightly edited for line length reasons):

$ bpftrace
bpftrace: error while loading shared libraries: libLLVM-18.so.18.1: cannot open shared object file: No such file or directory
$ readelf -d /usr/bin/bpftrace | grep libLLVM
 0x0...01 (NEEDED)  Shared library: [libLLVM-18.so.18.1]
$ dpkg -L libllvm18 | grep libLLVM
/usr/lib/llvm-18/lib/libLLVM.so.1
/usr/lib/llvm-18/lib/libLLVM.so.18.1
/usr/lib/x86_64-linux-gnu/libLLVM-18.so
/usr/lib/x86_64-linux-gnu/libLLVM.so.18.1
$ dpkg -l bpftrace libllvm18
[...]
ii  bpftrace       0.20.2-1ubuntu4.2 amd64 [...]
ii  libllvm18:amd64 1:18.1.3-1ubuntu1 amd64 [...]

I originally mis-diagnosed this as a libllvm18 packaging failure, but this is in fact worse. Based on trawling through packages.ubuntu.com, only Ubuntu 24.10 and later have a 'libLLVM-18.so.18.1' in any package; in Ubuntu 24.04, the correct name for this is 'libLLVM.so.18.1'. If you rebuild the bpftrace source .deb on a genuine 24.04 machine, you get a bpftrace build (and binary .deb) that does correctly use 'libLLVM.so.18.1' instead of 'libLLVM-18.so.18.1'.

As far as I can see, there are two things that could have happened here. The first is that Canonical simply built a 24.10 (or later) bpftrace binary .deb and put it in 24.04 without bothering to check if the result actually worked. I would like to say that this shows shocking disregard for the functioning of an increasingly important observability tool from Canonical, but actually it's not shocking at all, it's Canonical being Canonical (and they would like us to pay for this for some reason). The second and worse option is that Canonical is building 'Ubuntu 24.04' packages in an environment that is contaminated with 24.10 or later packages, shared libraries, and so on. This isn't supposed to happen in a properly operating package building environment that intends to create reliable and reproducible results and casts doubt on the provenance and reliability of all Ubuntu 24.04 packages.

(I don't know if there's a way to inspect binary .debs to determine anything about the environment they were built in, the way you can get some information about RPMs. Also, I now have a new appreciation for Fedora putting the Fedora release version into the actual RPM's 'release' name. Ubuntu 24.10 and 24.04 don't have the same version of bpftrace, so this isn't quite as simple as Canonical copying the 24.10 package to 24.04; 24.10 has 0.21.2, while 24.04 is theoretically 0.20.2.)

Incidentally, this isn't an issue of the shared library having its name changed, because if you manually create a 'libLLVM-18.so.18.1' symbolic link to the 24.04 libllvm18's 'libLLVM.so.18.1' and run bpftrace, what you get is:

$ bpftrace
: CommandLine Error: Option 'debug-counter' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
abort

This appears to say that the Ubuntu 24.04 bpftrace binary is incompatible with the Ubuntu 24.04 libllvm18 shared libraries. I suspect that it was built against different LLVM 18 headers as well as different LLVM 18 shared libraries.

How to accidentally get yourself with 'find ... -name something*'

By: cks
28 January 2025 at 03:43

Suppose that you're in some subdirectory /a/b/c, and you want to search all of /a for the presence of files for any version of some program:

u@h:/a/b/c$ find /a -name program* -print

This reports '/a/b/c/program-1.2.tar' and '/a/b/f/program-1.2.tar', but you happen to know that there are other versions of the program under /a. What happened to a command that normally works fine?

As you may have already spotted, what happened is the shell's wildcard expansion. Because you ran your find in a directory that contained exactly one match for 'program*', the shell expanded it before you ran find, and what you actually ran was:

find /a -name program-1.2.tar -print

This reported the two instances of program-1.2.tar in the /a tree, but not the program-1.4.1.tar that was also in the /a tree.

If you'd run your find command in a directory without a shell match for the -name wildcard, the shell would (normally) pass the unexpanded wildcard through to find, which would do what you want. And if there had been only one instance of 'program-1.2.tar' in the tree, in your current directory, it might have been more obvious what went wrong; instead, the find returning more than one result made it look like it was working normally apart from inexplicably not finding and reporting 'program-1.4.1.tar'.

(If there were multiple matches for the wildcard in the current directory, 'find' would probably have complained and you'd have realized what was going on.)

Some shells have options to cause failed wildcard expansions to be considered an error; Bash has the 'failglob' shopt, for example. People who turn these options on are probably not going to stumble into this because they've already been conditioned to quote wildcards for 'find -name' and other similar tools. Possibly this Bash option or its equivalent in other shells should be the default for new Unix accounts, just so everyone gets used to quoting wildcards that are supposed to be passed through to programs.

(Although I don't use a shell that makes failed wildcard expansions an error, I somehow long ago internalized the idea that I should quote all wildcards I want to pass to programs.)

The (potential) complexity of good runqueue latency measurement in Linux

By: cks
21 January 2025 at 04:16

Run queue latency is the time between when a Linux task becomes ready to run and when it actually runs. If you want good responsiveness, you want a low runqueue latency, so for a while I've been tracking a histogram of it with eBPF, and I put some graphs of it up on some Grafana dashboards I look at. Then recently I improved the responsiveness of my desktop with the cgroup V2 'cpu.idle' setting, and questions came up about how this different from process niceness. When I was looking at those questions, I realized that my run queue latency measurements were incomplete.

When I first set up my run queue latency tracking, I wasn't using either cgroup V2 cpu.idle or process niceness, and so I set up a single global runqueue latency histogram for all tasks regardless of their priority and scheduling class. Once I started using 'idle' CPU scheduling (and testing the effectiveness of niceness), this resulted in hopelessly muddled data that was effectively meaningless during the time that multiple scheduling types of scheduling or multiple nicenesses were running. Running CPU-consuming processes only when the system is otherwise idle is (hopefully) good for the runqueue latency of my regular desktop processes, but more terrible than usual for those 'run only when idle' processes, and generally there's going to be a lot more of them than my desktop processes.

The moment you introduce more than one 'class' of processes for scheduling, you need to split run queue latency measurements up between these classes if you want to really make sense of the results. What these classes are will depend on your environment. I could probably get away with a class for 'cpu.idle' tasks, a class for heavily nice'd tasks, a class for regular tasks, and perhaps a class for (system) processes running with very high priority. If you're doing fair share scheduling between logins, you might need a class per login (or you could ignore run queue latency as too noisy a measure).

I'm not sure I'd actually track all of my classes as Prometheus metrics. For my personal purposes, I don't care very much about the run queue latency of 'idle' or heavily nice'd processes, so perhaps I should update my personal metrics gathering to just ignore those. Alternately, I could write a bpftrace script that gathered the detailed class by class data, run it by hand when I was curious, and ignore the issue otherwise (continuing with my 'global' run queue latency histogram, which is at least honest in general).

The history and use of /etc/glob in early Unixes

By: cks
13 January 2025 at 04:41

One of the innovations that the V7 Bourne shell introduced was built in shell wildcard globbing, which is to say expanding things like *, ?, and so on. Of course Unix had shell wildcards well before V7, but in V6 and earlier, the shell didn't implement globbing itself; instead this was delegated to an external program, /etc/glob (this affects things like looking into the history of Unix shell wildcards, because you have to know to look at the glob source, not the shell).

As covered in places like the V6 glob(8) manual page, the glob program was passed a command and its arguments (already split up by the shell), and went through the arguments to expand any wildcards it found, then exec()'d the command with the now expanded arguments. The shell operated by scanning all of the arguments for (unescaped) wildcard characters. If any were found, the shell exec'd /etc/glob with the whole show; otherwise, it directly exec()'d the command with its arguments. Quoting wildcards used a hack that will be discussed later.

This basic /etc/glob behavior goes all the way back to Unix V1, where we have sh.s and in it we can see that invocation of /etc/glob. In V2, glob is one of the programs that have been rewritten in C (glob.c), and in V3 we have a sh.1 that mentions /etc/glob and has an interesting BUGS note about it:

If any argument contains a quoted "*", "?", or "[", then all instances of these characters must be quoted. This is because sh calls the glob routine whenever an unquoted "*", "?", or "[" is noticed; the fact that other instances of these characters occurred quoted is not noticed by glob.

This section has disappeared in the V4 sh.1 manual page, which suggests that the V4 shell and /etc/glob had acquired the hack they use in V5 and V6 to avoid this particular problem.

How escaping wildcards works in the V5 and V6 shell is that all characters in commands and arguments are restricted to being seven-bit ASCII. The shell and /etc/glob both use the 8th bit to mark quoted characters, which means that such quoted characters don't match their unquoted versions and won't be seen as wildcards by either the shell (when it's deciding whether or not it needs to run /etc/glob) or by /etc/glob itself (when it's deciding what to expand). However, obviously neither the shell nor /etc/glob can pass such 'marked as quoted' characters to actual commands, so each of them strips the high bit from all characters before exec()'ing actual commands.

(This is clearer in the V5 glob.c source; look for how cat() ands every character with octal 0177 (0x7f) to drop the high bit. You can also see it in the V5 sh.c source, where you want to look at trim(), and also the #define for 'quote' at the start of sh.c and how it's used later.)

PS: I don't know why expanding shell wildcards used a separate program in V6 and earlier, but part of it may have been to keep the shell smaller and more minimal so that it required less memory.

PPS: See also Stephen R. Bourne's 2015 presentation from BSDCan [PDF], which has a bunch of interesting things on the V7 shell and confirms that /etc/glob was there from V1.

What a FreeBSD kernel message about your bridge means

By: cks
8 January 2025 at 03:58

Suppose, not hypothetically, that you're operating a FreeBSD based bridging firewall (or some other bridge situation) and you see something like the following kernel message:

kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix0 to ix1
kernel: bridge0: mac address 01:02:03:04:05:06 vlan 0 moved from ix1 to ix0

The bad news is that this message means what you think it means. Your FreeBSD bridge between ix0 and ix1 first saw this MAC address as the source address on a packet it received on the ix0 interface of the bridge, and then it saw the same MAC address as the source address of a packet received on ix1, and then it received another packet on ix0 with that MAC address as the source address. Either you have something echoing those packets back on one side, or there is a network path between the two sides that bypasses your bridge.

(If you're lucky this happens regularly. If you're not lucky it happens only some of the time.)

This particular message comes from bridge_rtupdate() in sys/net/if_bridge.c, which is called to update the bridge's 'routing entries', which here means MAC addresses, not IP addresses. This function is called from bridge_forward(), which forwards packets, which is itself called from bridge_input(), which handles received packets. All of this only happens if the underlying interfaces are in 'learning' mode, but this is the default.

As covered in the ifconfig manual page, you can inspect what MAC addresses have been learned on which device with 'ifconfig bridge0 addr' (covered in the 'Bridge Interface Parameters' section of the manual page). This may be useful to see if your bridge normally has a certain MAC address (perhaps the one that's moving) on the interface it should be on. If you want to go further, it's possible to set a static mapping for some MAC addresses, which will make them stick to one interface even if seen on another one.

Logging this message is controlled by the net.link.bridge.log_mac_flap sysctl, and it's rate limited to only being reported five times a second in general (using ppsratecheck()). That's five times total, even if each time is a different MAC address or even a different bridge. This 'five times a second' log count isn't controllable through a sysctl.

(I'm writing all of this down because I looked much of it up today. Sometimes I'm a system programmer who goes digging in the (FreeBSD) kernel source just to be sure.)

The issue with DNF 5 and script output in Fedora 41

By: cks
7 January 2025 at 04:45

These days Fedora uses DNF as its high(er) level package management software, replacing yum. However, there are multiple versions of DNF, which behave somewhat differently. Through Fedora 40, the default version of DNF was DNF 4; in Fedora 41, DNF is now DNF 5. DNF 5 brings a number of improvements but it has at least one issue that makes me unhappy with it in my specific situation. Over on the Fediverse I said:

Oh nice, DNF 5 in Fedora 41 has nicely improved the handling of output from RPM scriptlets, so that you can more easily see that it's scriptlet output instead of DNF messages.

[later]

I must retract my praise for DNF 5 in Fedora 41, because it has actually made the handling of output from RPM scriptlets *much* worse than in dnf 4. DNF 5 will repeatedly re-print the current output to date of scriptlets every time it updates a progress indicator of, for example, removing packages. This results in a flood of output for DKMS module builds during kernel updates. Dnf 5's cure is far worse than the disease, and there's no way to disable it.

<bugzilla 2331691>

(Fedora 41 specifically has dnf5-5.2.8.1, at least at the moment.)

This can be mostly worked around for kernel package upgrades and DKMS modules by manually removing and upgrading packages before the main kernel upgrade. You want to do this so that dnf is removing as few packages as possible while your DKMS modules are rebuilding. This is done with:

  1. Upgrade all of your non-kernel packages first:

    dnf upgrade --exclude 'kernel*'
    

  2. Remove the following packages for the old kernel:

    kernel kernel-core kernel-devel kernel-modules kernel-modules-core kernel-modules-extra

    (It's probably easier to do 'dnf remove kernel*<version>*' and let DNF sort it out.)

  3. Upgrade two kernel packages that you can do in advance:

    dnf upgrade kernel-tools kernel-tools-libs
    

Unfortunately in Fedora 41 this still leaves you with one RPM package that you can't upgrade in advance and that will be removed while your DKMS module is rebuilding, namely 'kernel-devel-matched'. To add extra annoyance, this is a virtual package that contains no files, and you can't remove it because a lot of things depend on it.

As far as I can tell, DNF 5 has absolutely no way to shut off its progress bars. It completely ignores $TERM and I can't see anything else that leaves DNF usable. It would have been nice to have some command line switches to control this, but it seems pretty clear that this wasn't high on the DNF 5 road map.

(Although I don't expect this to be fixed in Fedora 41 over its lifetime, I am still deferring the Fedora 41 upgrades of my work and home desktops for as long as possible to minimize the amount of DNF 5 irritation I have to deal with.)

WireGuard's AllowedIPs aren't always the (WireGuard) routes you want

By: cks
6 January 2025 at 04:35

A while back I wrote about understanding WireGuard's AllowedIPs, and also recently I wrote about how different sorts of WireGuard setups have different difficulties, where one of the challenges for some setups is setting up what you want routed through WireGuard connections. As Ian Z aka nobrowser recently noted in a comment on the first entry, these days many WireGuard related programs (such as wg-quick and NetworkManager) will automatically set routes for you based on AllowedIPs. Much of the time this will work fine, but there are situations where adding routes for all AllowedIPs ranges isn't what you want.

WireGuard's AllowedIPs setting for a particular peer controls two things at once: what (inside-WireGuard) source IP addresses you will accept from the peer, and what destination addresses WireGuard will send to that peer if the packet is sent to that WireGuard interface. However, it's the routing table that controls what destination addresses are sent to a particular WireGuard interface (or more likely a combination of IP policy routing rules and some routing table).

If your WireGuard IP address is only reachable from other WireGuard peers, you can sensibly bound your AllowedIPs so that the collection of all of them matches the routing table. This is also more or less doable if some of them are gateways for additional networks; hopefully your network design puts all of those networks under some subnet and the subnet isn't too big. However, if your WireGuard IP can wind up being reached by a broader range of source IPs, or even 'all of the Internet' (as is my case), then your AllowedIPs range is potentially much larger than what you want to always be routed to WireGuard.

A related case is if you have a 'work VPN' WireGuard configuration where you could route all of your traffic through your WireGuard connection but some of the time you only want to route traffic to specific (work) subnets. Unless you like changing AllowedIPs all of the time or constructing two different WireGuard interfaces and only activating the correct one, you'll want an AllowedIPs that accepts everything but some of the time you'll only route specific networks to the WireGuard interface.

(On the other hand, with the state of things in Linux, having two separate WireGuard interfaces might be the easiest way to manage this in NetworkManager or other tools.)

I think that most people's use of WireGuard will probably involve AllowedIPs settings that also work for routing, provided that the tools involve handle the recursive routing problem. These days, NetworkManager handles that for you, although I don't know about wg-quick.

(This is one of the entries that I write partly to work it out in my own head. My own configuration requires a different AllowedIPs than the routes I send through the WireGuard tunnel. I make this work with policy based routing.)

My unusual X desktop wasn't made 'from scratch' in a conventional sense

By: cks
1 January 2025 at 04:10

There are people out there who set up unusual (Unix) environments for themselves from scratch; for example, Mike Hoye recently wrote Idiosyncra. While I have an unusual desktop, I haven't built it from scratch in quite the same way that Mike Hoye and other people have; instead I've wound up with my desktop through a rather easier process.

It would be technically accurate to say that my current desktop environment has been built up gradually over time (including over the time I've been writing Wandering Thoughts, such as my addition of dmenu). But this isn't really how it happened, in that I didn't start from a normal desktop and slowly change it into my current one. The real story is that the core of my desktop dates from the days when everyone's X desktops looked like mine does. Technically there were what we would call full desktops back in those days, if you had licensed the necessary software from your Unix vendor and chose to run it, but hardware was sufficiently slow back then that people at universities almost always chose to run more lightweight environments (especially since they were often already using the inexpensive and slow versions of workstations).

(Depending on how much work your local university system administrators had done, your new Unix account might start out with the Unix vendor's X setup, or it could start out with what X11R<whatever> defaulted to when built from source, or it might be some locally customized setup. In all cases you often were left to learn about the local tastes in X desktops and how to improve yours from people around you.)

To show how far back this goes (which is to say how little of it has been built 'from scratch' recently), my 1996 SGI Indy desktop has much of the look and the behavior of my current desktop, and its look and behavior wasn't new then; it was an evolution of my desktop from earlier Unix workstations. When I started using Linux, I migrated my Indy X environment to my new (and better) x86 hardware, and then as Linux has evolved and added more and more things you have to run to have a usable desktop with things like volume control, your SSH agent, and automatically mounted removable media, I've added them piece by piece (and sometimes updated them as how you do this keeps changing).

(At some point I moved from twm as my window manager to fvwm, but that was merely redoing my twm configuration in fvwm, not designing a new configuration from scratch.)

I wouldn't want to start from scratch today to create a new custom desktop environment; it would be a lot of work (and the one time I looked at it I wound up giving up). Someday I will have to move from X, fvwm, dmenu, and so on to some sort of Wayland based environment, but even when I do I expect to make the result as similar to my current X setup as I can, rather than starting from a clean sheet design. I know what I want because I'm very used to my current environment and I've been using variants of it for a very long time now.

(This entry was sparked by Ian Z aka nobrowser's comment on my entry from yesterday.)

PS: Part of the long lineage and longevity of my X desktop is that I've been lucky and determined enough to use Unix and X continuously at work, and for a long time at home as well. So I've never had a time when I moved away from X on my desktop(s) and then had to come back to reconstruct an environment and catch it up to date.

PPS: This is one of the roots of my xdm heresy, where my desktops boot into a text console and I log in there to manually start X with a personal script that's a derivative of the ancient startx command.

In an unconfigured Vim, I want to do ':set paste' right away

By: cks
29 December 2024 at 03:53

Recently I wound up using a FreeBSD machine, where I promptly installed vim for my traditional reason. When I started modifying some files, I had contents to paste in from another xterm window, so I tapped my middle mouse button while in insert mode (ie, I did the standard xterm 'paste text' thing). You may imagine the 'this is my face' meme when what vim inserted was the last thing I'd deleted in vim on that FreeBSD machine, instead of my X text selection.

For my future use, the cure for this is ':set paste', which turns off basically all of vim's special handling of pasted text. I've traditionally used this to override things like vim auto-indenting or auto-commenting the text I'm pasting in, but it also turns off vim's special mouse handling, which is generally active in terminal windows, including over SSH.

(The defaults for ':set mouse' seem to vary from system to system and probably vim build to vim build. For whatever reason, this FreeBSD system and its vim defaulted to 'mouse=a', ie special mouse handling was active all the time. I've run into mouse handling limits in vim before, although things may have changed since then.)

In theory, as covered in Vim's X11 selection mechanism, I might be able to paste from another xterm (or whatever) using "*p (to use the '"*' register, which is the primary selection or the cut buffer if there's no primary selection). In practice I think this only works under limited circumstances (although I'm not sure what they are) and the Vim manual itself tells you to get used to using Shift with your middle mouse button. I would rather set paste mode, because that gets everything; a vim that has the mouse active probably has other things I don't want turned on too.

(Some day I'll put together a complete but minimal collection of vim settings to disable everything I want disabled, but that day isn't today.)

PS: If I'm reading various things correctly, I think vim has to be built with the 'xterm_clipboard' option in order to pull out selection information from xterm. Xterm itself must have 'Window Ops' allowed, which is not a normal setting; with this turned on, vim (or any other program) can use the selection manipulation escape sequences that xterm documents in "Operating System Commands". These escape sequences don't require that vim have direct access to your X display, so they can be used over plain SSH connections. Support for these escape sequences is probably available in other terminal emulators too, and these terminal emulators may have them always enabled.

(Note that access to your selection is a potential security risk, which is probably part of why xterm doesn't allow it by default.)

Cgroup V2 memory limits and their potential for thrashing

By: cks
28 December 2024 at 04:10

Recently I read 32 MiB Working Sets on a 64 GiB machine (via), which recounts how under some situations, Windows could limit the working set ('resident set') of programs to 32 MiB, resulting in a lot of CPU time being spent on soft (or 'minor') page faults. On Linux, you can do similar things to limit memory usage of a program or an entire cgroup, for example through systemd, and it occurred to me to wonder if you can get the same thrashing effect with cgroup V2 memory limits. Broadly, I believe that the answer depends on what you're using the memory for and what you use to set limits, and it's certainly possible to wind up setting limits so that you get thrashing.

(As a result, this is now something that I'll want to think about when setting cgroup memory limits, and maybe watch out for.)

Cgroup V2 doesn't have anything that directly limits a cgroup's working set (what is usually called the 'resident set size' (RSS) on Unix systems). The closest it has is memory.high, which throttles a cgroup's memory usage and puts it under heavy memory reclaim pressure when it hits this high limit. What happens next depends on what sort of memory pages are being reclaimed from the process. If they are backed by files (for example, they're pages from the program, shared libraries, or memory mapped files), they will be dropped from the process's resident set but may stay in memory so it's only a soft page fault when they're next accessed. However, if they're anonymous pages of memory the process has allocated, they must be written to swap (if there's room for them) and I don't know if the original pages stay in memory afterward (and so are eligible for a soft page fault when next accessed). If the process keeps accessing anonymous pages that were previously reclaimed, it will thrash on either soft or hard page faults.

(The memory.high limit is set by systemd's MemoryHigh=.)

However, the memory usage of a cgroup is not necessarily in ordinary process memory that counts for RSS; it can be in all sorts of kernel caches and structures. The memory.high limit affects all of them and will generally shrink all of them, so in practice what it actually limits depends partly on what the processes in the cgroup are doing and what sort of memory that allocates. Some of this memory can also thrash like user memory does (for example, memory for disk cache), but some won't necessarily (I believe shrinking some sorts of memory usage discards the memory outright).

Since memory.high is to a certain degree advisory and doesn't guarantee that the cgroup never goes over this memory usage, I think people more commonly use memory.max (for example, via the systemd MemoryMax= setting). This is a hard limit and will kill programs in the cgroup if they push hard on going over it; however, the memory system will try to reduce usage with other measures, including pushing pages into swap space. In theory this could result in either swap thrashing or soft page fault thrashing, if the memory usage was just right. However, in our environments cgroups that hit memory.max generally wind up having programs killed rather than sitting there thrashing (at least for very long). This is probably partly because we don't configure much swap space on our servers, so there's not much room between hitting memory.max with swap available and exhausting the swap space too.

My view is that this generally makes it better to set memory.max than memory.high. If you have a cgroup that overruns whatever limit you're setting, using memory.high is much more likely to cause some sort of thrashing because it never kills processes (the kernel documentation even tells you that memory.high should be used with some sort of monitoring to 'alleviate heavy reclaim pressure', ie either raise the limit or actually kill things). In a past entry I set MemoryHigh= to a bit less than my MemoryMax setting, but I don't think I'll do that in the future; any gap between memory.high and memory.max is an opportunity for thrashing through that 'heavy reclaim pressure'.

WireGuard on OpenBSD just works (at least as a VPN server)

By: cks
27 December 2024 at 04:12

A year or so ago I mentioned that I'd set up WireGuard on an Android and an iOS device in a straightforward VPN configuration. What I didn't mention in that entry is that the other end of the VPN was not on a Linux machine, but on one of our OpenBSD VPN servers. At the time it was running whatever was the then-current OpenBSD version, and today it's running OpenBSD 7.6, which is the current version at the moment. Over that time (and before it, since the smartphones weren't its first WireGuard clients), WireGuard on OpenBSD has been trouble free and has just worked.

In our configuration, OpenBSD WireGuard requires installing the 'wireguard-tools' package, setting up an /etc/wireguard/wg0.conf (perhaps plus additional files for generated keys), and creating an appropriate /etc/hostname.wg0. I believe that all of these are covered as part of the standard OpenBSD documentation for setting up WireGuard. For this VPN server I allocated a /24 inside the RFC 1918 range we use for VPN service to be used for WireGuard, since I don't expect too many clients on this server. The server NATs WireGuard connections just as it NATs connections from the other VPNs it supports, which requires nothing special for WireGuard in its /etc/pf.conf.

(I did have to remember to allow incoming traffic to the WireGuard UDP port. For this server, we allow WireGuard clients to send traffic to each other through the VPN server if they really want to, but in another one we might want to restrict that with additional pf rules.)

Everything I'd expect to work does work, both in terms of the WireGuard tools (I believe the information 'wg' prints is identical between Linux and OpenBSD, for example) and for basic system metrics (as read out by, for example, the OpenBSD version of the Prometheus host agent, which has overall metrics for the 'wg0' interface). If we wanted per-client statistics, I believe we could probably get them through this third party WireGuard Prometheus exporter, which uses an underlying package to talk to WireGuard that does apparently work on OpenBSD (although this particular exporter can potentially have label cardinality issues), or generate them ourselves by parsing 'wg' output (likely from 'wg show all dump').

This particular OpenBSD VPN server is sufficiently low usage that I haven't tried to measure either the possible bandwidth we can achieve with WireGuard or the CPU usage of WireGuard. Historically, neither are particularly critical for our VPNs in general, which have generally not been capable of particularly high bandwidth (with either OpenVPN or L2TP, our two general usage VPN types so far; our WireGuard VPN is for system staff only).

(In an ideal world, none of this should count as surprising. In this world, I like to note when things that are a bit out of the mainstream just work for me, with a straightforward setup process and trouble free operation.)

A gotcha with importing ZFS pools and NFS exports on Linux (as of ZFS 2.3.0)

By: cks
24 December 2024 at 03:41

Ever since its Solaris origins, ZFS has supported automatic NFS and CIFS sharing of ZFS filesystems through their 'sharenfs' and 'sharesmb' properties. Part of the idea of this is that you could automatically have NFS (and SMB) shares created and removed as you did things like import and export pools, rather than have to maintain a separate set of export information and keep it in sync with what ZFS filesystems were available. On Linux, OpenZFS still supports this, working through standard Linux NFS export permissions (which don't quite match the Solaris/Illumos model that's used for sharenfs) and standard tools like exportfs. A lot of this works more or less as you'd expect, but it turns out that there's a potentially unpleasant surprise lurking in how 'zpool import' and 'zpool export' work.

In the current code, if you import or export a ZFS pool that has no filesystems with a sharenfs set, ZFS will still run 'exportfs -ra' at the end of the operation even though nothing could have changed in the NFS exports situation. An important effect that this has is that it will wipe out any manually added or changed NFS exports, reverting your NFS exports to what is currently in /etc/exports and /etc/exports.d. In many situations (including ours) this is a harmless operation, because /etc/exports and /etc/exports.d are how things are supposed to be. But in some environments you may have programs that maintain their own exports list and permissions through running 'exportfs' in various ways, and in these environments a ZFS pool import or export will destroy those exports.

(Apparently one such environment is high availability systems, some of which manually manage NFS exports outside of /etc/exports (I maintain that this is a perfectly sensible design decision). These are also the kind of environment that might routinely import or export pools, as HA pools move between hosts.)

The current OpenZFS code runs 'exportfs -ra' entirely blindly. It doesn't matter if you don't NFS export any ZFS filesystems, much less any from the pool that you're importing or exporting. As long as an 'exportfs' binary is on the system and can be executed, ZFS will run it. Possibly this could be changed if someone was to submit an OpenZFS bug report, but for a number of reasons (including that we're not directly affected by this and aren't in a position to do any testing), that someone will not be me.

(As far as I can tell this is the state of the code in all Linux OpenZFS versions up through the current development version and 2.3.0-rc4, the latest 2.3.0 release candidate.)

Appendix: Where this is in the current OpenZFS source code

The exportfs execution is done in nfs_commit_shares() in lib/libshare/os/linux/nfs.c. This is called (indirectly) by sa_commit_shares() in lib/libshare/libshare.c, which is called by zfs_commit_shares() in lib/libzfs/libzfs_mount.c. In turn this is called by zpool_enable_datasets() and zpool_disable_datasets(), also in libzfs_mount.c, which are called as part of 'zpool import' and 'zpool export' respectively.

(As a piece of trivia, zpool_disable_datasets() will also be called during 'zpool destroy'.)

On the US FAA's response to Falcon 9 debris

By: VM
4 March 2025 at 10:06
On the US FAA's response to Falcon 9 debris

On February 1, SpaceX launched its Starlink 11-4 mission onboard a Falcon 9 rocket. The rocket's reusable first stage returned safely to the ground and the second stage remained in orbit after deploying the Starlink satellites. It was to deorbit later in a controlled procedure and land somewhere in the Pacific Ocean. But on February 19 it was seen breaking up in the skies over Denmark, England, Poland, and Sweden, with some larger pieces crashing into parts of Poland. After the Polish space agency determined the debris to belong to a SpaceX Falcon 9 rocket, the US Federal Aviation Administration (FAA) was asked about its liability. This was its response:

The FAA determined that all flight events for the SpaceX Starlink 11-4 mission occurred within the scope of SpaceX's licensed activities and that SpaceX satisfied safety at end-of-launch requirements. Per post-launch reporting requirements, SpaceX must identify any discrepancy or anomaly that occurred during the launch to the FAA within 90-days. The FAA has not identified any events that should be classified as a mishap at this time. Licensed flight activities and FAA oversight concluded upon SpaceX's last exercise of control over the Falcon 9 vehicle. SpaceX posted information on its website that the second stage from this launch reentered over Europe. The FAA is not investigating the uncontrolled reentry of the second stage nor the debris found in Poland.

I've spotted a lot of people on the internet (not trolls) describing this response as being in line with Donald Trump's "USA first" attitude and reckless disregard for the consequences of his government's actions and policies on other countries. It's understandable given how his meeting with Zelenskyy on February 28 played out as well as NASA acting administrator Janet Petro's disgusting comment about US plans to "dominate" lunar and cislunar space. However, the FAA's position has been unchanged since at least August 18, 2023, when it issued a "notice of proposed rulemaking" designated 88 FR 56546. Among other things:

The proposed rule would … update definitions relating to commercial space launch and reentry vehicles and occupants to reflect current legislative definitions … as well as implement clarifications to financial responsibility requirements in accordance with the United States Commercial Space Launch Competitiveness Act.

Under Section 401.5 2(i), the notice stated:

(1) Beginning of launch. (i) Under a license, launch begins with the arrival of a launch vehicle or payload at a U.S. launch site.

The FAA's position has likely stayed the same for some duration before the August 2023 date. According to Table 1 in the notice, the "effect of change" of the clarification of the term "Launch", under which Section 401.5 2(i) falls, is:

None. The FAA has been applying these definitions in accordance with the statute since the [US Commercial Space Launch Competitiveness Act 2015] went into effect. This change would now provide regulatory clarity.

Skipping back a bit further, the FAA issued a "final rule" on "Streamlined Launch and Reentry License Requirements" on September 30, 2020. The rule states (pp. 680-681) under Section 450.1 (b) 3:

(i) For an orbital launch of a vehicle without a reentry of the vehicle, launch ends after the licensee’s last exercise of control over its vehicle on orbit, after vehicle component impact or landing on Earth, after activities necessary to return the vehicle or component to a safe condition on the ground after impact or landing, or after activities necessary to return the site to a safe condition, whichever occurs latest;
(ii) For an orbital launch of a vehicle with a reentry of the vehicle, launch ends after deployment of all payloads, upon completion of the vehicle's first steady-state orbit if there is no payload deployment, after vehicle component impact or landing on Earth, after activities necessary to return the vehicle or component to a safe condition on the ground after impact or landing, or after activities necessary to return the site to a safe condition, whichever occurs latest; …

In part B of this document, under the heading "Detailed Discussion of the Final Rule" and further under the sub-heading "End of Launch", the FAA presents the following discussion:

[Commercial Spaceflight Federation] and SpaceX suggested that orbital launch without a reentry in proposed Β§450.3(b)(3)(i) did not need to be separately defined by the regulation, stating that, regardless of the type of launch, something always returns: Boosters land or are disposed, upper stages are disposed. CSF and SpaceX further requested that the FAA not distinguish between orbital and suborbital vehicles for end of launch.
The FAA does not agree because the distinctions in § 450.3(b)(3)(i) and (ii) are necessary due to the FAA's limited authority on orbit. For a launch vehicle that will eventually return to Earth as a reentry vehicle, its on-orbit activities after deployment of its payload or payloads, or completion of the vehicle's first steady-state orbit if there is no payload, are not licensed by the FAA. In addition, the disposal of an upper stage is not a reentry under 51 U.S.C. Chapter 509, because the upper stage does not return to Earth substantially intact.

From 51 USC Chapter 509, Section 401.7:

Reentry vehicle means a vehicle designed to return from Earth orbit or outer space to Earth substantially intact. A reusable launch vehicle that is designed to return from Earth orbit or outer space to Earth substantially intact is a reentry vehicle.

This means Section 450.1 (b) 3(i) under "Streamlined Launch and Reentry License Requirements" of 2020 applies to the uncontrolled deorbiting of the Falcon 9 upper stage in the Starlink 11-4 mission. In particular, according to the FAA, the launch ended "after the licensee’s last exercise of control over its vehicle on orbit", which was the latest relevant event.

Back to the "Detailed Discussion of the Final Rule":

Both CSF and SpaceX proposed β€œend of launch” should be defined on a case-by-case basis in pre-application consultation and specified in the license. The FAA disagrees, in part. The FAA only regulates on a case-by-case basis if the nature of an activity makes it impossible for the FAA to promulgate rules of general applicability. This need has not arisen, as evidenced by decades of FAA oversight of end-of-launch activities. That said, because the commercial space transportation industry continues to innovate, Β§450.3(a) gives the FAA the flexibility to adjust the scope of license, including end of launch, based on unique circumstances as agreed to by the Administrator.

The world currently doesn't have a specific international law or agreement dealing with accountability for space debris that crashes to the earth, including paying for the damages such debris wreaks and imposing penalties on offending launch operators. In light of this fact, it's important to remember the FAA's position β€” even if it seems disagreeable β€” has been unchanged for some time even as it has regularly updated its rulemaking to accommodate private sector innovation within the spirit of the existing law.

Trump is an ass and I'm not holding out for him to look out for the concerns of other countries when pieces of made-in-USA rockets descend in uncontrolled fashion over their territories, damaging property or even taking lives. But that the FAA didn't develop its present position afresh under Trump 2.0, and that it was really developed with feedback from SpaceX and other US-based spaceflight operators, is important to understand that its attitude towards crashing debris goes beyond ideology, encompassing the support of both Democrat and Republican governments over the years.

Signal kontejner

10 November 2024 at 00:00

Signal je aplikacija za varno in zasebno sporočanje, ki je brezplačna, odprtokodna in enostavna za uporabo. Uporablja močno Ε‘ifriranje od začetne do končne točke (anlg. end-to-end), uporabljajo pa jo Ε‘tevilni aktivisti, novinarji, ΕΎviΕΎgači, pa tudi drΕΎavni uradniki in poslovneΕΎi. Skratka vsi, ki cenijo svojo zasebnost. Signal teče na mobilnih telefonih z operacijskim sistemom Android in iOS, pa tudi na namiznih računalnikih (Linux, Windows, MacOS) - pri čemer je namizna različica narejena tako, da jo poveΕΎemo s svojo mobilno različico Signala. To nam omogoča, da lahko vse funkcije Signala uporabljamo tako na telefonu kot na namiznem računalniku, prav tako se vsa sporočila, kontakti, itd. sinhronizirajo med obema napravama. Vse lepo in prav, a Signal je (ΕΎal) vezan na telefonsko Ε‘tevilko in praviloma lahko na enem telefonu poganjate samo eno kopijo Signala, enako pa velja tudi za namizni računalnik. Bi se dalo to omejitev zaobiti? Vsekakor, a za to je potreben manjΕ‘i β€œhack”. KakΕ‘en, preberite v nadaljevanju.

Poganjanje več različic Signala na telefonu

Poganjanje več različic Signala na telefonu je zelo enostavno - a samo, če uporabljate GrapheneOS. GrapheneOS je operacijski sistem za mobilne telefone, ki ima vgrajene őtevilne varnostne mehanizme, poleg tega pa je zasnovan na način, da kar najbolje skrbi za zasebnost uporabnika. Je odprtokoden, visoko kompatibilen z Androidom, vendar s őtevilnimi izboljőavami, ki izredno otežujejo oz. kar onemogočajo tako forenzični zaseg podatkov, kot tudi napade z vohunsko programsko opremo tipa Pegasus in Predator.

GrapheneOS omogoča uporabo več profilov (do 31 + uporabniőki profil tim. gosta), ki so med seboj popolnoma ločeni. To pomeni, da lahko v različnih profilih nameőčate različne aplikacije, imate povsem različen seznam stikov, na enem profilu uporabljate en VPN, na drugem drugega ali pa sploh nobenega, itd.

ReΕ‘itev je torej preprosta. V mobilnem telefonu z GrapheneOS si odpremo nov profil, tam namestimo novo kopijo Signala, v telefon vstavimo drugo SIM kartico in Signal poveΕΎemo z novo Ε‘tevilko.

Ko je telefonska őtevilka registrirana, lahko SIM kartico odstranimo in v telefon vstavimo staro. Signal namreč za komunikacijo uporablja samo prenos podatkov (seveda lahko telefon uporabljamo tudi brez SIM kartice, samo na WiFi-ju). Na telefonu imamo sedaj nameőčeni dve različici Signala, vezani na dve različni telefonski őtevilki, in iz obeh različic lahko poőiljamo sporočila (tudi med njima dvema!) ali kličemo.

Čeprav so profili ločeni, pa lahko nastavimo, da obvestila iz aplikacije Signal na drugem profilu, dobivamo tudi ko smo prijavljeni v prvi profil. Le za pisanje sporočil ali vzpostavljanje klicev, bo treba preklopiti v pravi profil na telefonu.

Preprosto, kajne?

Poganjanje več različic Signala na računalniku

Zdaj bi si seveda nekaj podobnega želeli tudi na računalniku. Skratka, želeli bi si možnosti, da na računalniku, pod enim uporabnikom poganjamo dve različni instanci Signala (vsaka vezana na svojo telefonsko őtevilko).

No, tukaj je zadeva na prvi pogled malenkost bolj zapletena, a se s pomočjo virtualizacije da težavo elegantno reőiti. Seveda na računalniku samo za Signal ne bomo poganjali kar celega novega virtualnega stroja, lahko pa uporabimo tim. kontejner.

V operacijskem sistemu Linux najprej namestimo aplikacijo systemd-container (v sistemih Ubuntu je sicer že privzeto nameőčena).

Na gostiteljskem računalniku omogočimo tim neprivilegirane uporabniőke imenske prostore (angl. unprivileged user namespaces), in sicer z ukazom sudo nano /etc/sysctl.d/nspawn.conf, nato pa v datoteko vpiőemo:

kernel.unprivileged_userns_clone=1

Zdaj je SistemD storitev treba ponovno zagnati:

sudo systemctl daemon-reload
sudo systemctl restart systemd-sysctl.service
sudo systemctl status systemd-sysctl.service

…nato pa lahko namestimo Debootstrap: sudo apt install debootstrap.

Zdaj ustvarimo nov kontejner, v katerega bomo namestili operacijski sistem Debian (in sicer različico stable) - v resnici bo nameőčena le minimalno zahtevana koda operacijskega sistema:

sudo debootstrap --include=systemd,dbus stable

Dobimo pribliΕΎno takle izpis:

/var/lib/machines/debian
I: Keyring file not available at /usr/share/keyrings/debian-archive-keyring.gpg; switching to https mirror https://deb.debian.org/debian
I: Retrieving InRelease 
I: Retrieving Packages 
I: Validating Packages 
I: Resolving dependencies of required packages...
I: Resolving dependencies of base packages...
I: Checking component main on https://deb.debian.org/debian...
I: Retrieving adduser 3.134
I: Validating adduser 3.134
...
...
...
I: Configuring tasksel-data...
I: Configuring libc-bin...
I: Configuring ca-certificates...
I: Base system installed successfully.

Zdaj je kontejner z operacijskim sistemom Debian nameőčen. Zato ga zaženemo in nastavimo geslo korenskega uporabnika :

sudo systemd-nspawn -D /var/lib/machines/debian -U --machine debian

Dobimo izpis:

Spawning container debian on /var/lib/machines/debian.
Press Ctrl-] three times within 1s to kill container.
Selected user namespace base 1766326272 and range 65536.
root@debian:~#

Zdaj se preko navideznega terminala poveΕΎemo v operacijski sistem in vpiΕ‘emo naslednja dva ukaza:

passwd
printf 'pts/0\npts/1\n' >> /etc/securetty 

S prvim ukazom nastavimo geslo, drugi pa omogoči povezavo preko tim. lokalnega terminala (TTY). Na koncu vpiőemo ukaz logout in se odjavimo nazaj na gostiteljski računalnik.

Zdaj je treba nastaviti omrežje, ki ga bo uporabljal kontejner. Najbolj enostavno je, če uporabimo kar omrežje gostiteljskega računalnika. Vpiőemo naslednja dva ukaza:

sudo mkdir /etc/systemd/nspawn
sudo nano /etc/systemd/nspawn/debian.nspawn

V datoteko vnesemo:

[Network]
VirtualEthernet=no

Zdaj kontejner ponovno zaΕΎenemo z ukazom sudo systemctl start systemd-nspawn@debian ali pa Ε‘e enostavneje - machinectl start debian.

Seznam zagnanih kontejnerjev si lahko tudi ogledamo:

machinectl list
MACHINE CLASS     SERVICE        OS     VERSION ADDRESSES
debian  container systemd-nspawn debian 12      -        

1 machines listed.

Oziroma se poveΕΎemo v ta virtualni kontejner: machinectl login debian. Dobimo izpis:

Connected to machine debian. Press ^] three times within 1s to exit session.

Debian GNU/Linux 12 cryptopia pts/1

cryptopia login: root
Password: 

Na izpisu se vidi, da smo se povezali z uporabnikom root in geslom, ki smo ga prej nastavili.

Zdaj v tem kontejnerju namestimo Signal Desktop.

apt update
apt install wget gpg

wget -O- https://updates.signal.org/desktop/apt/keys.asc | gpg --dearmor > signal-desktop-keyring.gpg

echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/signal-desktop-keyring.gpg] https://updates.signal.org/desktop/apt xenial main' | tee /etc/apt/sources.list.d/signal-xenial.list

apt update
apt install --no-install-recommends signal-desktop
halt

Z zadnjim ukazom kontejner zaustavimo. Zdaj je v njem nameőčena sveža različica aplikacije Signal Desktop.

Mimogrede, če želimo, lahko kontejner preimenujemo v bolj prijazno ime, npr. sudo machinectl rename debian debian-signal. Seveda pa bomo potem isto ime morali uporabljati tudi za zagon kontejnerja (torej, machinectl login debian-signal).

Zdaj naredimo skripto, s katero bomo kontejner pognali in v njem zagnali Signal Desktop na način, da bomo njegovo okno videli na namizju gostiteljskega računalnika:

Ustvarimo datoteko nano /opt/runContainerSignal.sh (ki jo shranimo npr. v mapo /opt), vsebina datoteke pa je naslednja:

#!/bin/sh
xhost +local:
pkexec systemd-nspawn --setenv=DISPLAY=:0 \
                      --bind-ro=/tmp/.X11-unix/  \
                      --private-users=pick \
                      --private-users-chown \
                      -D /var/lib/machines/debian-signal/ \
                      --as-pid2 signal-desktop --no-sandbox
xhost -local:

S prvim xhost ukazom omogočimo povezovanje na naő zaslon, vendar samo iz lokalnega računalnika, drugi xhost ukaz pa bo te povezave (na zaslon) spet blokiral). Nastavimo, da je skripta izvrőljiva (chmod +x runContainerSignal.sh), in to je to.

Dve ikoni aplikacije Signal Desktop

Dve ikoni aplikacije Signal Desktop

No, ne őe čisto, saj bi skripto morali zaganjati v terminalu, veliko bolj udoben pa je zagon s klikom na ikono.

Naredimo torej .desktop datoteko: nano ~/.local/share/applications/runContainerSignal.desktop. Vanjo zapiΕ‘emo naslednjo vsebino:

[Desktop Entry]
Type=Application
Name=Signal Container
Exec=/opt/runContainerSignal.sh
Icon=security-high
Terminal=false
Comment=Run Signal Container

…namesto ikone security-high, lahko uporabimo kakΕ‘no drugo, na primer:

Icon=/usr/share/icons/Yaru/scalable/status/security-high-symbolic.svg

Pojasnilo: skripta je shranjena v ~/.local/share/applications/, torej je dostopa samo specifičnemu uporabniku in ne vsem uporabnikom na računalniku.

Zdaj nastavimo, da je .desktop datoteka izvrΕ‘ljiva: chmod +x ~/.local/share/applications/runContainerSignal.desktop

OsveΕΎimo tim. namizne vnose (angl. Desktop Entries): update-desktop-database ~/.local/share/applications/, in to je to!

Dve instanci aplikacije Signal Desktop

Dve instanci aplikacije Signal Desktop"

Ko bomo v iskalnik aplikacij vpisali β€œSignal Container”, se bo prikazala ikona aplikacije, sklikom na njo pa bomo zagnali Signal v kontejnerju (bo pa za zagon potrebno vpisati geslo).

Zdaj ta Signal Desktop samo őe povežemo s kopijo Signala na telefonu in že lahko na računalniku uporabljamo dve kopiji aplikacije Signal Desktop.

Kaj pa…?

Ε½al pa v opisanem primeru ne deluje dostop do kamere in zvoka. Klice bomo torej Ε‘e vedno morali opravljati iz telefona.

Izkaže se namreč, da je povezava kontejnerja z zvočnim sistemom PipeWire in kamero gostiteljskega računalnika neverjetno zapletena (vsaj v moji postavitvi sistema). Če imate namig kako zadevo reőiti, pa mi seveda lahko sporočite. :)

Mozilla Is Worried About the Proposed Fixes for Google’s Search Monopoly

By: Nick Heer
27 November 2024 at 00:46

Michael Kan, PC Magazine:

Mozilla points to a key but less eye-catching proposal from the DOJ to regulate Google’s search business, which a judge ruled as a monopoly in August. In their recommendations, federal prosecutors urged the court to ban Google from offering β€œsomething of value” to third-party companies to make Google the default search engine over their software or devices.Β 

β€œThe proposed remedies are designed to end Google’s unlawful practices and open up the market for rivals and new entrants to emerge,” the DOJ told the court. The problem is that Mozilla earns most of its revenue from royalty deals β€” nearly 86% in 2022 β€” making Google the default Firefox browser search engine.

This is probably another reason why U.S. prosecutors want to jettison Chrome from Google: they want to reduce any benefit it may accrue from trying to fix its illegal search monopoly. But it seems Google’s position in the industry is so entrenched that correcting it will hurt lots of other businesses, too. That does not mean it should not be broken up or that the DOJ’s proposed remedies are wrong, however.

βŒ₯ Permalink

Unix's buffered IO in assembly and in C

By: cks
9 December 2024 at 02:44

Recently on the Fediverse, I said something related to Unix's pre-V7 situation with buffered IO:

[...]

(I think the V1 approach is right for an assembly based minimal OS, while the stdio approach kind of wants malloc() and friends.)

The V1 approach, as documented in its putc.3 and getw.3 manual pages, is that the caller to the buffered IO routines supplies the data area used for buffering, and the library functions merely initialize it and later use it. How you get the data area is up to you and your program; you might, for example, simply have a static block of memory in your BSS segment. You can dynamically allocate this area if you want to, but you don't have to. The V2 and later putchar have a similar approach but this time they contain a static buffer area and you just have to do a bit of initialization (possibly putchar was in V1 too, I don't know for sure).

Stdio of course has a completely different API. In stdio, you don't provide the data area; instead, stdio provides you an opaque reference (a 'FILE *') to the information and buffers it maintains internally. This is an interface that definitely wants some degree of dynamic memory allocation, for example for the actual buffers themselves, and in modern usage most of the FILE objects will be dynamically allocated too.

(The V7 stdio implementation had a fixed set of FILE structs and so would error out if you used too many of them. However, it did use malloc() for the buffer associated with them, in filbuf.c and flsbuf.c.)

You can certainly do dynamic memory allocation in assembly, but I think it's much more natural in C, and certainly the C standard library is more heavyweight than the relatively small and minimal assembly language stuff early Unix programs (written in assembly) seem to have required. So I think it makes a lot of sense that Unix started with a buffering approach where the caller supplies the buffer (and probably doesn't dynamically allocate it), then moved to one where the library does at least some allocation and supplies the buffer (and other data) itself.

Buffered IO in Unix before V7 introduced stdio

By: cks
6 December 2024 at 04:16

I recently read Julia Evans' Why pipes sometimes get "stuck": buffering. Part of the reason is that almost every Unix program does some amount of buffering for what it prints (or writes) to standard output and standard error. For C programs, this buffering is built into the standard library, specifically into stdio, which includes familiar functions like printf(). Stdio is one of the many things that appeared first in Research Unix V7. This might leave you wondering if this sort of IO was buffered in earlier versions of Research Unix and if it was, how it was done.

The very earliest version of Research Unix is V1, and in V1 there is putc.3 (at that point entirely about assembly, since C was yet to come). This set of routines allows you to set up and then use a 'struct' to implement IO buffering for output. There is a similar set of buffered functions for input, in getw.3, and I believe the memory blocks the two sets of functions use are compatible with each other. The V1 manual pages note it as a bug that the buffer wasn't 512 bytes, but also notes that several programs would break if the size was changed; the buffer size will be increased to 512 bytes by V3.

In V2, I believe we still have putc and getw, but we see the first appearance of another approach, in putchr.s. This implements putchar(), which is used by printf() and which (from later evidence) uses an internal buffer (under some circumstances) that has to be explicitly flush()'d by programs. In V3, there's manual pages for putc.3 and getc.3 that are very similar to the V1 versions, which is why I expect these were there in V2 as well. In V4, we have manual pages for both putc.3 (plus getc.3) and putch[a]r.3, and there is also a getch[a]r.3 that's the input version of putchar(). Since we have a V4 manual page for putchar(), we can finally see the somewhat tangled way it works, rather than having to read the PDP-11 assembly. I don't have links to V5 manuals, but the V5 library source says that we still have both approaches to buffered IO.

(If you want to see how the putchar() approach was used, you can look at, for example, the V6 grep.c, which starts out with the 'fout = dup(1);' that the manual page suggests for buffered putchar() usage, and then periodically calls flush().)

In V6, there is a third approach that was added, in /usr/source/iolib, although I don't know if any programs used it. Iolib has a global array of structs, that were statically associated with a limited number of low-numbered file descriptors; an iolib function such as cflush() would be passed a file descriptor and use that to look up the corresponding struct. One innovation iolib implicitly adds is that its copen() effectively 'allocates' the struct for you, in contrast to putc() and getc(), where you supply the memory area and fopen()/fcreate() merely initialize it with the correct information.

Finally V7 introduces stdio and sorts all of this out, at the cost of some code changes. There's still getc() and putc(), but now they take a FILE *, instead of their own structure, and you get the FILE * from things like fopen() instead of supplying it yourself and having a stdio function initialize it. Putchar() (and getchar()) still exist but are now redone to work with stdio buffering instead of their own buffering, and 'flush()' has become fflush() and takes an explicit FILE * argument instead of implicitly flushing putchar()'s buffer, and generally it's not necessary any more. The V7 grep.c still uses printf(), but now it doesn't explicitly flush anything by calling fflush(); it just trusts in stdio.

Using systemd-run to limit something's memory usage in cgroups v2

By: cks
1 December 2024 at 03:59

Once upon a time I wrote an entry about using systemd-run to limit something's RAM consumption. This was back in the days of cgroups v1 (also known as 'non-unified cgroups'), and we're now in the era of cgroups v2 ('unified cgroups') and also ZRAM based swap. This means we want to make some adjustments, especially if you're dealing with programs with obnoxiously large RAM usage.

As before, the basic thing you want to do is run your program or thing in a new systemd user scope, which is done with 'systemd-run --user --scope ...'. You may wish to give it a unit name as well, '--unit <name>', especially if you expect it to persist a while and you want to track it specifically. Systemd will normally automatically clean up this scope when everything in it exits, and the scope is normally connected to your current terminal and otherwise more or less acts normally as an interactive process.

To actually do anything with this, we need to set some systemd resource limits. To limit memory usage, the minimum is a MemoryMax= value. It may also work better to set MemoryHigh= to a value somewhat below the absolute limit of MemoryMax. If you're worried about whatever you're doing running your system out of memory and your system uses ZRAM based swap, you may also want to set a MemoryZSwapMax= value so that the program doesn't chew up all of your RAM by 'swapping' it to ZRAM and filling that up. Without a ZRAM swap limit, you might find that the program actually uses MemoryMax RAM plus your entire ZRAM swap RAM, which might be enough to trigger a more general OOM. So this might be:

systemd-run --user --scope -p MemoryHigh=7G -p MemoryMax=8G -p MemoryZSwapMax=1G ./mach build

(Good luck with building Firefox in merely 8 GBytes of RAM, though. And obviously if you do this regularly, you're going to want to script it.)

If you normally use ZRAM based swap and you're worried about the program running you out of memory that way, you may want to create some actual swap space that the program can be turned loose on. These days, this is as simple as creating a 'swap.img' file somewhere and then swapping onto it:

cd /
dd if=/dev/zero of=swap.img bs=1MiB count=$((4*1024))
mkswap swap.img
swapon /swap.img

(You can use swapoff to stop swapping to this image file after you're done running your big program.)

Then you may want to also limit how much of this swap space the program can use, which is done with a MemorySwapMax= value. I've read both systemd's documentation and the kernel's cgroup v2 memory controller documentation, and I can't tell whether the ZRAM swap maximum is included in the swap maximum or is separate. I suspect that it's included in the swap maximum, but if it really matters you should experiment.

If you also want to limit the program's CPU usage, there are two options. The easiest one to set is CPUQuota=. The drawback of CPU quota limits is that programs may not realize that they're being restricted by such a limit and wind up running a lot more threads (or processes) than they should, increasing the chances of overloading things. The more complex but more legible to programs way is to restrict what CPUs they can run on using taskset(1).

(While systemd has AllowedCPUs=, this is a cgroup setting and doesn't show up in the interface used by taskset and sched_getaffinity(2).)

Systemd also has CPUWeight=, but I have limited experience with it; see fair share scheduling in cgroup v2 for what I know. You might want the special value 'idle' for very low priority programs.

What NFS server threads do in the Linux kernel

By: cks
26 November 2024 at 03:40

If we ignore the network stack and take an abstract view, the Linux kernel NFS server needs to do things at various different levels in order to handle NFS client requests. There is NFS specific processing (to deal with things like the NFS protocol and NFS filehandles), general VFS processing (including maintaining general kernel information like dentries), then processing in whatever specific filesystem you're serving, and finally some actual IO if necessary. In the abstract, there are all sorts of ways to split up the responsibility for these various layers of processing. For example, if the Linux kernel supported fully asynchronous VFS operations (which it doesn't), the kernel NFS server could put all of the VFS operations in a queue and let the kernel's asynchronous 'IO' facilities handle them and notify it when a request's VFS operations were done. Even with synchronous VFS operations, you could split the responsibility between some front end threads that handled the NFS specific side of things and a backend pool of worker threads that handled the (synchronous) VFS operations.

(This would allow you to size the two pools differently, since ideally they have different constraints. The NFS processing is more or less CPU bound, and so sized based on how much of the server's CPU capacity you wanted to use for NFS; the VFS layer would ideally be IO bound, and could be sized based on how much simultaneous disk IO it was sensible to have. There is some hand-waving involved here.)

The actual, existing Linux kernel NFS server takes the much simpler approach. The kernel NFS server threads do everything. Each thread takes an incoming NFS client request (or a group of them), does NFS level things like decoding NFS filehandles, and then calls into the VFS to actually do operations. The VFS will call into the filesystem, still in the context of the NFS server thread, and if the filesystem winds up doing IO, the NFS server thread will wait for that IO to complete. When the thread of execution comes back out of the VFS, the NFS thread then does the NFS processing to generate replies and dispatch them to the network.

This unfortunately makes it challenging to answer the question of how many NFS server threads you want to use. The NFS server threads may be CPU bound (if they're handling NFS requests from RAM and the VFS's caches and data structures), or they may be IO bound (as they wait for filesystem IO to be performed, usually for reading and writing files). When you're IO bound, you probably want enough NFS server threads so that you can wait on all of the IO and still have some threads left over to handle the collection of routine NFS requests that can be satisfied from RAM. When you're CPU bound, you don't want any more NFS server threads than you have CPUs, and maybe you want a bit less.

If you're lucky, your workload is consistently and predictably one or the other. If you're not lucky (and we're not), your workload can be either of these at different times or (if we're really out of luck) both at once. Energetic people with NFS servers that have no other real activity can probably write something that automatically tunes the number of NFS threads up and down in response to a combination of the load average, the CPU utilization, and pressure stall information.

(We're probably just going to set it to the number of system CPUs.)

(After yesterday's question I decided I wanted to know for sure what the kernel's NFS server threads were used for, just in case. So I read the kernel code, which did have some useful side effects such as causing me to learn that the various nfsd4_<operation> functions we sometimes use bpftrace on are doing less than I assumed they were.)

The question of how many NFS server threads you should use (on Linux)

By: cks
25 November 2024 at 04:48

Today, not for the first time, I noticed that one of our NFS servers was sitting at a load average of 8 with roughly half of its overall CPU capacity used. People with experience in Linux NFS servers are now confidently predicting that this is a 16-CPU server, which is correct (it has 8 cores and 2 HT threads per core). They're making this prediction because the normal Linux default number of kernel NFS server threads to run is eight.

(Your distribution may have changed this, and if so it's most likely by changing what's in /etc/nfs.conf, which is the normal place to set this. It can be changed on the fly by writing a new value to /proc/fs/nfsd/threads.)

Our NFS server wasn't saturating its NFS server threads because someone on a NFS client was doing a ton of IO. That might actually have slowed the requests down. Instead, there were some number of programs that were constantly making some number of NFS requests that could be satisfied entirely from (server) RAM, which explains why all of the NFS kernel threads were busy using system CPU (mostly on a spinlock, apparently, according to 'perf top'). It's possible that some of these constant requests came from code that was trying to handle hot reloading, since this is one of the sources of constant NFS 'GetAttr' requests, but I believe there's other things going on.

(Since this is the research side of a university department, we have very little visibility into what the graduate students are running on places like our little SLURM cluster.)

If you search around the Internet, you can find all sorts of advice about what to to set the number of NFS server threads to on your Linux NFS server. Many of them involve relatively large numbers (such as this 2024 SuSE advice of 128 threads). Having gone through this recent experience, my current belief is that it depends on what your problem is. In our case, with the NFS server threads all using kernel CPU time and not doing much else, running more threads than we have CPUs seems pointless; all it would do is create unproductive contention for CPU time. If NFS clients are going to totally saturate the fileserver with (CPU-eating) requests even at 16 threads, possibly we should run fewer threads than CPUs, so that user level management operations have some CPU available without contending against the voracious appetite of the kernel NFS server.

(Some advice suggests some number of server NFS kernel threads per NFS client. I suspect this advice is not used in places with tens or hundreds of NFS clients, which is our situation.)

To figure out what your NFS server's problem is, I think you're going to need to look at things like pressure stall information and information on the IO rate and the number of IO requests you're seeing. You can't rely on overall iowait numbers, because Linux iowait is a conservative lower bound. IO pressure stall information is much better for telling you if some NFS threads are blocked on IO even while others are active.

(Unfortunately the kernel NFS threads are not in a cgroup of their own, so you can't get per-cgroup pressure stall information for them. I don't know if you can manually move them into a cgroup, or if systemd would cooperate with this if you tried it.)

PS: In theory it looks like a potentially reasonable idea to run roughly at least as many NFS kernel threads as you have CPUs (maybe a few less so you have some user level CPU left over). However, if you have a lot of CPUs, as you might on modern servers, this might be too many if your NFS server gets flooded with an IO-heavy workload. Our next generation NFS fileserver hardware is dual socket, 12 cores per socket, and 2 threads per core, for a total of 48 CPUs, and I'm not sure we want to run anywhere near than many NFS kernel threads. Although we probably do want to run more than eight.

Ubuntu LTS (server) releases have become fairly similar to each other

By: cks
19 November 2024 at 04:15

Ubuntu 24.04 LTS was released this past April, so one of the things we've been doing since then is building out our install system for 24.04 and then building a number of servers using 24.04, both new servers and servers that used to be build on 20.04 or 22.04. What has been quietly striking about this process is how few changes there have been for us between 20.04, 22.04, and 24.04. Our customization scripts needed only very small changes, and many of the instructions for specific machines could be revised by just searching and replacing either '20.04' or '22.04' with '24.04'.

Some of this lack of changes is illusory, because when I actually look at the differences between our 22.04 and 24.04 postinstall scripting, there are a number of changes, adjustments, and new fixes (and a big change in having to install Python 2 ourselves). Even when we didn't do anything there were decisions to be made, like whether or not we would stick with the Ubuntu 24.04 default of socket activated SSH (our decision so far is to stick with 24.04's default for less divergence from upstream). And there were also some changes to remove obsolete things and restructure how we change things like the system-wide SSH configuration; these aren't forced by the 22.04 to 24.04 change, but building the install setup for a new release is the right time to rethink existing pieces.

However, plenty of this lack of changes is real, and I credit a lot of that to systemd. Systemd has essentially standardized a lot of the init process and in the process, substantially reduced churn in it. For a relevant example, our locally developed systemd units almost never need updating between Ubuntu versions; if it worked in 20.04, it'll still work just as well in 24.04 (including its relationships to various other units). Another chunk of this lack of changes is that the current 20.04+ Ubuntu server installer has maintained a stable configuration file and relatively stable feature set (at least of features that we want to use), resulting in very little needing to be modified in our spin of it as we moved from 20.04 to 22.04 to 24.04. And the experience of going through the server installer has barely changed; if you showed me an installer screen from any of the three releases, I'm not sure I could tell you which it's from.

I generally feel that this is a good thing, at least on servers. A normal Linux server setup and the software that you run on it has broadly reached a place of stability, where there's no particular need to make really visible changes or to break backward compatibility. It's good for us that moving from 20.04 to 22.04 to 24.04 is mostly about getting more recent kernels and more up to date releases of various software packages, and sometimes having bugs fixed so that things like bpftrace work better.

(Whether this is 'welcome maturity' or 'unwelcome statis' is probably somewhat in the eye of the observer. And there are quiet changes afoot behind the scenes, like the change from iptables to nftables.)

Complications in supporting 'append to a file' in a NFS server

By: cks
8 November 2024 at 04:14

In the comments of my entry on the general problem of losing network based locks, an interesting side discussion has happened between commentator abel and me over NFS servers (not) supporting the Unix O_APPEND feature. The more I think about it, the more I think it's non-trivial to support well in an NFS server and that there are some subtle complications (and probably more than I haven't realized). I'm mostly going to restrict this to something like NFS v3, which is what I'm familiar with.

The basic Unix semantics of O_APPEND are that when you perform a write(), all of your data is immediately and atomically put at the current end of the file, and the file's size and maximum offset are immediately extended to the end of your data. If you and I do a single append write() of 128 Mbytes to the same file at the same time, either all of my 128 Mbytes winds up before yours or vice versa; your and my data will never wind up intermingled.

This basic semantics is already a problem for NFS because NFS (v3) connections have a maximum size for single NFS 'write' operations and that size may be (much) smaller than the user level write(). Without a multi-operation transaction of some sort, we can't reliably perform append write()s of more data than will fit in a NFS write operation; either we fail those 128 Mbyte writes, or we have the possibility that data from you and I will be intermingled in the file.

In NFS v2, all writes were synchronous (or were supposed to be, servers sometimes lied about this). NFS v3 introduced the idea of asynchronous, buffered writes that were later committed by clients. NFS servers are normally permitted to discard asynchronous writes that haven't yet been committed by the client; when the client tries to commit them later, the NFS server rejects the commit and the client resends the data. This works fine when the client's request has a definite position in the file, but it has issues if the client's request is a position-less append write. If two clients do append writes to the same file, first A and then B after it, the server discards both, and then client B is the first one to go through the 'COMMIT, fail, resend' process, where does its data wind up? It's not hard to wind up with situations where a third client that's repeatedly reading the file will see inconsistent results, where first it sees A's data then B's and then later either it sees B's data before A's or B's data without anything from A (not even a zero-filled gap in the file, the way you'd get with ordinary writes).

(While we can say that NFS servers shouldn't ever deliberately discard append writes, one of the ways that this happens is that the server crashes and reboots.)

You can get even more fun ordering issues created by retrying lost writes if there is another NFS client involved that is doing manual append writes by finding out the current end of file and writing at it. If A and B do append writes, C does a manual append write, all writes are lost before they're committed, B redoes, C redoes, and then A redoes, a natural implementation could easily wind up with B's data, an A data sized hole, C's data, and then A's data appended after C's.

This also creates server side ordering dependencies for potentially discarding uncommitted asynchronous write data, ones that a NFS server can normally make independently. If A appended a lot of data and then B appended a little bit, you probably don't want to discard A's data but not B's, because there's no guarantee that A will later show up to fail a COMMIT and resend it (A could have crashed, for example). And if B requests a COMMIT, you probably want to commit A's data as well, even if there's much more of it.

One way around this would be to adopt a more complex model of append writes over NFS, where instead of the client requesting an append write, it requests 'write this here but fail if this is not the current end of file'. This would give all NFS writes a definite position in the file at the cost of forcing client retries on the initial request (if the client later has to repeat the write because of a failed commit, it must carefully strip this flag off). Unfortunately a file being appended to from multiple clients at a high rate would probably result in a lot of client retries, with no guarantee that a given client would ever actually succeed.

(You could require all append writes to be synchronous, but then this would do terrible things to NFS server performance for potentially common use of append writes, like appending log lines to a shared log file from multiple machines. And people absolutely would write and operate programs like that if append writes over NFS were theoretically reliable.)

Losing NFS locks and the SunOS SIGLOST signal

By: cks
7 November 2024 at 02:48

NFS is a network filesystem that famously also has a network locking protocol associated with it (or part of it, for NFSv4). This means that NFS has to consider the issue of the NFS client losing a lock that it thinks it holds. In NFS, clients losing locks normally happens as part of NFS(v3) lock recovery, triggered when a NFS server reboots. On server reboot, clients are told to re-acquire all of their locks, and this re-acquisition can explicitly fail (as well as going wrong in various ways that are one way to get stuck NFS locks). When a NFS client's kernel attempts to reclaim a lock and this attempt fails, it has a problem. Some process on the local machine thinks that it holds a (NFS) lock, but as far as the NFS server and other NFS clients are concerned, it doesn't.

Sun's original version of NFS dealt with this problem with a special signal, SIGLOST. When the NFS client's kernel detected that a NFS lock had been lost, it sent SIGLOST to whatever process held the lock. SIGLOST was a regular signal, so by default the process would exit abruptly; a process that wanted to do something special could register a signal handler for SIGLOST and then do whatever it could. SIGLOST appeared no later than SunOS 3.4 (cf) and still lives on today in Illumos, where you can find this discussed in uts/common/klm/nlm_client.c and uts/common/fs/nfs/nfs4_recovery.c (and it's also mentioned in fcntl(2)). The popularity of actually handling SIGLOST may be indicated by the fact that no program in the Illumos source tree seems to set a signal handler for it.

Other versions of Unix mainly ignore the situation. The Linux kernel has a specific comment about this in fs/lockd/clntproc.c, which very briefly talks about the issue and picks ignoring it (apart from logging the kernel message "lockd: failed to reclaim lock for ..."). As far as I can tell from reading FreeBSD's sys/nlm/nlm_advlock.c, FreeBSD silently ignores any problems when it goes through the NFS client process of reclaiming locks.

(As far as I can see, NetBSD and OpenBSD don't support NFS locks on clients at all, rendering the issue moot. I don't know if POSIX locks fail on NFS mounted filesystems or if they work but create purely local locks on that particular NFS client, although I think it's the latter.)

On the surface this seems rather bad, and certainly worse than the Sun approach of SIGLOST. However, I'm not sure that SIGLOST is all that great either, because it has some problems. First, what you can do in a signal handler is very constrained; basically all that a SIGLOST handler can do is set a variable and hope that the rest of the code will check it before it does anything dangerous. Second, programs may hold multiple (NFS) locks and SIGLOST doesn't tell you which lock you lost; as far as I know, there's no way of telling. If your program gets a SIGLOST, all you can do is assume that you lost all of your locks. Third, file locking may quite reasonably be used inside libraries in a way that is hidden from callers by the library's API, but signals and handling signals is global to the entire program. If taking a file lock inside a library exposes the entire program to SIGLOST, you have a collection of problems (which ones depend on whether the program has its own file locks and whether or not it has installed a SIGLOST handler).

This collection of problems may go part of the way to explain why no Illumos programs actually set a SIGLOST handler and why other Unixes simply ignore the issue. A kernel that uses SIGLOST essentially means 'your program dies if it loses a lock', and it's not clear that this is better than 'your program optimistically continues', especially in an environment where a NFS client losing a NFS lock is rare (and letting the program continue is certainly simpler for the kernel).

A rough equivalent to "return to last power state" for libvirt virtual machines

By: cks
5 November 2024 at 04:13

Physical machines can generally be set in their BIOS so that if power is lost and then comes back, the machine returns to its previous state (either powered on or powered off). The actual mechanics of this are complicated (also), but the idealized version is easily understood and convenient. These days I have a revolving collection of libvirt based virtual machines running on a virtualization host that I periodically reboot due to things like kernel updates, and for a while I have quietly wished for some sort of similar libvirt setting for its virtual machines.

It turns out that this setting exists, sort of, in the form of the libvirt-guests systemd service. If enabled, it can be set to restart all guests that were running when the system was shut down, regardless of whether or not they're set to auto-start on boot (none of my VMs are). This is a global setting that applies to all virtual machines that were running at the time the system went down, not one that can be applied to only some VMs, but for my purposes this is sufficient; it makes it less of a hassle to reboot the virtual machine host.

Linux being Linux, life is not quite this simple in practice, as is illustrated by comparing my Ubuntu VM host machine with my Fedora desktops. On Ubuntu, libvirt-guests.service defaults to enabled, it is configured through /etc/default/libvirt-guests (the Debian standard), and it defaults to not not automatically restarting virtual machines. On my Fedora desktops, libvirt-guests.service is not enabled by default, it is configured through /etc/sysconfig/libvirt-guests (as in the official documentation), and it defaults to automatically restarting virtual machines. Another difference is that Ubuntu has a /etc/default/libvirt-guests that has commented out default values, while Fedora has no /etc/sysconfig/libvirt-guests so you have to read the script to see what the defaults are (on Fedora, this is /usr/libexec/libvirt-guests.sh, on Ubuntu /usr/lib/libvirt/libvirt-guests.sh).

I've changed my Ubuntu VM host machine so that it will automatically restart previously running virtual machines on reboot, because generally I leave things running intentionally there. I haven't touched my Fedora machines so far because by and large I don't have any regularly running VMs, so if a VM is still running when I go to reboot the machine, it's most likely because I forgot I had it up and hadn't gotten around to shutting it off.

(My pre-libvirt virtualization software was much too heavy-weight for me to leave a VM running without noticing, but libvirt VMs have a sufficiently low impact on my desktop experience that I can and have left them running without realizing it.)

The history of Unix's ioctl and signal about window sizes

By: cks
4 November 2024 at 03:38

One of the somewhat obscure features of Unix is that the kernel has a specific interface to get (and set) the 'window size' of your terminal, and can also send a Unix signal to your process when that size changes. The official POSIX interface for the former is tcgetwinsize(), but in practice actual Unixes have a standard tty ioctl for this, TIOCGWINSZ (see eg Linux ioctl_tty(2) (also) or FreeBSD tty(4)). The signal is officially standardized by POSIX as SIGWINCH, which is the name it always has had. Due to a Fediverse conversation, I looked into the history of this today, and it turns out to be more interesting than I expected.

(The inclusion of these interfaces in POSIX turns out to be fairly recent.)

As far as I can tell, 4.2 BSD did not have either TIOCGWINSZ or SIGWINCH (based on its sigvec(2) and tty(4) manual pages). Both of these appear in the main BSD line in 4.3 BSD, where sigvec(2) has added SIGWINCH (as the first new signal along with some others) and tty(4) has TIOCGWINSZ. This timing makes a certain amount of sense in Unix history. At the time of 4.2 BSD's development and release, people were connecting to Unix systems using serial terminals, which had more or less fixed sizes that were covered by termcap's basic size information. By the time of 4.3 BSD in 1986, Unix workstations existed and with them, terminal windows that could have their size changed on the fly; a way of finding out (and changing) this size was an obvious need, along with a way for full-screen programs like vi to get notified if their terminal window was resized on the fly.

However, as far as I can tell 4.3 BSD itself did not originate SIGWINCH, although it may be the source of TIOCGWINSZ. The FreeBSD project has manual pages for a variety of Unixes, including 'Sun OS 0.4', which seems to be an extremely early release from early 1983. This release has a signal(2) with a SIGWINCH signal (using signal number 28, which is what 4.3 BSD will use for it), but no (documented) TIOCGWINSZ. However, it does have some programs that generate custom $TERMCAP values with the right current window sizes.

The Internet Archives has a variety of historical material from Sun Microsystems, including (some) documentation for both SunOS 2.0 and SunOS 3.0. This documentation makes it clear that the primary purpose of SIGWINCH was to tell graphical programs that their window (or one of them) had been changed, and they should repaint the window or otherwise refresh the contents (a program with multiple windows didn't get any indication of which window was damaged; the programming advice is to repaint them all). The SunOS 2.0 tgetent() termcap function will specifically update what it gives you with the current size of your window, but as far as I can tell there's no other documented support of getting window sizes; it's not mentioned in tty(4) or pty(4). Similar wording appears in the SunOS 3.0 Unix Interface Reference Manual.

(There are PDFs of some SunOS documentation online (eg), and up through SunOS 3.5 I can't find any mention of directly getting the 'window size'. In SunOS 4.0, we finally get a TIOCGWINSZ, documented in termio(4). However, I have access to SunOS 3.5 source, and it does have a TIOCGWINSZ ioctl, although that ioctl isn't documented. It's entirely likely that TIOCGWINSZ was added (well) before SunOS 3.5.)

According to this Git version of the original BSD development history, BSD itself added both SIGWINCH and TIOCGWINSZ at the end of 1984. The early SunOS had SIGWINCH and it may well have had TIOCGWINSZ as well, so it's possible that BSD got both from SunOS. It's also possible that early SunOS had a different (terminal) window size mechanism than TIOCGWINSZ, one more specific to their window system, and the UCB CSRG decided to create a more general mechanism that Sun then copied back by the time of SunOS 3.5 (possibly before the official release of 4.3 BSD, since I suspect everyone in the BSD world was talking to each other at that time).

PS: SunOS also appears to be the source of the mysteriously missing signal 29 in 4.3 BSD (mentioned in my entry on how old various Unix signals are). As described in the SunOS 3.4 sigvec() manual page, signal 29 is 'SIGLOST', "resource lost (see lockd(8C))". This appears to have been added at some point between the initial SunOS 3.0 release and SunOS 3.4, but I don't know exactly when.

Notes on the compatibility of crypted passwords across Unixes in late 2024

By: cks
2 November 2024 at 02:31

For years now, all sorts of Unixes have been able to support better password 'encryption' schemes than the basic old crypt(3) salted-mutant-DES approach that Unix started with (these days it's usually called 'password hashing'). However, the support for specific alternate schemes varies from Unix to Unix, and has for many years. Back in 2010 I wrote some notes on the situation at the time; today I want to look at the situation again, since password hashing is on my mind right now.

The most useful resource for cross-Unix password hash compatibility is Wikipedia's comparison table. For Linux, support varies by distribution based on their choice of C library and what version of libxcrypt they use, and you can usually see a list in crypt(5), and pam_unix may not support using all of them for new passwords. For FreeBSD, their support is documented in crypt(3). In OpenBSD, this is documented in crypt(3) and crypt_newhash(3), although there isn't much to read since current OpenBSD only lists support for 'Blowfish', which for password hashing is also known as bcrypt. On Illumos, things are more or less documented in crypt(3), crypt.conf(5), and crypt_unix(7) and associated manual pages; the Illumos section 7 index provides one way to see what seems to be supported.

System administrators not infrequently wind up wanting cross-Unix compatibility of their local encrypted passwords. If you don't care about your shared passwords working on OpenBSD (or NetBSD), then the 'sha512' scheme is you best bet; it basically works everywhere these days. If you do need to include OpenBSD or NetBSD, you're stuck with bcrypt but even then there may be problems because bcrypt is actually several schemes, as Wikipedia covers.

Some recent Linux distributions seem to be switching to 'yescrypt' by default (including Debian, which means downstream distributions like Ubuntu have also switched). Yescrypt in Ubuntu is now old enough that it's probably safe to use in an all-Ubuntu environment, although your distance may vary if you have 18.04 or earlier systems. Yescrypt is not yet available in FreeBSD and may never be added to OpenBSD or NetBSD (my impression is that OpenBSD is not a fan of having lots of different password hashing algorithms and prefers to focus on one that they consider secure).

(Compared to my old entry, I no longer particularly care about the non-free Unixes, including macOS. Even Wikipedia doesn't bother trying to cover AIX. For our local situation, we may someday want to share passwords to FreeBSD machines, but we're very unlikely to care about sharing passwords to OpenBSD machines since we currently only use them in situations where having their own stand-alone passwords is a feature, not a bug.)

Pam_unix and your system's supported password algorithms

By: cks
1 November 2024 at 03:15

The Linux login passwords that wind up in /etc/shadow can be encrypted (well, hashed) with a variety of algorithms, which you can find listed (and sort of documented) in places like Debian's crypt(5) manual page. Generally the choice of which algorithm is used to hash (new) passwords (for example, when people change them) is determined by an option to the pam_unix PAM module.

You might innocently think, as I did, that all of the algorithms your system supports will all be supported by pam_unix, or more exactly will all be available for new passwords (ie, what you or your distribution control with an option to pam_unix). It turns out that this is not the case some of the time (or if it is actually the case, the pam_unix manual page can be inaccurate). This is surprising because pam_unix is the thing that handles hashed passwords (both validating them and changing them), and you'd think its handling of them would be symmetric.

As I found out today, this isn't necessarily so. As documented in the Ubuntu 20.04 crypt(5) manual page, 20.04 supports yescrypt in crypt(3) (sadly Ubuntu's manual page URL doesn't seem to work). This means that the Ubuntu 20.04 pam_unix can (or should) be able to accept yescrypt hashed passwords. However, the Ubuntu 20.04 pam_unix(8) manual page doesn't list yescrypt as one of the available options for hashing new passwords. If you look only at the 20.04 pam_unix manual page, you might (incorrectly) assume that a 20.04 system can't deal with yescrypt based passwords at all.

At one level, this makes sense once you know that pam_unix and crypt(3) come from different packages and handle different parts of the work of checking existing Unix password and hashing new ones. Roughly speaking, pam_unix can delegate checking passwords to crypt(3) without having to care how they're hashed, but to hash a new password with a specific algorithm it has to know about the algorithm, have a specific PAM option added for it, and call some functions in the right way. It's quite possible for crypt(3) to get ahead of pam_unix for a new password hashing algorithm, like yescrypt.

(Since they're separate packages, pam_unix may not want to implement this for a new algorithm until a crypt(3) that supports it is at least released, and then pam_unix itself will need a new release. And I don't know if linux-pam can detect whether or not yescrypt is supported by crypt(3) at build time (or at runtime).)

PS: If you have an environment with a shared set of accounts and passwords (whether via LDAP or your own custom mechanism) and a mixture of Ubuntu versions (maybe also with other Linux distribution versions), you may want to be careful about using new password hashing schemes, even once it's supported by pam_unix on your main systems. The older some of your Linuxes are, the more you'll want to check their crypt(3) and crypt(5) manual pages carefully.

Linux's /dev/disk/by-id unfortunately often puts the transport in the name

By: cks
28 October 2024 at 03:24

Filippo Valsorda ran into an issue that involved, in part, the naming of USB disk drives. To quote the relevant bit:

I can't quite get my head around the zfs import/export concept.

When I replace a drive I like to first resilver the new one as a USB drive, then swap it in. This changes the device name (even using by-id).

[...]

My first reaction was that something funny must be going on. My second reaction was to look at an actual /dev/disk/by-id with a USB disk, at which point I got a sinking feeling that I should have already recognized from a long time ago. If you look at your /dev/disk/by-id, you will mostly see names that start with things like 'ata-', 'scsi-OATA-', 'scsi-1ATA', and maybe 'usb-' (and perhaps 'nvme-', but that's a somewhat different kettle of fish). All of these names have the problem that they burn the transport (how you talk to the disk) into the /dev/disk/by-id, which is supposed to be a stable identifier for the disk as a standalone thing.

As Filippo Valsorda's case demonstrates, the problem is that some disks can move between transports. When this happens, the theoretically stable name of the disk changes; what was 'usb-' is now likely 'ata-' or vice versa, and in some cases other transformations may happen. Your attempt to use a stable name has failed and you will likely have problems.

Experimentally, there seem to be some /dev/disk/by-id names that are more stable. Some but not all of our disks have 'wwn-' names (one USB attached disk I can look at doesn't). Our Ubuntu based systems have 'scsi-<hex digits>' and 'scsi-SATA-<disk id>' names, but one of my Fedora systems with SATA drives has only the 'scsi-<hex>' names and the other one has neither. One system we have a USB disk on has no names for the disk other than 'usb-' ones. It seems clear that it's challenging at best to give general advice about how a random Linux user should pick truly stable /dev/disk/by-id names, especially if you have USB drives in the picture.

(See also Persistent block device naming in the Arch Wiki.)

This whole current situation seems less than ideal, to put it one way. It would be nice if disks (and partitions on them) had names that were as transport independent and usable as possible, especially since most disks have theoretically unique serial numbers and model names available (and if you're worried about cross-transport duplicates, you should already be at least as worried as duplicates within the same type of transport).

PS: You can find out what information udev knows about your disks with 'udevadm info --query=all --name=/dev/...' (from, via, by coincidence). The information for a SATA disk differs between my two Fedora machines (one of them has various SCSI_* and ID_SCSI* stuff and the other doesn't), but I can't see any obvious reason for this.

Using pam_access to sometimes not use another PAM module

By: cks
26 October 2024 at 02:40

Suppose that you want to authenticate SSH logins to your Linux systems using some form of multi-factor authentication (MFA). The normal way to do this is to use 'password' authentication and then in the PAM stack for sshd, use both the regular PAM authentication module(s) of your system and an additional PAM module that requires your MFA (in another entry about this I used the module name pam_mfa). However, in your particular MFA environment it's been decided that you don't have to require MFA for logins from some of your other networks or systems, and you'd like to implement this.

Because your MFA happens through PAM and the details of this are opaque to OpenSSH's sshd, you can't directly implement skipping MFA through sshd configuration settings. If sshd winds up doing password based authentication at all, it will run your full PAM stack and that will challenge people for MFA. So you must implement sometimes skipping your MFA module in PAM itself. Fortunately there is a PAM module we can use for this, pam_access.

The usual way to use pam_access is to restrict or allow logins (possibly only some logins) based on things like the source address people are trying to log in from (in this, it's sort of a superset of the old tcpwrappers). How this works is configured through an access control file. We can (ab)use this basic matching in combination with the more advanced form of PAM controls to skip our PAM MFA module if pam_access matches something.

What we want looks like this:

auth  [success=1 default=ignore]  pam_access.so noaudit accessfile=/etc/security/access-nomfa.conf
auth  requisite  pam_mfa

Pam_access itself will 'succeed' as a PAM module if the result of processing our access-nomfa.conf file is positive. When this happens, we skip the next PAM module, which is our MFA module. If it 'fails', we ignore the result, and as part of ignoring the result we tell pam_access to not report failures.

Our access-nomfa.conf file will have things like:

# Everyone skips MFA for internal networks
+:ALL:192.168.0.0/16 127.0.0.1

# Insure we fail otherwise.
-:ALL:ALL

We list the networks we want to allow password logins without MFA from, and then we have to force everything else to fail. (If you leave this off, everything passes, either explicitly or implicitly.)

As covered in the access.conf manual page, you can get quite sophisticated here. For example, you could have people who always had to use MFA, even from internal machines. If they were all in a group called 'mustmfa', you might start with:

-:(mustmfa):ALL

If you get at all creative with your access-nomfa.conf, I strongly suggest writing a lot of comments to explain everything. Your future self will thank you.

Unfortunately but entirely reasonably, the information about the remote source of a login session doesn't pass through to later PAM authentication done by sudo and su commands that you do in the session. This means that you can't use pam_access to not give MFA challenges on su or sudo to people who are logged in from 'trusted' areas.

(As far as I can tell, the only information ``pam_access' gets about the 'origin' of a su is the TTY, which is generally not going to be useful. You can probably use this to not require MFA on su or sudo that are directly done from logins on the machine's physical console or serial console.)

Having an emergency backup DNS resolver with systemd-resolved

By: cks
25 October 2024 at 03:08

At work we have a number of internal DNS resolvers, which you very much want to use to resolve DNS names if you're inside our networks for various reasons (including our split-horizon DNS setup). Purely internal DNS names aren't resolvable by the outside world at all, and some DNS names resolve differently. However, at the same time a lot of the host names that are very important to me are in our public DNS because they have public IPs (sort of for historical reasons), and so they can be properly resolved if you're using external DNS servers. This leaves me with a little bit of a paradox; on the one hand, my machines must resolve our DNS zones using our internal DNS servers, but on the other hand if our internal DNS servers aren't working for some reason (or my home machine can't reach them) it's very useful to still be able to resolve the DNS names of our servers, so I don't have to memorize their IP addresses.

A while back I switched to using systemd-resolved on my machines. Systemd-resolved has a number of interesting virtues, including that it has fast (and centralized) failover from one upstream DNS resolver to another. My systemd-resolved configuration is probably a bit unusual, in that I have a local resolver on my machines, so resolved's global DNS resolution goes to it and then I add a layer of (nominally) interface-specific DNS domain overrides that point to our internal DNS resolvers.

(This doesn't give me perfect DNS resolution, but it's more resilient and under my control than routing everything to our internal DNS resolvers, especially for my home machine.)

Somewhat recently, it occurred to me that I could deal with the problem of our internal DNS resolvers all being unavailable by adding '127.0.0.1' as an additional potential DNS server for my interface specific list of our domains. Obviously I put it at the end, where resolved won't normally use it. But with it there, if all of the other DNS servers are unavailable I can still try to resolve our public DNS names with my local DNS resolver, which will go out to the Internet to talk to various authoritative DNS servers for our zones.

The drawback with this emergency backup approach is that systemd-resolved will stick with whatever DNS server it's currently using unless that DNS server stops responding. So if resolved switches to 127.0.0.1 for our zones, it's going to keep using it even after the other DNS resolvers become available again. I'll have to notice that and manually fiddle with the interface specific DNS server list to remove 127.0.0.1, which would force resolved to switch to some other server.

(As far as I can tell, the current systemd-resolved correctly handles the situation where an interface says that '127.0.0.1' is the DNS resolver for it, and doesn't try to force queries to 127.0.0.1:53 to go out that interface. My early 2013 notes say that this sometimes didn't work, but I failed to write down the specific circumstances.)

Doing basic policy based routing on FreeBSD with PF rules

By: cks
24 October 2024 at 03:26

Suppose, not hypothetically, that you have a FreeBSD machine that has two interfaces and these two interfaces are reached through different firewalls. You would like to ping both of the interfaces from your monitoring server because both of them matter for the machine's proper operation, but to make this work you need replies to your pings to be routed out the right interface on the FreeBSD machine. This is broadly known as policy based routing and is often complicated to set up. Fortunately FreeBSD's version of PF supports a basic version of this, although it's not well explained in the FreeBSD pf.conf manual page.

To make our FreeBSD machine reply properly to our monitoring machine's ICMP pings, or in general to its traffic, we need a stateful 'pass' rule with a 'reply-to':

B_IF="emX"
B_IP="10.x.x.x"
B_GW="10.x.x.254"
B_SUBNET="10.x.x.0/24"

pass in quick on $B_IF \
  reply-to ($B_IF $B_GW) \
  inet from ! $B_SUBNET to $B_IP \
  keep state

(Here $B_IP is the machine's IP on this second interface, and we also need the second interface, the gateway for the second interface's subnet, and the subnet itself.)

As I discovered, you must put the 'reply-to' where it is here, although as far as I can tell the FreeBSD pf.conf manual page will only tell you that if you read the full BNF. If you put it at the end the way you might read the text description, you will get only opaque syntax errors.

We must specifically exclude traffic from the subnet itself to us, because otherwise this rule will faithfully send replies to other machines on the same subnet off to the gateway, which either won't work well or won't work at all. You can restrict the PF rule more narrowly, for example 'from { IP1 IP2 IP3 }' if those are the only off-subnet IPs that are supposed to be talking to your secondary interface.

(You may also want to match only some ports here, unless you want to give all incoming traffic on that interface the ability to talk to everything on the machine. This may require several versions of this rule, basically sticking the 'reply-to ...' bit into every 'pass in quick on ...' rule you have for that interface.)

This PF rule only handles incoming connections (including implicit ones from ICMP and UDP traffic). If we want to be able to route our outgoing traffic over our secondary interface by selecting a source address when you do things, we need a second PF rule:

pass out quick \
  route-to ($B_IF $B_GW) \
  inet from $B_IP to ! $B_SUBNET \
  keep state

Again we must specifically exclude traffic to our local network, because otherwise it will go flying off to our gateway, and also you can be more specific if you only want this machine to be able to connect to certain things using this gateway and firewall (eg 'to { IP1 IP2 SUBNET3/24 }', or you could use a port-based restriction).

(The PF rule can't be qualified with 'on $B_IF', because the situation where you need this rule is where the packet would not normally be going out that interface. Using 'on <the interface with your default route's gateway>' has some subtle differences in the semantics if you have more than two interfaces.)

Although you might innocently think otherwise, the second rule by itself isn't sufficient to make incoming connections to the second interface work correctly. If you want both incoming and outgoing connections to work, you need both rules. Possibly it would work if you matched incoming traffic on $B_IF without keeping state.

A surprise with /etc/cron.daily, run-parts, and files with '.' in their name

By: cks
16 October 2024 at 03:30

Linux distributions have a long standing general cron feature where there is are /etc/cron.hourly, /etc/cron.daily, and /etc/cron.weekly directories and if you put scripts in there, they will get run hourly, daily, or weekly (at some time set by the distribution). The actual running is generally implemented by a program called 'run-parts'. Since this is a standard Linux distribution feature, of course there is a single implementation of run-parts and its behavior is standardized, right?

Since I'm asking the question, you already know the answer: there are at least two different implementations of run-parts, and their behavior differs in at least one significant way (as well as several other probably less important ones).

In Debian, Ubuntu, and other Debian-derived distributions (and also I think Arch Linux), run-parts is a C program that is part of debianutils. In Fedora, Red Hat Enterprise Linux, and derived RPM-based distributions, run-parts is a shell script that's part of the crontabs package, which is part of cronie-cron. One somewhat unimportant way that these two versions differ is that the RPM version ignores some extensions that come from RPM packaging fun (you can see the current full list in the shell script code), while the Debian version only skips the Debian equivalents with a non-default option (and actually documents the behavior in the manual page).

A much more important difference is that the Debian version ignores files with a '.' in their name (this can be changed with a command line switch, but /etc/cron.daily and so on are not processed with this switch). As a non-hypothetical example, if you have a /etc/cron.daily/backup.sh script, a Debian based system will ignore this while a RHEL or Fedora based system will happily run it. If you are migrating a server from RHEL to Ubuntu, this may come as an unpleasant surprise, partly since the Debian version doesn't complain about skipping files.

(Whether or not the restriction could be said to be clearly documented in the Debian manual page is a matter of taste. Debian does clearly state the allowed characters, but it does not point out that '.', a not uncommon character, is explicitly not accepted by default.)

Linux software RAID and changing your system's hostname

By: cks
11 October 2024 at 03:46

Today, I changed the hostname of an old Linux system (for reasons) and rebooted it. To my surprise, the system did not come up afterward, but instead got stuck in systemd's emergency mode for a chain of reasons that boiled down to there being no '/dev/md0'. Changing the hostname back to its old value and rebooting the system again caused it to come up fine. After some diagnostic work, I believe I understand what happened and how to work around it if it affects us in the future.

One of the issues that Linux RAID auto-assembly faces is the question of what it should call the assembled array. People want their RAID array names to stay fixed (so /dev/md0 is always /dev/md0), and so the name is part of the RAID array's metadata, but at the same time you have the problem of what happens if you connect up two sets of disks that both want to be 'md0'. Part of the answer is mdadm.conf, which can give arrays names based on their UUID. If your mdadm.conf says 'ARRAY /dev/md10 ... UUID=<x>' and mdadm finds a matching array, then in theory it can be confident you want that one to be /dev/md10 and it should rename anything else that claims to be /dev/md10.

However, suppose that your array is not specified in mdadm.conf. In that case, another software RAID array feature kicks in, which is that arrays can have a 'home host'. If the array is on its home host, it will get the name it claims it has, such as '/dev/md0'. Otherwise, well, let me quote from the 'Auto-Assembly' section of the mdadm manual page:

[...] Arrays which do not obviously belong to this host are given names that are expected not to conflict with anything local, and are started "read-auto" so that nothing is written to any device until the array is written to. i.e. automatic resync etc is delayed.

As is covered in the documentation for the '--homehost' option in the mdadm manual page, on modern 1.x superblock formats the home host is embedded into the name of the RAID array. You can see this with 'mdadm --detail', which can report things like:

Name : ubuntu-server:0
Name : <host>:25  (local to host <host>)

Both of these have a 'home host'; in the first case the home host is 'ubuntu-server', and in the second case the home host is the current machine's hostname. Well, its 'hostname' as far as mdadm is concerned, which can be set in part through mdadm.conf's 'HOMEHOST' directive. Let me repeat that, mdadm by default identifies home hosts by their hostname, not by any more stable identifier.

So if you change a machine's hostname and you have arrays not in your mdadm.conf with home hosts, their /dev/mdN device names will get changed when you reboot. This is what happened to me, as we hadn't added the array to the machine's mdadm.conf.

(Contrary to some ways to read the mdadm manual page, arrays are not renamed if they're in mdadm.conf. Otherwise we'd have noticed this a long time ago on our Ubuntu servers, where all of the arrays created in the installer have the home host of 'ubuntu-server', which is obviously not any machine's actual hostname.)

Setting the home host value to the machine's current hostname when an array is created is the mdadm default behavior, although you can turn this off with the right mdadm.conf HOMEHOST setting. You can also tell mdadm to consider all arrays to be on their home host, regardless of the home host embedded into their names.

(The latter is 'HOMEHOST <ignore>', the former by itself is 'HOMEHOST <none>', and it's currently valid to combine them both as 'HOMEHOST <ignore> <none>', although this isn't quite documented in the manual page.)

PS: Some uses of software RAID arrays won't care about their names. For example, if they're used for filesystems, and your /etc/fstab specifies the device of the filesystem using 'UUID=' or with '/dev/disk/by-id/md-uuid-...' (which seems to be common on Ubuntu).

PPS: For 1.x superblocks, the array name as a whole can only be 32 characters long, which obviously limits how long of a home host name you can have, especially since you need a ':' in there as well and an array number or the like. If you create a RAID array on a system with a too long hostname, the name of the resulting array will not be in the '<host>:<name>' format that creates an array with a home host; instead, mdadm will set the name of the RAID to the base name (either whatever name you specified, or the N of the 'mdN' device you told it to use).

(It turns out that I managed to do this by accident on my home desktop, which has a long fully qualified name, by creating an array with the name 'ssd root'. The combination turns out to be 33 characters long, so the RAID array just got the name 'ssd root' instead of '<host>:ssd root'.)

The history of inetd is more interesting than I expected

By: cks
10 October 2024 at 03:11

Inetd is a traditional Unix 'super-server' that listens on multiple (IP) ports and runs programs in response to activity on them. When inetd listens on a port, it can act in two different modes. In the simplest mode, it starts a separate copy of the configured program for every connection (much like the traditional HTTP CGI model), which is an easy way to implement small, low volume services but usually not good for bigger, higher volume ones. The second mode is more like modern 'socket activation'; when a connection comes in, inetd starts your program and passes it the master socket, leaving it to you to keep accepting and processing connections until you exit.

(In inetd terminology, the first mode is 'nowait' and the second is 'wait'; this describes whether inetd immediate resumes listening on the socket for connections or waits until the program exits.)

Inetd turns out to have a more interesting history than I expected, and it's a history that's entwined with daemonization, especially with how the BSD r* commands daemonize themselves in 4.2 BSD. If you'd asked me before I started writing this entry, I'd have said that inetd was present in 4.2 BSD and was being used for various low-importance services. This turns out to be false in both respects. As far as I can tell, inetd was introduced in 4.3 BSD, and when it was introduced it was immediately put to use for important system daemons like rlogind, telnetd, ftpd, and so on, which were surprisingly run in the first style (with a copy of the relevant program started for each connection). You can see this in the 4.3 BSD /etc/inetd.conf, which has the various TCP daemons and lists them as 'nowait'.

(There are still network programs that are run as stand-alone daemons, per the 4.3 BSD /etc/rc and the 4.3 BSD /etc/rc.local. If we don't count syslogd, the standard 4.3 BSD tally seems to be rwhod, lpd, named, and sendmail.)

While I described inetd as having two modes and this is the modern state, the 4.3 BSD inetd(8) manual page says that only the 'start a copy of the program every time' mode ('nowait') is to be used for TCP programs like rlogind. I took a quick read over the 4.3 BSD inetd.c and it doesn't seem to outright reject a TCP service set up with 'wait', and the code looks like it might actually work with that. However, there's the warning in the manual page and there's no inetd.conf entry for a TCP service that is 'wait', so you'd be on your own.

The corollary of this is that in 4.3 BSD, programs like rlogind don't have the daemonization code that they did in 4.2 BSD. Instead, the 4.3 BSD rlogind.c shows that it can only be run under inetd or some equivalent, as rlogind immediately aborts if its standard input isn't a socket (and it expects the socket to be connected to some other end, which is true for the 'nowait' inetd mode but not how things would be for the 'wait' mode).

This 4.3 BSD inetd model seems to have rapidly propagated into BSD-derived systems like SunOS and Ultrix. I found traces that relatively early on, both of them had inherited the 4.3 style non-daemonizing rlogind and associated programs, along with an inetd-based setup for them. This is especially interesting for SunOS, because it was initially derived from 4.2 BSD (I'm less sure of Ultrix's origins, although I suspect it too started out as 4.2 BSD derived).

PS: I haven't looked to see if the various BSDs ever changed this mode of operation for rlogind et al, or if they carried the 'per connection' inetd based model all through until each of them removed the r* commands entirely.

OpenBSD kernel messages about memory conflicts on x86 machines

By: cks
9 October 2024 at 02:44

Suppose you boot up an OpenBSD machine that you think may be having problems, and as part of this boot you look at the kernel messages for the first time in a while (or perhaps ever), and when doing so you see messages that look like this:

3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000

Or maybe the messages are like this:

memory map conflict 0xe00fd000/0x1000
memory map conflict 0xfe000000/0x11000
[...]
3:0:0: mem address conflict 0xfffc0000/0x40000
3:0:1: mem address conflict 0xfffc0000/0x40000

This sounds alarming, but there's almost certainly no actual problem, and if you check logs you'll likely find that you've been getting messages like this for as long as you've had OpenBSD on the machine.

The short version is that both of these are reports from OpenBSD that it's finding conflicts in the memory map information it is getting from your BIOS. The messages that start with 'X:Y:Z' are about PCI(e) device memory specifically, while the 'memory map conflict' errors are about the general memory map the BIOS hands the system.

Generally, OpenBSD will report additional information immediately after about what the PCI(e) devices in question are. Here are the full kernel messages around the 'rom address conflict':

pci3 at ppb2 bus 3
3:0:0: rom address conflict 0xfffc0000/0x40000
3:0:1: rom address conflict 0xfffc0000/0x40000
bge0 at pci3 dev 0 function 0 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy0 at bge0 phy 1: BCM5720C 10/100/1000baseT PHY, rev. 0
bge1 at pci3 dev 0 function 1 "Broadcom BCM5720" rev 0x00, BCM5720 A0 (0x5720000), APE firmware NCSI 1.4.14.0: msi, address 50:9a:4c:xx:xx:xx
brgphy1 at bge1 phy 2: BCM5720C 10/100/1000baseT PHY, rev. 0

Here these are two network ports on the same PCIe device (more or less), so it's not terribly surprising that the same ROM is maybe being reused for both. I believe the two messages mean that both ROMs (at the same address) are conflicting with another unmentioned allocation. I'm not sure how you find out what the original allocation and device is that they're both conflicting with.

The PCI related messages come from sys/dev/pci/pci.c and in current OpenBSD come in a number of variations, depending on what sort of PCI address space is detected as in conflict in pci_reserve_resources(). Right now, I see 'mem address conflict', 'io address conflict', the already mentioned 'rom address conflict', 'bridge io address conflict', 'bridge mem address conflict' (in several spots in the code), and 'bridge bus conflict'. Interested parties can read the source for more because this exhausts my knowledge on the subject.

The 'memory map conflict' message comes from a different place; for most people it will come from sys/arch/amd64/pci/pci_machdep.c, in pci_init_extents(). If I'm understanding the code correctly, this is creating an initial set of reserved physical address space that PCI devices should not be using. It registers each piece of bios_memmap, which according to comments in sys/arch/amd64/amd64/machdep.c is "the memory map as the bios has returned it to us". I believe that a memory map conflict at this point says that two pieces of the BIOS memory map overlap each other (or one is entirely contained in the other).

I'm not sure it's correct to describe these messages as harmless. However, it's likely that they've been there for as long as your system's BIOS has been setting up its general memory map and the PCI devices as it has been, and you'd likely see the same address conflicts with another system (although Linux doesn't seem to complain about it; I don't know about FreeBSD).

Daemonization in Unix programs is probably about restarting programs

By: cks
6 October 2024 at 02:55

It's standard for Unix daemon programs to 'daemonize' themselves when they start, completely detaching from how they were run; this behavior is quite old and these days it's somewhat controversial and sometimes considered undesirable. At this point you might ask why programs even daemonize themselves in the first place, and while I don't know for sure, I do have an opinion. My belief is that daemonization is because of restarting daemon programs, not starting them at boot.

During system boot, programs don't need to daemonize in order to start properly. The general Unix boot time environment has long been able to detach programs into the background (although the V7 /etc/rc didn't bother to do this with /etc/update and /etc/cron, the 4.2BSD /etc/rc did do this for the new BSD network daemons). In general, programs started at boot time don't need to worry that they will be inheriting things like stray file descriptors or a controlling terminal. It's the job of the overall boot time environment to insure that they start in a clean environment, and if there's a problem there you should fix it centrally, not make it every program's job to deal with the failure of your init and boot sequence.

However, init is not a service manager (not historically), which meant that for a long time, starting or restarting daemons after boot was entirely in your hands with no assistance from the system. Even if you remembered to restart a program as 'daemon &' so that it was backgrounded, the newly started program could inherit all sorts of things from your login session. It might have some random current directory, it might have stray file descriptors that were inherited from your shell or login environment, its standard input, output, and error would be connected to your terminal, and it would have a controlling terminal, leaving it exposed to various bad things happening to it when, for example, you logged out (which often would deliver a SIGHUP to it).

This is the sort of thing that even very old daemonization code deals with, which is to say that it fixes. The 4.2BSD daemonization code closes (stray) file descriptors and removes any controlling terminal the process may have, in addition to detaching itself from your shell (in case you forgot or didn't use the '&' when starting it). It's also easy to see how people writing Unix daemons might drift into adding this sort of code to them as people restarted the daemons (by hand) and ran into the various problems (cf). In fact the 4.2BSD code for it is conditional on 'DEBUG' not being defined; presumably if you were debugging, say, rlogind, you'd build a version that didn't detach itself on you so you could easily run it under a debugger or whatever.

It's a bit of a pity that 4.2 BSD and its successors didn't create a general 'daemonize' program that did all of this for you and then told people to restart daemons with 'daemonize <program>' instead of '<program>'. But we got the Unix that we have, not the Unix that we'd like to have, and Unixes did eventually grow various forms of service management that tried to encapsulate all of the things required to restart daemons in one place.

(Even then, I'm not sure that old System V init systems would properly daemonize something that you restarted through '/etc/init.d/<whatever> restart', or if it was up to the program to do things like close extra file descriptors and get rid of any controlling terminal.)

PS: Much later, people did write tools for this, such as daemonize. It's surprisingly handy to have such a program lying around for when you want or need it.

Traditionally, init on Unix was not a service manager as such

By: cks
5 October 2024 at 03:05

Init (the process) has historically had a number of roles but, perhaps surprisingly, being a 'service manager' (or a 'daemon manager') was not one of them in traditional init systems. In V7 Unix and continuing on into traditional 4.x BSD, init (sort of) started various daemons by running /etc/rc, but its only 'supervision' was of getty processes for the console and (other) serial lines. There was no supervision or management of daemons or services, even in the overall init system (stretching beyond PID 1, init itself). To restart a service, you killed its process and then re-ran it somehow; getting even the command line arguments right was up to you.

(It's conventional to say that init started daemons during boot, even though technically there are some intermediate processes involved since /etc/rc is a shell script.)

The System V init had a more general /etc/inittab that could in theory handle more than getty processes, but in practice it wasn't used for managing anything more than them. The System V init system as a whole did have a concept of managing daemons and services, in the form of its multi-file /etc/rc.d structure, but stopping and restarting services was handled outside of the PID 1 init itself. To stop a service you directly ran its init.d script with 'whatever stop', and the script used various approaches to find the processes and get them to stop. Similarly, (re)starting a daemon was done directly by its init.d script, without PID 1 being involved.

As a whole system the overall System V init system was a significant improvement on the more basic BSD approach, but it (still) didn't have init itself doing any service supervision. In fact there was nothing that actively did service supervision even in the System V model. I'm not sure what the first system to do active service supervision was, but it may have been daemontools. Extending the init process itself to do daemon supervision has a somewhat controversial history; there are Unix systems that don't do this through PID 1, although doing a good job of it has clearly become one of the major jobs of the init system as a whole.

That init itself didn't do service or daemon management is, in my view, connected to the history of (process) daemonization. But that's another entry.

(There's also my entry on how init (and the init system as a whole) wound up as Unix's daemon manager.)

❌
❌